# 算法原理

![image.png](attachment:72329d52-d4ae-47a5-9c7d-621f03968d70.png)

![image.png](attachment:image.png)

PCA是将数据投影到方差最大的几个相互正交的方向上，以期待保留最多的样本信息。样本的方差越大表示样本的多样性越好，在训练模型的时候，希望数据的差别越大越好。PCA降维的目的：将数据投影到方差最大的几个相互正交的方向上。

# 数据准备

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import mglearn
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_auc_score

In [2]:
#读取数据
data = pd.read_csv("../data/breast_cancer.csv")
print(data.head())

       0      1       2       3        4        5       6        7       8  \
0  17.99  10.38  122.80  1001.0  0.11840  0.27760  0.3001  0.14710  0.2419   
1  20.57  17.77  132.90  1326.0  0.08474  0.07864  0.0869  0.07017  0.1812   
2  19.69  21.25  130.00  1203.0  0.10960  0.15990  0.1974  0.12790  0.2069   
3  11.42  20.38   77.58   386.1  0.14250  0.28390  0.2414  0.10520  0.2597   
4  20.29  14.34  135.10  1297.0  0.10030  0.13280  0.1980  0.10430  0.1809   

         9  ...     21      22      23      24      25      26      27  \
0  0.07871  ...  17.33  184.60  2019.0  0.1622  0.6656  0.7119  0.2654   
1  0.05667  ...  23.41  158.80  1956.0  0.1238  0.1866  0.2416  0.1860   
2  0.05999  ...  25.53  152.50  1709.0  0.1444  0.4245  0.4504  0.2430   
3  0.09744  ...  26.50   98.87   567.7  0.2098  0.8663  0.6869  0.2575   
4  0.05883  ...  16.67  152.20  1575.0  0.1374  0.2050  0.4000  0.1625   

       28       29  label  
0  0.4601  0.11890      0  
1  0.2750  0.08902      0  
2 

In [3]:
#准备数据
data = data.dropna()
y = data['label']
x = data.drop(['label'],axis=1).astype('float64')

# 训练集、测试集划分
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state= 42)

# 模型训练

In [5]:
model = PCA(n_components=10)
model.fit(x_train)
print(model.explained_variance_ratio_) # 投影后10个特征维度的方差比例

[9.81812429e-01 1.61093766e-02 1.85615255e-03 1.26330735e-04
 8.39206078e-05 6.29071238e-06 3.99000263e-06 8.55317401e-07
 3.70404123e-07 1.88175484e-07]


# 模型保存

In [10]:
# 法一
import joblib

# 保存模型
joblib.dump(model, '../outputs/best_models/pca.pkl')

# 加载模型
model = joblib.load('../outputs/best_models/pca.pkl')

In [11]:
# 法二
import pickle

with open('../outputs/best_models/pca.pkl', 'wb') as f:
    pickle.dump(model, f)

#读取Model
with open('../outputs/best_models/pca.pkl', 'rb') as f:
    model = pickle.load(f)

# 模型预测

In [15]:
# 转换数据
x_new = model.transform(x_train)

In [18]:
x_new.shape

(426, 10)

# 模型应用

In [32]:
# 先对数据降维，再进行分类
pca = PCA(n_components=10)
pca.fit(x_train)
x_new = pca.transform(x_train)

# 分类模型
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': [2, 3, 4]}

#GridSearchCV优化参数、训练模型
gsearch = GridSearchCV(knn, param_grid)
knn = gsearch.fit(x_new, y_train)

#打印最优结果
print('KNN params:', knn.best_estimator_)

KNN params: KNeighborsClassifier(n_neighbors=3)


In [33]:
x_new = pca.transform(x_test)
prediction = knn.predict(x_new)

In [34]:
# 计算准确率
acc = accuracy_score(y_test, prediction)
print("acc为：", acc)

acc为： 0.9300699300699301


In [35]:
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.91      0.91      0.91        54
           1       0.94      0.94      0.94        89

    accuracy                           0.93       143
   macro avg       0.93      0.93      0.93       143
weighted avg       0.93      0.93      0.93       143



# 特征重构

In [4]:
# 建立简单矩阵
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

# 将含有2个特征的数据经过PCA压缩为1个特征
pca = PCA(n_components=1)
pca.fit(X)

X_pca = pca.transform(X)
print("X_pca:\n", X_pca)

X_pca:
 [[ 1.38340578]
 [ 2.22189802]
 [ 3.6053038 ]
 [-1.38340578]
 [-2.22189802]
 [-3.6053038 ]]


再逆转换, 输出值

In [5]:
X_origin = pca.inverse_transform(X_pca)
print("X_origin:\n", X_origin)

X_origin:
 [[-1.15997501 -0.75383654]
 [-1.86304424 -1.21074232]
 [-3.02301925 -1.96457886]
 [ 1.15997501  0.75383654]
 [ 1.86304424  1.21074232]
 [ 3.02301925  1.96457886]]


如果维度降低, 则会损失信息, 如果进行PCA时维度不变, 则逆转换后值与原来相同

In [2]:
# 建立简单矩阵
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

# 将含有2个特征的数据经过PCA压缩为1个特征
pca = PCA(n_components=2)
pca.fit(X)

X_pca = pca.transform(X)
X_origin = pca.inverse_transform(X_pca)
print("X_origin:\n", X_origin)

X_origin:
 [[-1. -1.]
 [-2. -1.]
 [-3. -2.]
 [ 1.  1.]
 [ 2.  1.]
 [ 3.  2.]]
