## 对测试数据集如何进行归一化

* 测试数据是模拟真实的环境　　　　　
* 但是当我们的模型建立好后，有可能无法得到所有的测试数据自然也就不能得到测试集的均值和方差　　　　　　
* 所以要保存好训练集的均值和方差，然后对每一个测试数据都以训练集的均值方差进行归一化　　　　
* 而且对之后送进模型的数据都应该执行这样的归一化操作，这也是算法的一部分　　     

    (X_test - mean_train) / std_train 

## Scikit-Learn中的Scaler

In [2]:
import numpy as np
from sklearn import datasets

In [3]:
iris = datasets.load_iris()

In [4]:
X = iris.data
y = iris.target

In [23]:
X[:10,:]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

先进行train_test_split

In [6]:
from sklearn.model_selection import train_test_split

In [8]:
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=666)

### Scikit-learn中的StandardScaler

In [9]:
from sklearn.preprocessing import StandardScaler

In [12]:
standardScale = StandardScaler()
#计算并保存每个特征维度上的均值和方差　　
standardScale.fit(X_train, y_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

输出每个特征维度上的均值和方差

In [13]:
print(standardScale.mean_)

[5.83416667 3.0825     3.70916667 1.16916667]


In [14]:
print(standardScale.scale_)

[0.81019502 0.44076874 1.76295187 0.75429833]


#### 先进行训练集的归一化操作

In [15]:
X_train = standardScale.transform(X_train)
X_train

array([[-0.90616043,  0.94720873, -1.30982967, -1.28485856],
       [-1.15301457, -0.18717298, -1.30982967, -1.28485856],
       [-0.16559799, -0.64092567,  0.22169257,  0.17345038],
       [ 0.45153738,  0.72033239,  0.95909217,  1.49918578],
       [-0.90616043, -1.3215547 , -0.40226093, -0.0916967 ],
       [ 1.43895396,  0.2665797 ,  0.56203085,  0.30602392],
       [ 0.3281103 , -1.09467835,  1.07253826,  0.30602392],
       [ 2.1795164 , -0.18717298,  1.63976872,  1.2340387 ],
       [-0.78273335,  2.30846679, -1.25310662, -1.4174321 ],
       [ 0.45153738, -2.00218372,  0.44858475,  0.43859746],
       [ 1.80923518, -0.41404933,  1.46959958,  0.83631808],
       [ 0.69839152,  0.2665797 ,  0.90236912,  1.49918578],
       [ 0.20468323,  0.72033239,  0.44858475,  0.571171  ],
       [-0.78273335, -0.86780201,  0.10824648,  0.30602392],
       [-0.53587921,  1.40096142, -1.25310662, -1.28485856],
       [-0.65930628,  1.40096142, -1.25310662, -1.28485856],
       [-1.0295875 ,  0.

In [17]:
X_test_standard = standardScale.transform(X_test)
X_test_standard

array([[-0.28902506, -0.18717298,  0.44858475,  0.43859746],
       [-0.04217092, -0.64092567,  0.78892303,  1.63175932],
       [-1.0295875 , -1.77530738, -0.2320918 , -0.22427024],
       [-0.04217092, -0.86780201,  0.78892303,  0.96889162],
       [-1.52329579,  0.03970336, -1.25310662, -1.28485856],
       [-0.41245214, -1.3215547 ,  0.16496953,  0.17345038],
       [-0.16559799, -0.64092567,  0.44858475,  0.17345038],
       [ 0.82181859, -0.18717298,  0.84564608,  1.10146516],
       [ 0.57496445, -1.77530738,  0.39186171,  0.17345038],
       [-0.41245214, -1.09467835,  0.39186171,  0.04087684],
       [ 1.06867274,  0.03970336,  0.39186171,  0.30602392],
       [-1.64672287, -1.77530738, -1.36655271, -1.15228502],
       [-1.27644165,  0.03970336, -1.19638358, -1.28485856],
       [-0.53587921,  0.72033239, -1.25310662, -1.01971148],
       [ 1.68580811,  1.17408507,  1.35615349,  1.76433286],
       [-0.04217092, -0.86780201,  0.22169257, -0.22427024],
       [-1.52329579,  1.

使用归一化的数据来测试knn分类器

In [18]:
from sklearn.neighbors import KNeighborsClassifier

In [19]:
knn_clf = KNeighborsClassifier(n_neighbors=3)

In [20]:
knn_clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [24]:
knn_clf.score(X_test_standard, y_test)

1.0

**这里需要强调的一点是，如果我们对训练集进行了归一化操作，那使用模型进行预测时也要对测试集进行归一化操作，否则预测的结果就会很离谱**

In [25]:
knn_clf.score(X_test, y_test)

0.3333333333333333