## 测试数据集如何归一化?

一般的,我们都会对训练数据集进行归一化(常用标准值归一化),那测试数据集是否也要进行同样的步骤来计算均值和方差计算呢?

事实上这样做是不可取的.我们训练出的模型就是要投入到真实环境中使用,而测试数据是模拟真实(生产)环境的

因此,对测试数据集进行归一化是不能按照之前训练数据集归一化的思路来处理的,原因如下:

- 在真实环境中,很有可能无法得到所有测试数据集的均值和方差

直接采用训练数据集的均值和方差对测试数据集进行归一化才是准确的思路,因此我们要保存训练数据集的均值和方差

Scikit-learn中封装了Scaler类来同处理数据归一化问题

## Scikit-learn中的Scaler类

Scale类和其他机器学习算法的封装思想类似,实战操作一下 

In [85]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler  # 从数据预处理模块导入标准值归一化类

# 构造数据(鸢尾花数据集)
iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target
X[:10, :]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [86]:
# 数据集分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

In [87]:
# 训练数据集归一化

standardScale = StandardScaler()
standardScale.fit(X_train)  # 计算特征的均值和方差

StandardScaler(copy=True, with_mean=True, with_std=True)

In [88]:
# 查看均值
standardScale.mean_

array([5.78916667, 3.05      , 3.72833333, 1.1875    ])

In [89]:
# 查看方差
standardScale.scale_


array([0.79987456, 0.43397389, 1.71610913, 0.74280577])

In [90]:
# 归一化
X_train = standardScale.transform(X_train)
X_train[:10,:]


array([[-0.36151502, -1.49778598, -0.01651022, -0.2524213 ],
       [-0.11147581, -0.57607153,  0.21657519,  0.15145278],
       [ 0.263583  , -1.9586432 ,  0.74101736,  0.42070217],
       [-1.23665225, -0.11521431, -1.35675132, -1.46404355],
       [-0.48653462,  1.9586432 , -1.41502267, -1.06016947],
       [ 0.76366141,  0.34564292,  0.4496606 ,  0.42070217],
       [-0.86159344,  1.72821459, -1.24020862, -1.32941885],
       [-0.36151502, -1.26735736,  0.15830384,  0.15145278],
       [ 0.63864181,  0.80650014,  1.09064548,  1.63232442],
       [ 0.01354379, -0.57607153,  0.79928872,  1.63232442]])

In [91]:
# 测试数据集归一化
X_test_standard = standardScale.transform(X_test)
X_test_standard[:10, :]

array([[ 0.76366141, -0.57607153,  1.09064548,  1.22845033],
       [-1.23665225,  0.80650014, -1.24020862, -1.32941885],
       [ 2.38891626, -1.03692875,  1.84817306,  1.49769972],
       [-0.11147581,  3.11078626, -1.29847997, -1.06016947],
       [ 0.63864181, -0.80650014,  0.68274601,  0.82457625],
       [ 2.38891626, -0.11521431,  1.38200224,  1.49769972],
       [-0.73657383,  2.41950042, -1.29847997, -1.46404355],
       [-1.11163264, -0.11521431, -1.35675132, -1.32941885],
       [ 0.88868102, -0.11521431,  1.03237413,  0.82457625],
       [-1.23665225, -0.11521431, -1.35675132, -1.19479416]])

In [92]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [93]:
# 查看模型得分(归一化后的测试数据集)
knn_clf.score(X_test_standard, y_test)  # 得分为100%

1.0

注意:此时不能传入没有归一化的测试数据集！

In [94]:
knn_clf.score(X_test, y_test)  # 模型得分很低

0.43333333333333335

测试自己封装的Scale类

In [95]:
# 构造新数据集(未归一化的)
data_iris = datasets.load_iris()
X2 = data_iris.data
y2 = data_iris.target

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2)

In [96]:
%run preprocessing/preprocessing.py

# 导入封装好的Scale类
from preprocessing.preprocessing import StandardScaler

my_standardScale = StandardScaler()

In [97]:
# 训练数据集fit和transform
my_standardScale.fit(X_train2)
X_train2 = my_standardScale.transform(X_train2)
X_train2[:10, :]

array([[ 0.47198385,  0.79152547,  0.9544239 ,  1.47445467],
       [-0.97629537,  1.02238707, -1.38053324, -1.15514602],
       [ 0.71336373,  0.32980228,  0.44187233,  0.4226144 ],
       [-0.7349155 ,  1.02238707, -1.26663289, -1.28662605],
       [ 0.83405366, -0.13192091,  1.01137408,  0.8170545 ],
       [-1.82112492, -0.13192091, -1.49443359, -1.41810609],
       [ 2.28233289, -0.59364411,  1.69477617,  1.08001457],
       [ 2.28233289, -0.13192091,  1.35307512,  1.47445467],
       [ 1.07543353, -0.13192091,  0.72662321,  0.68557447],
       [-0.37284569, -1.0553673 ,  0.38492216,  0.02817429]])

In [98]:
# 测试数据集fit和transform
X_test_standard2 = my_standardScale.transform(X_test2)
X_test_standard2[:10, :]

array([[ 0.10991405,  0.32980228,  0.61272286,  0.8170545 ],
       [ 1.3168134 ,  0.09894068,  0.66967303,  0.4226144 ],
       [ 1.92026308, -0.59364411,  1.35307512,  0.94853453],
       [-0.25215576, -0.59364411,  0.66967303,  1.08001457],
       [-1.33836518,  0.32980228, -1.38053324, -1.28662605],
       [ 0.83405366, -0.59364411,  0.49882251,  0.4226144 ],
       [-0.85560544,  0.56066388, -1.15273254, -0.89218595],
       [-1.21767524,  0.79152547, -1.20968272, -1.28662605],
       [-0.97629537,  1.25324867, -1.32358307, -1.28662605],
       [ 1.07543353,  0.56066388,  1.12527443,  1.73741474]])

In [99]:
# 验证自己的封装的Scale类的效果
knn_clf2 = KNeighborsClassifier(n_neighbors=3)
knn_clf2.fit(X_train2, y_train2)
knn_clf2.score(X_test_standard2, y_test2) 

0.9736842105263158

自己封装的Scale类还是与Sklearn封装的有差距,得分都到不了100%

主要是学习数据归一化的原理已经Slearn封装的思想
