## 08 Scikit-learn中的Scaler

不能使用测试数据集的均值和 std 进行归一化
![IMAGE](https://farm2.staticflickr.com/1755/42006075894_66a3fb4ccc_o.png)
应该使用训练数据集的参数
![IMAGE](https://farm2.staticflickr.com/1753/27855249217_bde37ce30b_o.png)
测试数据是模拟真实环境
* 真实环境很有可能无法得到所有测试数据的均值和方差：比如测试数据就一个数据，就没有什么意义
* 对数据的归一化也是算法的一部分

> 所以要保存训练数据集的参数（均值和方差） → sklearn 帮我们封装了 Scalar 类

![IMAGE](https://farm2.staticflickr.com/1734/42006279704_77cf75b2e9_o.png)

In [3]:
import numpy as np
from sklearn import datasets

In [4]:
iris = datasets.load_iris()

In [5]:
X = iris.data
y = iris.target

In [6]:
X[:10,:]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1]])

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=666)

### scikit-learn中的StandardScaler

In [9]:
from sklearn.preprocessing import StandardScaler 
# ❤️ 注意这里倒入的是 预处理的包

In [10]:
standardScalar = StandardScaler() 
# ❤️ 首字母小写，创建一个实例

In [11]:
standardScalar.fit(X_train) # 它返回的是它自己

StandardScaler(copy=True, with_mean=True, with_std=True)

#### 重要的代码规范：
`.mean_`和`.scale_`这些变量不是用户传进来的，而是系统计算出来的。但是用户可以获取，所以后面有一个下划线。

In [12]:
standardScalar.mean_

array([ 5.83416667,  3.0825    ,  3.70916667,  1.16916667])

返回到一个 list，对应4个 feature

In [13]:
standardScalar.scale_
# .std_已经被弃用，现在用.scale_来描述标准差

array([ 0.81019502,  0.44076874,  1.76295187,  0.75429833])

In [14]:
standardScalar.transform(X_train)
# ❤️ .tansform 之后 是数据归一化返回的结果

array([[-0.90616043,  0.94720873, -1.30982967, -1.28485856],
       [-1.15301457, -0.18717298, -1.30982967, -1.28485856],
       [-0.16559799, -0.64092567,  0.22169257,  0.17345038],
       [ 0.45153738,  0.72033239,  0.95909217,  1.49918578],
       [-0.90616043, -1.3215547 , -0.40226093, -0.0916967 ],
       [ 1.43895396,  0.2665797 ,  0.56203085,  0.30602392],
       [ 0.3281103 , -1.09467835,  1.07253826,  0.30602392],
       [ 2.1795164 , -0.18717298,  1.63976872,  1.2340387 ],
       [-0.78273335,  2.30846679, -1.25310662, -1.4174321 ],
       [ 0.45153738, -2.00218372,  0.44858475,  0.43859746],
       [ 1.80923518, -0.41404933,  1.46959958,  0.83631808],
       [ 0.69839152,  0.2665797 ,  0.90236912,  1.49918578],
       [ 0.20468323,  0.72033239,  0.44858475,  0.571171  ],
       [-0.78273335, -0.86780201,  0.10824648,  0.30602392],
       [-0.53587921,  1.40096142, -1.25310662, -1.28485856],
       [-0.65930628,  1.40096142, -1.25310662, -1.28485856],
       [-1.0295875 ,  0.

In [15]:
X_train[:10,:]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 5.7,  2.8,  4.1,  1.3],
       [ 6.2,  3.4,  5.4,  2.3],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 7. ,  3.2,  4.7,  1.4],
       [ 6.1,  2.6,  5.6,  1.4],
       [ 7.6,  3. ,  6.6,  2.1],
       [ 5.2,  4.1,  1.5,  0.1],
       [ 6.2,  2.2,  4.5,  1.5]])

我们发现 X_train 本身没有改变，我们需要 用 X_train = 来描述其他的矩阵

In [16]:
X_train = standardScalar.transform(X_train)

In [17]:
X_train[:10,:]

array([[-0.90616043,  0.94720873, -1.30982967, -1.28485856],
       [-1.15301457, -0.18717298, -1.30982967, -1.28485856],
       [-0.16559799, -0.64092567,  0.22169257,  0.17345038],
       [ 0.45153738,  0.72033239,  0.95909217,  1.49918578],
       [-0.90616043, -1.3215547 , -0.40226093, -0.0916967 ],
       [ 1.43895396,  0.2665797 ,  0.56203085,  0.30602392],
       [ 0.3281103 , -1.09467835,  1.07253826,  0.30602392],
       [ 2.1795164 , -0.18717298,  1.63976872,  1.2340387 ],
       [-0.78273335,  2.30846679, -1.25310662, -1.4174321 ],
       [ 0.45153738, -2.00218372,  0.44858475,  0.43859746]])

我们再对 X_test 进行归一化处理，注意还是standardScalar这个实例。

In [18]:
X_test_standard = standardScalar.transform(X_test) 

In [19]:
X_test_standard[:10,:]

array([[-0.28902506, -0.18717298,  0.44858475,  0.43859746],
       [-0.04217092, -0.64092567,  0.78892303,  1.63175932],
       [-1.0295875 , -1.77530738, -0.2320918 , -0.22427024],
       [-0.04217092, -0.86780201,  0.78892303,  0.96889162],
       [-1.52329579,  0.03970336, -1.25310662, -1.28485856],
       [-0.41245214, -1.3215547 ,  0.16496953,  0.17345038],
       [-0.16559799, -0.64092567,  0.44858475,  0.17345038],
       [ 0.82181859, -0.18717298,  0.84564608,  1.10146516],
       [ 0.57496445, -1.77530738,  0.39186171,  0.17345038],
       [-0.41245214, -1.09467835,  0.39186171,  0.04087684]])

#### 使用归一化后的数据进行knn分类

In [20]:
from sklearn.neighbors import KNeighborsClassifier

In [21]:
knn_clf = KNeighborsClassifier(n_neighbors=3)

我们把 归一化之后的X_train 和 y_train 进行 fit 操作

In [22]:
knn_clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [23]:
knn_clf.score(X_test_standard, y_test)

1.0

> 当 train 数据进行归一化处理之后，test 必须进行归一化处理，否则成绩会很低

In [24]:
knn_clf.score(X_test, y_test)

0.33333333333333331

### 实现我们自己的standardScaler

代码参见：[这里](playML/preprocessing.py)
```python
class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        """根据训练数据集X获得数据的均值和方差"""
        ## 我们暂时只处理二维矩阵
        assert X.ndim == 2, "The dimension of X must be 2"

        self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])])

        return self

    def transform(self, X):
        """将X根据这个StandardScaler进行均值方差归一化处理"""
        assert X.ndim == 2, "The dimension of X must be 2"
        # 归一化之前，我们必须已经处理了均值和标准差
        assert self.mean_ is not None and self.scale_ is not None, \
               "must fit before transform!"
        assert X.shape[1] == len(self.mean_), \
               "the feature number of X must be equal to mean_ and std_"

        resX = np.empty(shape=X.shape, dtype=float)
        # ❤️ 建立一个空的 float array，一列一列的进行均一处理
        for col in range(X.shape[1]):
            resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col]
        return resX


```

In [25]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=666)

In [26]:
from playML.preprocessing import StandardScaler

my_standardScalar = StandardScaler() 
my_standardScalar.fit(X_train)

<playML.preprocessing.StandardScaler at 0x10db9a3c8>

In [27]:
my_standardScalar.mean_

array([ 5.83416667,  3.0825    ,  3.70916667,  1.16916667])

In [28]:
my_standardScalar.scale_

array([ 0.81019502,  0.44076874,  1.76295187,  0.75429833])

In [29]:
X_train = standardScalar.transform(X_train)

In [30]:
X_train[:10,:]

array([[-0.90616043,  0.94720873, -1.30982967, -1.28485856],
       [-1.15301457, -0.18717298, -1.30982967, -1.28485856],
       [-0.16559799, -0.64092567,  0.22169257,  0.17345038],
       [ 0.45153738,  0.72033239,  0.95909217,  1.49918578],
       [-0.90616043, -1.3215547 , -0.40226093, -0.0916967 ],
       [ 1.43895396,  0.2665797 ,  0.56203085,  0.30602392],
       [ 0.3281103 , -1.09467835,  1.07253826,  0.30602392],
       [ 2.1795164 , -0.18717298,  1.63976872,  1.2340387 ],
       [-0.78273335,  2.30846679, -1.25310662, -1.4174321 ],
       [ 0.45153738, -2.00218372,  0.44858475,  0.43859746]])

### Scikit-Learn中的最值归一化

MinMaxScaler: [http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)


练习：同学们也可以尝试实现自己的MinMaxScaler:)