## ScikitLearn

## 标准化，也称去均值和方差按比例缩放

数据集的 标准化 对scikit-learn中实现的大多数机器学习算法来说是 常见的要求 。如果个别特征或多或少看起来不是很像标准正态分布(具有零均值和单位方差)，那么它们的表现力可能会较差。

在实际情况中,我们经常忽略特征的分布形状，直接经过去均值来对某个特征进行中心化，再通过除以非常量特征(non-constant features)的标准差进行缩放。

例如，在机器学习算法的目标函数(例如SVM的RBF内核或线性模型的l1和l2正则化)，许多学习算法中目标函数的基础都是假设所有的特征都是零均值并且具有同一阶数上的方差。如果某个特征的方差比其他特征大几个数量级，那么它就会在学习算法中占据主导位置，导致学习器并不能像我们说期望的那样，从其他特征中学习。

In [1]:
from sklearn import preprocessing
import numpy as np

In [2]:
X_train = np.array(
[
    [ 1., -1.,  2.],
    [ 2.,  0.,  0.],
    [ 0.,  1., -1.]
])

In [3]:
X_train

array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])

In [5]:
np.mean(X_train)

0.4444444444444444

In [6]:
np.std(X_train)

1.0657403385139377

In [7]:
X_scale = preprocessing.scale(X_train)

In [8]:
X_scale

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [9]:
np.mean(X_scale)

4.9343245538895844e-17

In [10]:
np.std(X_scale)

1.0

In [11]:
X_scale.mean(axis=0)

array([0., 0., 0.])

In [12]:
X_scale.std(axis=0)

array([1., 1., 1.])

**标准化,将数据集的均值方差归零.**

**归一化，即将数据统一映射到[0,1]区间上**

预处理模块还提供实用程序类StandardScaler，其实现Transformer API以计算训练集上的均值和标准偏差，以便稍后能够在测试集上重新应用相同的变换。因此，此类适用于sklearn.pipeline.Pipeline的早期步骤：

In [13]:
from sklearn.preprocessing import StandardScaler

In [14]:
standard_scaler = StandardScaler()

In [15]:
standard_scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [16]:
X_s_scaled = standard_scaler.fit_transform(X_train)

In [17]:
X_s_scaled == X_scale

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

In [18]:
standard_scaler.mean_

array([1.        , 0.        , 0.33333333])

In [19]:
standard_scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

### 将特征缩放至特定范围内

In [20]:
X_train

array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])

In [21]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler

In [22]:
minmaxscaler = MinMaxScaler().fit(X_train)
minmaxscaler

MinMaxScaler(copy=True, feature_range=(0, 1))

In [23]:
X_minmax_scaled = minmaxscaler.transform(X_train)
X_minmax_scaled

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [24]:
minmaxscaler.min_

array([0.        , 0.5       , 0.33333333])

In [25]:
minmaxscaler.scale_

array([0.5       , 0.5       , 0.33333333])

In [26]:
np.max(X_minmax_scaled)

1.0

In [27]:
np.min(X_minmax_scaled)

0.0

In [28]:
max_scaler = MaxAbsScaler().fit(X_train)

In [29]:
max_scaler.max_abs_

array([2., 1., 2.])

In [30]:
max_scaler.scale_

array([2., 1., 2.])

In [31]:
X_max_abs_scaled=max_scaler.transform(X_train)

In [32]:
np.max(X_max_abs_scaled)

1.0

In [33]:
np.min(X_max_abs_scaled)

-1.0

### 缩放稀疏数据

将稀疏数据居中会破坏数据中的稀疏结构，因此很少是明智之举。但是，缩放稀疏输入是有意义的，尤其是在特征不同的情况下。

MaxAbsScaler和maxabs_scale是专门为缩放稀疏数据设计的，都去这是推荐的方式。 但是，scale和StandardScaler可以接受scipy.sparse矩阵作为输入，只要将with_mean = False显式传递给构造函数即可。 否则会引发ValueError，因为静默居中会破坏稀疏性，并且通常会无意中分配过多的内存而导致执行崩溃。 RobustScaler无法适应稀疏输入，但您可以在稀疏输入上使用变换方法。

### 缩放具有离群值的数据

如果您的数据包含许多异常值，则使用数据的均值和方差进行缩放可能效果不佳。在这些情况下，您可以使用robust_scale和RobustScaler作为替代。他们对数据的中心和范围使用更可靠的估计。

### Scaling vs Whitening

由于下游模型可以进一步对特征的线性独立性做出一些假设，有时不足以独立地对中心和缩放特征。

要解决此问题，您可以使用带有whiten = True的sklearn.decomposition.PCA来进一步消除特征之间的线性相关性。

## 非线性变换


类似于缩放， QuantileTransformer 类将每个特征缩放在同样的范围或分布情况下。但是，通过执行一个秩转换能够使异常的分布平滑化，并且能够比缩放更少地受到离群值的影响。但是它的确使特征间及特征内的关联和距离失真了。

### Mapping to a Uniform distribution(均匀分布)

In [34]:
from sklearn.datasets import load_iris

In [35]:
from sklearn.model_selection import train_test_split

In [36]:
Iris = load_iris()

In [37]:
X = Iris.data
y = Iris.target

In [38]:
X_train,X_test,y_train,y_test =  train_test_split(X,y)

In [39]:
quantile_transformer  = preprocessing.QuantileTransformer().fit(X)

  % (self.n_quantiles, n_samples))


In [40]:
X_train_quantile = quantile_transformer.transform(X_train)

In [41]:
X_train[:10,:]

array([[6.1, 2.9, 4.7, 1.4],
       [6.8, 3.2, 5.9, 2.3],
       [5.6, 2.9, 3.6, 1.3],
       [7.9, 3.8, 6.4, 2. ],
       [7.7, 2.8, 6.7, 2. ],
       [5.5, 2.3, 4. , 1.3],
       [5.1, 3.5, 1.4, 0.2],
       [5.5, 2.5, 4. , 1.3],
       [6.5, 3. , 5.8, 2.2],
       [6.3, 3.3, 4.7, 1.6]])

In [42]:
X_train_quantile[:10,:]

array([[0.61409396, 0.34563758, 0.61744966, 0.54697987],
       [0.87919463, 0.67114094, 0.92281879, 0.93624161],
       [0.41275168, 0.34563758, 0.36912752, 0.47651007],
       [1.        , 0.94295302, 0.97315436, 0.82885906],
       [0.97986577, 0.26845638, 0.98657718, 0.82885906],
       [0.36912752, 0.03691275, 0.4261745 , 0.47651007],
       [0.24161074, 0.8557047 , 0.11409396, 0.12751678],
       [0.36912752, 0.09731544, 0.4261745 , 0.47651007],
       [0.78187919, 0.46644295, 0.90604027, 0.89932886],
       [0.69463087, 0.73489933, 0.61744966, 0.67114094]])

In [43]:
np.percentile(X_train[:,0],[25,50,75,100])

array([5.1, 5.8, 6.4, 7.9])

In [44]:
np.percentile(X_train_quantile[:,0],[25,50,75,100])

array([0.24161074, 0.51006711, 0.74496644, 1.        ])

## Mapping to a Gaussian distribution(高斯分布)

在许多建模方案中，期望数据集中的特征的正态性(normality)。PowerTransform是一系列参数单调变换，旨在将来自任何分布的数据映射到尽可能接近高斯分布，以便稳定方差并最小化偏度。

In [45]:
pt = preprocessing.PowerTransformer(method='box-cox').fit(X_train)

In [46]:
pt

PowerTransformer(copy=True, method='box-cox', standardize=True)

In [47]:
X_train_power = pt.transform(X_train)

In [48]:
X_train_power[:,0]

array([ 0.4402802 ,  1.19381818, -0.17643512,  2.18148326,  2.01674981,
       -0.30908469, -0.87548541, -0.30908469,  0.88481467,  0.66748855,
       -0.87548541, -1.67770954,  0.55516737, -1.67770954, -1.50776105,
       -0.17643512,  0.07917161, -0.04706179,  0.9899981 ,  1.19381818,
        0.07917161, -0.87548541, -1.18258664,  0.20239348, -0.72818616,
        0.55516737,  0.77734045,  0.88481467,  1.19381818, -1.18258664,
        0.4402802 , -1.34280206, -1.34280206, -0.30908469,  2.01674981,
       -1.34280206, -0.72818616,  0.9899981 , -2.2202397 , -0.04706179,
        0.66748855, -0.87548541, -0.72818616,  1.38941211,  0.4402802 ,
        0.32272491, -1.34280206,  0.77734045, -0.30908469,  0.77734045,
       -0.04706179, -0.87548541,  0.4402802 , -1.02688567,  1.57733306,
        0.66748855,  0.77734045,  1.09297316, -1.02688567,  0.07917161,
        0.32272491, -0.44515534,  0.55516737,  1.29260756, -1.02688567,
       -0.87548541, -1.02688567,  1.75808294,  0.32272491,  0.77

## 归一化

归一化 是 缩放单个样本以具有单位范数 的过程。如果你计划使用二次形式(如点积或任何其他核函数)来量化任何样本间的相似度，则此过程将非常有用。

In [49]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

In [50]:
X_norm = preprocessing.normalize(X,norm='l2')

In [51]:
X_norm

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [52]:
normalizer = preprocessing.Normalizer().fit(X)
normalizer

Normalizer(copy=True, norm='l2')

In [53]:
normalizer.transform(X)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

## 分类特征编码

In [54]:
enc = preprocessing.OrdinalEncoder()

In [55]:
enc

OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)

In [56]:
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
enc

OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)

In [57]:
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [58]:
enc.transform([['female', 'from US', 'uses Safari']])

array([[0., 1., 1.]])

In [59]:
test_enc = [['female', 'from US', 'uses Safari']]

In [60]:
enc = preprocessing.OneHotEncoder()
enc.fit(X)

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=True)

In [61]:
test_enc_trans = enc.transform(test_enc)

In [62]:
print(test_enc_trans)

  (0, 0)	1.0
  (0, 3)	1.0
  (0, 5)	1.0


In [63]:
test_enc_trans.toarray()

array([[1., 0., 0., 1., 0., 1.]])

In [64]:
enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
enc.fit(X)

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='ignore',
              n_values=None, sparse=True)

In [65]:
test_enc2 = [['female', 'from Asia', 'uses Chrome']]

In [66]:
enc.transform(test_enc2).toarray()

array([[1., 0., 0., 0., 0., 0.]])

In [67]:
enc = preprocessing.OneHotEncoder(drop='first').fit(X)

## 离散化

In [68]:
X = np.array([[ -3., 5., 15 ],
              [  0., 6., 14 ],
              [  6., 3., 11 ]])

In [69]:
kbinsDiscretizer = preprocessing.KBinsDiscretizer(n_bins=[3,2,2],encode='ordinal')

In [70]:
kbinsDiscretizer.fit(X)

KBinsDiscretizer(encode='ordinal', n_bins=[3, 2, 2], strategy='quantile')

In [71]:
kbinsDiscretizer.transform(X)

array([[0., 1., 1.],
       [1., 1., 1.],
       [2., 0., 0.]])

**不同的离散化策略,uniform,quantile,kmeans**

KBinsDiscretizer implements different binning strategies, which can be selected with the strategy parameter. The ‘uniform’ strategy uses constant-width bins. The ‘quantile’ strategy uses the quantiles values to have equally populated bins in each feature. The ‘kmeans’ strategy defines bins based on a k-means clustering procedure performed on each feature independently.



### 特征二值化

In [72]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

In [73]:
binarizer = preprocessing.Binarizer()
binarizer

Binarizer(copy=True, threshold=0.0)

In [74]:
binarizer.fit(X)

Binarizer(copy=True, threshold=0.0)

In [75]:
binarizer.transform(X)

array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

## 生成多项式特征

In [76]:
X = np.arange(6).reshape(3,2)

In [77]:
X

array([[0, 1],
       [2, 3],
       [4, 5]])

In [78]:
poly_gen = preprocessing.PolynomialFeatures()
poly_gen.fit(X)

PolynomialFeatures(degree=2, include_bias=True, interaction_only=False,
                   order='C')

In [79]:
poly_gen.transform(X)

array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])

In [80]:
X = np.arange(9).reshape(3,3)

In [81]:
poly_gen = preprocessing.PolynomialFeatures(interaction_only=True)
poly_gen.fit(X)

PolynomialFeatures(degree=2, include_bias=True, interaction_only=True,
                   order='C')

In [82]:
poly_gen.transform(X)

array([[ 1.,  0.,  1.,  2.,  0.,  0.,  2.],
       [ 1.,  3.,  4.,  5., 12., 15., 20.],
       [ 1.,  6.,  7.,  8., 42., 48., 56.]])

## 自定义transformer

In [84]:
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])