**scikit-learn中的特征工程有关的操作**
- @Author: Rui Zhu
- @Date: 2024-07-02
- @Follow: [6. Dataset transformations](https://scikit-learn.org/stable/data_transforms.html#dataset-transformations)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 6.1 特征缩放 (Feature Scaling)
* follow: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-data

### 6.1.1 标准化 (standardization)
* 将每个特征归一化到均值为0, 方差为1
* $X_{\rm scaled} = \frac{X - \mu}{\sigma}$

In [16]:
from sklearn.preprocessing import StandardScaler

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.], 
                    [ 1.,  0.,  1.]])

scaler = StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)

print(X_scaled)
print(f"mean: {X_scaled.mean(axis=0)}")
print(f"std: {X_scaled.std(axis=0)}")

[[ 0.         -1.41421356  1.34164079]
 [ 1.41421356  0.         -0.4472136 ]
 [-1.41421356  1.41421356 -1.34164079]
 [ 0.          0.          0.4472136 ]]
mean: [0.00000000e+00 0.00000000e+00 1.38777878e-17]
std: [1. 1. 1.]


In [13]:
"""
上述操作等同于下面的手动计算
"""
x_mean = np.mean(X_train, axis=0)
x_std = np.std(X_train, axis=0)
X_scaled_manual = (X_train - x_mean) / x_std
X_scaled_manual

array([[ 0.        , -1.41421356,  1.34164079],
       [ 1.41421356,  0.        , -0.4472136 ],
       [-1.41421356,  1.41421356, -1.34164079],
       [ 0.        ,  0.        ,  0.4472136 ]])

### 6.1.2 min-max scaling

* 默认scale到0-1

In [17]:
from sklearn.preprocessing import MinMaxScaler
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = MinMaxScaler()
X_train_minmax = scaler.fit_transform(X_train)
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [18]:
"""
上述操作等同于下面的手动计算
"""
x_min = np.min(X_train, axis=0)
x_max = np.max(X_train, axis=0)
X_train_minmax_manual = (X_train - x_min) / (x_max - x_min)
X_train_minmax_manual

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

* scale到-1-1

In [20]:
scaler = MinMaxScaler(feature_range=(-1, 1))
X_train_minmax = scaler.fit_transform(X_train)
X_train_minmax

array([[ 0.        , -1.        ,  1.        ],
       [ 1.        ,  0.        , -0.33333333],
       [-1.        ,  1.        , -1.        ]])

In [21]:
"""
上述操作等同于下面的手动计算
"""
scale_min, scale_max = -1, 1

x_min = np.min(X_train, axis=0)
x_max = np.max(X_train, axis=0)
X_train_minmax_manual = (X_train - x_min) / (x_max - x_min)
X_train_minmax_manual = X_train_minmax_manual * (scale_max - scale_min) + scale_min
X_train_minmax_manual

array([[ 0.        , -1.        ,  1.        ],
       [ 1.        ,  0.        , -0.33333333],
       [-1.        ,  1.        , -1.        ]])

In [22]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
X_train_maxabs = scaler.fit_transform(X_train)
X_train_maxabs

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

### 6.1.3 最大值归一化
* 直接将特征除以最大值

In [23]:
"""
上述操作等同于下面的手动计算
"""
x_max = np.max(np.abs(X_train), axis=0)

X_train_maxabs_manual = X_train / x_max
X_train_maxabs_manual

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])