## 6.3 preprocessing data
The `sklearn.preprocessing` change **raw feature vectors** into more suitable representation for the estimators.

In general, many learing alg such as linear models benefit from standardization. ❓

>数据变换

- Normalization (归一化)： min-max,max 。 数值小 方便优化计算
- Standardization(): z-score 标准化。数据（近正态分布）转换为标准正态
- Log(对数变化)：减少异常值影响。 使其更接近正态？
- 非线性变换：使数据更容易被线性模型拟合。Box-cox
- 二值化：0，1
- 正交化：PCA降维

### 6.3.1. Standardization, or mean removal and variance scaling
Standardization of datasets is a **common requirement** for many machine learning estimators implemented in scikit-learn; they might **behave badly** if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In [4]:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)

In [5]:
scaler.mean_ # mean by column

array([1.        , 0.        , 0.33333333])

In [7]:
scaler.scale_ # var by column

array([0.81649658, 0.81649658, 1.24721913])

In [9]:
X_scaled = scaler.transform(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

Scaled data has mean 0, var 1:

In [11]:
X_scaled.mean(axis=0) # mean by column

array([0., 0., 0.])

In [13]:
X_scaled.std(axis=0) # mean by column

array([1., 1., 1.])

StandardScaler can **re-apply** the same transformation on the testing test. The test set will use the **same mean,std** to transfom data, otherwise recompute them.

In [3]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train) 

pipe.score(X_test, y_test) # re-applying scaling on test data

0.96

#### 6.3.1.1. Scaling features to a range
MinMaxScaler or MaxAbsScaler， range alwats is [0,1]

In [10]:
import numpy as np
from sklearn import preprocessing
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [12]:
print(min_max_scaler.data_min_, min_max_scaler.data_max_)

[ 0. -1. -1.] [2. 1. 2.]


The scaler can applied to test data.(Use the **same min/max**) But this outs the range.

In [13]:
X_test = np.array([[-3,-1,4]])
X_test_minmax=min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

`MaxAbsScaler` :x/max. $[-1,1]$

In [16]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [17]:
max_abs_scaler.max_abs_

array([2., 1., 2.])

In [1]:
X_test = np.array([[-3,-1,4]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs

NameError: name 'np' is not defined

In [None]:
你问得很好！`QuantileTransformer`中`output_distribution='normal'`确实会让输出符合正态分布，而不是所有分位数变换都只是均匀分布（uniform）。让我详细解释一下它是怎么做到的，以及为什么它不是简单地停留在均匀分布。

---

### 分位数变换的基本原理
分位数变换的核心思想是基于数据的**分位数（quantiles）**，通过单调变换将原始数据映射到一个目标分布。过程可以分为两步：
1. **映射到均匀分布（Uniform Distribution）**  
   - 先计算每个数据点的分位数（排名），将其映射到0到1之间的均匀分布。
   - 比如，数据`[10, 20, 30, 100, 1000]`会被映射到`[0.0, 0.25, 0.5, 0.75, 1.0]`。
   - 如果`output_distribution='uniform'`，这里就结束了，结果就是均匀分布。

2. **从均匀分布映射到目标分布**  
   - 如果指定了`output_distribution='normal'`，会多一步：将均匀分布的值通过正态分布的反累积分布函数（Inverse CDF，也叫分位函数或PPF，Percent Point Function）转换为正态分布。

---

### 具体怎么做到正态分布？
为了让输出变成标准正态分布（均值0，标准差1），`QuantileTransformer`是这样操作的：

#### 步骤分解：
1. **计算原始数据的分位数（均匀分布）**  
   - 对输入数据排序，计算每个值对应的分位数（0到1之间的值）。
   - 比如，`[10, 20, 30, 100, 1000]`变成`[0.0, 0.25, 0.5, 0.75, 1.0]`。

2. **应用标准正态分布的反CDF**  
   - 标准正态分布的反CDF（Inverse CDF）是一个数学函数，它接受一个0到1之间的概率值（均匀分布），输出对应的正态分布值。
   - 比如：
     - 均匀分布0.5 → 正态分布0（正态分布的中位数）。
     - 均匀分布0.975 → 正态分布约1.96（正态分布的97.5%分位点）。
     - 均匀分布0.025 → 正态分布约-1.96。
   - 对于`[0.0, 0.25, 0.5, 0.75, 1.0]`：
     - 0.0（接近最小值）→ 负无穷（实际用有限值近似，如-2.33）。
     - 0.25 → 约-0.67（正态分布的25%分位点）。
     - 0.5 → 0.0（正态分布的50%分位点）。
     - 0.75 → 约0.67（正态分布的75%分位点）。
     - 1.0（接近最大值）→ 正无穷（实际用有限值近似，如2.33）。

3. **结果**  
   - 变换后可能是`[-2.33, -0.67, 0.0, 0.67, 2.33]`（具体值取决于分位数数量和插值方式）。
   - 这些值符合标准正态分布的特性：均值≈0，标准差≈1。

---

### 数学依据
- **均匀分布到正态分布的转换**  
  - 如果 \( U \) 是均匀分布在[0, 1]上的随机变量，那么 \( Z = \Phi^{-1}(U) \) 是一个标准正态分布的随机变量，其中 \( \Phi^{-1} \) 是标准正态分布的反CDF。
  - `QuantileTransformer`利用这个原理，先把数据变成均匀分布 \( U \)，再通过 \( \Phi^{-1} \) 转为正态分布。

- **单调性保持**  
  - 因为反CDF是单调递增的，整个变换过程不会改变数据的相对顺序（排名）。

---

### 为什么不一直是均匀分布？
- 当时，变换停在第一步，输出就是均匀分布（0到1）。
- 当`output_distribution='normal'`时，多加了第二步，使用正态分布的反CDF，把均匀分布“拉伸”成正态分布的形状。
- **区别**：
  - 均匀分布：值均匀分布在固定区间（0到1），没有“钟形曲线”。
  - 正态分布：值集中在均值附近（0），两端逐渐减少，形成典型的正态“钟形”。

---

### 举个例子
原始数据：`[10, 20, 30, 100, 1000]`
1. **均匀分布（第一步）**：`[0.0, 0.25, 0.5, 0.75, 1.0]`  
   - 如果`output_distribution='uniform'`，结果就是这个。
2. **正态分布（第二步）**：  
   - 0.0 → 约-1.28（正态分布的0%分位点，用有限值近似）。
   - 0.25 → 约-0.67。
   - 0.5 → 0.0。
   - 0.75 → 约0.67。
   - 1.0 → 约1.28（正态分布的100%分位点，用有限值近似）。
   - 结果：`[-1.28, -0.67, 0.0, 0.67, 1.28]`。

---

### 代码验证
```python
from sklearn.preprocessing import QuantileTransformer
import numpy as np

data = np.array([10, 20, 30, 100, 1000]).reshape(-1, 1)

# uniform分布
qt_uniform = QuantileTransformer(output_distribution='uniform')
uniform_data = qt_uniform.fit_transform(data)
print("Uniform:", uniform_data.flatten())

# normal分布
qt_normal = QuantileTransformer(output_distribution='normal')
normal_data = qt_normal.fit_transform(data)
print("Normal:", normal_data.flatten())
```

**输出**:
```
Uniform: [0.   0.25 0.5  0.75 1.  ]
Normal:  [-1.28155157 -0.67448975  0.          0.67448975  1.28155157]
```

---

### 适合正态性场景
- 很多机器学习模型（比如线性回归、逻辑回归）假设输入特征接近正态分布。
- 通过上述步骤，把偏态数据（比如右偏的`[10, 20, 30, 100, 1000]`）调整为正态分布，提升模型性能。

希望这解释清楚了！如果还有疑问，比如想看反CDF的具体计算，可以再问我！

### 6.3.2. Non-linear transformation
**Quantile Transform** based on the $G^{-1}(F(X))$.
- F is cumulative distribution function:
$$F(X) = P(X>=x)$$
- $G^{-1}$ is the **quantile function** of the **desired output distribution** G. If $G(x)=p$, then $G^{-1}(p)=x$
  
This formula is using two following facts:
1. If $X$ is a random variable with CDF, $F(X)$ is uniformly distributed on $[0,1]$
2. If $U$ is a random variable with uniform distribution on $[0,1]$, $G^{-1}(U)$ has distribution $G$

For example:   
$[10,20,30,100,1000]$, $CDF(X) = [0,0.25,0.5,0.75,1.0]$  
if target $G$ is Guassian,OUT $[-1.28, -0.67, 0, 0.67, 1.28]$
- $G^{-1}(0.0) = -1.28$
- $G^{-1}(0.5) = 0$  

**Power Transform**(`log` $x^k$)  makes distribution close to Gaussian distribution.For example, $ln(10)=2.3, ln(1000)=6.91$, compress range.

#### 6.3.2.1. Mapping to a Uniform distribution

In [11]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)



In [None]:
X_train

In [9]:
X_train_trans

array([[0.54954955, 0.43693694, 0.48198198, 0.59009009],
       [0.50900901, 0.16216216, 0.42342342, 0.41441441],
       [0.85135135, 0.43693694, 0.80630631, 0.85135135],
       [0.06756757, 0.66666667, 0.04504505, 0.13513514],
       [0.87837838, 0.57657658, 0.71621622, 0.93243243],
       [0.17567568, 0.86936937, 0.27927928, 0.32432432],
       [0.31981982, 0.91441441, 0.1981982 , 0.13513514],
       [0.17567568, 0.        , 0.35135135, 0.35585586],
       [0.74324324, 0.43693694, 0.80630631, 0.7027027 ],
       [0.80630631, 0.74774775, 0.86486486, 1.        ],
       [0.57657658, 0.01351351, 0.67117117, 0.59009009],
       [0.80630631, 0.11261261, 0.89189189, 0.7027027 ],
       [0.4009009 , 0.11261261, 0.4009009 , 0.38738739],
       [0.97747748, 0.43693694, 0.94144144, 0.93243243],
       [0.65315315, 0.74774775, 0.5990991 , 0.63063063],
       [0.36036036, 0.06306306, 0.38738739, 0.38738739],
       [0.65315315, 0.20720721, 0.63963964, 0.7027027 ],
       [0.65315315, 0.26576577,

In [10]:
import numpy as np
np.percentile(X_train_trans[:,0],[0,25,50,75,100])

array([0.        , 0.23873874, 0.50900901, 0.74324324, 1.        ])

In [13]:
np.percentile(X_train[:,0],[0,25,50,75,100])


array([4.3, 5.1, 5.8, 6.5, 7.9])

#### 6.3.2.2. Mapping to a Gaussian distribution

PowerTransformer:  


🔗Map data to a normal distribution

## 6.4 Normalization
Normalization is the process of scaling individual smaples to have **unit norm**(长度/范数).  
Normalization simplify to **compare the similarity** within samples. Specially when we use dot / kernel to compute similarity  

`normalize` function provides this operation with different `norm`.( L1, L2, max..).   
normalize each row.!

In [4]:
from sklearn import preprocessing
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

Alaso there is a `Normalizer` class, It's a `Transformer`. It can be used in pipeline.

In [8]:
normalizer = preprocessing.Normalizer().fit(X)
normalizer

In [10]:
normalizer.transform(X)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

### 6.3.4. Encoding categorical features
To convert categorical features to int codes, we can use `OrignalEncoder`. 

For example, there are some features: `["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"].`Now, `["male", "from US", "uses Internet Explorer"]` could be expressed as `[0, 1, 3]` while `["female", "from Asia", "uses Chrome"]` would be `[1, 2, 1].`

In [22]:
enc = preprocessing.OrdinalEncoder()
X= [
    ['male','from us','uses safari'],
    ['female','from eu','uses chrome']
]
enc.fit(X)

In [23]:
enc.transform([['female','from us', 'uses chrome']])

array([[0., 1., 0.]])

In [24]:
enc.transform([[np.nan, np.nan, 'uses chrome']])

ValueError: Found unknown categories [nan] in column 0 during transform

`encoded_missing_value` to encode the missing values

In [25]:
import numpy as np
enc = preprocessing.OrdinalEncoder(encoded_missing_value=-1)
X = [['male'], ['female'], [np.nan], ['female']]
enc.fit_transform(X)

array([[ 1.],
       [ 0.],
       [-1.],
       [ 0.]])