# Data Preprocessing with Scikit-Learn

># 1. Scaling
* `scale(X)`: Standard Normal Gaussian (average=0, variance=1)
* `robust_scale(X)`: use median & interquartile range / minimize the influence of the outliers
* `minmax_scale(X)`: min$\rightarrow$1, max$\rightarrow$1
* `maxabs_scale(X)`: maximum absolute $\rightarrow$ 1

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() 
scaler.fit(data1) 
data2 = scaler.transform(data1)
```

># 2. Normalization
* Magnitude of each vector $\rightarrow$ 1

```python
from sklearn.preprocessing import normalize

y = normalize(x)
```

># 3. Encoding

### One-Hot-Encoding
* Integers $\rightarrow$ K-dim vector

```python
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X)
ohe.n_values_, ohe.feature_indices_, ohe.active_features_
ohe.transform(X).toarray() #transform into array
```

### Label Binarizer
* OHE for string

```python
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb.fit(y)
lb.classes_
lb.transform(y)
```

### Label Encoding
* Labels $\rightarrow$ integers

```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(z)
le.classes_
le.transform([1, 1, 2, 6]) 
le.inverse_transform([0, 0, 1, 2])
```

### Binarizer
* 0/1 based on threshold (default = 0)

```python
from sklearn.preprocessing import Binarizer
bn = Binarizer(threshold = 1.1)
bn.fit(X)
bn.tranform(X)
```

># 4. Imputer
* Fills up empty information
* Parameters
  * `missing_values`
  * `strategy`: how to fill them
    * `mean`
    * `median`
    * `most_frequent`

```python
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit_transform(X)
```

># 5. Polynomial Features
$$ x \;\; \rightarrow \;\; [ 1, x, x^2, x^3, \cdots ] $$
$$$$
$$ [x_1, x_2] \;\; \rightarrow \;\; [ 1, x_1, x_2, x_1 \cdot x_2 ] $$
* Parameters
  * `degree`
  * `interaction_only`
  * `include_bias`

```python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
poly.fit_transform(X)
```

># 6. Function Transformer
$$ x \;\; \rightarrow \;\; [ f_1(x),  f_2(x),  f_3(x),  \cdots ] $$

```python
from sklearn.preprocessing import FunctionTransformer

def all_but_first_column(X):
    return X[:, 1:]
    
FunctionTransformer(all_but_first_column).fit_transform(X)
```