# Data Preprocessing

Real world data is never clean
- Sklearn provided transformers to pre-process the data
- Sklearn provides pipeline for making it easier to chain multiple transforms together and apply them uniformly across train, eval and test sets.

### Typical Problems include
- Missing Values 
    - Numerical values are not on the same scale
        - Categorical Values needs to be represented in a sensible numerical manner
- Too many features, need to remove them
    - Extract features from non-numerical data

Sklearn provides a library of transformers for data preprocessing.
- Data cleaning `(sklearn.preprocessing)` such as standardization, missing value imputation, etc.
- Feature extraction `(sklearn.feature_extraction)`
- Feature reduction `(sklearn.decomposition.pca)`
- Feature expansion `(sklearn.kernel_approximation)`

### Tranfomer methods

- `fit()` method learns model parameters from a training set.
- `transform()` method applies the learnt transformation to the new data.
- `fit_transform()` performs function of both `fit()` and `transform()` methods and is more convenient and efficient to use.

`sklearn.feature_extraction` has useful APIs to extract features from data:
- DictVectorizer - Convertes data from dictonary to a Matrix
- FeatureHasher - 

### feature extraction from images and text
- `sklearn. feature_extraction. image.*` has useful APIs to extract features from image data. Find out more about them in sklearn user guide at the following link: Feature Extraction from Images.
- `sklearn.feature_extraction.text.*` has useful APIs to extract features from text data. Find out more about them in sklearn user guide at the following link: Feature Extraction from Text.

## Data Cleaning
Missing values occur due to errors in data capture such as sensor malfunctioning, measurement errors etc.

- `sklearn.impute` API provides functionality to fill missing values in a dataset.

- SimpleImputer
- KNNImputer

Can be used to fill missing data.

`MissingIndicator` provides indicators for missing values.

### SimpleImputer

- Fills missing values with one of the following strategies:
'mean', 'median' , 'most_frequent' and 'constant'

In [1]:
# si = SimpleImputer(strategy = 'mean)
# si.fit_transform()

### KNNImputer

- Uses k-nearest neighbours approach to fill missing values in a dataset.
- The missing value of an attribute in a specific example is filled with the mean value of the same attribute of n_neighbors closest neighbors.
- The nearest neighbours are decided based on Euclidean distance.

In [None]:
# knni = KNNImputer(n_neighnours = 2, weights='uniform')
# knni.fit_transform()

It is useful to indicate the presence of missing values in the dataset.
- MissingIndicator helps us get those indications.
    - It returns a binary matrix,
        - True values correspond to missing entries in original dataset.

### Using SimpleImputer

In [4]:
import numpy as np
from sklearn.impute import SimpleImputer

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
mean_imputer = SimpleImputer(strategy="mean")
mean_imputer.fit (X)
mean_imputer.transform(X)

array([[ 7.,  2.,  3.],
       [ 4.,  2.,  6.],
       [10.,  2.,  9.]])

### Using KNNImputer

In [6]:
import numpy as np
from sklearn.impute import KNNImputer

X = np.array([[1,2,np.nan], [3,4,3], [np.nan,6,5], [8,8,7]])
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

## Categorical Transfomer

### OneHotEncoder

- Encodes categorical feature or label as a one-hot numeric array.
- Creates one binary column for each of K unique values.
- Exactly one column has 1 in it and rest have 0.

In [None]:
# ohe = OneHotEncode()
# ohe.fit_transform(X)

### LabelEncoder
Encodes target labels with value between 0 and K - 1, where K is number of distinct values.

In [None]:
# le = LabelEncoder()
# le.fit_transform(y)

### OrdinalEncoder

Encodes categorical features with value between 0 and K - 1, where K is number of distinct values.

In [7]:
# oe = OrdinalEncoder()
# oe.fit_transform()

### LabelBinarizer

- Several regression and binary classification can be extended o multi-class setup in one-vs-all fashion.
- This involves training a single regressor or classifier per class.

In [8]:
# lb = LabelBinarizer()
# lb.fit_transform(y)

### MultiLabelBinarizer

Encodes categorical features with value between 0 and K - 1, where K is number of classes.

In [None]:
# mlb = MultiLabelBinarizer()
# mlb.fit_transform()

### add_dummy_feature

Augments dataset with a column vector, each value in the column vector is 1. To hadle biasedness

In [9]:
# add_dummy_feature(X)

## Numerical Tranformer

- Feature Scaling
- Polynomial Tranformation
- Discretization

### Feature Scaling

- Numerical features with different scales leads to slower convergence of iterative optimization procedures.
- It is a good practice to scale numerical features so that all of them are on the same scale.
- Three feature scaling APls are available in sklearn
    - StandardScaler
        - $$ x' = \frac{(x-mu)}{sigma}$$
    - MinMaxScaler
        - $$ x` = \frac{(x-x.min)}{(x.max-x.min)}$$
        - All the values fall within range [0,1]
    - MaxAbsScaler
        - It transforms the original features vector x into new feature vector x' so that all values fall within range [-1, 1]
        - $$ x' = \frac{(x)}{{MaxAbsoluteValue}}$$
        - $$ MaxAbsoluteValue = max(x.max , |x.min|)$$

In [11]:
## StandardScaler

# ss = StandardScaler()
# ss.fit_transform()

## MinMaxScaler

# mms = MinMaxScaler()
# mms.fit_transfornm()

## MaxAbsScaler()

# mas = MaxAbsScaler()
# mas.fit_transform(x)


### FunctionTransformer()

Constructs transformed features by applying a user defined function.

In [None]:
# ft = FunctionTransformer(numpy.log2)
# ft.fit_transform(X)

### Polynomial Transformation

Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

In [13]:
# pf = PolynomialFeatures(degree=2)
# pf.fit_transform(x)

### KBinDiscretizer

- Divides a continuous variable into bins.
- One hot encoding or ordinal encoding is further applied to the bin labels.

In [14]:
# KBinsDiscretizer(
# n_bins=5,
# strategy='uniforn',
# encode = 'ordinal')

### Outliers Removal

- IQR method

In [26]:
import matplotlib.pyplot as plt
data = np.array([1, 12, 14, 15, 17, 19, 21, 23, 26, 36])

iqr = np.quantile(data, 0.75) - np.quantile(data, 0.25)
lower_bound = np.quantile(data, 0.25) - 1.5*iqr
upper_bound = np.quantile(data, 0.75) + 1.5*iqr

print(lower_bound, upper_bound)

# plt.boxplot(data)

1.875 34.875


- Z-Score method
    - Calculate mean and std, and calculate Z-Score : (x-mu)/sigma
- Any point above or below Z-Score is calculated as Outlier