The real world training data is not clean and has many issues: NaN values, or non-numeric attributes etc..

So, to preprocess this kind of data- Sklearn provides rich set of `transformers`

sklearn provides `pipeline` for making it easier to chain the multiple transformers together and apply them uniformly across train, evaluation and test set

## Typical Problems:
  1. Missing values in features
  2. No same scale
  3. Categorical attributes needs some sensible numerical scale
  4. Too many features>> reduce them
  5. Extract features from non-numerical data like images, or text etcc 

## Sklearn library:

  1. Data Cleaning: `sklearn.preprocessing`- standardization, missing value imputation etc.
  2. Feature Extraction: `sklearn.feature_extraction` 
  3. Feature reduction: `sklearn.decomposition.pca`
  4. Feature Expansion: `sklearn.kernel_approximation`

## Transformer Methods:
1. `fit()`: learns model parameters from training set
2. `transform()`: applies learnt transformation to new data
3. `fit_transform()`: perfroms the function of both `fit` and `transform` and is more convenient and efficient to use.

# Part 1: Feature Extraction

`sklearn.feature_extraction` has useful APIs to extract features from data:

  `dictVectorizer`|`FeatureHasher`|
  ----------------|--------------|
  Converts list of mappings of faturename and value into a matrix | High-Speed, low memory vectorizer that uses feature hashing technique
Builds `hash table` of features | It applies `hash function` to the features to determine their column index in sample matrices directly
| The hasher does not remember what input feature looked like and thus has `no inverse_transform` method.
| Output of this transformer is `scipy.sparse` matrix

In [9]:
# dictVectorizer illustration

data = [{'age':4, 'height':96},{'age':5,'height':101}]

from sklearn.feature_extraction import DictVectorizer
df = DictVectorizer(sparse = False)
df.fit_transform(data)

array([[  4.,  96.],
       [  5., 101.]])

## Feature extraction from Images and Texts

### `sklearn.feature_extraction.image.*` class

### `sklearn.feature_extraction.text.*` class

# Part 2: Cleaning Data

## Handling Missing Values:

Missing values occur due to error in data capture


### `sklearn.impute`

`MissingIndicator` that provides indications for missing values

`SimpleImputer`|`KNNImputer`
---|---
Fills the missing value with: `mean`, `median`, `most_frequent`, or `constant` | uses K-nearest neighbour approach to fill the missing value
| Value filled is `mean` of n_neighbors
| The nearest neighbours are decided based on eucledian distance




```
# this is just for illustration purpose of simple imputer

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy ='mean')
si.fit_transform(data)
```



```
# this is illustration for Knn imputer

knni = KNNImputer(n_neighbours = 2, weights = 'uniform')
knni.fit_transform(data)
```




### `Missing Indicator`

Return Binary Matrix. True when the value is missing

### SimpleImputer Strategies:

1. `mean`, then replace missing values using the mean along each column. Can only be used with numeric data.
2. `median`, then replace missing values using the median along each column. Can only be used with numeric data
3. `most_frequent`, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
4. `constant`, then replace missing values with fill_value. Can be used with strings or numeric data.

In [16]:
## Example

import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer ( missing_values=np.nan , strategy='median')
X = [ [ 4 , 1 ] , [ np . nan , 5 ] , [ 8 , 0 ] ]
imp.fit_transform (X)

array([[4., 1.],
       [6., 5.],
       [8., 0.]])