# sklearn.preprocessing
* Transformers - Changes data from one format to another
* http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
* sklearn.preprocessing consist of common utility functions & transformer classes for converting raw data to a format that machine learning algos can consume ( just numbers )
* learning algos benefit from standarization of dataset

#### Standardization, or mean removal and variance scaling

In [3]:
from sklearn import preprocessing
import numpy as np

In [4]:
X_train = np.array([[ 1., -1.,  2.],
                   [ 2.,  0.,  0.],
                   [ 0.,  1., -1.]])

In [6]:
X_scaled = preprocessing.scale(X_train)

In [7]:
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [11]:
X_scaled.mean(axis=0)

array([ 0.,  0.,  0.])

In [12]:
X_scaled.std(axis=0)

array([ 1.,  1.,  1.])

#### StandardScaler
* Converts data into cols mean as 0 & cols std as 1

In [13]:
#Create a transformer object
transformer = preprocessing.StandardScaler()

In [15]:
#Create the logic for transformation
transformer.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [16]:
#DO the data transformation
transformer.transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [14]:
#Combination of above twp
transformer.fit_transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

#### MinMaxScaler
* Converts all data between 0 & 1 range ( inclusive)

In [18]:
minmaxscaler = preprocessing.MinMaxScaler()
minmaxscaler.fit_transform(X_train)

array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

#### MaxAbsScaler
* Converts all data in the range [-1,1]

In [19]:
maxabsscaler = preprocessing.MaxAbsScaler()
maxabsscaler.fit_transform(X_train)

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

#### Scaling Sparse Matrix
* The zeros might not remain as zeros anymore on doing scaling of sparse matrix
* Make sure you don't use above scaling tricks on sparse matrix

#### Scaling Outliers
* Scaling can disturb relationship between data
* PCA - Used for finding important features. This can be effected as it might misunderstand relato handle thistionship between data. Use whiten=True 

#### Non-linear transformation

#### Normalization
* Cuts down peak from data

In [22]:
norm = preprocessing.Normalizer()
norm.fit_transform(X_train)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

#### Binarization
* Given any data, converts it into 0 & 1

In [23]:
binarizer = preprocessing.Binarizer()
binarizer.fit_transform(X_train)

array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

###  Categorical Data 
* Two types of data - categorical & numerical
* Categorical data needs to be converted to numerical data

##### OneHotEncoder

In [27]:
enc = preprocessing.OneHotEncoder()
enc.fit_transform([[3], [0], [0], [1]]).toarray()

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

In [32]:
#shows col information
enc.active_features_

array([0, 1, 3], dtype=int64)

In [41]:
enc = preprocessing.OneHotEncoder(n_values=[2,3,4])
#2 cols for 1st data, 3 cols for 2nd data & 4 for last data
enc.fit_transform([[0, 0, 3], [1, 1, 1], [0, 2, 1], [1, 1, 0]]).toarray()

array([[ 1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.]])

In [34]:
enc.active_features_

array([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)

##### DictVectorizer

In [43]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer()

In [68]:
import pandas as pd

In [56]:
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.},
]

df = pd.DataFrame(measurements)

In [61]:
dv.fit_transform(measurements).toarray()

array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])

In [67]:
#orient records convert into dic of records
dv.fit_transform(df.to_dict(orient='records')).toarray()

array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])

In [69]:
dv.feature_names_

['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

##### CountVectorizer

In [85]:
corpus = [
     'This is the first document awesome food.',
     'This is the second second document.',
     'And the third one the is mission impossible.',
     'Is this the first document?',
]

In [86]:
df = pd.DataFrame({'Text':corpus})
df

Unnamed: 0,Text
0,This is the first document awesome food.
1,This is the second second document.
2,And the third one the is mission impossible.
3,Is this the first document?


In [87]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [88]:
cv.fit_transform(df.Text).toarray()

array([[0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1],
       [0, 0, 1, 0, 0, 0, 1, 0, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 2, 1, 0],
       [0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1]], dtype=int64)

In [89]:
cv.vocabulary_

{'and': 0,
 'awesome': 1,
 'document': 2,
 'first': 3,
 'food': 4,
 'impossible': 5,
 'is': 6,
 'mission': 7,
 'one': 8,
 'second': 9,
 'the': 10,
 'third': 11,
 'this': 12}

In [94]:
cv = CountVectorizer(stop_words=['the','is'])
cv.fit_transform(df.Text).toarray()

array([[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 1],
       [1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]], dtype=int64)

In [95]:
cv.vocabulary_

{'and': 0,
 'awesome': 1,
 'document': 2,
 'first': 3,
 'food': 4,
 'impossible': 5,
 'mission': 6,
 'one': 7,
 'second': 8,
 'third': 9,
 'this': 10}

In [96]:
cv = CountVectorizer(stop_words='english')
cv.fit_transform(df.Text).toarray()

array([[1, 1, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 2],
       [0, 0, 0, 1, 1, 0],
       [0, 1, 0, 0, 0, 0]], dtype=int64)

In [100]:
cv = CountVectorizer(vocabulary=['mission','food'])
cv.fit_transform(df.Text).toarray()

array([[0, 1],
       [0, 0],
       [1, 0],
       [0, 0]], dtype=int64)

#### Imputation of missing values

#### Generating polynomial features

#### Custom Transformer