# Pre Processing

#### Rescalling-Data

1. MinMaxScaler
2. MaxAbsScaler
3. Robust-Scaler
4. StandardScaler

# Rescalling Data

When data is comprised of **attributes with varying scales**, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent.

It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors. 
Rescaling of data using different techniques, some of which are listed below.

When faced with features which are very different in scale / units, it is quite clear to see that classifiers / regressors which rely on euclidean distance such as k-nearest neighbours will fail or be sub-optimal. Same goes for other regressors. Especially the ones that rely on gradient descent based optimisation such as logistic regressions, Support Vector Machines and Neural networks. The only classifiers/regressors which are immune to impact of scale are the tree based regressors.

**NOTE:** 
1. Before performing scalling one should check oultier and Treat the outlier
2. Check the EDA Notebook for various outlier treament method 

### MinMaxScaler

Transform features by **scaling each feature to a given range**.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.<br><br>The transformation is given by:<br>X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))<br>X_scaled = X_std * (max - min) + min

In [20]:
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))
MinMaxScaler()
print(scaler.transform(data))

MinMaxScaler()
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


In [21]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [22]:
# minmax scaler on cloumn of a dataframe
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['price'] = scaler.fit_transform(df[['price']])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


In [23]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

###  MaxAbsScaler

This estimator **scales and translates each feature individually** such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.<br><br>
This scaler can also be applied to sparse CSR or CSC matrices.

In [24]:
from sklearn.preprocessing import MaxAbsScaler
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
transformer = MaxAbsScaler().fit(X)
transformer
MaxAbsScaler()
transformer.transform(X)


array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [25]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


In [26]:
from sklearn.preprocessing import MaxAbsScaler
transformer = MaxAbsScaler().fit(df[['price']])
df['price'] = transformer.transform(df[['price']])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Robust Scaler

Scale features using statistics that are **robust to outliers**. RobustScaler transforms the feature vector by subtracting the median and then dividing by the interquartile range (75% value — 25% value)<br><br>
**Centering and scaling happen independently on each feature** by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.<br>
Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results<br><br>**Use RobustScaler, to reduce the effects of outliers**, relative to MinMaxScaler.

In [27]:
from sklearn.preprocessing import RobustScaler
X = [[ 1., -2.,  2.],
     [ -2.,  1.,  3.],
     [ 4.,  1., -2.]]
transformer = RobustScaler().fit(X)
transformer
RobustScaler()
transformer.transform(X)


array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])

In [28]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,22.0
1,L,0.5,gray,class 2,20.0
2,XL,0.75,blue,class 2,19.0
3,M,0.5,orange,class 1,17.0
4,M,1.0,green,class 3,20.0
5,M,0.0,red,class 1,22.0


In [29]:
# robustscaler for dataframe
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(df[['boh']])
df['boh'] = transformer.transform(df[['boh']])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,0.25,black,class 1,0.888889
1,L,0.5,gray,class 2,0.0
2,XL,0.75,blue,class 2,-0.444444
3,M,0.5,orange,class 1,-1.333333
4,M,1.0,green,class 3,0.0
5,M,0.0,red,class 1,0.888889


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### StandardScaler

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does not meet the strict definition of scale I introduced earlier.

**When to use**
it can be used when to transform a feature so it is close to normally distributed 
**NOTE**
1. Results in the distribution with a Standard deviation equal to 1
2. If there are outliers in the feature, normalize the data and scale most of the data to a small interval


In [30]:
import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler


data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]

df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')

sc_X = StandardScaler()
df = sc_X.fit_transform(df)

# df = pd.DataFrame(df, columns=['N0', 'N1', 'N2', 'N3', 'N4'])
# Get the dataframe for further analysis



# From this stats infromation can be obtanined
num_cols = len(df[0,:])
for i in range(num_cols):
    col = df[:,i]
    col_stats = ss.describe(col)
    print(col_stats)

DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)
