# Intro to Data Science
## Part III. - Data Transformation

### Table of contents

##### Data Transformation
- <a href="#What-is-Data-Transformation?">Theory</a>
- <a href="#1.-Numerical-features">Numerical features</a>
- <a href="#2.-Nominal-features">Nominal features</a>

##### Models:
- <a href="#Intermission---instance-based-classifiers">kNN</a>
- <a href="Intermission-II.---Model-of-the-week">Decision tree</a>

## What is Data Transformation?
During data transformation the goal is to prepare the data to be usable in the modelling steps. These transformations include normalization, standardization, text processing, generating complex features from basic ones, or any kind of data mapping.

_"...a data transformation converts a set of data values from the data format of a source data system into the data format of a destination data system._

_Data transformation can be divided into two steps:_
1. _data mapping maps data elements from the source data system to the destination data system and captures any transformation that must occur_
2. _code generation that creates the actual transformation program"_
from: <a href="https://en.wikipedia.org/wiki/Data_transformation">Wikipedia</a>

### Why is it important?

Most of the models are sensitive to data, so you must transform it into a more desired format. Unfortunately the data you start with is usually in terrible shape:

- It has missing values
- It is full of outliers
- The data is distorted by noise
- The features are in different scales
- The features are correlated/redundant/uninformative


### Tools

- scaling/binarizing
- normalizing/standardizing
- outlier detecting
- filtering
- mathematical transformations
- representational changes
- etc.

---

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import scipy.sparse as sp
import pandas as pd

from copy import deepcopy

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

In [None]:
def set2val(row, col, val):
    if row['tmp']:
        row[col] = np.nan
    return row

def add_missing(df, cols, inf=False, percent=.2):
    nrows, _ = df.shape
    missing = np.nan if not inf else np.inf
    df['tmp'] = np.random.rand(nrows)
    df['tmp'] = np.where(df['tmp'] < percent, 1, 0)
    df['tgt_col'] = np.random.choice(cols, nrows)
    df = df.apply(lambda x: set2val(x, x['tgt_col'], missing), axis=1)
    del df['tmp']
    del df['tgt_col']
    return df

def apply_scaler(df, cols, scaler):
    df = df.copy()
    df[cols] = scaler.transform(df[cols])
    return df
        
def gridplot(X, y=None):
    if y is not None:
        data = pd.concat((X, y), axis=1)
        fig = sns.PairGrid(data, vars=numerical_cols, hue='Label')
    else:
        fig = sns.PairGrid(X, vars=numerical_cols)
    fig = fig.map_diag(plt.hist)
    fig = fig.map_offdiag(plt.scatter)
    fig = fig.add_legend()
    return fig

## Intermission - instance-based classifiers

### K-Nearest Neighbour classification

#### Philosophy
<img src="https://github.com/fulibacsi/notebooks/raw/d72172b305e12170c2e8bfe6d0e9e646103a8f9b/szisz/data/pics/kacsa.png" width=400 align="left">

<br style="clear:left;"/>


If it looks like a duck and quacks like a duck, it is probably a duck. 

#### Taxonomy & Definition
>_"In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory."_ - <a href="https://en.wikipedia.org/wiki/Instance-based_learning">Wiki</a>


#### Algorithm
The basic <a href="http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification">`kNN`</a> algorithm stores training points with their labels without any coefficient fitting done. During classification a simple majority voting is done to determine the class label.

It is called `k` nearest neighbour for a reason: `k` is the number of closest data point considered in the majority voting. When a new data point arrives, the algorithm calculates the `k` nearest data point from the training set. Then these `k` label is used to determine the new entry's label. It is possible to provide a `weights` parameter as well. In this case the labels will be weighted according to the weighting function. Most common weighting method is distance based; the closer the point with a label the more weight it gets.

There are different strategies to set the <a href="https://www.analyticsvidhya.com/blog/2014/10/introduction-k-neighbours-algorithm-clustering/">ideal `k` values</a>. `k = 1` is a special case where the new data entry gets the closest train point's label.

#### Shortcomings

If the data is high dimensinal, the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> affects the algorithm's performance.  
There is no clear method to determine the best distance metrics, it always depends on the data.  
For best performance, the data should be preprocessed and transformed.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

pipe = Pipeline(steps=[('knn', KNeighborsClassifier())])

### Reading the loan dataset

In [None]:
df = pd.read_csv('./data/loan.csv', index_col=0)
# Transform target values early for plotting reasons
df['Label'] = LabelEncoder().fit_transform(df['Target'].values)

# Intentionally left 'Loan_ID' out
nominal_cols = ['Gender', 'Married', 'Education',
                'Dependents', 'Self_Employed', 'Property_Area']
numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
                  'Loan_Amount_Term', 'Credit_History']
target_col = ['Label']

X = df[numerical_cols + nominal_cols]
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=41)

---

## 1. Numerical features


### <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html">missing values</a>

In [None]:
missing = add_missing(df, numerical_cols)
missing.describe()

- dropping NAs

In [None]:
dropped = missing.dropna(axis=0)
dropped.shape

- fill NAs

In [None]:
filled = missing.fillna(value=0)
filled.describe()

- intepolate NAs

In [None]:
interpolated = missing.interpolate(method='nearest')
interpolated.describe()

### <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html#values-considered-missing">infinite values</a>

In [None]:
missing = add_missing(df, numerical_cols, inf=True)
missing.describe()

In [None]:
with pd.option_context('mode.use_inf_as_null', True):
    print missing.dropna(axis=0).shape

In [None]:
with pd.option_context('mode.use_inf_as_null', True):
    print missing.fillna(value=0).describe()

In [None]:
with pd.option_context('mode.use_inf_as_null', True):
    print missing.interpolate().describe()

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling">different scales</a>

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, pipe.predict(X_test[numerical_cols]))

---

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaled_pipe = deepcopy(pipe)
minmax = MinMaxScaler()
scaled_pipe.steps.insert(0, ('minmax', minmax))

scaled_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, scaled_pipe.predict(X_test[numerical_cols]))

In [None]:
gridplot(apply_scaler(X_train, numerical_cols, minmax), y_train)

#### Excercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with minmaxscaler and logistic regressor. Compare the results and try to explain the difference.

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#normalization">unnormalized data</a>

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
normalized_pipe = deepcopy(pipe)
standard = StandardScaler()
normalized_pipe.steps.insert(0, ('standard', standard))

normalized_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, normalized_pipe.predict(X_test[numerical_cols]))

In [None]:
gridplot(apply_scaler(X_train, numerical_cols, standard), y_train)

#### Excercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with standardscaler and logistic regressor. Compare the results and try to explain the difference.

### correlated features

In [None]:
sns.heatmap(df.corr(), robust=True)

Not now. More about this topic in the next issue of DS101. Cough-cough-<a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers" style="color: black; text-decoration: none; cursor: default;">PCA</a>-cough.

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers">outliers</a>

<img src="https://i0.wp.com/flowingdata.com/wp-content/uploads/2014/09/outlier.gif" align="left" width="400">

<br style="clear:left;"/>

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
robust_pipe = deepcopy(pipe)
robust = RobustScaler()
robust_pipe.steps.insert(0, ('robust', robust))

robust_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, robust_pipe.predict(X_test[numerical_cols]))

#### Excercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with robustscaler and logistic regressor. Compare the results and try to explain the difference.

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#feature-binarization">binarization</a>

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
binarizer = Binarizer(threshold=101.0)
X_train['BinLoanAmount'] = binarizer.fit_transform(X_train[['LoanAmount']])
X_test['BinLoanAmount'] = binarizer.transform(X_test[['LoanAmount']])

In [None]:
numwithbincols = [col for col in numerical_cols if not col == 'LoanAmount'] + ['BinLoanAmount']
pipe.fit(X_train[numwithbincols], y_train)

In [None]:
accuracy_score(y_test.values, pipe.predict(X_test[numwithbincols]))

---

## Intermission II. - Model of the week

### Decision Trees

Decision trees are a type of supervised machine learning algorithms that can be used to predict both categorical and continuous values (in this case they are called *regression trees*). **Basically the algorithm divides the training population along the values of their attributes, and assigns a prediction value to each of these categories.** In the case of **categorical** target variable, the assigned prediction will be **the mode of the target variable values** in the particular subpopulation. In the other case it will be **the mean of the values**.  
An example which will be familiar, from the <a href="https://en.wikipedia.org/wiki/Decision_tree_learning">wikipedia page of decision trees</a>:  
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png">  
A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of outcome and the percentage of observations in the leaf.

#### So, where's the trick?  
At deciding *where to split the population*. **The goal is to make these subpopulations as homogenous in the target variable as we can.** (So naturally if the target variable and other variables are completely uncorrelated with an attribute, the decision tree won't split the population according to that variable.) There are multiple *metrics* used to decide what splitting is good at a particular point in the tree, but most of them calculate an **"impurity"** of the parent node, and choose a split that will furthest reduce this impurity, most of the time only considering the local node - meaning **it doesn't ensure that the final output will be the best globally**.  

##### How big/complex tree do we want to build? Or: When should a node be declared a leaf?
As one of the biggest weakness of decision/regression trees is **overfitting**, these are very important questions. There are multiple ways to limit a tree: 
- Set a minimum sample number at which a split can be made
- Set a maximum sample number at which a node can be a leaf
- Set a threshold for the impurity decrease, below which no splits are made
- Set a maximum depth of the tree
- Set the maximum number of variables used for splitting
- Set a maximum number of leaves

Another answer to overfitting is **tree pruning**. This basically means to let the tree grow to the point where every leaf has only a few number of observations, and then starting from the top or bottom, get rid of the splits that don't contribute too much to the accuraccy of the model.

#### And why are trees better than say, logistic regression?
The most important case is when **there is a non-linear relation between the attributes and the target variable**. Another good side of decision trees is that despite this they are very easy to interpret (just look at the picture above).

#### What is better than a tree? Multiple trees!
To tackle overfitting, an early technique was to randomly sample the training set and using different samples, build multiple trees. Then the prediction is made using the "votes" of the trees (either the mean or the mode of them). This is called **bagging**. **Random forests** enhance this procedure by excluding random attributes from the potential splitting at each node. The reasoning behind this is that if there are a few attributes which have very good explaining powers, they would be used at most of the splits - and so the different trees would be strongly correlated, reducing the effect of bagging.  
An important thing to remember:  
> When in doubt, use <s>brute force</s> random forest.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier()

---

## 2. Nominal Features

### Replacing values

In [None]:
replace_map = {'Dependents': {'0': 0, '1': 1, '2': 2, '3+': 4}}
X_train.replace(replace_map).head()

### <a href="">Label encoding</a>

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encodedX_train = X_train[nominal_cols]
encodedX_test = X_test[nominal_cols]
for col in nominal_cols:
    encoder = LabelEncoder()
    encoder.fit(X[[col]])
    encodedX_train[col] = encoder.transform(encodedX_train[col])
    encodedX_test[col] = encoder.transform(encodedX_test[col])

pipe.fit(encodedX_train, y_train)
accuracy_score(y_test.values, pipe.predict(encodedX_test))

#### Excercise: Try the same experiment with decision tree
- Create a pipe containing a decision tree and measure its accuracy with the encoded data

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features">One-hot encoding</a>

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
onehot = OneHotEncoder(sparse=False)
onehotpipe = deepcopy(pipe)
onehotpipe.steps.insert(0, ('hot', onehot))

onehotpipe.fit(encodedX_train, y_train)
accuracy_score(y_test.values, onehotpipe.predict(encodedX_test))

#### Excercise: Try the same experiment with decision tree
- Create a pipe containing a decision tree and measure its accuracy

### Label binarizing

In [None]:
from sklearn.preprocessing import LabelBinarizer

#### Excercise: Try out LabelBinarizer

### 3. Putting all together 

In [None]:
# Normalizing numerical cols
standard = StandardScaler().fit(X[numerical_cols])
X_train[numerical_cols] = standard.transform(X_train[numerical_cols])

# Label encoding nominal cols
for col in nominal_cols:
    encoder = LabelEncoder()
    encoder.fit(X[col])
    X_train[col] = encoder.transform(X_train[col])
    X_test[col] = encoder.transform(X_test[col])

In [None]:
pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)
accuracy_score(y_test, y_hat)

#### Excercise: Try out the different classifiers with the preprocessed data!