# Intro to Data Science
## Part III. - Data Transformation

### Table of contents
- <a href="#What-is-Data-Transformation?">Theory</a>
- <a href="#1.-Numerical-features">Numerical features</a>
- <a href="#2.-Nominal-features">Nominal features</a>
- <a href="#3.-FeatureUnions">Feature Unions</a>

## What is Data Transformation?
During data transformation the goal is to prepare the data to be usable in the modelling steps. These transformations include normalization, standardization, text processing, generating complex features from basic ones, or any kind of data mapping.

_"...a data transformation converts a set of data values from the data format of a source data system into the data format of a destination data system._

_Data transformation can be divided into two steps:_
1. _data mapping maps data elements from the source data system to the destination data system and captures any transformation that must occur_
2. _code generation that creates the actual transformation program"_
from: <a href="https://en.wikipedia.org/wiki/Data_transformation">Wikipedia</a>

### Why is it important?

Most of the models are sensitive to data, so you must transform it into a more desired format. Unfortunately the data you start with is usually in terrible shape:

- It has missing values
- It is full of outliers
- The data is distorted by noise
- The features are in different scales
- The features are correlated/redundant/uninformative


### Tools

- scaling/binarizing
- normalizing/standardizing
- outlier detecting
- filtering
- mathematical transformations
- representational changes
- etc.

---

In [None]:
%matplotlib inline
import collections

import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

## Intermission - prototype based classifiers

### K-Nearest Neighbour classification
If it looks like a duck, and quacks like a duck is probably a duck. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

### Reading the loan dataset

In [None]:
df = pd.read_csv('./data/loan.csv', index_col=0)

numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'Dependents',
                  'LoanAmount', 'Loan_Amount_Term']
nominal_cols = ['Loan_ID', 'Gender', 'Married', 'Education',
                'Credit_History', 'Self_Employed', 'Property_Area']
target_col = ['Target']

X = df[numerical_cols + nominal_cols]
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=41)

---

## 1. Numerical features


### <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html">missing values</a>

In [None]:
missing = pd.DataFrame()
missing.describe()

- dropping NAs

In [None]:
dropped = missing.dropna(axis=0)
dropped.shape

- fill NAs

In [None]:
filled = missing.fillna(value=0)
filled.describe()

- intepolate NAs

In [None]:
interpolated = missing.interpolate(method='nearest')
interpolated.describe()

### <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html#values-considered-missing">infinite values</a>

In [None]:
pd.set_option('mode.use_inf_as_null', True)

In [None]:
np.finfo('d')

In [None]:
dropped = missing.dropna(axis=0)
dropped.shape

In [None]:
filled = missing.fillna(value=0)
filled.describe()

In [None]:
interpolated = missing.interpolate()
interpolated.describe()

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling">different scales</a>

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
scaled[[6]] = scaler.fit_transform(scaled[[6]].values)

In [None]:
# TODO: move somewhere where it makes at least minimal sense
def gridplot(X, y=None):
    g = sns.PairGrid(X, hue=y)
    g = g.map_diag(plt.hist)
    g = g.map_offdiag(plt.scatter)
    return g

In [None]:
# TODO: transform y_train to numerical value to color the dots by classes
gridplot(X_train[numerical_cols])

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#normalization">unnormalized data</a>

In [None]:
from sklearn.preprocessing import StandardScaler

### correlated features

In [None]:
sns.heatmap(df.corr(), robust=True)

Not now. More about this topic in the next issue of DS101. Cough-cough-<a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers" style="color: black; text-decoration: none; cursor: default;">PCA</a>-cough.

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers">outliers</a>

<img src="https://i0.wp.com/flowingdata.com/wp-content/uploads/2014/09/outlier.gif" align="left" width="400">

<br style="clear:left;"/>

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#feature-binarization">binarization</a>

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
binarizer = Binarizer()
binarizer.fit_transform(df)[:15]

---

## 2. Nominal Features

### Replacing values

```python
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
                                  "two": 2, "twelve": 12, "three":3 }}
obj_df.replace(cleanup_nums, inplace=True)
obj_df.head()
```

### <a href="">Label encoding</a>

In [None]:
from sklearn.preprocessing import LabelEncoder

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features">One-hot encoding</a>

In [None]:
from sklearn.preprocessing import OneHotEncoder

__\# TODO:__ replace outdated cells 
```python
categorical = binarize(data)

encoder = OneHotEncoder()
encoder.fit_transform(categorical).todense()

categorical = np.array([[np.random.choice([0,1,2])]
                        for _ in range(100)])

encoded = encoder.fit_transform(categorical)
encoded.shape

categorical[:15]

encoded[:15].todense()
```

### Label binarizing

In [None]:
from sklearn.preprocessing import LabelBinarizer

## 3. FeatureUnions

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.metrics import mean_squared_error

In [None]:
pipe = Pipeline(steps=[
    # TODO: add prepocessed features 
    ('knn', KNeighborsClassifier())    
])

In [None]:
# Use binary metric instead of rmse?
baseline = np.ones((len(y_train), 1)) * y_train.mean()
mean_squared_error(y_train, baseline)

In [None]:
pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)
mean_squared_error(y_test, y_hat)

---

## Model of the week: Decision trees

Decision trees are a type of supervised machine learning algorithms that can be used to predict both categorical and continuous values (in this case they are called *regression trees*). **Basically the algorithm divides the training population along the values of their attributes, and assigns a prediction value to each of these categories.** In the case of **categorical** target variable, the assigned prediction will be **the mode of the target variable values** in the particular subpopulation. In the other case it will be **the mean of the values**.  
An example which will be familiar, from the <a href="https://en.wikipedia.org/wiki/Decision_tree_learning">wikipedia page of decision trees</a>:  
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png">  
A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of outcome and the percentage of observations in the leaf.

### So, where's the trick?  
At deciding *where to split the population*. **The goal is to make these subpopulations as homogenous in the target variable as we can.** (So naturally if the target variable and other variables are completely uncorrelated with an attribute, the decision tree won't split the population according to that variable.) There are multiple *metrics* used to decide what splitting is good at a particular point in the tree, but most of them calculate an **"impurity"** of the parent node, and choose a split that will furthest reduce this impurity, most of the time only considering the local node - meaning **it doesn't ensure that the final output will be the best globally**.  

#### How big/complex tree do we want to build? Or: When should a node be declared a leaf?
As one of the biggest weakness of decision/regression trees is **overfitting**, these are very important questions. There are multiple ways to limit a tree: 
- Set a minimum sample number at which a split can be made
- Set a maximum sample number at which a node can be a leaf
- Set a threshold for the impurity decrease, below which no splits are made
- Set a maximum depth of the tree
- Set the maximum number of variables used for splitting
- Set a maximum number of leaves

Another answer to overfitting is **tree pruning**. This basically means to let the tree grow to the point where every leaf has only a few number of observations, and then starting from the top or bottom, get rid of the splits that don't contribute too much to the accuraccy of the model.

### And why are trees better than say, logistic regression?
The most important case is when **there is a non-linear relation between the attributes and the target variable**. Another good side of decision trees is that despite this they are very easy to interpret (just look at the picture above).

### What is better than a tree? Multiple trees!
To tackle overfitting, an early technique was to randomly sample the training set and using different samples, build multiple trees. Then the prediction is made using the "votes" of the trees (either the mean or the mode of them). This is called **bagging**. **Random forests** enhance this procedure by excluding random attributes from the potential splitting at each node. The reasoning behind this is that if there are a few attributes which have very good explaining powers, they would be used at most of the splits - and so the different trees would be strongly correlated, reducing the effect of bagging.  
An important thing to remember:  
> When in doubt, use <s>brute force</s> random forest.