# Intro to Data Science
## Part III. - Data Transformation

### Table of contents
- <a href="#What-is-Data-Transformation?">Theory</a>
- <a href="#1.-Numerical-features">Numerical features</a>
- <a href="#2.-Nominal-features">Nominal features</a>
- <a href="#3.-FeatureUnions">Feature Unions</a>

## What is Data Transformation?
During data transformation the goal is to prepare the data to be usable in the modelling steps. These transformations include normalization, standardization, text processing, generating complex features from basic ones, or any kind of data mapping.

_"...a data transformation converts a set of data values from the data format of a source data system into the data format of a destination data system._

_Data transformation can be divided into two steps:_
1. _data mapping maps data elements from the source data system to the destination data system and captures any transformation that must occur_
2. _code generation that creates the actual transformation program"_
from: <a href="https://en.wikipedia.org/wiki/Data_transformation">Wikipedia</a>

### Why is it important?

Most of the models are sensitive to data, so you must transform it into a more desired format. Unfortunately the data you start with is usually in terrible shape:

- It has missing values
- It is full of outliers
- The data is distorted by noise
- The features are in different scales
- The features are correlated/redundant/uninformative


### Tools

- scaling/binarizing
- normalizing/standardizing
- outlier detecting
- filtering
- mathematical transformations
- representational changes
- etc.

---

In [None]:
%matplotlib inline
import collections

import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

In [None]:
def set2val(row, col, val):
    if row['tmp']:
        row[col] = np.nan
    return row

def add_missing(df, cols, inf=False, percent=.2):
    nrows, _ = df.shape
    missing = np.nan if not inf else np.inf
    df['tmp'] = np.random.rand(nrows)
    df['tmp'] = np.where(df['tmp'] < percent, 1, 0)
    df['tgt_col'] = np.random.choice(cols, nrows)
    df = df.apply(lambda x: set2val(x, x['tgt_col'], missing), axis=1)
    del df['tmp']
    del df['tgt_col']
    return df

def apply_scaler(df, cols, scaler):
    df = df.copy()
    df[cols] = scaler.transform(df[cols])
    return df
        
def gridplot(X, y=None):
    if y is not None:
        data = pd.concat((X, y), axis=1)
        fig = sns.PairGrid(data, vars=numerical_cols, hue='Label')
    else:
        fig = sns.PairGrid(X, vars=numerical_cols)
    fig = fig.map_diag(plt.hist)
    fig = fig.map_offdiag(plt.scatter)
    fig = fig.add_legend()
    return fig

## Intermission - prototype based classifiers

### K-Nearest Neighbour classification
If it looks like a duck and quacks like a duck, it is probably a duck. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
pipe = Pipeline(steps=[('knn', KNeighborsClassifier())])

### Reading the loan dataset

In [246]:
df = pd.read_csv('./data/loan.csv', index_col=0)
# Transform target values early for plotting reasons
df['Label'] = LabelEncoder().fit_transform(df['Target'].values)

# Intentionally left 'Loan_ID' out
nominal_cols = ['Gender', 'Married', 'Education',
                'Dependents', 'Self_Employed', 'Property_Area']
numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
                  'Loan_Amount_Term', 'Credit_History']
target_col = ['Label']

X = df[numerical_cols + nominal_cols]
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=41)

---

## 1. Numerical features


### <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html">missing values</a>

In [None]:
missing = add_missing(df, numerical_cols)
missing.describe()

- dropping NAs

In [None]:
dropped = missing.dropna(axis=0)
dropped.shape

- fill NAs

In [None]:
filled = missing.fillna(value=0)
filled.describe()

- intepolate NAs

In [None]:
interpolated = missing.interpolate(method='nearest')
interpolated.describe()

### <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html#values-considered-missing">infinite values</a>

In [None]:
missing = add_missing(df, numerical_cols, inf=True)
missing.describe()

In [None]:
with pd.option_context('mode.use_inf_as_null', True):
    print missing.dropna(axis=0).shape

In [None]:
with pd.option_context('mode.use_inf_as_null', True):
    print missing.fillna(value=0).describe()

In [None]:
with pd.option_context('mode.use_inf_as_null', True):
    print missing.interpolate().describe()

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling">different scales</a>

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, pipe.predict(X_test[numerical_cols]))

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
from copy import deepcopy

In [None]:
scaled_pipe = deepcopy(pipe)
minmax = MinMaxScaler()
scaled_pipe.steps.insert(0, ('minmax', minmax))

scaled_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, scaled_pipe.predict(X_test[numerical_cols]))

In [None]:
gridplot(apply_scaler(X_train, numerical_cols, minmax), y_train)

#### Excercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with minmaxscaler and logistic regressor. Compare the results and try to explain the difference.

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#normalization">unnormalized data</a>

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
normalized_pipe = deepcopy(pipe)
standard = StandardScaler()
normalized_pipe.steps.insert(0, ('standard', standard))

normalized_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, normalized_pipe.predict(X_test[numerical_cols]))

In [None]:
gridplot(apply_scaler(X_train, numerical_cols, standard), y_train)

#### Excercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with minmaxscaler and logistic regressor. Compare the results and try to explain the difference.

### correlated features

In [None]:
sns.heatmap(df.corr(), robust=True)

Not now. More about this topic in the next issue of DS101. Cough-cough-<a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers" style="color: black; text-decoration: none; cursor: default;">PCA</a>-cough.

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers">outliers</a>

<img src="https://i0.wp.com/flowingdata.com/wp-content/uploads/2014/09/outlier.gif" align="left" width="400">

<br style="clear:left;"/>

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
robust_pipe = deepcopy(pipe)
robust = RobustScaler()
normalized_pipe.steps.insert(0, ('robust', robust))

robust_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, robust_pipe.predict(X_test[numerical_cols]))

#### Excercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with minmaxscaler and logistic regressor. Compare the results and try to explain the difference.

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#feature-binarization">binarization</a>

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
binarizer = Binarizer(threshold=101.0)
X_train['BinLoanAmount'] = binarizer.fit_transform(X_train[['LoanAmount']])
X_test['BinLoanAmount'] = binarizer.transform(X_test[['LoanAmount']])

In [None]:
numwithbincols = [col for col in numerical_cols if not col == 'LoanAmount'] + ['BinLoanAmount']
pipe.fit(X_train[numwithbincols], y_train)

In [None]:
accuracy_score(y_test.values, pipe.predict(X_test[numwithbincols]))

---

## Intermission II. - Model of the week

### Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier()

---

## 2. Nominal Features

### Replacing values

In [None]:
replace_map = {'Dependents': {'0': 0, '1': 1, '2': 2, '3+': 4}}
X_train.replace(replace_map).head()

### <a href="">Label encoding</a>

In [None]:
from sklearn.preprocessing import LabelEncoder

In [253]:
encoder = LabelEncoder()
encodedX_train = X_train[nominal_cols].apply(encoder.fit_transform)
encodedX_test = X_test[nominal_cols].apply(encoder.fit_transform)

pipe.fit(encodedX_train, y_train)
accuracy_score(y_test.values, pipe.predict(encodedX_test))

0.64375000000000004

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features">One-hot encoding</a>

In [None]:
from sklearn.preprocessing import OneHotEncoder

### Label binarizing

In [None]:
from sklearn.preprocessing import LabelBinarizer

## 3. FeatureUnions

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.metrics import mean_squared_error

In [None]:
pipe = Pipeline(steps=[
    # TODO: add prepocessed features 
    ('knn', KNeighborsClassifier())    
])

In [None]:
# Use binary metric instead of rmse?
baseline = np.ones((len(y_train), 1)) * y_train.mean()
mean_squared_error(y_train, baseline)

In [None]:
pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)
mean_squared_error(y_test, y_hat)

---

## : Decision trees

__\# TODO!__