# Intro to Data Science @ SzISz Part III.
## Data Transformation

### Table of contents
- <a href="#What-is-Data-Transformation?">Theory</a>
- <a href="#Numerical-Features">Numerical Transformations</a>
- <a href="#Textual-Features">Textual Transformations</a>
- <a href="#Pipelines-and-FeatureUnions">Pipelines and Feature Unions</a>

## What is Data Transformation?
During data transformation the goal is to prepare the data to be usable in the modelling steps. These transformations include normalization, standardization, text processing, generating complex features from basic ones, or any kind of data mapping.

_"...a data transformation converts a set of data values from the data format of a source data system into the data format of a destination data system._

_Data transformation can be divided into two steps:_
1. _data mapping maps data elements from the source data system to the destination data system and captures any transformation that must occur_
2. _code generation that creates the actual transformation program"_
from: <a href="https://en.wikipedia.org/wiki/Data_transformation">Wikipedia</a>

## Why is it important?

Most of the models are sensitive to data, so you must transform it into a more desired format. Unfortunately the data you start with is usually in terrible shape:

- It has missing values
- It is full of outliers
- The data is distorted by noise
- The features are in different scales
- The features are correlated/redundant/uninformative


## Tools

- scaling/binarizing
- normalizing/standardizing
- outlier detecting
- filtering
- mathematical transformations
- representational changes
- etc.

---

## Imports and custom functions

In [None]:
%matplotlib inline
import collections

import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_classification

In [None]:
def input_missing(X, random_state=42):
    """Randomly replace values in an np.ndarray with np.nan, np.inf or -np.inf.
    
    Parameters:
    -----------
    X : np.ndarray
        Target array in which the values will be randomly replaced.
        
    random_state : int
        Random seed initializing the pseudo-random number generator.
        
    Returns:
    --------
    X' : np.ndarray
        Array with nans and infs.
    """
    custom_random = np.random.RandomState(random_state)
    mask = custom_random.rand(*X.shape)
    mask[mask < 0.9] = 1.
    mask[(mask >= 0.9) & (mask < 0.95)] = np.nan
    mask[(mask >= 0.95) & (mask < 0.975)] = np.inf
    mask[(mask >= 0.975) & (mask < 1.)] = -np.inf
    return X * mask

def change_scale(X, factor=5., columns=1, random_state=42):
    """Randomly multiply a column or columns in an np.ndarray.
    
    Parameters:
    -----------
    X : np.ndarray
        Target array in which a column will be multiplied.
    
    factor : float
        The multiplication factor
        
    columns : int or array-like
        Number of columns to multiply if int type, else the column indices which are multiplied.
        
    random_state : int
        Random seed initializing the pseudo-random number generator.
        
    Returns:
    --------
    X' : np.ndarray
        Array with multiplied valued columns.
    """
    custom_random = np.random.RandomState(random_state)
    X_comma = X.copy()
    rows, cols = X.shape
    if not isinstance(columns, collections.Iterable):
        columns = [custom_random.randint(cols) for _ in xrange(columns)]
    for column in columns:
        X_comma[:, column] *= factor
    return X_comma

def add_outlier(X, value=10, num=1, random_state=42):
    """Add a specified number of outliers to the input np.ndarray.
    
    Parameters:
    -----------
    X : np.ndarray
        Target array in which the outliers will be inputted.
    
    value : float
        The value of the outlier.
        
    num : int
        The number of outliers to be placed.
        
    random_state : int
        Random seed initializing the pseudo-random number generator.
        
    Returns:
    --------
    X' : np.ndarray
        Array with outliers.
    """
    custom_random = np.random.RandomState(random_state)
    X_comma = X.copy()
    rows, cols = X_comma.shape
    for _ in xrange(num):
        row = custom_random.randint(rows)
        col = custom_random.randint(cols)
        X_comma[row, col] += value
    
    return X_comma

def binarize(X, bins=2):
    """Binarize matrix elements based on the values.
    
    Parameters:
    -----------
    X : np.ndarray
        Target array to binarize.
        
    bins : int
        Number of values to appear in the binarized matrix.
        
    Returns:
    --------
    X_comma : np.ndarray
        Binarized matrix
    """
    X_comma = X.copy()
    delims = np.linspace(X_comma.min(), X_comma.max(), bins+1)
    delims = zip(delims, delims[1:])
    for bin_val, (start, end) in enumerate(delims):
        X_comma[(start <= X) & (X <= end)] = bin_val
    return X_comma

In [None]:
def plot_row(axis, df, column, labels):
    """Scatterplot a column against all of the columns in a pd.DataFrame.
    Colors the points based on labels.
    
    Parameters:
    -----------
    axis : iterable of matplotlib.axes
        List of target axis.
        
    df : pd.DataFrame
        DataFrame to plot.
        
    column : str
        The DataFrame's column to plot against.
        
    labels : iterable
        The labels for the rows in the DataFrame.
    """
    for i, col_ax in zip(df.columns.values, axis):
        col_ax.scatter(df[column], df[i], c=labels, cmap='magma')

def gridplot(df, labels, columns=None, figsize=(12,12)):
    """Generate a gridplot over a pd.DataFrame's columns.
    If the columns parameter is specified, plot against that column(s).
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to plot.
        
    labels : iterable
        The labels for the rows in the DataFrame.
        
    columns : None, pd.DataFrame columnname, iterable over pd.DF colnames
        The columns to plot against. If None, plot every column against every column.
        
    figsize : tuple of ints
        The size of the resulting plot.
    """
    if columns is None:
        columns = df.columns.values
    if not isinstance(columns, collections.Iterable):
        columns = [columns]
        
    ncols = len(df.columns)
    nrows = len(columns)

    fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex="col", sharey="row", figsize=figsize)
    ax = ax.reshape(nrows, ncols)

    for col, row_ax in zip(columns, ax):
        plot_row(axis=row_ax, df=df, column=col, labels=labels)

    fig.show()

## Data generation

In [None]:
data, labels = make_classification(n_features=10, random_state=42)
df = pd.DataFrame(data)

---

## Numerical features


### <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html">missing values</a>

In [None]:
missing = pd.DataFrame(input_missing(data))
missing.describe()

In [None]:
dropped = missing.dropna()
dropped.shape

In [None]:
filled = missing.fillna(value=0)
filled.describe()

In [None]:
interpolated = missing.interpolate(method='nearest')
interpolated.describe()

### <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html#values-considered-missing">infinite values</a>

In [None]:
pd.set_option('mode.use_inf_as_null', True)

In [None]:
dropped = missing.dropna(axis=0)
dropped.shape

In [None]:
filled = missing.fillna(value=0)

In [None]:
interpolated = missing.interpolate()

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling">different scales</a>

In [None]:
scaled = pd.DataFrame(change_scale(data))
scaled.describe()

In [None]:
gridplot(scaled, labels)

In [None]:
gridplot(scaled, labels, columns=6, figsize=(20,3))

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaled[[6]] = scaler.fit_transform(scaled[[6]].values)

In [None]:
gridplot(scaled, labels)

### correlated features

In [None]:
sns.heatmap(df.corr(), robust=True)

Not now. More about this topic in the next issue. Cough-cough-<a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers" style="color: black; text-decoration: none; cursor: default;">PCA</a>-cough.

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers">outliers</a>

In [None]:
outlied = pd.DataFrame(add_outlier(data, value=100))
outlied.describe()

In [None]:
gridplot(outlied, labels)

### <a href="https://www.youtube.com/watch?v=MymAUbwSX80" style="color: black; text-decoration: none; cursor: default;">ACT NOW!</a> Write a function which removes the outlier from a dataframe!

In [None]:
def remove_outlier(df):
    """Removes the outlier from the given dataframe.
    
    Parameters:
    -----------
    df : pd.DataFrame
        The dataframe with outliers.
        
    Returns:
    --------
    df' : pd.DataFrame
        The cleaned dataframe.
    """
    # TODO: YOUR MAGIC
    return df

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#feature-binarization">binarization</a>

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
binarizer = Binarizer()
binarizer.fit_transform(df)[:15]

## Textual Transformations

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features">Categorical values</a>

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
categorical = binarize(data)

### <a href="http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#bags-of-words">Bag of words</a>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

### <a href="http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#from-occurrences-to-frequencies">Tf-Idf</a>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Pipelines and FeatureUnions

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion