# Data Preprocessing: Discretization and Normalization

In [1]:
%matplotlib notebook
from sklearn import datasets
from kemlglearn.preprocessing import Discretizer
from pylab import *
import matplotlib.pyplot as plt
import seaborn as sns
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets


iris = datasets.load_iris()
col = ['r', 'g', 'b']
lc = [col[i] for i in iris['target']]

## Attributes Discretization

This is a plot of two of the attributes of the Iris dataset 

In [2]:
fig = plt.figure(figsize=(8,8))
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=lc,s=100);

<IPython.core.display.Javascript object>

Now we discretize the attributes using equal bins (5 in this examples), the plot represents the 5x5 possible combinations, one combination can have many data points

In [3]:
@interact(bins = (3,9,2))
def g(bins=5):
    disc = Discretizer(bins=bins, method='equal')
    disc.fit(iris['data'])
    irisdisc = disc.transform(iris['data'], copy=True)
    fig = plt.figure(figsize=(8,8))
    plt.scatter(irisdisc[:, 2], irisdisc[:, 1], c=lc,s=100);

Using frequency distretization obtains also a 5x5 grid of combinations but the distribution of the data points changes

In [4]:
@interact(bins = (3,9,2))
def g(bins=5):
    disc = Discretizer(bins=bins, method='frequency')
    disc.fit(iris['data'])
    irisdisc = disc.transform(iris['data'], copy=True)
    fig = plt.figure(figsize=(8,8))
    plt.scatter(irisdisc[:, 2], irisdisc[:, 1], c=lc,s=100);

## Data normalization

Data normalization standardizes the attributes so their scales do not influence the comparisons, first we will use a range normalization for the iris dataset, the only difference that we can observe in the plot is the change of scale (plots already scale the ranges of the attributes to maintain a 1:1 proportion)

In [5]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
mx = MinMaxScaler()
fdata = mx.fit_transform(iris['data'])
fig = plt.figure(figsize=(8,8))
plt.scatter(fdata[:, 2], fdata[:, 1], c=lc,s=100);

<IPython.core.display.Javascript object>

We can now use a standard score normalization assuming gaussian data, as before the only change that we can observe is the change in the scale of the axis of the plot

In [6]:
std = StandardScaler()
fdata = std.fit_transform(iris['data'])
fig = plt.figure(figsize=(8,8))
plt.scatter(fdata[:, 2], fdata[:, 1], c=lc,s=100);

<IPython.core.display.Javascript object>