# Data pre-processing in Python

The exercise shows basic pre-processing techniques (normalization, one-hot encoding, binarization) using `scikit-learn` and `pandas`.

In [None]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt

from sklearn import datasets, preprocessing

## Fisher's irises

Function `load_iris()` creates an object which contains the famous [Fischer's irises](https://en.wikipedia.org/wiki/Iris_flower_data_set) dataset.

In [None]:
iris = datasets.load_iris()

print('Feature names: ', iris.feature_names)
print('Labels: ', iris.target_names)

print('shape of the data: ', iris.data.shape)
print('shape of labels: ', iris.target.shape)

In this exercise we will use `pandas` to store the data and intermediate computations. 

In [None]:
# creating a dataframe based on a NumPy array of feature values
df = pd.DataFrame(iris.data)

# adding new column
df['target'] = iris.target

# redefinition of column's names
df.columns = iris.feature_names + ['target']

df.head(n=10)

Each column in the `DataFrame` is a `pd.Series` object with [a rich API](https://pandas.pydata.org/docs/reference/series.html). 

In [None]:
df['sepal length (cm)'].describe()

We will use the `apply()` function to define *ad hoc* functions applied to elements of a given Series.

In [None]:
df['sepal length (cm)'].head().apply(lambda x: x > 5.0)

Drawing can be easily done using [MatPlot](https://matplotlib.org), a handy library for simple data visualization.

In [None]:
x = df['sepal length (cm)'][:]
y = df['sepal width (cm)'][:]
t = df['target']

plt.scatter(x, y, c=t)
plt.show()

A similar effect can be obtained by calling directly the `plot()` method of a `pandas.Series`. In the following example `iloc` refers to the *index localization* and represents the selection of all rows (`:`) and second and third columns `[1,2]` (columns are 0-indexed).

In [None]:
df.iloc[:,[1,2]].plot(kind='scatter', x=0, y=1)

## Normalization

The first operation is the linear normalization performed by the [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) class. This class performs the following transformation of an attribute:

$$v' = \frac{v-min}{max-min} * (max'-min') + min'$$

where $max,min$ are the original max,min values of the attribute, $max',min'$ are the max,min values in the new scale, 
$v'$ is the new value of the attribute, and $v$ is the original value of the attribute.

Since we are only transforming features (and not the label column), in the first step we will save these feature columns (and their names) to new variables.

In [None]:
X = df.iloc[:, :-1]
cols = df.columns[:-1]

The following example shows how to normalize the entire table.

In [None]:
norm = preprocessing.MinMaxScaler(feature_range=(0,1)).fit(X)
X_minmax = pd.DataFrame(norm.transform(X), columns=cols)

X_minmax.head()

In [None]:
X_minmax.describe()

In [None]:
x = X_minmax['sepal length (cm)'][:]
y = X_minmax['sepal width (cm)'][:]
t = df['target']

plt.scatter(x, y, c=t)
plt.show()

## Standarization

Another type of feature normalization is standarization, where the standardized feature has the mean value of 0 and the standard deviation of 1. In the `scikit-learn` library this operation can be achieved using the [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) class which performs the following transformation:

$$v' = \frac{v-\mu}{\sigma}$$

where $\mu$ is the mean value of the feature, and $\sigma$ is its standard deviation.

In [None]:
scale = preprocessing.StandardScaler().fit(X)
X_scaled = pd.DataFrame(scale.transform(X), columns=cols)

X_scaled.head()

In [None]:
X_scaled.describe()

In [None]:
x = X_scaled['sepal length (cm)'][:]
y = X_scaled['sepal width (cm)'][:]
t = df['target']

plt.scatter(x, y, c=t)
plt.show()

## Discretization 

An alternative to manual binning of numerical attributes (since `scikit-learn` does not provide explicit classes to complete this task) is to use automatic range detection using the  [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer) class. This class divides the attribute into *k* ranges in such way that the distances between geometrical means of bins are maximized.

In [None]:
kbin = preprocessing.KBinsDiscretizer(n_bins=3, strategy='kmeans', encode='ordinal').fit(df[['sepal length (cm)']])

df_kbinned = pd.DataFrame(kbin.transform(df[['sepal length (cm)']]))

x = df['sepal length (cm)'][:]
y = df_kbinned[:]
t = df['target']

plt.scatter(x, y, c=t)
plt.show()

## Binarization

Sometimes a feature must be tranformed into a binary flag which denotes the result of a logical test conducted on the values of the feature. This can be easily done using the [Binarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html) class.

In [None]:
binarize = preprocessing.Binarizer(threshold=3).fit(X)

X_binned = pd.DataFrame(binarize.transform(X), columns=cols)

pd.concat([df,X_binned], axis=1).head()

## Displaying histograms

Simple counting of values in a feature can be done using:

- `pandas.Series.value_counts()`
- `collections.Counter`

and to plot the histogram the easiest way is to use `pandas.Series.hist()`

In [None]:
X_binned['sepal width (cm)'].value_counts()

In [None]:
from collections import Counter

Counter(X_binned['sepal width (cm)'].values)

In [None]:
X_binned['sepal width (cm)'].hist()

## Imputation of missing values

Missing values can significantly distort the results of the analysis. Many learning algorithms do not accept input data which contains missing data. [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) allows to change missing values to a mean, a median, or a mode, based on the non-missing values of the attribute.

In [None]:
from sklearn.impute import SimpleImputer

matrix = np.array([[ 1, 2, np.nan], [np.nan, 4, 5], [6, np.nan, 7]])

# alternative strategies are 'mean', 'median' and 'most_frequent'
imp = SimpleImputer(missing_values=np.nan, strategy='mean').fit(matrix)

print(matrix)
print()
print(imp.transform(matrix))

## Label encoding

A very useful class is the [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) which transforms categorical attributes into a set of binary features using the one-hot encoding. The transformation creates *k* new features, where *k* is the number of distinct values of the transformed attribute.

In [None]:
df_target = df['target'].values

print(df_target)

In [None]:
one_hot = preprocessing.OneHotEncoder(categories='auto').fit(df_target.reshape(-1,1))

one_hot.transform(df_target.reshape(-1,1)).todense()

In [None]:
one_hot.inverse_transform(np.array([[1,0,0]]))

## Exercise

Look up the docs for the [Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) class which performs the normalization of individual instances of the training set. Normalize the *Iris* dataset and observe, what happens when you modify the values of the `norm` argument of class constructor.

*hint* : use the [DataFrame.sum()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) method.