# PREPROCESSING

# Data checking

## Get familiar with your data

Before doing any calculation, just have a look at your data. Even with cured data sets, the answer to one or more of the following questions could be "yes":
- Is there anything wrong with the data?
- Are there any quirks with the data?
- Do I need to fix or remove any of the data?

Let's read some data!

In [None]:
import pandas as pd

init_data = pd.read_csv('init_data.csv')
init_data

By visual inspection (we see the first 30 and the last 30), we can detect some possible quirks:

- v5 seems to be constant
- v6 seems to be -v1

We can try to confirm the suspicions.

In [None]:
init_data.describe()

Indeed
- v5 has null std, thus constant. We remove it.
- v6 and v1 have opposite mean, the same std, and opposite-crossed min, max and quartiles. We check whether v6 is equal to -v1. If so, we remove v6.

In [None]:
list(init_data.v6.values) == list(init_data.v1.values * (-1))

In [None]:
init_data.drop('v5', axis=1, inplace=True)
init_data.drop('v6', axis=1, inplace=True)

Something else? What about v4 and v7? 
A picture is worth a thousand words

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(init_data, hue='c', vars=['v1', 'v2', 'v3', 'v4', 'v7'], size=2.5, palette=None);
plt.show()

The relationship with v1, v2, and v3 of both v4 and v7 is the same, but for the scale.

It seems that v7 is equal to v4+2.5. Let's check it

In [None]:
list(init_data.v7.values) == list(init_data.v4.values + 2.5)

We remove v7.

In [None]:
init_data.drop('v7', axis=1, inplace=True)

Our data set, as it is now, is a famous data set used by Fisher in 1936, called Iris.
The variables v1 to v4 correspond to measurements of sepal length, sepal width, petal length, and petal width of three species of Iris plants (Iris Setosa, Iris Versicolor, and Iris Virginica; corresponding to 1, 2 and 3 in our variable c).

In [None]:
iris = init_data
iris.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris.species.replace([1, 2, 3], ['Setosa', 'Versicolor', 'Virginica'], inplace=True)
sns.pairplot(iris, hue='species', vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], size=2.5, palette=None);
plt.show()

This data is coceived for trying to determine the species of an Iris plant, using the four measurements. Therefore, it is a classification problem with three classes, corresponding to the three species.

It seems that Setosa is separable from the rest just by looking at petal lengths or widths.

Nevertheless, it does not seem obvious for Versicolor and Virginica.

## Missing values and outliers

Now, we will have a look at outliers and missing values. We consider the following (naïve) approaches:
- Outliers: Remove rows containing outliers. We will consider both individual and colective  outlier detections.
- Missing values (nan): Imputation using the mean value, or row removal. "nan" stands for "not a number".

Open question 1: Which one would you perform first?
- Outliers + nan? => Removing a row because of an outlier affects the mean of all columns, not only the one containing it, for the subsequent nan treatment. We assume that nan are ignored in the mean and std initial calculation.
- nan + outliers? => The imputation using the mean does not affect the posterior mean, but affects the std.

Let's see what happens in our Iris data in both outliers and missing values treatments separately.

### Missing values

We have artificially introduced some nan in petal length and petal width.

In [None]:
nan_data = pd.read_csv('nan_data.csv')
nan_data.species.replace(['Setosa', 'Versicolor', 'Virginica'], [1, 2, 3], inplace=True)
nan_data.describe()

In the count for petal length and width we can see that there are 6 nan in each. 

In [None]:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy='mean')
nan_imputed_data = pd.DataFrame(data=imp.fit_transform(nan_data))
nan_imputed_data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
nan_imputed_data.describe()

Let's see how the plots look now!

In [None]:
nan_imputed_data.species.replace([1, 2, 3], ['Setosa', 'Versicolor', 'Virginica'], inplace=True)
sns.pairplot(nan_imputed_data, hue='species', vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], size=2.5);
plt.show()

No obvious separation is now possible.

What if we had just ignored those rows?

In [None]:
nan_removed_data = nan_data
nan_removed_data.dropna(axis=0, how='any', inplace=True)
nan_removed_data.describe()


In [None]:
nan_removed_data.species.replace([1, 2, 3], ['Setosa', 'Versicolor', 'Virginica'], inplace=True)
sns.pairplot(nan_removed_data, hue='species', vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], size=2.5);
plt.show()

As the nan have been artificially introduced by us, we will not further use nan_removed_data.

### Outliers

#### Colectively

We first check colectively, using Mahalanobis distance.

In [None]:
import sklearn
from sklearn.covariance import EllipticEnvelope
outlier_data = pd.read_csv('iris_data.csv')
outlier_data.species.replace(['Setosa', 'Versicolor', 'Virginica'], [1, 2, 3], inplace=True)
elip_env = sklearn.covariance.EllipticEnvelope().fit(outlier_data)
detection = elip_env.predict(outlier_data)
outlier_positions_mah = [x for x in range(outlier_data.shape[0]) if detection[x] == -1]
if detection is []:
    print("There are not outliers in the data.")
else:
    print("The outliers found are in positions:\n" + str(outlier_positions_mah))
    classes_names = ['Setosa', 'Versicolor', 'Virginica']
    print("They correspond respectively to classes:\n" +
          str([classes_names[x-1] for x in outlier_data.species.values[outlier_positions_mah]])) 

Graphically,

In [None]:
outlier_data.species.values[outlier_positions_mah] += 3
outlier_data.species.replace([1, 2, 3, 4, 5, 6], 
                             ['Setosa', 'Versicolor', 'Virginica', 'Outliers Setosa',
                              'Outliers Versicolor', 'Outliers Virginica'], inplace=True)
sns.pairplot(outlier_data, hue='species', vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], size=2.5);
plt.show()

There is not a single variable contributing alone to the colective assertion as outliers of those 15 rows.

Why? Think on a 3D-ellipsoid whose axes's lengths are a, b, and c, inside a box whose measures are also a, b, and c.
A point in one corner of the box is inside the box, but outside the ellipsoid.

#### Individually

Now we check individualy, i.e. variable by variable. We opt for the robust option based on boxplots, i.e.
$$x \in X \:\: outlier \:\:\Leftrightarrow \:\: x\notin \left[Q_1 - 1.5 * IQR, Q_3 + 1.5 * IQR\right]$$


In [None]:
outlier_data.species.replace(['Setosa', 'Versicolor', 'Virginica', 'Outliers Setosa', 'Outliers Versicolor', 'Outliers Virginica'], [1, 2, 3, 1, 2, 3], inplace=True)
ax = sns.boxplot(data=outlier_data[outlier_data.columns[:-1]], orient="h", palette="Set2", linewidth=2.5)
plt.show()

Closer look to sepal width:

In [None]:
ax2 = sns.boxplot(y="sepal_width", data=outlier_data, orient="h", color=sns.color_palette("Set2")[1], linewidth=2.5)
plt.show()

We find them:

In [None]:
IQR = outlier_data.describe()["sepal_width"]["75%"] - outlier_data.describe()["sepal_width"]["25%"]
whiskers = [outlier_data.describe()["sepal_width"]["25%"] - (1.5 * IQR), outlier_data.describe()["sepal_width"]["75%"] + (1.5 * IQR)]
outlier_positions_box = [x for x in range(outlier_data.shape[0]) if outlier_data.sepal_width.values[x] < whiskers[0] or outlier_data.sepal_width.values[x] > whiskers[1]]
print("The outliers found are in positions:\n" + str(outlier_positions_box))
print("They correspond respectively to sepal widths:\n" + str(outlier_data.sepal_width.values[outlier_positions_box]))
classes_names = ['Setosa', 'Versicolor', 'Virginica']
print("They correspond respectively to classes:\n" + str([classes_names[x-1] for x in outlier_data.species.values[outlier_positions_box]])) 

Graphically,

In [None]:
outlier_data.species.values[outlier_positions_box] += 3
outlier_data.species.replace([1, 2, 3, 4, 5, 6], 
                             ['Setosa', 'Versicolor', 'Virginica', 'Outliers Setosa',
                              'Outliers Versicolor', 'Outliers Virginica'], inplace=True)
sns.pairplot(outlier_data, hue='species', vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], size=2.5);
plt.show()

Open question 2: Which outlier treatment order would you choose?:
- First individual, then colective? => Then the ellipsoid for collective changes after individual.
- First colective, then individual? => Then the boxplots change after collective.
- Both in parallel? => Then present outliers affect both the ellipsoid and the boxplots.

We will do it here in parallel.

In [None]:
import numpy as np
outlier_data.species.replace(['Setosa', 'Versicolor', 'Virginica', 'Outliers Setosa', 'Outliers Versicolor', 'Outliers Virginica'], [1, 2, 3, 1, 2, 3], inplace=True)
outlier_free_data = outlier_data
outlier_positions = list(np.sort(outlier_positions_mah + outlier_positions_box))
outlier_free_data.drop(outlier_free_data.index[outlier_positions], inplace=True)
outlier_free_data.describe()

Graphically,

In [None]:
outlier_free_data.species.replace([1, 2, 3], ['Setosa', 'Versicolor', 'Virginica'], inplace=True)
sns.pairplot(outlier_free_data, hue='species', vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], size=2.5);
plt.show()

Notice that now all three classes are almost already separated.

# Discretization

For applying the Fayyad-Irani MDLP discretization algorithm, there is not an implementation in any standard Python package. Therefore, we have a perfect excuse to use Orange.

But first, a note about data csv files format in Orange.

Notice that a usual csv file looks like this:

sepal_length,sepal_width,petal_length,petal_width,species

7.7,2.8,6.7,2.0,Virginica

6.0,2.2,4.0,1.0,Versicolor

6.6,3.0,4.4,1.4,Versicolor

6.7,3.3,5.7,2.1,Virginica

4.9,3.1,1.5,0.1,Setosa

...

For Orange you need to add 2 extra rows containing info about
- First the "type of data" in each attribute ("c" for "continuous", "d" for "discrete", and "s" for "string").
- Second the "attribute kind" ("class" in the attribute of interest, and "meta" in the "metadata", i.e. attributes providing some extra information like, for instance, an index).

In our previous example:

sepal_length,sepal_width,petal_length,petal_width,species

c,c,c,c,d

, , , ,class

7.7,2.8,6.7,2.0,Virginica

6.0,2.2,4.0,1.0,Versicolor

6.6,3.0,4.4,1.4,Versicolor

6.7,3.3,5.7,2.1,Virginica

4.9,3.1,1.5,0.1,Setosa

...

Preprocessing is not a goal by itself. Its aim is to prepare the data for a posterior task (learning). The task we will choose is a decision tree.

In [None]:
from IPython.core.display import Image, display
print("Some data")
display(Image(filename='DecisionTreeSampleData.png'))

print("Decision tree levels")
display(Image(filename='DecisionTreeLevels.png'))

We will compare the performance of our four Iris datasets:
- Raw data
- Raw discretized data
- Cleaned data
- Cleaned discretized data

Orange screen shot:

In [None]:
display(Image(filename='OrangeScreenShot1.png'))

# Feature extraction

We will explore principal component analysis (PCA). We will see how many principal components (PCs) are selected when we want to capture at least 95% of the total variance, as well as the linear combinations defining them. We do not need to perform mean centering because PCA from Scikit-Learn does it internally.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)

First with raw data:

In [None]:
raw_data = pd.read_csv('iris_data.csv')
raw_data.species.replace(['Setosa', 'Versicolor', 'Virginica'], [1, 2, 3], inplace=True)
X_raw = raw_data[raw_data.columns[:-1]]
pca.fit(X_raw)
X_reduced_raw = pca.transform(X_raw)
raw_pca_data = pd.DataFrame(data=X_reduced_raw, columns=["PC1", "PC2"])
raw_pca_data = pd.concat([raw_pca_data, raw_data[raw_data.columns[-1]]], axis=1)
print("There have been selected " + str(X_reduced_raw.shape[1]) + " principal components.")
print("Meaning of the 2 components:")
for component in pca.components_:
    print(" + ".join("%.3f x %s" % (value, name) for value, name in zip(component, ["sepal_length", "sepal_width", "petal_length", "petal_width"])))

Now with clean data

In [None]:
clean_data = pd.read_csv('iris_clean_data.csv')
clean_data.species.replace(['Setosa', 'Versicolor', 'Virginica'], [1, 2, 3], inplace=True)
X_clean = clean_data[clean_data.columns[:-1]]
pca.fit(X_clean)
X_reduced_clean = pca.transform(X_clean)
clean_pca_data = pd.DataFrame(data=X_reduced_clean, columns=["PC1", "PC2"])
clean_pca_data = pd.concat([clean_pca_data, clean_data[clean_data.columns[-1]]], axis=1)
print("There have been selected " + str(X_reduced_clean.shape[1]) + " principal components.")
print("Meaning of the 2 components:")
for component in pca.components_:
    print(" + ".join("%.3f x %s" % (value, name) for value, name in zip(component, ["sepal_length", "sepal_width", "petal_length", "petal_width"])))

Let's compare both by means of scatter plots

In [None]:
fig, (ax1, ax2) = plt.subplots(figsize=(16, 6), ncols=2)
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.3)
y_raw = raw_data.species.values
y_clean = clean_data.species.values
ax1.scatter(X_reduced_raw[:, 0], X_reduced_raw[:, 1], c=y_raw, alpha = 1.0)
ax1.set_title('Raw data')
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')
ax2.scatter(X_reduced_clean[:, 0], X_reduced_clean[:, 1], c=y_clean, alpha = 1.0)
ax2.set_title('Clean data')
ax2.set_xlabel('PC1')
ax2.set_ylabel('PC2')
plt.show()

We can expect better performance with the clean data, at least for low depth levels. We check it in Orange.

Orange screen shot:

In [None]:
display(Image(filename='OrangeScreenShot2.png'))

### Comparison with feature selections

As we have only 4 features, there are only 6 posible subsets of size 2. We consider all 6 2-features subsets in the clean data.

Questions: Which pair do you think will work the best? Better than PCs? Better than the original 4-features data?

Answer: It depends on the algorithm you use.

Right questions: Which one do you think will work best with decision trees? Better than PCs? Better than the original 4-features data?

Here is a reminder of the 2-feature pairs:

In [None]:
display(Image(filename='FSS2Example.png'))

We will compare them with PCA and also with the original clean data in Orange.

Orange screen shot:

In [None]:
display(Image(filename='OrangeScreenShot3.png'))

# Imbalanced data

There is a powerful package written in Python and developed by part of the developers of Scikit-Learn, called Imbalanced-Learn.

It is developed through GitHub (see https://github.com/scikit-learn-contrib/imbalanced-learn), and there is also an official website (see http://imbalanced-learn.org/en/stable/) where you can find all the info you might need.

I strongly recommend to read the user guide (see http://imbalanced-learn.org/en/stable/user_guide.html) as well as the general examples as a complement to it (see http://imbalanced-learn.org/en/stable/auto_examples/index.html).