# Project 2: Classification
---

This notebook is supposed to be used to provide the solution to the project 2 of the module Introduction to Machine Learning 2019 @ ETHZ.

---


## Environmental Set-Up

We first set the environment and load the later required packages, as well as fix the random seed globally.

In [3]:
import warnings
import pandas as pd
import numpy as np
import seaborn as sn
import sklearn as sl
import datetime
import random
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split, KFold
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import ShuffleSplit
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA, FastICA
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, RFECV, RFE
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from skrebate import ReliefF, MultiSURFstar


%matplotlib inline
sn.set_context('notebook')
%config InlineBackend.figure_format = 'retina'
random.seed(1234)
warnings.filterwarnings('ignore')


---
## Project 2

The following section now solves the project 2 of the Introduction to Machine Learning course 2019.

---

### Formatting the data

To start of we load the data from the file system into the handy pandas dataframe format.

In [6]:
# Get train data
train = pd.read_hdf('/path/to/train/data/h5', 'train')
train.head()

We quickly inspect the shape of the data to make sure the data has been correctly loaded and casted into a pandas data frame. Now we also load the sample submission file (to get an idea of the format) and the test data into memory.

In [7]:
'''
Get sample prediction file format.
Sample predictions will be simply replaced with the ones obtained from the
custom model.
''' 

submission = pd.read_csv('/path/to/sample/submission/csv', index_col=0, float_precision='high')
submission.head()

In [9]:
X_test = pd.read_hdf('/path/to/test/data/h5', 'test')
X_test.head()

That looks very good. We seperate the label from the features for the sake of handiness of our implementations and data handling in the following.

In [10]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

---

### Exploratory Data Analysis

Before starting with trying to model the data, we will have a first look at the data. First we will look at the distribution of the labels in the training data, since knowing if the data is balanced or not heavily influences the choice of algorithms we will consider later on.

In [19]:
n, bins, patches = plt.hist(np.array(y_train), [-0.25, 0.25, 0.75,1.25,1.75, 2.25],
                            facecolor='b', alpha=0.75, align="mid")


plt.xlabel('Class Label')
plt.ylabel('Rel. Frequency')
plt.title('Histogram of Class Labels in the Training Data')
plt.axis([-0.5, 3, 0, 700])
plt.grid(True)
plt.show()

We see that the number of training samples we have for each class is roughly the same and about 650. This is good since many classification algorithms are sensitive to class imbalances. We will now investigate the correlation structure between the features.

In [20]:
corr = X_train.corr()
print(X_train.shape)

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

We see a couple strong correlations, that are e.g. between $x_9$ and $x_{19}$ or $x_{10}$ and $x_{20}$. Additionally we see that $x_{18}$ and $x_{20}$ seem to be also very strongly correlated. We see a somewhat weaker but still quite strong correlation between $x_{14}$ and $x_{15}$. All those insights suggest that feature reduction techniques might stabilize the solution, as our features are likely to be subject to multicollinearity.

While the visual inspection gave us a first idea let us numerically compute the pair-wise correlations and list the 10 variable pairs that are the most correlated.

In [21]:
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

In [22]:
print("Top Absolute Correlations")
print(get_top_abs_correlations(X_train, 10))

As indicated by our heat map we see that four pairs are correlated 1-to-1 that is one variable is a multiple of the other. Hence, we will for sure drop $x_{18}, x_{19}, x_{20}$ as those do not yield any additional information given we include $x_9$ and $x_{10}$, but are likely increase the variance of our estimators as we would have to fit weights for those. Additionally multicollinearity is known to cause unstable solutions. This we also would like to avoid.

In [23]:
X_train = X_train.drop(['x18','x19', 'x20'], axis=1)
X_test = X_test.drop(['x18','x19', 'x20'], axis=1)

Let us quickly validate if the issue of the 1-to-1 correlated variables is solved thereby.

In [24]:
corr = X_train.corr()
print(X_train.shape)

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

In [25]:
print("Top Absolute Correlations")
print(get_top_abs_correlations(X_train, 10))

This is the case.

---
### Initial Experiments - SVC

To get an idea if further feature preprocessing or selection steps are required we will quickly fit an SVM to the data and inspect the train and test error as estimates by running a 10-fold cross validation. As the SVM is sensitive to the scale of the data we will first standardize the data.

In [12]:
sc = StandardScaler()
svc = SVC(random_state=1234, max_iter = -1)
pip = Pipeline(steps=[('SC',sc), ('SVC', svc)])

# Define GridSearch parameter
param_dict = { 'SVC__C':[1,3], 'SVC__kernel':['rbf'], 
              'SVC__gamma':[1, 0.5],
             'SVC__decision_function_shape':['ovo']}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=5, random_state=1234),
                   return_train_score=True, verbose=100, n_jobs= -1)
clf.fit(np.array(X_train), np.array(y_train))

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

In [32]:
print('Best estimator parameter: ')
print(clf.best_params_)

print('Mean CV test score for the best estimator: ')
max_id = np.argmax(clf.cv_results_['mean_test_score'])
print(clf.cv_results_['mean_test_score'][max_id])
print(' ')
print('Std CV test score for the best estimator: ')
print(clf.cv_results_['std_test_score'][max_id])
print(' ')

print('Mean CV train score for the best estimator: ')
print(clf.cv_results_['mean_train_score'][max_id])
print(' ')
print('Std CV train score for the best estimator: ')
print(clf.cv_results_['std_train_score'][max_id])
print(' ')

What we see is that for many folds the train accuracy is 1, while the test accuracy score is much lower. This implies that our current approach is subject to overfitting. For that reason we will use a well-known recursive feature selection approach to determine the most important features. We will use the feature importance determined by a RandomForestClassifier as a basis and run a 10-fold cross validation to determine the optimal number of features to keep.

---
### Feature Selection

In [13]:
rfc = RandomForestClassifier(n_estimators=1000, random_state=1234)
rfe = RFECV(estimator=rfc, verbose = 1, cv=KFold(n_splits=10, random_state=1234))
rfe.fit(X_train, y_train)



Let us inspect the results that are the number of features determined to be optimal and which features those are in fact.

In [34]:
print('#_features:', rfe.n_features_)
print('support:', rfe.support_)
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)
X_train_rfe.shape

In [38]:
rfe.ranking_


---

### SVC with reduced feature set

We see that recursive feature selection determined 9 features to be the optimal number. This is way less than the orginal 17 we had. Hence, let us check how the SVM now performs on that subset of features. Again we include a standardization step in the pipeline for previously mentioned reasons.

In [44]:
sc = StandardScaler()
svc = SVC(random_state=1234, max_iter = -1)
pip = Pipeline(steps=[('SC',sc), ('SVC', svc)])

# Define GridSearch parameter
param_dict = { 'SVC__C':[1,2,3], 'SVC__kernel':['rbf'], 
              'SVC__gamma':[1, 0.75,  0.5, 0.4, 0.3, 0.25],
             'SVC__decision_function_shape':['ovo']}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234),
                   return_train_score=True, verbose=1, n_jobs= -1)
clf.fit(np.array(X_train_rfe), np.array(y_train))

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

In [45]:
print('Best estimator parameter: ')
print(clf.best_params_)

print('Mean CV test score for the best estimator: ')
max_id = np.argmax(clf.cv_results_['mean_test_score'])
print(clf.cv_results_['mean_test_score'][max_id])
print(' ')
print('Std CV test score for the best estimator: ')
print(clf.cv_results_['std_test_score'][max_id])
print(' ')

print('Mean CV train score for the best estimator: ')
print(clf.cv_results_['mean_train_score'][max_id])
print(' ')
print('Std CV train score for the best estimator: ')
print(clf.cv_results_['std_train_score'][max_id])
print(' ')

The results look promising, while we do not get perfect training fits anymore the test accuracy estimate is now much higher than it was before. Also the discrepancy between the train and test score estimate based on the 10-fold cross validation and the standard deviation estimate of the scores is much lower. Hence, it seems that we were able to reduce the issue of overfitting drastically. However there is still quite a difference. So let us force our classifier to use a larger regularization parameter C (drastically try 3 and upwards) by setting the scope of the grid search accordingly and check how the results change.

In [58]:
sc = StandardScaler()
svc = SVC(random_state=1234, max_iter = -1)
pip = Pipeline(steps=[('SC',sc), ('SVC', svc)])

# Define GridSearch parameter
param_dict = { 'SVC__C':[2.75,3, 3.25, 4,5], 'SVC__kernel':['rbf'], 
              'SVC__gamma':[1, 0.75, 0.6, 0.55,  0.5, 0.4, 0.3, 0.25],
             'SVC__decision_function_shape':['ovo', 'ovr']}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234),
                   return_train_score=True, verbose=1, n_jobs= -1)
clf.fit(np.array(X_train_rfe), np.array(y_train))

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

In [59]:
print('Best estimator parameter: ')
print(clf.best_params_)

print('Mean CV test score for the best estimator: ')
max_id = np.argmax(clf.cv_results_['mean_test_score'])
print(clf.cv_results_['mean_test_score'][max_id])
print(' ')
print('Std CV test score for the best estimator: ')
print(clf.cv_results_['std_test_score'][max_id])
print(' ')

print('Mean CV train score for the best estimator: ')
print(clf.cv_results_['mean_train_score'][max_id])
print(' ')
print('Std CV train score for the best estimator: ')
print(clf.cv_results_['std_train_score'][max_id])
print(' ')

Those results look even better in terms especially of the ratio of the test score accuracy estimate and the associated standard deviation estimate. Hence, we will construct a submission based on that one.

---

## Submission

In [60]:
y_pred = clf.predict(X_test_rfe)
submission["y"] = y_pred
submission.head()

---

## Export data

We finally use the Google Colab API to download our submission data frame in from of an csv, that we can submit to the submission platform.

In [61]:
ts = str(datetime.datetime.utcnow())
ts = ts.replace(' ', '_')
Filename = 'submission_name'
fname = Filename+ts+'.csv'

with open(fname, 'w') as f:
  submission.to_csv(f, float_format='%.64f', index=True, header=True)