# Introduction to Machine Learning
---

This notebook is supposed to serve as a guidance and sandbox for the projects forming the practical part of the module Introduction to Machine Learning 2019 @ ETHZ.

---


## Environmental Set-Up

Hereinafter the required libraries are loaded.

In [0]:
import pandas as pd
import numpy as np
import seaborn as sn
import sklearn as sl
import datetime
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import ShuffleSplit
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

%matplotlib inline
sn.set_context('notebook')
%config InlineBackend.figure_format = 'retina'

If the required packages is not one of the pre-installed of Google Collab, bash commands can be run with a preceding "!" to install those:

In [0]:
# !pip install pandas

# !apt-get install pandas

---

## Load in the data

The following upload UI can be used to upload data files to the environment. Note that those can be after the upload is complete be accessed via the respective filename.

If the sizes of the files exceed the disk capacity indicated in the top right corner. Google Collab usually automatically allocates more disk space. If this might not be the case, one can edit the run time using the tab "Runtime" and allocate additional GPU or TPU, which will also upgrade the disk space to 300GB at the same time.

GPU or TPU should be on the other hand only used if required, as the allocated size depends on the number of user currently accessing the Google Cloud Infrastructure.

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

---
## Example Workflow
---

### Exploratory Data Analysis

Before algorithms are applied, one can try to visualize the data or any lower-dimensional space representations.
In the following an example is given using a sample data set from the sklearn package. This is the wine data set consisting of 13 real-valued features and divided into 3 classes ("target").

First the data set is loaded.

In [0]:
data = load_wine()

# The data is converted into a pandas dataframe.

data = pd.DataFrame(data= np.c_[data['data'], data['target']],
                     columns= data['feature_names'] + ['target'])
data.head()

We now construct a heatmap displaying the pearson correlation coefficient of the individual features to get an idea of possibly highly-correlated and hence somewhat redundant features.

In [0]:
features = data.iloc[:, :-1]
f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
corr = features.corr()
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

If one in contrast is interested in the distribution of a certain feature often a histogram is helpful. Let us now inspect a histogram of the alcohol feature exemplary.

In [0]:
alcohol = features["alcohol"]
n, bins, patches = plt.hist(alcohol, 20, density=True, facecolor='b')


plt.xlabel('Promille')
plt.ylabel('Probability')
plt.title('Histogram of alcohol')
plt.grid(True)
plt.show()

For some use cases it might be useful to plot one feature against another. This can be easily doable as shown by the following example. Note that the output is non-informative as such plots only make sense for e.g. time-series data.

In [0]:
ash = features["ash"]

plt.plot(alcohol, ash)
plt.plot()

plt.xlabel("alcohol")
plt.ylabel("ash")
plt.title("Ash values plotted against the alcohol values")
plt.show()

---
### Model Fitting and Scoring

Hereinafter we will birefly show by an example how to use the data in the form of a pandas dataframe to fit Machine learning models. In this case we will use the Support Vector Machine Classifier from the sklearn library to do so.

In [0]:
labels = data["target"]

clf =  make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
cross_val_score(clf, features, labels, cv=5)



While this enable us to get a performance estimate using a KFold cross validation it would be more handy to test a number of different specifications and get the cv scores for all of those. To this end we can use the GridSearchCV method from sklearn. The following example shows how it can be used.

In [0]:
#Set pipeline
SC = preprocessing.StandardScaler()
SVC = svm.SVC()
pip = Pipeline(steps=[('SC', SC), ('SVC', SVC)])

#Define GridSearch parameter
param_dict = {'SVC__kernel':['linear', 'rbf'], 
              'SVC__C':[1, 10]}

# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=3)
clf.fit(features, labels)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')
print('Best estimator parameter: ')
print(clf.best_params_)

While selecting the best parameter one should be careful with the interpretation of the reported scores. Note that the CV score usually is biased towards being pessimistic as the model is not trained on the full training data set at each fold. However, the mean cv test score should never be considered the associated standard error of the results to avoid overfitting.

---
## Project 0

The following section solves the project 0 of the Introduction to Machine Learning course 2019. It is assumed that the environment is set and the train and test data is already loaded.

---

### Formatting the data

Although the data is loaded we format it to have it in the handy pandas dataframe format.

In [0]:
# Get train data
train = pd.read_csv('train.csv', index_col=0)
train.head()

In [0]:
# Get test data for which predictions should be made
test = pd.read_csv('test.csv', index_col=0)
test.head()

In [0]:
'''
Get sample prediction file format.
Sample predictions will be simply replaced with the ones obtained from the
custom model.
''' 

submission = pd.read_csv('sample.csv', index_col=0)
submission.head()

In [0]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

X_test = test

----

### EDA

Since this problem is a regression problem for simplicity we will use a linear regression model. Because of the invariance of that model to the scale of the data, we do not to standardize it. The fact, that we have only 10 features also does not require much feature reduction. Nonetheless we inspect the data first for multicollinearity.

In [0]:
corr = X_train.corr()
print(X_train.shape)

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

We see that the features are completely uncorrelated. Hence we should consider all of them recalling the fact that we have only 10 features for 10'000 samples.

---

### Model Fitting and Selection

In [0]:
#Set pipeline
LR = LinearRegression()
pip = Pipeline(steps=[('LR', LR)])

#Define GridSearch parameter
param_dict = {'LR__normalize':[False, True]}

# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=8)
clf.fit(X_train, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')
print('Best estimator parameter: ')
print(clf.best_params_)

The results look very promosing. Let us inspect the coefficients quickly.

In [0]:
fitted_pip = clf.best_estimator_
fitted_pip.named_steps['LR'].coef_

Seems like the label for the artificial data set was simply constructed by taking the average of the 10 predictors.

---

### Predictions

We will now use the fitted model to get the predictions for our test set.


In [0]:
y_pred = fitted_pip.predict(X_test)

submission['y'] = y_pred
submission.head()

---

## Export data

We hereinafter provide the possibility to export dataframe that are defined in the current environment to csv and download the respective file.

In [0]:
from google.colab import files

ts = str(datetime.datetime.utcnow())
ts = ts.replace(' ', '_')
fname = 'y_pred_'+ts+'.csv'

with open(fname, 'w') as f:
  submission.to_csv(f, float_format='%.90f')

files.download(fname)