# Project 1A: Ridge Regression
---

This notebook is supposed to be used to provide the solution to the project 1A of the module Introduction to Machine Learning 2019.

---


## Environmental Set-Up

Hereinafter the required libraries are loaded.

In [0]:
import pandas as pd
import numpy as np
import seaborn as sn
import sklearn as sl
import datetime
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import ShuffleSplit
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from sklearn.model_selection import KFold

%matplotlib inline
sn.set_context('notebook')
%config InlineBackend.figure_format = 'retina'

---

## Load in the data

We first load the data into the temprorary cloud storage provided by the Google Colab environment.

In [2]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


---
## Project 1A

The following section now solves the project 1A of the Introduction to Machine Learning course 2019. It is assumed that the environment is set and the train and test data is already loaded.

---

### Formatting the data

Although the data is loaded we format it to have it in the handy pandas dataframe format.

In [3]:
# Get train data
train = pd.read_csv('train.csv', index_col=0)
train.head()

In [4]:
train.shape

In [5]:
'''
Get sample prediction file format.
Sample predictions will be simply replaced with the ones obtained from the
custom model.
''' 

submission = pd.read_csv('sample.csv', header=None)
submission.head()

In [0]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

----

### EDA

Since this problem is a regression problem for simplicity we will use a linear regression model. Because of the invariance of that model to the scale of the data, we do not to standardize it. The fact, that we have only 10 features also does not require much feature reduction. Nonetheless, we inspect the data first for multicollinearity of the features.

In [7]:
corr = X_train.corr()
print(X_train.shape)

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

We see that there exist no strong correlation strucutre betwen the feautures. Hence, and especially because of the task description we should consider all of them recalling the fact that we have only 10 features for 10'000 samples.

---

### Model Fitting and Selection

We now perform the model fitting in the procedure dictated by the task that is 10 fold crossvalidation with a preset set of regularization parameters.

In [0]:
# Set pipeline
RR = Ridge(fit_intercept=False)
pip = Pipeline(steps=[('RR', RR)])

# Define GridSearch parameter
param_dict = {'RR__alpha':[0.1, 1, 10, 100, 1000]}

In [0]:
# Define the scoring function
def rmse(y, y_pred, random_state=1234):
  RMSE = mean_squared_error(y, y_pred)**0.5
  return(RMSE)
rmse_scorer = make_scorer(rmse, greater_is_better=False)

In [25]:
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234), 
                   scoring=rmse_scorer)
clf.fit(X_train, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')
print('Best estimator parameter: ')
print(clf.best_params_)

We get the desired RMSE reported as their negatives as the above mean CV test scores.
We are not interested in using the optimal parameter, but only in the reported mean validation errors and hence we are good to go to construct our submission file.

---

### Submission


In [26]:
rmses = -1*clf.cv_results_['mean_test_score']
print(rmses)

submission.iloc[:,0]= rmses
submission.head()

---

## Export data

We now use the API of Google Colab to download our submission file in the desired csv format. That's it. We are done.

In [0]:
from google.colab import files

ts = str(datetime.datetime.utcnow())
ts = ts.replace(' ', '_')
fname = 'rmse_pred_'+ts+'.csv'

with open(fname, 'w') as f:
  submission.to_csv(f, float_format='%.64f', index=False, header=False)

files.download(fname)