# Project 1B: LinearRegression with feature transformations
---

This notebook is supposed to be used to provide the solution to the project 1B of the module Introduction to Machine Learning 2019 @ ETHZ.

---


## Environmental Set-Up

We first set the environment and load the later required packages, as well as fix the random seed globally.

In [0]:
import warnings
import pandas as pd
import numpy as np
import seaborn as sn
import sklearn as sl
import datetime
import random
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import ShuffleSplit
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.linear_model import Ridge, Lasso, HuberRegressor, LassoLarsIC
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from sklearn.svm import SVR, LinearSVR
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.feature_selection import RFECV, RFE, SelectKBest, f_regression
from sklearn.model_selection import KFold


%matplotlib inline
sn.set_context('notebook')
%config InlineBackend.figure_format = 'retina'
random.seed(1234)

---

## Load in the data

We now use the Google Colab API to load the data and the sample submission from disk into the temproray cloud storage attached to this PaaS (platform as a service) solution to make it accessible.

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


---
## Project 1B

The following section now solves the project 1B of the Introduction to Machine Learning course 2019..

---

### Formatting the data

Although the data is loaded we format it to have it in the handy pandas dataframe format.

In [0]:
'''
Get sample prediction file format.
Sample predictions will be simply replaced with the ones obtained from the
custom model.
''' 

submission = pd.read_csv('sample.csv', header=None, float_precision='high')
submission.head()

In [0]:
# Get train data
train = pd.read_csv('train.csv', index_col=0, float_precision='high')
train.head()

We quickly inspect the shape of the data to make sure the data has been correctly loaded and casted into a pandas data frame.

In [0]:
train.shape

That looks very good. We seperate the label from the features for the sake of handiness of our implementations and data handling in the following.

In [0]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

---
## Feature Engineering

As required by the task we construct additional features that are non-linear transformations of our given features and add those to the training data frame. Those are namely the quadratic form of the 5 predictor variables, as well as e to the power of those and the cosine of them. Finally we add a constant bias feature to the set of predictors.

In [0]:
# Add quadratic version of the features
for i in range(5):
  feature_name = 'phi'+str(X_train.shape[1]+1)
  X_train[feature_name] = X_train.iloc[:,i]**2

# Add exponential version of the features
for i in range(5):
  feature_name = 'phi'+str(X_train.shape[1]+1)
  X_train[feature_name] = np.exp(X_train.iloc[:,i])
  
# Add cosine version of the features
for i in range(5):
  feature_name = 'phi'+str(X_train.shape[1]+1)
  X_train[feature_name] = np.cos(X_train.iloc[:,i])

# Add constant feature

feature_name = 'phi'+str(X_train.shape[1]+1)
X_train[feature_name] = 1
  
  
X_train.describe()

In [0]:
y_train.describe()

---

### Model Fitting and Selection

Since the data is now loaded, we start with the simpliest model and fit a linear regression model. Note since this model has no hyperparameters we do not need any grid search approaches over a set of those. Nonetheless, we will use the respective class from the sklearn package to handily perform a 10 fold cross validation to get an idea of the performance of the model.


In [0]:
# We first define the score function as the RMSE since this is the metric of 
# grading defined by the task
def rmse(y, y_pred):
  RMSE = mean_squared_error(y, y_pred)**0.5
  return(RMSE)
rmse_scorer = make_scorer(rmse, greater_is_better=False)

In [0]:
LR = LinearRegression()
pip = Pipeline(steps=[('LR', LR)])

# Define GridSearch parameter
param_dict = {'LR__fit_intercept':[False]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234), 
                   scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train.iloc[:,0:5], y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

We see that the performance is not quite satisfactory. As for reference this value would allow to pass the easy baseline, but would yet require a decrease by roughly 4% to pass the medium baseline. More sophisticated approaches are required.

---
### Ridge Regression

One possible explanation for the rather bad performance of our linear regression model is that it captures to much of the random noise. Thus, we will now consider a different approach and try to make the solution less sensitive to noise by using a Ridge Regression approach, which due to the L2 regularization is less prone to overfit than linear regression. This time we have a hyperparameter namely the regularization parameter to tune and we will do so using a grid search approach performing again a 10 fold cross validation.


In [0]:
# Set pipeline
RR = Ridge(fit_intercept=False, random_state=1234)
pip = Pipeline(steps=[('RR', RR)])

# Define GridSearch parameter using the common applied practise we choose
# different magnitudes
param_dict = {'RR__alpha':[0.1, 1 , 10, 100, 1000]}

In [0]:
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234), 
                   scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')


print('Best estimator parameter: ')
print(clf.best_params_)

We see that the best estimator is fitted for $\alpha = 100$.  We see that the determined mean cross validation error is roughly 9.971 and would hence be way below the hard baseline. Nonetheless, we see that this estimate is obtained with a the highest variance across the other test models with different $\alpha$'s. Nonetheless, we will give it a shot and use the coefficients from the model fitted to the whole data with the regularization parameter $\alpha=100$ and submit the those to see how the model performs on the public test set.

In [0]:
# Extract coefficients
fitted_pip = clf.best_estimator_
RR_coefs = fitted_pip.named_steps['RR'].coef_
RR_coefs

Note that the functionality to transform that array of coefficients into the desired csv submission format is given in the submission chapter of that notebook.

---

### Lasso

The results are better but not yet satisfactory, in fact we obtain a public test score of 10.343 using this approach. 

However, it seems that the regularization yielded better results, backing our impression that the data is quite noisy and our current models capture to much of that random noice. 

Hence we will use linear model using the L1 regularization, which will generally drive down the weights more quickly to 0 and hence making our model more stable and less likely to overfit.  We thereby hope to drive down the variance of our model, which seems to be present considering the great deviance between the public test score and our mean cross validation score and the standard errors of cross validation scores.

In [0]:
# Set pipeline
Ls = Lasso(fit_intercept=False, max_iter=100000, random_state = 1234)
pip = Pipeline(steps=[('Lasso', Ls)])

# Define GridSearch parameter
param_dict = {'Lasso__alpha':[0.001, 0.1, 0.2, 0.5, 1, 1.5, 2, 5, 10]}

# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234), scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')


print('Best estimator parameter: ')
print(clf.best_params_)

In [0]:
clf.cv_results_

We see that the optimal regularization parameter according to our cross validation error estimates for the RMSE of the model is 0.1. The reported mean cross validation error estimate looks better than for Ridge Regression, and also the standard error of the RMSE obtained for the different folds is slightly higher than for Ridge regression.

However, we will construct a submission based on the weights obtained from the Lasso model fitted to the whole data set with $\alpha$=0.1 and check the performance of that model on the public test set.

In [0]:
fitted_pip = clf.best_estimator_
Lasso_coefs = fitted_pip.named_steps['Lasso'].coef_
Lasso_coefs

---

The public test score is better than what we obtained for Ridge regression and with 10.1123 way below the medium baseline but yet above the hard baseline.

Nonetheless the fact that Lasso shrunk down the weights of 8 features to 0 and thus excluded them in some way and the fact that it yielded a better public test score as well as a better cross validation error score, suggests that our model is still subject to high-variance and a more sophistacted feature selection might be promising.

Before doing so we will however try using a larger $\alpha \geq 0.8$ as this would also provide a more drastic feature selection by shrinking down more weights to zero.

In [0]:
# Set pipeline
Ls = Lasso(fit_intercept=False, max_iter=100000, random_state = 1234)
pip = Pipeline(steps=[('Lasso', Ls)])

# Define GridSearch parameter
param_dict = {'Lasso__alpha':[0.7,0.8, 0.9,1, 1.5, 2, 5, 10]}

# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234), 
                   scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')


print('Best estimator parameter: ')
print(clf.best_params_)

The results look promising since the mean CV test score is still quite good but the associated standard error is slightly reduced. So let us have a look at the coefficents.

In [0]:
fitted_pip = clf.best_estimator_
Lasso_coefs = fitted_pip.named_steps['Lasso'].coef_
Lasso_coefs

We see that we are left with 5 features, one linear, three quadratic and 2 exponential features. Since this model is less complex we anticipate more bias but less variance in the model and hence less deviation between the cv test error estimate and the actual test error. Let us verify that by creating a submission using those coefficents.

---
### Manual Feature Selection

Recalling, how the features were constructed we will first construct different subset of features inspired by the form.

However, first let us inspect the correlation structure of the features to see if have problem the issue of multicollinearity, which would lead to very unstable models. This could be one cause for fact that our models still were subject to overfitting indicated by the great deviance between our cross validation error estimate and the public test score.

In [0]:
corr = X_train.corr()
print(X_train.shape)

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

What we see is that $x_1,...,\phi_{10}$ are fairly strongly correlated with $\phi_{11},...,\phi{10}$, hence we will try out a setting where we only include the $x_1,..., \phi_{10}$.

---
##### Quadratic Regression

Doing so yield a quadratic regression model. Before fitting the model let us check if removing all other predictors yields a nicer correlation structure between the predictors.

In [0]:
X_train_quad = X_train[['x1', 'x2', 'x3', 'x4', 'x5','phi6', 'phi7', 'phi8', 
                        'phi9', 'phi10', 'phi21']]
corr = X_train_quad.corr()

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

This is very much the case as we dont see any strongly correlated features any longer. Hence we will now fit a Ridge Regression Model to this set of features.

In [0]:
# Set pipeline
RR = Ridge(fit_intercept=False, random_state=1234, max_iter=10000)
pip = Pipeline(steps=[('RR', RR)])

# Define GridSearch parameter
param_dict = {'RR__alpha':[0, 1, 10, 100, 1000]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234), 
                   scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train_quad, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

The results look better in a sense that the standard error of our cross validation scores is less in comparison to our previous approaches. The score it self however are remarkably higher. This is not suprising as by removing a set of variables we increase the bias, while reducing the variance. However the reported score would be still way beyond the hard baseline, if we get a similar performance on the private test with that model. We will construct the array of coefficients and  submit such to get an idea of the performance on the public test set.

In [0]:
fitted_pip = clf.best_estimator_
QuadR_coefs = fitted_pip.named_steps['RR'].coef_
QuadR_coefs = np.concatenate((QuadR_coefs[0:10], np.repeat(0,10), np.array([QuadR_coefs[10]])),axis=0)
QuadR_coefs


The results are better than what we had for the  Lasso regression on the whole data set but yet not satisfactory. We will aim for even more regularization inspired by the still remarkable gap between the cross validation error estimate and the public test score by replacing the Ridge regression with the Lasso.

In [0]:
# Set pipeline
Ls = Lasso(fit_intercept=False, random_state=1234, max_iter=10000)
pip = Pipeline(steps=[('Lasso', Ls)])

# Define GridSearch parameter
param_dict = {'Lasso__alpha':[0.01, 0.1, 0.5, 1]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234), 
                   scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train_quad, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

According to results from our cross validation the Lasso performs better to a tiny degree.
Let us inspect the estimated coefficents.

In [0]:
fitted_pip = clf.best_estimator_
QuadL_coefs = fitted_pip.named_steps['Lasso'].coef_
QuadL_coefs = np.concatenate((QuadL_coefs[0:10], np.repeat(0,10), np.array([QuadL_coefs[10]])),axis=0)
QuadL_coefs

We see that Lasso basically excluded $\phi_7$. We will construct a submission based on those as well to see if even larger regularization provides a public test score that is more consistent with the error estimates in the cross validation. With a public test score of roughly 10.090 this is the case. However it not satisfactory. We will now without much of further explanation check the other obvious subsets of the features in the following in a similar manner.

---
##### Exponential Function Fitting

We will now try only using $x_1,...,x_5,\phi_{11},...,\phi_{15}$ now.

In [0]:
X_train_exp = X_train[['x1', 'x2', 'x3', 'x4', 'x5','phi11', 'phi12', 'phi13', 'phi14', 'phi15', 'phi21']]
corr = X_train_exp.corr()

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

We see a quite strong correlation structure between constructed non-linear features and the constructed features.  We will also fit a Lasso Regression here and look at the results.

In [0]:
# Set pipeline
Ls = Lasso(fit_intercept=False, random_state=1234, max_iter=10000)
pip = Pipeline(steps=[('Lasso', Ls)])

# Define GridSearch parameter
param_dict = {'Lasso__alpha':[0.5, 1, 2, 5, 10]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=KFold(n_splits=10, random_state=1234),
                   scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train_exp, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

In [0]:
fitted_pip = clf.best_estimator_
ExpL_coefs = fitted_pip.named_steps['Lasso'].coef_
ExpL_coefs = np.concatenate((ExpL_coefs[0:5], np.repeat(0,5), ExpL_coefs[5:10],np.repeat(0,5), np.array([ExpL_coefs[10]])),axis=0)
ExpL_coefs

The  scores look promosing especially because the train and test errors are almost the same, the standard error of the cross validation is not too high and due to the strong regualization we are left with a very simple model that is not as likely to overfit than more complex models. 

Since this model also yields a public test score of 10.009, and thereby the smallest difference between the cross validation error estimate and the one obtained for public test set, it is a promising candidate for the final submission.

---
##### Cosine Regression

Last but not least we also consider only a cosine regression.



In [0]:
X_train_cos = X_train[['x1', 'x2', 'x3', 'x4', 'x5','phi16', 'phi17', 'phi18', 'phi19', 'phi20', 'phi21']]
corr = X_train_cos.corr()

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

In [0]:
# Set pipeline
RR = Ridge(fit_intercept=False, random_state=1234, max_iter=10000)
pip = Pipeline(steps=[('RR', RR)])

# Define GridSearch parameter
param_dict = {'RR__alpha':[0, 0.1, 1, 100, 1000]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=5, scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train_cos, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

The reported scores are less promosing although the associated standard errors are also way less.

---
##### Quadratic-Exponential Regression

We will now also consider the other subsets starting with that having the exponential, linear and quadratic terms included.

In [0]:
X_train_exp_quad = X_train[['x1', 'x2', 'x3', 'x4', 'x5','phi6', 'phi7', 'phi8', 'phi9', 'phi10','phi11', 'phi12', 'phi13', 'phi14', 'phi15', 'phi21']]
corr = X_train_exp_quad.corr()

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

In [0]:
# Set pipeline
RR = Ridge(fit_intercept=False, random_state=1234, max_iter=10000)
pip = Pipeline(steps=[('RR', RR)])

# Define GridSearch parameter
param_dict = {'RR__alpha':[100, 200, 300]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=5, scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train_exp_quad, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

We very similar performance estimates by the CV than we got for just including the exponential and linear features. This suggests that it is not worth to include the quadratic features, when the exponential features are already included.

---
##### Quadratic and Cosine Regression

In [0]:
X_train_cos_quad = X_train[['x1', 'x2', 'x3', 'x4', 'x5','phi6', 'phi7', 'phi8', 'phi9', 'phi10','phi16', 'phi17', 'phi18', 'phi19', 'phi20', 'phi21']]
corr = X_train_cos_quad.corr()

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

In [0]:
# Set pipeline
RR = Ridge(fit_intercept=False, random_state=1234, max_iter=10000)
pip = Pipeline(steps=[('RR', RR)])

# Define GridSearch parameter
param_dict = {'RR__alpha':[0, 0.1, 1, 10, 100, 200, 300, 500, 1000]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=5, scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train_cos_quad, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

The results are very similar to the ones we obtained from just using the quadratic and linear features. The optimal penalizing parameter $\alpha$ changes, however the associated mean CV scores and the standard errors are of the same margin. Hence it does not seem to be promising to include also the cosine features if we already use the quadratic and linear features.

---

##### Cosine and Exponential Regression

In [0]:
X_train_cos_exp = X_train[['x1', 'x2', 'x3', 'x4', 'x5','phi11', 'phi12', 'phi13', 'phi14', 'phi15','phi16', 'phi17', 'phi18', 'phi19', 'phi20', 'phi21']]
corr = X_train_cos_exp.corr()

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

In [0]:
# Set pipeline
RR = Ridge(fit_intercept=False, random_state=1234, max_iter=10000)
pip = Pipeline(steps=[('RR', RR)])

# Define GridSearch parameter
param_dict = {'RR__alpha':[0, 0.1, 1, 10, 100, 200, 300, 500, 1000]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=5, scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train_cos_exp, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

The results are way worse then ones we got for just using the linear features and the exponentially transformed versions of those. Thus it seems like that including the cosine features do not yield any benefit but rather make the performance worse.

### Automated Feature Selection

##### Support Vector Regression with recursive Feature Selection

In the following we aim for even more robust solutions by using Support Vector Regression following a recursive feature selection approach, where features are selected according to the importance measure determined by a preceding Random Forest Regression.

In [0]:
# Define Regressors
SVR = LinearSVR(fit_intercept=False, random_state=1234, max_iter=10000)
RFR = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1234)
ETR = ExtraTreesRegressor(n_estimators=100, random_state=1234, oob_score=True)

# Run Recursive Feature Elimination with rmse as scor
RFE_SVR = RFECV(SVR, cv=10, scoring=rmse_scorer)
RFE_SVR = RFE_SVR.fit(X_train, y_train)

In [0]:
mask = RFE_SVR.support_
mask

We see that the recursive backward selection determined to only use $x_2, x_1^2,x_5^2,exp(x_1), exp(x_3), cos(x_2), cos(x_4)$ and the intercept. That are the features refering to columns with the indices: 1,5,9,10,12,16,18, 20.

Let us check the respective cv_test scores for the SVR fitted on that data set i.e. the mean score and the standard error of it.

In [0]:
RFE_SVR.grid_scores_

In [0]:
RFE_SVR.estimator_.coef_

Let us take those coefficients and just add one submission to get an idea how that would perform.

In [0]:
RFE_SVR_coefs = []
coef_idcs = [1,5,9,10,12,16,18, 20]
tmp=0
for i in range(21):
  if i in coef_idcs:
    RFE_SVR_coefs.append(RFE_SVR.estimator_.coef_[tmp])
    tmp += 1
  else:
    RFE_SVR_coefs.append(0)

RFE_SVR_coefs = np.array(RFE_SVR_coefs)
RFE_SVR_coefs

#### RidgeRegression with Recursive Feature Selection


In [0]:
# Define Regressors
RR = Ridge(fit_intercept=False, alpha=100)

# Run Recursive Feature Elimination with rmse as scor
RFE_RR = RFECV(RR, cv=10, scoring=rmse_scorer)
RFE_RR = RFE_RR.fit(X_train, y_train)

In [0]:
mask = RFE_RR.support_
mask

In [0]:
RFE_RR.ranking_

In [0]:
RFE_RR.grid_scores_

In [0]:
print('mean test score:')
print(np.max(RFE_RR.grid_scores_))

In [0]:
RFE_RR_coefs = []
tmp = 0
for i in range(21):
  if mask[i]:
    RFE_RR_coefs.append(RFE_RR.estimator_.coef_[tmp])
    tmp += 1
  else:
    RFE_RR_coefs.append(0)
RFE_RR_coefs

---

#### Extensive Feature Subset Selection GridSearch


In [0]:
# Set pipeline
RFR = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=1234)
rfe = RFE(RFR)
RR = Ridge(fit_intercept=False, random_state=1234, max_iter=1000)
pip = Pipeline(steps=[('RFE', rfe),('RR', RR)])

# Define GridSearch parameter
param_dict = {'RFE__n_features_to_select':np.arange(0,11)+6,'RR__alpha':[100, 200, 300]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=5, scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train, y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

In [0]:
mask = clf.best_estimator_.named_steps['RFE'].support_
coefs = clf.best_estimator_.named_steps['RR'].coef_

RFE_RR_coefs = []
tmp = 0
for i in range(21):
  if mask[i]:
    RFE_RR_coefs.append(coefs[tmp])
    tmp += 1
  else:
    RFE_RR_coefs.append(0)
np.array(RFE_RR_coefs)

In [0]:
mask

In [0]:
# Note that because of the because feature 21 is the same for data points the
# F-score is mal-defined to avoid an overflow of warnings assessing that issue
# we set the following argument.
#warnings.filterwarnings('ignore')

# Set pipeline
SKB = SelectKBest(f_regression)
RR = Ridge(fit_intercept=True, random_state=1234, max_iter=1000)
LR = LinearRegression()
SVR = LinearSVR(random_state=1234)
pip = Pipeline(steps=[('SKB', SKB),('RR', SVR)])

# Define GridSearch parameter
param_dict = {'SKB__k':np.arange(1,21), 'RR__C':[0.1,1,2,3,5,10,15,20]}
# Run GridSearch
clf = GridSearchCV(pip, param_dict, cv=10, scoring=rmse_scorer, return_train_score=True)
clf.fit(X_train.iloc[:,:-1], y_train)

print('Mean CV test score: ')
print(clf.cv_results_['mean_test_score'])
print(' ')
print('Std CV test score: ')
print(clf.cv_results_['std_test_score'])
print(' ')

print('Mean CV train score: ')
print(clf.cv_results_['mean_train_score'])
print(' ')
print('Std CV train score: ')
print(clf.cv_results_['std_train_score'])
print(' ')

print('Best estimator parameter: ')
print(clf.best_params_)

In [0]:
print('Test error of best estimator')
print(np.max(clf.cv_results_['mean_test_score']))

print('')

print('Test error std of best estimator')
print(np.min(clf.cv_results_['std_test_score']))

print('')

print('Train error of best estimator')
print(np.max(clf.cv_results_['mean_train_score']))

print('')

print('Train error std of best estimator')
print(np.min(clf.cv_results_['std_train_score']))

The metrics look promising, let us inspect which features where selected, that are the 14 ones with the highest score plus the intercept.

In [0]:
best_estimator = clf.best_estimator_
scores = best_estimator.named_steps['SKB'].scores_
print(scores)

Let us check to which features those refer to.

In [0]:
ind = np.argpartition(scores, -14)[-14:]
sorted_idc = np.sort(ind)
print(sorted_idc)

We see that they refer to $x_1,...,x_4,x_1^2, x_3^2,x_5^2, e^x_1,..,e^x_4, cos(x_1),...,cos(x_3), cos(x_5)$ 

Let us submit the determined coefficients for those

In [0]:
SKB_RR_coefs =[]
tmp = 0
for i in range(20):
  if i in sorted_idc:
    SKB_RR_coefs.append(best_estimator.named_steps['RR'].coef_[tmp])
    tmp += 1
  else:
    SKB_RR_coefs.append(0.0)
SKB_RR_coefs.append(best_estimator.named_steps['RR'].intercept_)

SKB_RR_coefs = np.array(SKB_RR_coefs)
SKB_RR_coefs

---

### Submission

We construct the submission with the choice of respective coefficents from our trials.

In [0]:
submission.iloc[:,0]= Lasso_coefs
submission


---

## Export data

We finally use the Google Colab API to download our submission data frame in from of an csv, that we can submit to the submission platform.

In [0]:
from google.colab import files

ts = str(datetime.datetime.utcnow())
ts = ts.replace(' ', '_')
fname = 'Lasso_full_alpha08_cv10_9980_std074'+ts+'.csv'

with open(fname, 'w') as f:
  submission.to_csv(f, float_format='%.64f', index=False, header=False)

files.download(fname)