# Regression exercise

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import pandas

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


## Loading and visualizing the data

Lets have a look at the dataset

In [None]:
training_data = pandas.read_csv('../data/regression_training.csv')
seaborn.pairplot(training_data)
plt.show()

We have to divide our dataframe into features and targets, and we'll split some samples off for validation.

In [None]:
def split_targets(data, target_key):
    '''
    '''
    return data.drop(columns=target_key), data[target_key].copy()



# For visualisation purposes, we just use one validation split here.
train_features, val_features, train_target, val_target = train_test_split(*split_targets(training_data, 'target'),
                                                                          test_size=0.2)

print('Train features: ', train_features.shape)
print('Validation features: ', val_features.shape)

Lets fit a simple linear model:

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(train_features, train_target)

How well does it work?

In [None]:
def plot_model_predictions(model, features, targets, name):
    '''compare predictions vs reference targets.
    
    ''' 
    pred = model.predict(features)
    mse = mean_squared_error(pred, targets)
    
    plt.scatter(targets, pred, marker='+', label='MSE ({}): {:1.2f}'.format(name, mse))
    val_range = [min(targets.min(), pred.min()), max(targets.max(), pred.max())]
    plt.xlim(val_range)
    plt.ylim(val_range)
    plt.plot(val_range, val_range, color='lightgrey')
    plt.ylabel('Predicted')
    plt.xlabel('Reference')



plot_model_predictions(model, train_features, train_target, 'training')
plot_model_predictions(model, val_features, val_target, 'validation')
plt.legend()
plt.show()


That doesnt look great. Can we do better? How about a more complex model?

One simple way to achieve this is to add more features, e.g. with the ```PolynomialFeatures``` transformation as described [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)

This can quite easily be combined with the estimator using the [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). Note that we also added a [standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

def create_polynomial_model(degree):
    '''
    '''
    return Pipeline([('scaling', StandardScaler()),
                     ('polynomial_features', PolynomialFeatures(degree=degree, include_bias=False)),
                     ('linear_regression', LinearRegression())])

model = create_polynomial_model(3)
model.fit(train_features, train_target)

plot_model_predictions(model, train_features, train_target, 'training')
plot_model_predictions(model, val_features, val_target, 'validation')
plt.legend()
plt.show()



- What happened here? What is the behaviour with increasing number of features?

- What happens if you replace the ```LinearRegression``` with a regularized regression? E.g. a [ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) or a [lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) regression?
- Do you find any non-linear regression models in scikit-learn? If yes, try it out and compare it to the (regularized) least squares regression from above.
- Is there a way to determine which features are important for the task?

In [None]:
# TODO ...

## Final test

Finally, lets load the test dataset and see how well our model does

In [None]:
raise RuntimeError("Are you sure you already want to test your classifier?")

plot_model_predictions(model, *split_targets(pandas.read_csv('../data/regression_test.csv'), 'target'), 'test')
plt.legend()
plt.show()

## More data

Now that you are happy with your model's performance, you tell everybody about it. A colleague approaches you, saying that he does the same measurements and using your regression model would save a lot of time. So you give it a try:

In [None]:
more_data = pandas.read_csv('../data/regression_more.csv')
more_features, more_targets = split_targets(more_data, 'target')

plot_model_predictions(model, more_features, more_targets, 'more')
plt.legend()
plt.show()

- How does it compare to your test evaluation? Is this normal? If no, what happened?