# Model selection 

The goal of this notebook is to start using scikit-learn for model selection. We will use a simple regression algorithm, the k-nearest-neighbor, on the `vinho verde` data set, and try to find the best value for k.

You will also learn how to standardize features, and use the `pandas` library for simple data manipulation.

This notebook was created by [Chloé-Agathe Azencott](http://cazencott.info).

Throughout the lab, do not hesitate to refer heavily to the [scikit-learn documentation](http://scikit-learn.org/stable/documentation.html).

This notebook was created using
* python 3.4.3
* numpy 1.15.0
* matplotlib 2.2.2
* scikit-learn 0.19.2
* pandas 0.22.0

You can check your version of Python by running
```python
import sys
print(sys.version)
```

and the version of any module by running
```python
import <module name>
print(<module name>.__version__)
```

## 1. Loading our data science libraries

In [None]:
%pylab inline
import pandas

## 2. Loading the data

The `vinho verde` data set contains physico-chemical information on a number of Portuguese wines, as well as their rating by human tasters. 

Our goal is to use these data to automatically predict the rating of a wine, so as to assist oenologists, improve wine production, and target the taste of niche consumers.

This data set has been made available on the UCI archive repository (it is one of the oldest and most well-known repository of ML problems).

It is available from: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ (but already in your repository; we will focus on white wines here).

In [None]:
data = pandas.read_csv('data/winequality-white.csv', sep=";")

In [None]:
type(data)

We have loaded the data in a _pandas DataFrame_ object. Let us examine what information is available:

In [None]:
data.head()

The data contains 12 columns. The first 10 (fixed acidity -- alcohol) are physico-chemical features of the wines; the last one is their rating (or quality).

Let us extract from this data a numpy array that contains the design matrix X:

In [None]:
X = data.values[:, :-1]
print(X.shape)

__Question 1:__ Extract from this data a one-dimensional numpy array that contains the labels y.

In [None]:
# TODO
y = data.values[:, -1]
print(y.shape)

Let us now plot a histogram of the values taken by each of our features:

In [None]:
# create a figure of size 16x12
fig = plt.figure(figsize=(16, 12))

for feat_idx in range(X.shape[1]):
    # create a subplot in the (feat_idx+1) position of a 3x4 grid
    ax = fig.add_subplot(3, 4, (feat_idx+1))
    # plot the histogram of feat_idx
    h = ax.hist(X[:, feat_idx], bins=50, color='steelblue',
                edgecolor='none')
    # use the name of the feature as a title for each histogram
    ax.set_title(data.columns[feat_idx], fontsize=14)

__Question 2:__
What are the ranges of values taken by the different features? What do you think is going to happen when one computes the euclidean distance between two samples: will the `free sulfur dioxide` be accounted for in a manner similar to the `sulphates`? How is this going to affect the k-nearest-neighbor algorithm?

__Answer:__

## 3. Model selection

### 3.1 Train and test sets

We will consider this problem as a regression problem. Note that it is not necessarily the best way to address this problem! In particular, a regression prediction will not be restricted to integer scores on a limited scale.

Let us start by separating our data into a training and a test set (containing 30% of the data). We will use cross-validation _on the training set_ to select the value of k, and we will report the performance of the selected model on the test set, as an approximation to the _generalization performance_ of our model.

This can be done with scikit-learn's [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [None]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = \
    model_selection.train_test_split(X, y,
                                    test_size=0.3 # 30% des données dans le jeu de test
                                    )

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

### 3.2 Feature standardization

As we have observed in Section 2, the features need to be _standardized_ so as to be more or less on the same scale. We will achieve this by forcing them to have 0 mean and standard deviation 1. They will therefore all have the same importance to the eyes of a distance-based learning algorithm.

We will use scikit-learn's [preprocessing.StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [None]:
from sklearn import preprocessing

# Create a standardizer object and fit it to the training data.
std_scale = preprocessing.StandardScaler().fit(X_train)

# Apply the standardization to the training and the test data.
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)

__Question 3:__ Why did we fit the standardizer (i.e. computed the mean and standard deviation for each feature) on the training set only?

__Answer:__

__Question 4:__ Visualize the scaled data again to check that the standardization had the intended effect.

In [None]:
# TODO

### 3.3 Model selection by cross-validation

We will now choose the optimal value of k (the number of neighbors) in a k-nearest neighbor algorithm using a cross-validation procedure _on the training set_.

This can be done using scikit-learn's [model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In addition, we will need scikit-learn's (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)[neighbors.KNeighborsRegressor] to implement a regression with a k-nearest neighbors algorithm.

In [None]:
from sklearn import neighbors

# Set the values of the hyperparameter to test
param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15]}

# Pick a score to optimize, here the correlation coefficient r2
score = 'r2'

# Build a kNN classifier with hyperparameter search by cross-validation:
predictor = model_selection.GridSearchCV(neighbors.KNeighborsRegressor(), # a kNN
                                   param_grid, # hyperparameters to test
                                   cv=5, # number of folds of the cross-validation
                                   scoring=score # score to optimize
                                  )

# Optimize the classifier on the standardized training data
predictor.fit(X_train_std, y_train)

# Print optimal hyperparameter(s)
print("Best hyperparameter(s) on the training set:",
       predictor.best_params_)

# Print performance
print("Cross-validation results:")
for mean, std, params in zip(predictor.cv_results_['mean_test_score'], # mean score on test set
                             predictor.cv_results_['std_test_score'], # standard deviation
                             predictor.cv_results_['params'] # value of the hyperparameter
                            ):
    print("\t%s = %0.3f (+/-%0.03f) for %r" % (score, # performance
                                              mean, # mean score
                                              std * 2, # error bar
                                              params # hyperparameter
                                              ))

## 3.4 Generalization performance

We can now report the performance of our best model (it is automatically stored in `predictor`) on the test set.

We will use scikit-learn's [metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module to this end.

In [None]:
from sklearn import metrics

y_pred = predictor.predict(X_test_std)
print("R2 on the test set: %0.3f" % metrics.r2_score(y_test, y_pred))

__Question 5__ Compute the root mean squared error on the test set.

In [None]:
# TODO