# Cross validation
So far, we have train one model with one parameter setting. Ususally we want to compare different models. We don't want to use our test set for parameter optimization, so we can do yet another split, splitting the training data into a training set and validation set, and use the latter for paramater optimization. A more sophisticated way to do this is *cross validation*, here we split our data into N parts, for example `X1, X2, X3`. Then we use `X1+X2` for training and `X3` for validating, `X1+X3` for training and `X2` for validating and `X2+X3` for training and `X1` for validating.

### Exercise
- Why do we not want to use the test set for parameter optimization?
- What are advantages or disadvantages of cross validation over a single train-validation split?

Luckily, cross validation is really easy in scikit-learn and requires little coding, especially if we already have the pipeline as we had earlier. Let's make that pipeline again

In [5]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import pandas as pd

In [2]:
pipe = Pipeline([
    ('scale', MinMaxScaler()),
    ('model', KNeighborsClassifier()) # Now we leave out the parameter that we are going to tune!
])

Let's see what parameters we could in theory tune:

In [3]:
pipe.get_params()

{'memory': None,
 'steps': [('scale', MinMaxScaler()), ('model', KNeighborsClassifier())],
 'verbose': False,
 'scale': MinMaxScaler(),
 'model': KNeighborsClassifier(),
 'scale__clip': False,
 'scale__copy': True,
 'scale__feature_range': (0, 1),
 'model__algorithm': 'auto',
 'model__leaf_size': 30,
 'model__metric': 'minkowski',
 'model__metric_params': None,
 'model__n_jobs': None,
 'model__n_neighbors': 5,
 'model__p': 2,
 'model__weights': 'uniform'}

In [4]:
model = GridSearchCV(estimator=pipe,
                     cv = 3,
                    param_grid = {
                        'model__n_neighbors': [1,2,3,4,5]
                    })

In [7]:
# Read in our training data again
penguins_train = pd.read_csv('data/penguins_train_nona.csv')
numerical_features = penguins_train.columns[2:6]
X = penguins_train[numerical_features]
y = penguins_train['species']

In [8]:
# Fit the model
model.fit(X,y)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scale', MinMaxScaler()),
                                       ('model', KNeighborsClassifier())]),
             param_grid={'model__n_neighbors': [1, 2, 3, 4, 5]})

We can inspect the results as follows:

In [9]:
cv_results = pd.DataFrame(model.cv_results_)
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006505,0.000289,0.00852,0.000756,1,{'model__n_neighbors': 1},0.978022,0.978022,0.989011,0.981685,0.00518,1
1,0.01484,0.008781,0.015364,0.005134,2,{'model__n_neighbors': 2},1.0,0.945055,0.989011,0.978022,0.023739,4
2,0.009235,0.003393,0.009098,0.000655,3,{'model__n_neighbors': 3},0.989011,0.956044,0.989011,0.978022,0.015541,4
3,0.007702,0.001527,0.007836,0.000137,4,{'model__n_neighbors': 4},1.0,0.956044,0.989011,0.981685,0.018678,1
4,0.009074,0.003549,0.010324,0.003781,5,{'model__n_neighbors': 5},1.0,0.956044,0.989011,0.981685,0.018678,1


This is a lot of information, but it basically tells us for each parameter, the scores for each cross validation splits. By default, this score is the mean accuracy but we could provide a different metric here

### Exercise: different models
Look at the sklearn documentation and choose a different model. Create a pipeline, looping over different parameters. What do you find?