# Nested Crossed Validation / Validation Croisée imbriquée 

## Définition &amp; explications :

La méthode consiste à séparer nos données de manière organisée. On sépare l'ensemble en 10 (N) groupes. On répète 10 (N) itérations dans lesquelles chacun des groupes sera utilisé pour tester les données. Le reste des groupes devient l'ensemble d'apprentissage, qui est divisé lui même en deux : le réel ensemble d'apprentissage, et l'ensemble de validation du modèle. Au sein d'une boucle interne, on va apprendre les modèles (cad apprendre les meilleurs hyper-paramètres le concernant).  En sortie de l'algorithme, on connaitra le meilleur paramètre &amp; le biais réel (sans overfitting) du modèle.

But du notebook : montrer l'intérêt de cette pratique quand on a peu d'individus (sur les KNN), puis l'utiliser sur les autres méthodes afin de les comparer.  

## Exemple des KNN

Lire le dataset, importer les bibliothèques...

In [2]:
import pandas as pd
data = pd.read_csv('../../data/penguins_for_machine_learning.csv')
data.head()

Unnamed: 0,Biscoe,Dream,Torgersen,species,bill_length,bill_depth,flipper_length,body_mass,del15,del13
0,0.0,0.0,1.0,Adelie,39.1,18.7,181.0,3750.0,8.859733,-25.804194
1,0.0,0.0,1.0,Adelie,39.5,17.4,186.0,3800.0,8.94956,-24.69454
2,0.0,0.0,1.0,Adelie,40.3,18.0,195.0,3250.0,8.36821,-25.33302
3,0.0,0.0,1.0,Adelie,36.7,19.3,193.0,3450.0,8.76651,-25.32426
4,0.0,0.0,1.0,Adelie,39.3,20.6,190.0,3650.0,8.66496,-25.29805


Source : https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

In [3]:
from sklearn.model_selection import StratifiedKFold

# classifiers 
from sklearn.neighbors import KNeighborsClassifier

#others
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

#loc the data 
X = data.drop(['species'], axis=1)
y = data.iloc[:, 3]
cv_outer = StratifiedKFold(n_splits=10, random_state=1729, shuffle=True)

# empty pd DF to store accuracies
err_cv_nested = pd.DataFrame()

# (1) outer cross validation 
for idx_train, idx_test in cv_outer.split(X, y):
  X_outer_train, y_outer_train = X[idx_train], y[idx_train]
  X_outer_test, y_outer_test = X[idx_test], y[idx_test]

  # Setting up the KNNs
  n_neighbors_list = np.arange(1, 100, 5)
  param_grid = {"n_neighbors": n_neighbors_list}
  cls = KNeighborsClassifier()

  # (2.2) Nested crossed validation with inner CV
  search_knn = GridSearchCV(cls, param_grid, scoring="accuracy", cv=10)
  search_knn.fit(X_outer_train, y_outer_train)

  # training the knn model using the optimal K we just found
  knn_cv_model = KNeighborsClassifier(n_neighbors=search_knn.best_params_["n_neighbors"])
  knn_cv_model.fit(X_outer_train, y_outer_train)

  # ---------------------------------------------------------------------------------------------------------------------------------------------------

  # (1.1) Computing real accuracy 
  y_pred_knn = knn_cv_model.predict(X_outer_test)

  # accuracy 
  acc_knn = accuracy_score(y_pred_knn, y_outer_test)
  err_cv_nested = err_cv_nested.append({"model": "KNN", "accuracy": acc_knn},ignore_index=True)


err_cv_nested

KeyError: "None of [Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,\n            ...\n            332, 333, 334, 335, 336, 337, 338, 339, 340, 341],\n           dtype='int64', length=307)] are in the [columns]"

In [None]:
from matplotlib import pyplot as plt
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
import numpy as np

# Number of random trials
NUM_TRIALS = 30

# Load the dataset
X = data.drop(['species'], axis=1)
y = data.iloc[:, 3]

# Set up possible values of parameters to optimize over
n_neighbors_list = np.arange(1, 100, 5)
param_grid = {"n_neighbors": n_neighbors_list}

# We will use a Support Vector Classifier with "rbf" kernel
cls = KNeighborsClassifier()

# Arrays to store scores
non_nested_scores = np.zeros(NUM_TRIALS)
nested_scores = np.zeros(NUM_TRIALS)

# Loop for each trial
for i in range(NUM_TRIALS):

    # Choose cross-validation techniques for the inner and outer loops,
    # independently of the dataset.
    # E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
    inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
    outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

    #1 : Non_nested parameter search and scoring --> directly 
    clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
    clf.fit(X_iris, y_iris)
    non_nested_scores[i] = clf.best_score_

    # Nested CV with parameter optimization
    nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
    nested_scores[i] = nested_score.mean()

Empty DataFrame
Columns: []
Index: []


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=1dd35576-464a-4d46-a0e6-752fc35b7463' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>