<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#model-selection" data-toc-modified-id="model-selection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>model selection</a></span><ul class="toc-item"><li><span><a href="#Finding-the-indices-for-KFold" data-toc-modified-id="Finding-the-indices-for-KFold-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Finding the indices for <code>KFold</code></a></span></li><li><span><a href="#Finding-indices-with-GroupKFold" data-toc-modified-id="Finding-indices-with-GroupKFold-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Finding indices with <code>GroupKFold</code></a></span></li></ul></li></ul></div>

# model selection


In [556]:
import sklearn
import numpy as np


## Finding the indices for `KFold`

The class `KFold` allows us to generate the train and validation indicies for performing training and validation indicies with different parts of our data.


In [557]:
from sklearn.model_selection import KFold
X = np.random.randn(100,4)
X.shape

(100, 4)

In [565]:
folds = KFold(10,shuffle=False)
splits = folds.split(X)
for fold, (tr_ind,va_ind) in enumerate(splits):
    print(fold, va_ind)

0 [0 1 2 3 4 5 6 7 8 9]
1 [10 11 12 13 14 15 16 17 18 19]
2 [20 21 22 23 24 25 26 27 28 29]
3 [30 31 32 33 34 35 36 37 38 39]
4 [40 41 42 43 44 45 46 47 48 49]
5 [50 51 52 53 54 55 56 57 58 59]
6 [60 61 62 63 64 65 66 67 68 69]
7 [70 71 72 73 74 75 76 77 78 79]
8 [80 81 82 83 84 85 86 87 88 89]
9 [90 91 92 93 94 95 96 97 98 99]


If shuffle is true then we sample all rows from our dataset randomly

In [4]:

folds = KFold(10,shuffle=True)
splits = folds.split(X)
for tr_ind,va_ind in splits:
    print(va_ind)

[11 20 24 54 60 72 80 84 85 87]
[ 6 15 30 31 34 40 50 68 71 97]
[ 7 17 19 25 29 32 35 67 75 76]
[ 5 12 27 43 44 59 64 70 79 82]
[ 3  4 16 36 51 55 61 77 78 93]
[ 9 14 18 21 33 42 49 56 58 65]
[ 0  1 22 41 45 81 83 89 92 94]
[13 46 47 53 57 62 86 90 96 99]
[ 2  8 23 38 39 52 63 73 88 95]
[10 26 28 37 48 66 69 74 91 98]


# Training models in crosvalidation using GridsearchCV

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
from sklearn.datasets import *

In [6]:
dataset = sklearn.datasets.california_housing.fetch_california_housing()



In [611]:
X = dataset.data
y = dataset.target

X.shape, y.shape

((20640, 8), (20640,))

In [612]:
X_tr, X_te, y_tr, y_te = sklearn.model_selection.train_test_split(X,y, random_state=1234)

X_tr.shape, y_tr.shape, X_te.shape, y_te.shape

((15480, 8), (15480,), (5160, 8), (5160,))

In [613]:
from sklearn import ensemble
from sklearn.ensemble import *

rf = sklearn.ensemble.RandomForestRegressor()

param_grid = {"max_depth":[5,None,10], "max_features":["auto",0.5]}

In [615]:
rf_grid = sklearn.model_selection.GridSearchCV(rf, 
                                               param_grid=param_grid,
                                               cv=4,
                                               n_jobs=-1)

In [616]:
rf_grid.fit(X_tr, y_tr)



GridSearchCV(cv=4, error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [5, None, 10],
     

In [421]:
rf_grid.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=0.5, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

## Saved metrics in a `GridSearchCV` object

After fitting a `GridsearchCV` object we can inspect the information saved during crossvalidation inside the field `.cv_results_`.

In [172]:
list(rf_grid.cv_results_.keys())

['mean_fit_time',
 'std_fit_time',
 'mean_score_time',
 'std_score_time',
 'param_max_depth',
 'param_max_features',
 'params',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'mean_test_score',
 'std_test_score',
 'rank_test_score']

###  Configurations of  Hyperparameters tested

All the combinations of parameters tested are kept in `'params'`

In [173]:
rf_grid.cv_results_["params"]

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

The different scores for each of the CV splits are found in `splitsK_test_score`.

In this case we have 4 arrays because we used `cv=4` when we instanciated `rf_grid`.

```
'split0_test_score'
'split1_test_score'
'split2_test_score'
'split3_test_score'
 ```
 
 Each array contains  possition `k` the test results of the k'th combination of hyperparameters.

In [174]:
rf_grid.n_splits_

4

In [175]:
all_scores = [rf_grid.cv_results_[f"split{i}_test_score" ] for i in range(4)]
all_scores

[array([0.67904653, 0.68113888, 0.78396571, 0.7939842 , 0.77155897,
        0.77483569]),
 array([0.65584192, 0.64976763, 0.76681051, 0.79086791, 0.75872302,
        0.76166767]),
 array([0.65937317, 0.67534806, 0.77321602, 0.78566242, 0.76626231,
        0.76819501]),
 array([0.6615118 , 0.65283713, 0.785738  , 0.78951341, 0.76382912,
        0.77557798])]

In [176]:
rf_grid.best_score_

0.7900069842686165

Notice that `rf_grid.best_score_` is different than `np.max(all_scores)`.

The best score is found using the mean value of the crossvalidation scores.

Therefore `rf_grid.best_score_= np.max(rf_grid.cv_results_["mean_test_score"])`

In [193]:
np.max(rf_grid.cv_results_["mean_test_score"]), rf_grid.best_score_

(0.7900069842686165, 0.7900069842686165)

In [194]:
rf_grid.cv_results_["mean_test_score"]

array([0.66394335, 0.66477292, 0.77743256, 0.79000698, 0.76509335,
       0.77006909])

Notice that we can compute `mean_test_score` simpy computing the mean
over the different results in the different splits of the crossvalidation process.

In [179]:
np.array(all_scores).mean(axis=0)

array([0.66394335, 0.66477292, 0.77743256, 0.79000698, 0.76509335,
       0.77006909])

## Ranking results of the crossvalidation process

All combinations of hyperparameters are stored in **`.cv_results_["params"]`**

We can visuallize in a single dataframe the differnet pa

In [257]:
results = pd.DataFrame({"params": rf_grid.cv_results_["params"], 
                        "std_test_score": rf_grid.cv_results_["std_test_score"],
                        "mean_test_score": rf_grid.cv_results_["mean_test_score"],
                       })

In [265]:
results = results.sort_values(by=["mean_test_score"], ascending=False)
results

Unnamed: 0,params,std_test_score,mean_test_score
3,"{'max_depth': None, 'max_features': 0.5}",0.002987,0.790007
2,"{'max_depth': None, 'max_features': 'auto'}",0.007783,0.777433
5,"{'max_depth': 10, 'max_features': 0.5}",0.005638,0.770069
4,"{'max_depth': 10, 'max_features': 'auto'}",0.004619,0.765093
1,"{'max_depth': 5, 'max_features': 0.5}",0.013668,0.664773
0,"{'max_depth': 5, 'max_features': 'auto'}",0.008952,0.663943


We can also see the order (rank) of each parameter configuration in **`.cv_results_["rank_test_score"]`**.

In [262]:
rank_test_score = rf_grid.cv_results_["rank_test_score"]
rank_test_score

array([6, 5, 2, 1, 4, 3], dtype=int32)

Notice that **results of the grid search are sorted by `rf_grid.cv_results_["rank_test_score"]`.**

In [279]:
[rf_grid.cv_results_["params"][k-1] for k in rank]

[{'max_depth': 10, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': 5, 'max_features': 'auto'}]

# Generating all possible combinations of hyperparameters

In [280]:
from itertools import product

In [407]:
param_grid

{'max_depth': [5, None, 10], 'max_features': ['auto', 0.5]}

The function product can take as input several iterators and it will generate
all the combinations of the values in the iterators

In [408]:
[x for x in product(["a","b","c"],[1,2])]

[('a', 1), ('a', 2), ('b', 1), ('b', 2), ('c', 1), ('c', 2)]

We can use it to generate the combinations of different hyperparamaters

In [409]:
[x for x in product(param_grid["max_depth"], param_grid["max_features"])]

[(5, 'auto'), (5, 0.5), (None, 'auto'), (None, 0.5), (10, 'auto'), (10, 0.5)]

the `*` notation allows us to generate a correctly formated input for `product`

In [410]:
combination_params_values = [x for x in product(*param_grid.values())]
combination_params_values

[(5, 'auto'), (5, 0.5), (None, 'auto'), (None, 0.5), (10, 'auto'), (10, 0.5)]

Notice that we can then write the name of the param for each component

In [411]:
params_keys = list(param_grid.keys())

for values in combination_params_values:
    print(dict(zip(param_grid.keys(),values)))

{'max_depth': 5, 'max_features': 'auto'}
{'max_depth': 5, 'max_features': 0.5}
{'max_depth': None, 'max_features': 'auto'}
{'max_depth': None, 'max_features': 0.5}
{'max_depth': 10, 'max_features': 'auto'}
{'max_depth': 10, 'max_features': 0.5}


We can do all this in a single function that will generate the list with the combinations we want to explore given an space of hyperparameter values.

In [422]:
def generate_params(param_grid):
    combination_params_values = [x for x in product(*param_grid.values())]
    params = []
    for values in combination_params_values:
        params.append(dict(zip(param_grid.keys(),values)))
    return params

In [423]:
generate_params(param_grid)

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

Notice that this is the same as 

In [424]:
rf_grid.cv_results_["params"]

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

## Creating your own `GridSearchCV` 

In [426]:
param_grid = {"max_depth":[5,None,10], "max_features":["auto",0.5]}

In [427]:
generate_params(param_grid)

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

In [430]:
rf = sklearn.ensemble.RandomForestRegressor()

### Training using crosvalidation a given model

We want to have code that can train for each combination in `param_grid` we would like to train `cv` models and save the scores
of the different fitted models.

In [434]:
param_grid

{'max_depth': [5, None, 10], 'max_features': ['auto', 0.5]}

In [517]:
rf

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=0.34, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators='warn',
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [641]:
def model_str(model, cv, params_combination):

    final_name = type(model).__name__ + "__fold{:02d}".format(cv)
    for item in params_combination.items():
        final_name += "__"
        final_name += item[0]+"=" +str(item[1])
        
    return final_name + ".joblib"

In [650]:
params_combinations =generate_params(param_grid)

model_str(rf,2, params_combinations[0])

'RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib'

In [669]:
import os
os.listdir("./")

['generating_validation_indices.ipynb', '.ipynb_checkpoints']

In [None]:
asdq[wpk ]

In [665]:
os.path.lexists("./alal")

False

In [737]:
from copy import deepcopy
import joblib
from joblib import dump, load
from pathlib import Path

def CVfit(X, y, model, cv, param_grid, score, path_folder="", save_model=False, verbose=0):
    
    cv_results_params = generate_params(param_grid)
    folds = KFold(cv,shuffle=False)
    cv_results_ = {}
    cv_results_["params"] = cv_results_params
    
    for fold in range(cv):
        cv_results_[f"split{fold}_test_score"] = np.array([])
    
    # Create target Directory if don't exist
    if not os.path.exists(path_folder):
        os.mkdir(path_folder)
        print("Directory " , path_folder ,  " Created ")
    else:    
        print("Directory " , path_folder ,  " already exists")

    for param_combination in cv_results_params:
        splits = folds.split(X)
        model_current = model.__class__(**param_combination)
        
        for fold, (tr_ind, va_ind) in enumerate(splits):
            fold_str = f"split{fold}_test_score"
            name = model_str(model_current, fold, param_combination)
                
            model_current.fit(X[tr_ind], y[tr_ind])
            test_score_fold       = model_current.score(X[va_ind], y[va_ind])
            cv_results_[fold_str] = np.append(cv_results_[fold_str], test_score_fold) 
            
            if save_model:
                file_path = os.path.join(path_folder, name) 
                fileName = Path(file_path)
            
                if fileName.is_file():
                    print(f"file {file_path} already exists")
                else:
                    joblib.dump(model_current, file_path) 
            
            if verbose ==1:
                print(name,f" --> test_score={test_score_fold}")
                
    
    test_scores = [cv_results_[f"split{i}_test_score" ] for i in range(4)]
    test_scores_arr = np.array(test_scores)
    
    cv_results_["best_test_score"] = np.max(test_scores_arr)
    cv_results_["mean_test_score"] = test_scores_arr.mean(axis=0)
    cv_results_["std_test_score"] = test_scores_arr.mean(axis=0)

    return cv_results_ 

In [738]:
cv = 4
rf = sklearn.ensemble.RandomForestRegressor()
path_models = "./saved_models"
cv_results_ = CVfit(X, y, rf, cv, param_grid, score=None, path_folder=path_models, save_model=True, verbose=1)

Directory  ./saved_models  Created 




RandomForestRegressor__fold00__max_depth=5__max_features=auto.joblib  --> test_score=0.5321063478434982
RandomForestRegressor__fold01__max_depth=5__max_features=auto.joblib  --> test_score=0.6307594302923779
RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib  --> test_score=0.5389194636864635
RandomForestRegressor__fold03__max_depth=5__max_features=auto.joblib  --> test_score=0.47436561520593346
RandomForestRegressor__fold00__max_depth=5__max_features=0.5.joblib  --> test_score=0.5060005114265005




RandomForestRegressor__fold01__max_depth=5__max_features=0.5.joblib  --> test_score=0.6012993005607387
RandomForestRegressor__fold02__max_depth=5__max_features=0.5.joblib  --> test_score=0.5084353852173539
RandomForestRegressor__fold03__max_depth=5__max_features=0.5.joblib  --> test_score=0.4668554676294122




RandomForestRegressor__fold00__max_depth=None__max_features=auto.joblib  --> test_score=0.5263879121570081
RandomForestRegressor__fold01__max_depth=None__max_features=auto.joblib  --> test_score=0.720771065413589
RandomForestRegressor__fold02__max_depth=None__max_features=auto.joblib  --> test_score=0.5983156280021272
RandomForestRegressor__fold03__max_depth=None__max_features=auto.joblib  --> test_score=0.5953685434050462




RandomForestRegressor__fold00__max_depth=None__max_features=0.5.joblib  --> test_score=0.512623688569302
RandomForestRegressor__fold01__max_depth=None__max_features=0.5.joblib  --> test_score=0.7186094090974982
RandomForestRegressor__fold02__max_depth=None__max_features=0.5.joblib  --> test_score=0.5905972557447783
RandomForestRegressor__fold03__max_depth=None__max_features=0.5.joblib  --> test_score=0.5641271356351979




RandomForestRegressor__fold00__max_depth=10__max_features=auto.joblib  --> test_score=0.5519698099942202
RandomForestRegressor__fold01__max_depth=10__max_features=auto.joblib  --> test_score=0.7160021728840288
RandomForestRegressor__fold02__max_depth=10__max_features=auto.joblib  --> test_score=0.607913719471272
RandomForestRegressor__fold03__max_depth=10__max_features=auto.joblib  --> test_score=0.597689159687149




RandomForestRegressor__fold00__max_depth=10__max_features=0.5.joblib  --> test_score=0.5671881101694052
RandomForestRegressor__fold01__max_depth=10__max_features=0.5.joblib  --> test_score=0.706890362485453
RandomForestRegressor__fold02__max_depth=10__max_features=0.5.joblib  --> test_score=0.591955201423953
RandomForestRegressor__fold03__max_depth=10__max_features=0.5.joblib  --> test_score=0.5643665159026661


In [739]:
models = os.listdir(path_models)
models

['RandomForestRegressor__fold00__max_depth=None__max_features=0.5.joblib',
 'RandomForestRegressor__fold01__max_depth=5__max_features=auto.joblib',
 'RandomForestRegressor__fold01__max_depth=None__max_features=auto.joblib',
 'RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib',
 'RandomForestRegressor__fold00__max_depth=None__max_features=auto.joblib',
 'RandomForestRegressor__fold00__max_depth=10__max_features=0.5.joblib',
 'RandomForestRegressor__fold03__max_depth=10__max_features=auto.joblib',
 'RandomForestRegressor__fold02__max_depth=10__max_features=auto.joblib',
 'RandomForestRegressor__fold03__max_depth=10__max_features=0.5.joblib',
 'RandomForestRegressor__fold03__max_depth=None__max_features=0.5.joblib',
 'RandomForestRegressor__fold00__max_depth=10__max_features=auto.joblib',
 'RandomForestRegressor__fold01__max_depth=10__max_features=auto.joblib',
 'RandomForestRegressor__fold03__max_depth=None__max_features=auto.joblib',
 'RandomForestRegressor__fold02__m

In [740]:
model = joblib.load(os.path.join(path_models, models[0]))
model

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=0.5, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [741]:
cv_results_

{'params': [{'max_depth': 5, 'max_features': 'auto'},
  {'max_depth': 5, 'max_features': 0.5},
  {'max_depth': None, 'max_features': 'auto'},
  {'max_depth': None, 'max_features': 0.5},
  {'max_depth': 10, 'max_features': 'auto'},
  {'max_depth': 10, 'max_features': 0.5}],
 'split0_test_score': array([0.53210635, 0.50600051, 0.52638791, 0.51262369, 0.55196981,
        0.56718811]),
 'split1_test_score': array([0.63075943, 0.6012993 , 0.72077107, 0.71860941, 0.71600217,
        0.70689036]),
 'split2_test_score': array([0.53891946, 0.50843539, 0.59831563, 0.59059726, 0.60791372,
        0.5919552 ]),
 'split3_test_score': array([0.47436562, 0.46685547, 0.59536854, 0.56412714, 0.59768916,
        0.56436652]),
 'best_test_score': 0.720771065413589,
 'mean_test_score': array([0.54403771, 0.52064767, 0.61021079, 0.59648937, 0.61839372,
        0.60760005]),
 'std_test_score': array([0.54403771, 0.52064767, 0.61021079, 0.59648937, 0.61839372,
        0.60760005])}


# Finding indices with `GroupKFold`

In [315]:
from sklearn.model_selection import GroupKFold

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

2

In [14]:
print(group_kfold)

GroupKFold(n_splits=2)


In [17]:
for train_index, test_index in group_kfold.split(X, y, groups):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)
    print("\n")

TRAIN: [0 1] TEST: [2 3]
[[1 2]
 [3 4]] [[5 6]
 [7 8]] [1 2] [3 4]


TRAIN: [2 3] TEST: [0 1]
[[5 6]
 [7 8]] [[1 2]
 [3 4]] [3 4] [1 2]


