<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#model-selection" data-toc-modified-id="model-selection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>model selection</a></span><ul class="toc-item"><li><span><a href="#Finding-the-indices-for-KFold" data-toc-modified-id="Finding-the-indices-for-KFold-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Finding the indices for <code>KFold</code></a></span></li><li><span><a href="#LeaveOneGroupOut" data-toc-modified-id="LeaveOneGroupOut-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><code>LeaveOneGroupOut</code></a></span></li></ul></li><li><span><a href="#Training-models-in-crosvalidation-using-GridsearchCV" data-toc-modified-id="Training-models-in-crosvalidation-using-GridsearchCV-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Training models in crosvalidation using GridsearchCV</a></span><ul class="toc-item"><li><span><a href="#Saved-metrics-in-a-GridSearchCV-object" data-toc-modified-id="Saved-metrics-in-a-GridSearchCV-object-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Saved metrics in a <code>GridSearchCV</code> object</a></span><ul class="toc-item"><li><span><a href="#Configurations-of--Hyperparameters-tested" data-toc-modified-id="Configurations-of--Hyperparameters-tested-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Configurations of  Hyperparameters tested</a></span></li></ul></li></ul></li><li><span><a href="#Generating-all-possible-combinations-of-hyperparameters" data-toc-modified-id="Generating-all-possible-combinations-of-hyperparameters-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Generating all possible combinations of hyperparameters</a></span><ul class="toc-item"><li><span><a href="#Ranking-results-of-the-crossvalidation-process" data-toc-modified-id="Ranking-results-of-the-crossvalidation-process-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Ranking results of the crossvalidation process</a></span></li><li><span><a href="#Creating-your-own-GridSearchCV" data-toc-modified-id="Creating-your-own-GridSearchCV-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Creating your own <code>GridSearchCV</code></a></span><ul class="toc-item"><li><span><a href="#Training-using-crosvalidation-a-given-model" data-toc-modified-id="Training-using-crosvalidation-a-given-model-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Training using crosvalidation a given model</a></span></li><li><span><a href="#TODO:-Make-a-function-that-given-cv_results_df-and-path_folder..." data-toc-modified-id="TODO:-Make-a-function-that-given-cv_results_df-and-path_folder...-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>TODO: Make a function that given cv_results_df and path_folder...</a></span></li></ul></li></ul></li><li><span><a href="#Finding-indices-with-GroupKFold" data-toc-modified-id="Finding-indices-with-GroupKFold-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Finding indices with <code>GroupKFold</code></a></span></li></ul></div>

# model selection


In [1]:
import sklearn
import numpy as np
import pandas as pd


## Finding the indices for `KFold`

The class `KFold` allows us to generate the train and validation indicies for performing training and validation indicies with different parts of our data.


In [2]:
from sklearn.model_selection import KFold
X = np.random.randn(100,4)
X.shape

(100, 4)

In [3]:
folds = KFold(10,shuffle=False)
splits = folds.split(X)
for fold, (tr_ind,va_ind) in enumerate(splits):
    print(fold, va_ind)

0 [0 1 2 3 4 5 6 7 8 9]
1 [10 11 12 13 14 15 16 17 18 19]
2 [20 21 22 23 24 25 26 27 28 29]
3 [30 31 32 33 34 35 36 37 38 39]
4 [40 41 42 43 44 45 46 47 48 49]
5 [50 51 52 53 54 55 56 57 58 59]
6 [60 61 62 63 64 65 66 67 68 69]
7 [70 71 72 73 74 75 76 77 78 79]
8 [80 81 82 83 84 85 86 87 88 89]
9 [90 91 92 93 94 95 96 97 98 99]


If shuffle is true then we sample all rows from our dataset randomly

In [4]:

folds = KFold(10,shuffle=True)
splits = folds.split(X)
for tr_ind,va_ind in splits:
    print(va_ind)

[ 5 19 27 69 70 71 72 75 76 78]
[ 0  3  7 22 26 29 63 73 74 90]
[18 35 43 48 52 65 77 81 83 92]
[ 8 11 13 24 34 36 39 57 84 97]
[15 16 20 21 31 33 44 62 98 99]
[ 9 10 45 47 53 58 60 87 93 96]
[ 1  2 12 28 38 55 59 61 66 86]
[14 42 49 50 51 54 64 80 88 95]
[ 4  6 23 30 46 68 82 89 91 94]
[17 25 32 37 40 41 56 67 79 85]


##  `LeaveOneGroupOut`

- Provides train/test indices to split data according to a third-party provided group. 

- This group information can be used to encode arbitrary domain specific stratifications of the samples as integers.


Now let us load the boston dataset and make a group that we will call `not_24`


In [5]:
from sklearn import datasets
dataset = sklearn.datasets.load_boston()

In [6]:
X = dataset.data
y = dataset.target
X_df = pd.DataFrame(X)

In [7]:
X_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [8]:
X_df[8].unique()

array([ 1.,  2.,  3.,  5.,  4.,  8.,  6.,  7., 24.])

In [9]:
X_df.shape

(506, 13)

Let us consider a case where we might want the train splits to be divided in two groups A and B

- A: examples which at column 8 have values in `[1,  2,  3,  4,  5,  6,  7,  8]` 
- B: examples at which column 9 takes value  `24`.



In [10]:
not_24 = X_df[8] != 24
not_24 = np.array(not_24.values, dtype="int")

In [11]:
X.shape, not_24.shape

((506, 13), (506,))

In [12]:
not_24

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [13]:
from sklearn import linear_model
from sklearn import neural_network

cv   = sklearn.model_selection.LeaveOneGroupOut()
clf  = sklearn.linear_model.LinearRegression()
grid = sklearn.model_selection.GridSearchCV(clf, 
                                            param_grid={},
                                            cv=cv, 
                                            return_train_score=True, iid=True, n_jobs=-1)

grid.fit(X,y, groups=not_24)

GridSearchCV(cv=LeaveOneGroupOut(), error_score='raise-deprecating',
             estimator=LinearRegression(copy_X=True, fit_intercept=True,
                                        n_jobs=None, normalize=False),
             iid=True, n_jobs=-1, param_grid={}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=True, scoring=None, verbose=0)

In [14]:
grid.cv_results_

{'mean_fit_time': array([0.0500716]),
 'std_fit_time': array([0.00098145]),
 'mean_score_time': array([0.00226498]),
 'std_score_time': array([5.00679016e-06]),
 'params': [{}],
 'split0_test_score': array([-6.59688586]),
 'split1_test_score': array([-6.75857278e+19]),
 'mean_test_score': array([-4.99546684e+19]),
 'std_test_score': array([2.96774953e+19]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([0.86686183]),
 'split1_train_score': array([0.68323554]),
 'mean_train_score': array([0.77504868]),
 'std_train_score': array([0.09181314])}

In [15]:
grid.cv_results_['split0_train_score'], grid.cv_results_['split1_test_score'] 

(array([0.86686183]), array([-6.75857278e+19]))

Notice that one of the errors (the one in split1) is huge. 

This happens because one of the experiments in the GridSearchCV will consist on taking the examples  from B as training set (and it is very small) and then use the values in A as validation set (which is very big).



In [16]:
from sklearn import linear_model
from sklearn import neural_network

cv   = sklearn.model_selection.LeaveOneGroupOut()
clf  = sklearn.linear_model.LinearRegression()
grid = sklearn.model_selection.GridSearchCV(clf, param_grid={}, cv=cv, 
                                            return_train_score=True, iid=True, n_jobs=-1)

values_in_col8 = np.array(X_df[8].values, dtype="int")

grid.fit(X,y, groups=values_in_col8)

GridSearchCV(cv=LeaveOneGroupOut(), error_score='raise-deprecating',
             estimator=LinearRegression(copy_X=True, fit_intercept=True,
                                        n_jobs=None, normalize=False),
             iid=True, n_jobs=-1, param_grid={}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=True, scoring=None, verbose=0)

In [17]:
grid.cv_results_

{'mean_fit_time': array([0.0049252]),
 'std_fit_time': array([0.00366666]),
 'mean_score_time': array([0.00155658]),
 'std_score_time': array([0.00077298]),
 'params': [{}],
 'split0_test_score': array([0.70028178]),
 'split1_test_score': array([0.77076574]),
 'split2_test_score': array([0.64766753]),
 'split3_test_score': array([0.70385404]),
 'split4_test_score': array([0.64780526]),
 'split5_test_score': array([-1.95031015]),
 'split6_test_score': array([0.3699049]),
 'split7_test_score': array([0.69561196]),
 'split8_test_score': array([-6.59688586]),
 'mean_test_score': array([-1.36260276]),
 'std_test_score': array([3.16246777]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([0.74100996]),
 'split1_train_score': array([0.73639286]),
 'split2_train_score': array([0.7369053]),
 'split3_train_score': array([0.73722303]),
 'split4_train_score': array([0.71793616]),
 'split5_train_score': array([0.74817225]),
 'split6_train_score': array([0.74326855]),
 'sp

# Training models in crosvalidation using GridsearchCV

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
from sklearn.datasets import *

In [19]:
dataset = sklearn.datasets.california_housing.fetch_california_housing()

In [20]:
X = dataset.data
y = dataset.target

X.shape, y.shape

((20640, 8), (20640,))

In [21]:
X_tr, X_te, y_tr, y_te = sklearn.model_selection.train_test_split(X,y, random_state=1234)

X_tr.shape, y_tr.shape, X_te.shape, y_te.shape

((15480, 8), (15480,), (5160, 8), (5160,))

In [22]:
from sklearn import ensemble
from sklearn.ensemble import *

rf = sklearn.ensemble.RandomForestRegressor()

param_grid = {"max_depth":[5,None,10], "max_features":["auto",0.5]}

In [23]:
rf_grid = sklearn.model_selection.GridSearchCV(rf, 
                                               param_grid=param_grid,
                                               cv=4,
                                               n_jobs=-1)

In [24]:
rf_grid.fit(X_tr, y_tr)



GridSearchCV(cv=4, error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [5, None, 10],
     

In [25]:
rf_grid.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=0.5, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

## Saved metrics in a `GridSearchCV` object

After fitting a `GridsearchCV` object we can inspect the information saved during crossvalidation inside the field `.cv_results_`.

In [26]:
list(rf_grid.cv_results_.keys())

['mean_fit_time',
 'std_fit_time',
 'mean_score_time',
 'std_score_time',
 'param_max_depth',
 'param_max_features',
 'params',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'mean_test_score',
 'std_test_score',
 'rank_test_score']

###  Configurations of  Hyperparameters tested

All the combinations of parameters tested are kept in `'params'`

In [27]:
rf_grid.cv_results_["params"]

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

The different scores for each of the CV splits are found in `splitsK_test_score`.

In this case we have 4 arrays because we used `cv=4` when we instanciated `rf_grid`.

```
'split0_test_score'
'split1_test_score'
'split2_test_score'
'split3_test_score'
 ```
 
 Each array contains  possition `k` the test results of the k'th combination of hyperparameters.

In [28]:
rf_grid.n_splits_

4

In [29]:
all_scores = [rf_grid.cv_results_[f"split{i}_test_score" ] for i in range(4)]
all_scores

[array([0.67252297, 0.69078351, 0.7875498 , 0.79336746, 0.76635207,
        0.77253906]),
 array([0.65211501, 0.64603259, 0.77528091, 0.78702763, 0.75597663,
        0.77191017]),
 array([0.65956146, 0.66037908, 0.77395936, 0.78176884, 0.76153541,
        0.76974942]),
 array([0.65125066, 0.66191977, 0.78400208, 0.79228301, 0.77446017,
        0.77307928])]

In [30]:
rf_grid.best_score_

0.7886117316641134

Notice that `rf_grid.best_score_` is different than `np.max(all_scores)`.

The best score is found using the mean value of the crossvalidation scores.

Therefore `rf_grid.best_score_= np.max(rf_grid.cv_results_["mean_test_score"])`

In [31]:
np.max(rf_grid.cv_results_["mean_test_score"]), rf_grid.best_score_

(0.7886117316641134, 0.7886117316641134)

In [32]:
rf_grid.cv_results_["mean_test_score"]

array([0.65886253, 0.66477874, 0.78019804, 0.78861173, 0.76458107,
       0.77181948])

Notice that we can compute `mean_test_score` simpy computing the mean
over the different results in the different splits of the crossvalidation process.

In [33]:
np.array(all_scores).mean(axis=0)

array([0.65886253, 0.66477874, 0.78019804, 0.78861173, 0.76458107,
       0.77181948])

# Generating all possible combinations of hyperparameters

In [34]:
from itertools import product

In [35]:
param_grid

{'max_depth': [5, None, 10], 'max_features': ['auto', 0.5]}

The function product can take as input several iterators and it will generate
all the combinations of the values in the iterators

In [36]:
[x for x in product(["a","b","c"],[1,2])]

[('a', 1), ('a', 2), ('b', 1), ('b', 2), ('c', 1), ('c', 2)]

We can use it to generate the combinations of different hyperparamaters

In [37]:
[x for x in product(param_grid["max_depth"], param_grid["max_features"])]

[(5, 'auto'), (5, 0.5), (None, 'auto'), (None, 0.5), (10, 'auto'), (10, 0.5)]

the `*` notation allows us to generate a correctly formated input for `product`

In [38]:
combination_params_values = [x for x in product(*param_grid.values())]
combination_params_values

[(5, 'auto'), (5, 0.5), (None, 'auto'), (None, 0.5), (10, 'auto'), (10, 0.5)]

Notice that we can then write the name of the param for each component

In [39]:
params_keys = list(param_grid.keys())

for values in combination_params_values:
    print(dict(zip(param_grid.keys(),values)))

{'max_depth': 5, 'max_features': 'auto'}
{'max_depth': 5, 'max_features': 0.5}
{'max_depth': None, 'max_features': 'auto'}
{'max_depth': None, 'max_features': 0.5}
{'max_depth': 10, 'max_features': 'auto'}
{'max_depth': 10, 'max_features': 0.5}


We can do all this in a single function that will generate the list with the combinations we want to explore given an space of hyperparameter values.

In [40]:
def generate_params(param_grid):
    combination_params_values = [x for x in product(*param_grid.values())]
    params = []
    for values in combination_params_values:
        params.append(dict(zip(param_grid.keys(),values)))
    return params

In [41]:
generate_params(param_grid)

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

Notice that this is the same as 

In [42]:
rf_grid.cv_results_["params"]

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

## Ranking results of the crossvalidation process

All combinations of hyperparameters are stored in **`.cv_results_["params"]`**

We can visuallize in a single dataframe the differnet pa

In [43]:
results = pd.DataFrame({"params": rf_grid.cv_results_["params"], 
                        "std_test_score": rf_grid.cv_results_["std_test_score"],
                        "mean_test_score": rf_grid.cv_results_["mean_test_score"],
                       })

In [44]:
results = results.sort_values(by=["mean_test_score"], ascending=False)
results

Unnamed: 0,params,std_test_score,mean_test_score
3,"{'max_depth': None, 'max_features': 0.5}",0.004621,0.788612
2,"{'max_depth': None, 'max_features': 'auto'}",0.005736,0.780198
5,"{'max_depth': 10, 'max_features': 0.5}",0.001265,0.771819
4,"{'max_depth': 10, 'max_features': 'auto'}",0.006783,0.764581
1,"{'max_depth': 5, 'max_features': 0.5}",0.016242,0.664779
0,"{'max_depth': 5, 'max_features': 'auto'}",0.008523,0.658863


We can also see the order (rank) of each parameter configuration in **`.cv_results_["rank_test_score"]`**.

In [45]:
rank_test_score = rf_grid.cv_results_["rank_test_score"]
rank_test_score

array([6, 5, 2, 1, 4, 3], dtype=int32)

Notice that **results of the grid search are sorted by `rf_grid.cv_results_["rank_test_score"]`.**

In [46]:
[rf_grid.cv_results_["params"][k-1] for k in rank_test_score]

[{'max_depth': 10, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'}]

##### Ranking the results of crossvalidation: 1:best result, 2:second best result etc...

We can find in `rank_test_score` a ranking of the best (1) to the worst (6) results

In [47]:
rf_grid.cv_results_["rank_test_score"]

array([6, 5, 2, 1, 4, 3], dtype=int32)

We can generate this vector using `scipy.stats.rankdata` in order to rank the solutions according to `mean_test_score`

In [48]:
import scipy
rank = scipy.stats.rankdata(-rf_grid.cv_results_["mean_test_score"],method='ordinal')
rank

array([6, 5, 2, 1, 4, 3])

## Creating your own `GridSearchCV` 

In [49]:
param_grid = {"max_depth":[5,None,10], "max_features":["auto",0.5]}

In [50]:
generate_params(param_grid)

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

In [51]:
rf = sklearn.ensemble.RandomForestRegressor()

### Training using crosvalidation a given model

We want to have code that can train for each combination in `param_grid` we would like to train `cv` models and save the scores
of the different fitted models.

In [52]:
param_grid

{'max_depth': [5, None, 10], 'max_features': ['auto', 0.5]}

In [53]:
rf

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators='warn',
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [54]:
def model_str(model, cv, params_combination):

    final_name = type(model).__name__ + "__fold{:02d}".format(cv)
    for item in params_combination.items():
        final_name += "__"
        final_name += item[0]+"=" +str(item[1])
        
    return final_name + ".joblib"


In [55]:
params_combinations = generate_params(param_grid)
model_str(rf,2, params_combinations[0])

'RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib'

In [56]:
import os
os.listdir("./")

['.ipynb_checkpoints',
 'generating_validation_indices.ipynb',
 'gridsearch_cv.ipynb']

In [57]:
from copy import deepcopy
import joblib
from joblib import dump, load
from pathlib import Path
from sklearn.metrics import roc_auc_score
import datetime


def CVfit(X, y, model, cv, param_grid, scorer, path_folder="",
          save_models=False, save_summary=True, add_time=False, verbose=0):
    
    """
    
    Important notes:
    
    - By default this method uses the model.score function to score the test results.
    
    - If `scorer="roc_auc_score"` or `scorer=sklearn.metrics.roc_auc_score` then the method uses 
    `model.predict_proba` instead of `model.predict` when scoring the predictions.
    
    """
    
    cv_results_params = generate_params(param_grid)
    folds = KFold(cv,shuffle=False)
    cv_results_ = {}
    cv_results_["params"] = cv_results_params
    
    if add_time:
        now = datetime.datetime.now()
        time_id = f"__{now.hour}:{now.minute}:{now.second}__{now.day}-{now.month}-{now.year}"
        path_folder = path_folder +  time_id       
    
    for fold in range(cv):
        cv_results_[f"split{fold}_test_score"] = np.array([])
    
    # Create target Directory if don't exist
    if not os.path.exists(path_folder):
        os.mkdir(path_folder)
        print("Directory " , path_folder ,  " Created ")
    else:    
        print("WARNING: Directory " , path_folder ,  " already exists")

    
    cvfit_log_file = open(os.path.join(path_folder, 'cvfit_log.txt'), "w")

        
    for param_combination in cv_results_params:
        splits = folds.split(X)
        model_current = model.__class__(**param_combination)
        
        for fold, (tr_ind, va_ind) in enumerate(splits):
            fold_str = f"split{fold}_test_score"
            name = model_str(model_current, fold, param_combination)
                
            model_current.fit(X[tr_ind], y[tr_ind])
            
            if scorer:
                if scorer.__name__ ==  "roc_auc_score":
                    test_score_fold = scorer(model_current.predict_proba(X[va_ind]), y[va_ind])
                elif scorer == "roc_auc_score":
                    test_score_fold = roc_auc_score(model_current.predict_proba(X[va_ind]), y[va_ind])
                else:
                    test_score_fold = scorer(model_current.predict(X[va_ind]), y[va_ind])
            else:
                test_score_fold       = model_current.score(X[va_ind], y[va_ind])
            
            cv_results_[fold_str] = np.append(cv_results_[fold_str], test_score_fold) 
            
            line_for_log = name + f" --> test_score={test_score_fold}"
            cvfit_log_file.write(line_for_log + "\n")
            
            if save_models:
                file_path = os.path.join(path_folder, name) 
                fileName = Path(file_path)
            
                if fileName.is_file():
                    print("WARNING: An exact model with the same hyperparameters is trying to be saved")
                    print(f"file {file_path} already exists")
                    return None
                else:
                    joblib.dump(model_current, file_path) 
            
               
            if verbose ==1:
                print(name,f" --> test_score={test_score_fold}")
            
    cvfit_log_file.close()
    test_scores = [cv_results_[f"split{i}_test_score"] for i in range(4)]
    test_scores_arr = np.array(test_scores)
    
    #cv_results_["best_test_score"] = np.max(test_scores_arr)
    cv_results_["mean_test_score"] = test_scores_arr.mean(axis=0)
    cv_results_["std_test_score"] = test_scores_arr.mean(axis=0)

    rank = scipy.stats.rankdata(-cv_results_["mean_test_score"], method='ordinal')
    cv_results_["rank_test_score"] = rank
    # Save in key "best_params" the best combination of hyperparams found
    # index_best  = np.argmax(rf_grid.cv_results_["mean_test_score"])
    # best_params = rf_grid.cv_results_["params"][ind_best]
    # cv_results_["best_params"] = best_params

    if save_summary:
        cv_results_df = pd.DataFrame(cv_results_).sort_values(["rank_test_score"])
        cv_results_df.to_csv( os.path.join(path_folder, "cv_results_df.csv"),index=False)
        
    return cv_results_ 

In [58]:
cv = 4
rf = sklearn.ensemble.RandomForestRegressor()
path_models = "./saved_models"
cv_results_ = CVfit(X, y, rf, cv, param_grid, scorer=None, 
                    path_folder=path_models, 
                    save_models=True, add_time=False, save_summary=True, verbose=1)

Directory  ./saved_models  Created 




RandomForestRegressor__fold00__max_depth=5__max_features=auto.joblib  --> test_score=0.5330377763177943
RandomForestRegressor__fold01__max_depth=5__max_features=auto.joblib  --> test_score=0.6399112735269173
RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib  --> test_score=0.5222067443053595
RandomForestRegressor__fold03__max_depth=5__max_features=auto.joblib  --> test_score=0.47601071757564184




RandomForestRegressor__fold00__max_depth=5__max_features=0.5.joblib  --> test_score=0.4606569463355764
RandomForestRegressor__fold01__max_depth=5__max_features=0.5.joblib  --> test_score=0.599632584300654
RandomForestRegressor__fold02__max_depth=5__max_features=0.5.joblib  --> test_score=0.5179928435599687
RandomForestRegressor__fold03__max_depth=5__max_features=0.5.joblib  --> test_score=0.46375031633979513




RandomForestRegressor__fold00__max_depth=None__max_features=auto.joblib  --> test_score=0.5247570322610484
RandomForestRegressor__fold01__max_depth=None__max_features=auto.joblib  --> test_score=0.7202141014459633
RandomForestRegressor__fold02__max_depth=None__max_features=auto.joblib  --> test_score=0.599081367417376
RandomForestRegressor__fold03__max_depth=None__max_features=auto.joblib  --> test_score=0.6010311839168737




RandomForestRegressor__fold00__max_depth=None__max_features=0.5.joblib  --> test_score=0.5474184006015956
RandomForestRegressor__fold01__max_depth=None__max_features=0.5.joblib  --> test_score=0.7013116201508998
RandomForestRegressor__fold02__max_depth=None__max_features=0.5.joblib  --> test_score=0.6043965742745478
RandomForestRegressor__fold03__max_depth=None__max_features=0.5.joblib  --> test_score=0.5369266095731109




RandomForestRegressor__fold00__max_depth=10__max_features=auto.joblib  --> test_score=0.5183564042822679
RandomForestRegressor__fold01__max_depth=10__max_features=auto.joblib  --> test_score=0.7211539072873407
RandomForestRegressor__fold02__max_depth=10__max_features=auto.joblib  --> test_score=0.5959197231643528
RandomForestRegressor__fold03__max_depth=10__max_features=auto.joblib  --> test_score=0.610742878893817




RandomForestRegressor__fold00__max_depth=10__max_features=0.5.joblib  --> test_score=0.49945039083206033
RandomForestRegressor__fold01__max_depth=10__max_features=0.5.joblib  --> test_score=0.7205516572163669
RandomForestRegressor__fold02__max_depth=10__max_features=0.5.joblib  --> test_score=0.6074610911124074
RandomForestRegressor__fold03__max_depth=10__max_features=0.5.joblib  --> test_score=0.5731898778110219



This function returns a dict `cv_results` containing the evaluation metrics of the crossvalidation process

In [59]:
cv_results_.keys()

dict_keys(['params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

In [60]:
rf_grid.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_max_features', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

In [61]:
os.listdir()

['.ipynb_checkpoints',
 'generating_validation_indices.ipynb',
 'gridsearch_cv.ipynb',
 'saved_models']

Since we executed with `save_summary=True` we will have `cv_results_df.csv` inside the folder where the models are saved.

In [62]:
cv_results_df = pd.read_csv( os.path.join(path_models,"cv_results_df.csv"))

In [63]:
cv_results_df

Unnamed: 0,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,"{'max_depth': 10, 'max_features': 'auto'}",0.518356,0.721154,0.59592,0.610743,0.611543,0.611543,1
1,"{'max_depth': None, 'max_features': 'auto'}",0.524757,0.720214,0.599081,0.601031,0.611271,0.611271,2
2,"{'max_depth': 10, 'max_features': 0.5}",0.49945,0.720552,0.607461,0.57319,0.600163,0.600163,3
3,"{'max_depth': None, 'max_features': 0.5}",0.547418,0.701312,0.604397,0.536927,0.597513,0.597513,4
4,"{'max_depth': 5, 'max_features': 'auto'}",0.533038,0.639911,0.522207,0.476011,0.542792,0.542792,5
5,"{'max_depth': 5, 'max_features': 0.5}",0.460657,0.599633,0.517993,0.46375,0.510508,0.510508,6


We can also inspect the log of the cvfit and see the information that is printed when verbose=1

In [64]:

!cat saved_models/cvfit_log.txt

RandomForestRegressor__fold00__max_depth=5__max_features=auto.joblib --> test_score=0.5330377763177943
RandomForestRegressor__fold01__max_depth=5__max_features=auto.joblib --> test_score=0.6399112735269173
RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib --> test_score=0.5222067443053595
RandomForestRegressor__fold03__max_depth=5__max_features=auto.joblib --> test_score=0.47601071757564184
RandomForestRegressor__fold00__max_depth=5__max_features=0.5.joblib --> test_score=0.4606569463355764
RandomForestRegressor__fold01__max_depth=5__max_features=0.5.joblib --> test_score=0.599632584300654
RandomForestRegressor__fold02__max_depth=5__max_features=0.5.joblib --> test_score=0.5179928435599687
RandomForestRegressor__fold03__max_depth=5__max_features=0.5.joblib --> test_score=0.46375031633979513
RandomForestRegressor__fold00__max_depth=None__max_features=auto.joblib --> test_score=0.5247570322610484
RandomForestRegressor__fold01__max_depth=None__max_features=auto

### TODO: Make a function that given cv_results_df and path_folder...

- Loads different models (topK) and creates a "metamodel". To do so I need to create a new MetaModel class.

    - The metamodel is essentially a list of models: `[m1,m2,m3,...]`
    
    - The metamodel.predict(X) simply does `np.mean(m1.predict(X), m2.predict(X),....)`
    
    - The metamodel should have a weighted average prediction method. 
    
          - The `mean_test_score` could be used to assign a weight into the average
          - The `std_test_score` could be used to assign a confidence of the prediction of the metamodel
      

The models will be saved in `path_folder`

In [65]:
models = os.listdir(path_models)
models

['cv_results_df.csv',
 'cvfit_log.txt',
 'RandomForestRegressor__fold00__max_depth=10__max_features=0.5.joblib',
 'RandomForestRegressor__fold00__max_depth=10__max_features=auto.joblib',
 'RandomForestRegressor__fold00__max_depth=5__max_features=0.5.joblib',
 'RandomForestRegressor__fold00__max_depth=5__max_features=auto.joblib',
 'RandomForestRegressor__fold00__max_depth=None__max_features=0.5.joblib',
 'RandomForestRegressor__fold00__max_depth=None__max_features=auto.joblib',
 'RandomForestRegressor__fold01__max_depth=10__max_features=0.5.joblib',
 'RandomForestRegressor__fold01__max_depth=10__max_features=auto.joblib',
 'RandomForestRegressor__fold01__max_depth=5__max_features=0.5.joblib',
 'RandomForestRegressor__fold01__max_depth=5__max_features=auto.joblib',
 'RandomForestRegressor__fold01__max_depth=None__max_features=0.5.joblib',
 'RandomForestRegressor__fold01__max_depth=None__max_features=auto.joblib',
 'RandomForestRegressor__fold02__max_depth=10__max_features=0.5.joblib',
 

We can load a model from disk using **`joblib.load`**

In [66]:
model = joblib.load(os.path.join(path_models, models[0]))
model

ValueError: invalid literal for int() with base 10: b'arams,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score'


# Finding indices with `GroupKFold`

In [67]:
from sklearn.model_selection import GroupKFold

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 5, 5])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

2

In [68]:
print(group_kfold)

GroupKFold(n_splits=2)


In [69]:
for train_index, test_index in group_kfold.split(X, y, groups):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)
    print("\n")

TRAIN: [0 1] TEST: [2 3]
[[1 2]
 [3 4]] [[5 6]
 [7 8]] [1 2] [3 4]


TRAIN: [2 3] TEST: [0 1]
[[5 6]
 [7 8]] [[1 2]
 [3 4]] [3 4] [1 2]




In [70]:
from sklearn import linear_model
from sklearn import neural_network

cv   = sklearn.model_selection.LeaveOneGroupOut()
clf  = sklearn.linear_model.LinearRegression()
grid = sklearn.model_selection.GridSearchCV(clf, param_grid={}, cv=cv, 
                                            return_train_score=True, iid=True, n_jobs=-1)

values_in_col8 = np.array(X_df[8].values, dtype="int")

grid.fit(X,y, groups=values_in_col8)


ValueError: Found input variables with inconsistent numbers of samples: [4, 4, 506]