<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#model-selection" data-toc-modified-id="model-selection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>model selection</a></span><ul class="toc-item"><li><span><a href="#Finding-the-indices-for-KFold" data-toc-modified-id="Finding-the-indices-for-KFold-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Finding the indices for <code>KFold</code></a></span></li></ul></li><li><span><a href="#Training-models-in-crosvalidation-using-GridsearchCV" data-toc-modified-id="Training-models-in-crosvalidation-using-GridsearchCV-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Training models in crosvalidation using GridsearchCV</a></span><ul class="toc-item"><li><span><a href="#Saved-metrics-in-a-GridSearchCV-object" data-toc-modified-id="Saved-metrics-in-a-GridSearchCV-object-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Saved metrics in a <code>GridSearchCV</code> object</a></span><ul class="toc-item"><li><span><a href="#Configurations-of--Hyperparameters-tested" data-toc-modified-id="Configurations-of--Hyperparameters-tested-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Configurations of  Hyperparameters tested</a></span></li></ul></li></ul></li><li><span><a href="#Generating-all-possible-combinations-of-hyperparameters" data-toc-modified-id="Generating-all-possible-combinations-of-hyperparameters-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Generating all possible combinations of hyperparameters</a></span><ul class="toc-item"><li><span><a href="#Ranking-results-of-the-crossvalidation-process" data-toc-modified-id="Ranking-results-of-the-crossvalidation-process-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Ranking results of the crossvalidation process</a></span></li><li><span><a href="#Creating-your-own-GridSearchCV" data-toc-modified-id="Creating-your-own-GridSearchCV-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Creating your own <code>GridSearchCV</code></a></span><ul class="toc-item"><li><span><a href="#Training-using-crosvalidation-a-given-model" data-toc-modified-id="Training-using-crosvalidation-a-given-model-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Training using crosvalidation a given model</a></span></li><li><span><a href="#TODO:-Make-a-function-that-given-cv_results_df-and-path_folder..." data-toc-modified-id="TODO:-Make-a-function-that-given-cv_results_df-and-path_folder...-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>TODO: Make a function that given cv_results_df and path_folder...</a></span></li></ul></li></ul></li><li><span><a href="#Finding-indices-with-GroupKFold" data-toc-modified-id="Finding-indices-with-GroupKFold-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Finding indices with <code>GroupKFold</code></a></span></li></ul></div>

# model selection


In [43]:
import sklearn
import numpy as np
import pandas as pd


## Finding the indices for `KFold`

The class `KFold` allows us to generate the train and validation indicies for performing training and validation indicies with different parts of our data.


In [2]:
from sklearn.model_selection import KFold
X = np.random.randn(100,4)
X.shape

(100, 4)

In [3]:
folds = KFold(10,shuffle=False)
splits = folds.split(X)
for fold, (tr_ind,va_ind) in enumerate(splits):
    print(fold, va_ind)

0 [0 1 2 3 4 5 6 7 8 9]
1 [10 11 12 13 14 15 16 17 18 19]
2 [20 21 22 23 24 25 26 27 28 29]
3 [30 31 32 33 34 35 36 37 38 39]
4 [40 41 42 43 44 45 46 47 48 49]
5 [50 51 52 53 54 55 56 57 58 59]
6 [60 61 62 63 64 65 66 67 68 69]
7 [70 71 72 73 74 75 76 77 78 79]
8 [80 81 82 83 84 85 86 87 88 89]
9 [90 91 92 93 94 95 96 97 98 99]


If shuffle is true then we sample all rows from our dataset randomly

In [4]:

folds = KFold(10,shuffle=True)
splits = folds.split(X)
for tr_ind,va_ind in splits:
    print(va_ind)

[ 4  9 26 33 44 70 74 82 93 98]
[10 21 38 47 49 51 64 71 78 90]
[12 14 18 39 53 65 79 80 83 88]
[ 0  8 11 31 35 40 72 73 95 97]
[20 46 48 52 59 63 68 69 85 91]
[ 6 27 54 55 56 60 75 81 89 92]
[17 19 24 36 42 45 50 62 84 96]
[22 28 32 34 37 43 58 61 66 87]
[ 2  5  7 13 15 16 25 41 57 99]
[ 1  3 23 29 30 67 76 77 86 94]


# Training models in crosvalidation using GridsearchCV

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
from sklearn.datasets import *

In [6]:
dataset = sklearn.datasets.california_housing.fetch_california_housing()



In [7]:
X = dataset.data
y = dataset.target

X.shape, y.shape

((20640, 8), (20640,))

In [8]:
X_tr, X_te, y_tr, y_te = sklearn.model_selection.train_test_split(X,y, random_state=1234)

X_tr.shape, y_tr.shape, X_te.shape, y_te.shape

((15480, 8), (15480,), (5160, 8), (5160,))

In [9]:
from sklearn import ensemble
from sklearn.ensemble import *

rf = sklearn.ensemble.RandomForestRegressor()

param_grid = {"max_depth":[5,None,10], "max_features":["auto",0.5]}

In [10]:
rf_grid = sklearn.model_selection.GridSearchCV(rf, 
                                               param_grid=param_grid,
                                               cv=4,
                                               n_jobs=-1)

In [11]:
rf_grid.fit(X_tr, y_tr)



GridSearchCV(cv=4, error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [5, None, 10],
     

In [12]:
rf_grid.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=0.5, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

## Saved metrics in a `GridSearchCV` object

After fitting a `GridsearchCV` object we can inspect the information saved during crossvalidation inside the field `.cv_results_`.

In [13]:
list(rf_grid.cv_results_.keys())

['mean_fit_time',
 'std_fit_time',
 'mean_score_time',
 'std_score_time',
 'param_max_depth',
 'param_max_features',
 'params',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'mean_test_score',
 'std_test_score',
 'rank_test_score']

###  Configurations of  Hyperparameters tested

All the combinations of parameters tested are kept in `'params'`

In [14]:
rf_grid.cv_results_["params"]

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

The different scores for each of the CV splits are found in `splitsK_test_score`.

In this case we have 4 arrays because we used `cv=4` when we instanciated `rf_grid`.

```
'split0_test_score'
'split1_test_score'
'split2_test_score'
'split3_test_score'
 ```
 
 Each array contains  possition `k` the test results of the k'th combination of hyperparameters.

In [15]:
rf_grid.n_splits_

4

In [16]:
all_scores = [rf_grid.cv_results_[f"split{i}_test_score" ] for i in range(4)]
all_scores

[array([0.676594  , 0.67458859, 0.78570293, 0.79470538, 0.77237557,
        0.77456112]),
 array([0.64978134, 0.64974643, 0.77543228, 0.78006514, 0.75181548,
        0.77282568]),
 array([0.6571138 , 0.65878861, 0.77061052, 0.78591699, 0.7627303 ,
        0.76966836]),
 array([0.65224242, 0.66934132, 0.78320904, 0.78694664, 0.76129547,
        0.7840957 ])]

In [17]:
rf_grid.best_score_

0.7869085345386662

Notice that `rf_grid.best_score_` is different than `np.max(all_scores)`.

The best score is found using the mean value of the crossvalidation scores.

Therefore `rf_grid.best_score_= np.max(rf_grid.cv_results_["mean_test_score"])`

In [18]:
np.max(rf_grid.cv_results_["mean_test_score"]), rf_grid.best_score_

(0.7869085345386662, 0.7869085345386662)

In [19]:
rf_grid.cv_results_["mean_test_score"]

array([0.65893289, 0.66311624, 0.77873869, 0.78690853, 0.76205421,
       0.77528771])

Notice that we can compute `mean_test_score` simpy computing the mean
over the different results in the different splits of the crossvalidation process.

In [20]:
np.array(all_scores).mean(axis=0)

array([0.65893289, 0.66311624, 0.77873869, 0.78690853, 0.76205421,
       0.77528771])

# Generating all possible combinations of hyperparameters

In [33]:
from itertools import product

In [34]:
param_grid

{'max_depth': [5, None, 10], 'max_features': ['auto', 0.5]}

The function product can take as input several iterators and it will generate
all the combinations of the values in the iterators

In [35]:
[x for x in product(["a","b","c"],[1,2])]

[('a', 1), ('a', 2), ('b', 1), ('b', 2), ('c', 1), ('c', 2)]

We can use it to generate the combinations of different hyperparamaters

In [36]:
[x for x in product(param_grid["max_depth"], param_grid["max_features"])]

[(5, 'auto'), (5, 0.5), (None, 'auto'), (None, 0.5), (10, 'auto'), (10, 0.5)]

the `*` notation allows us to generate a correctly formated input for `product`

In [44]:
combination_params_values = [x for x in product(*param_grid.values())]
combination_params_values

[(5, 'auto'), (5, 0.5), (None, 'auto'), (None, 0.5), (10, 'auto'), (10, 0.5)]

Notice that we can then write the name of the param for each component

In [45]:
params_keys = list(param_grid.keys())

for values in combination_params_values:
    print(dict(zip(param_grid.keys(),values)))

{'max_depth': 5, 'max_features': 'auto'}
{'max_depth': 5, 'max_features': 0.5}
{'max_depth': None, 'max_features': 'auto'}
{'max_depth': None, 'max_features': 0.5}
{'max_depth': 10, 'max_features': 'auto'}
{'max_depth': 10, 'max_features': 0.5}


We can do all this in a single function that will generate the list with the combinations we want to explore given an space of hyperparameter values.

In [46]:
def generate_params(param_grid):
    combination_params_values = [x for x in product(*param_grid.values())]
    params = []
    for values in combination_params_values:
        params.append(dict(zip(param_grid.keys(),values)))
    return params

In [47]:
generate_params(param_grid)

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

Notice that this is the same as 

In [48]:
rf_grid.cv_results_["params"]

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

## Ranking results of the crossvalidation process

All combinations of hyperparameters are stored in **`.cv_results_["params"]`**

We can visuallize in a single dataframe the differnet pa

In [49]:
results = pd.DataFrame({"params": rf_grid.cv_results_["params"], 
                        "std_test_score": rf_grid.cv_results_["std_test_score"],
                        "mean_test_score": rf_grid.cv_results_["mean_test_score"],
                       })

In [50]:
results = results.sort_values(by=["mean_test_score"], ascending=False)
results

Unnamed: 0,params,std_test_score,mean_test_score
3,"{'max_depth': None, 'max_features': 0.5}",0.005211,0.786909
2,"{'max_depth': None, 'max_features': 'auto'}",0.006031,0.778739
5,"{'max_depth': 10, 'max_features': 0.5}",0.005379,0.775288
4,"{'max_depth': 10, 'max_features': 'auto'}",0.007287,0.762054
1,"{'max_depth': 5, 'max_features': 0.5}",0.00959,0.663116
0,"{'max_depth': 5, 'max_features': 'auto'}",0.010533,0.658933


We can also see the order (rank) of each parameter configuration in **`.cv_results_["rank_test_score"]`**.

In [51]:
rank_test_score = rf_grid.cv_results_["rank_test_score"]
rank_test_score

array([6, 5, 2, 1, 4, 3], dtype=int32)

Notice that **results of the grid search are sorted by `rf_grid.cv_results_["rank_test_score"]`.**

In [277]:
[rf_grid.cv_results_["params"][k-1] for k in rank_test_score]

[{'max_depth': 10, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'}]

##### Ranking the results of crossvalidation: 1:best result, 2:second best result etc...

We can find in `rank_test_score` a ranking of the best (1) to the worst (6) results

In [285]:
rf_grid.cv_results_["rank_test_score"]

array([6, 5, 2, 1, 4, 3], dtype=int32)

We can generate this vector using `scipy.stats.rankdata` in order to rank the solutions according to `mean_test_score`

In [324]:
import scipy
rank = scipy.stats.rankdata(-rf_grid.cv_results_["mean_test_score"],method='ordinal')
rank

array([6, 5, 2, 1, 4, 3])

## Creating your own `GridSearchCV` 

In [56]:
param_grid = {"max_depth":[5,None,10], "max_features":["auto",0.5]}

In [58]:
generate_params(param_grid)

[{'max_depth': 5, 'max_features': 'auto'},
 {'max_depth': 5, 'max_features': 0.5},
 {'max_depth': None, 'max_features': 'auto'},
 {'max_depth': None, 'max_features': 0.5},
 {'max_depth': 10, 'max_features': 'auto'},
 {'max_depth': 10, 'max_features': 0.5}]

In [59]:
rf = sklearn.ensemble.RandomForestRegressor()

### Training using crosvalidation a given model

We want to have code that can train for each combination in `param_grid` we would like to train `cv` models and save the scores
of the different fitted models.

In [60]:
param_grid

{'max_depth': [5, None, 10], 'max_features': ['auto', 0.5]}

In [61]:
rf

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators='warn',
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [405]:
def model_str(model, cv, params_combination):

    final_name = type(model).__name__ + "__fold{:02d}".format(cv)
    for item in params_combination.items():
        final_name += "__"
        final_name += item[0]+"=" +str(item[1])
        
    return final_name + ".joblib"


In [429]:
params_combinations = generate_params(param_grid)
model_str(rf,2, params_combinations[0])

'RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib'

In [439]:
import os
os.listdir("./")

['generating_validation_indices.ipynb', '.ipynb_checkpoints']

In [440]:
from copy import deepcopy
import joblib
from joblib import dump, load
from pathlib import Path
from sklearn.metrics import roc_auc_score
import datetime


def CVfit(X, y, model, cv, param_grid, scorer, path_folder="",
          save_models=False, save_summary=True, add_time=False, verbose=0):
    
    """
    
    Important notes:
    
    - By default this method uses the model.score function to score the test results.
    
    - If `scorer="roc_auc_score"` or `scorer=sklearn.metrics.roc_auc_score` then the method uses 
    `model.predict_proba` instead of `model.predict` when scoring the predictions.
    
    """
    
    cv_results_params = generate_params(param_grid)
    folds = KFold(cv,shuffle=False)
    cv_results_ = {}
    cv_results_["params"] = cv_results_params
    
    if add_time:
        now = datetime.datetime.now()
        time_id = f"__{now.hour}:{now.minute}:{now.second}__{now.day}-{now.month}-{now.year}"
        path_folder = path_folder +  time_id       
    
    for fold in range(cv):
        cv_results_[f"split{fold}_test_score"] = np.array([])
    
    # Create target Directory if don't exist
    if not os.path.exists(path_folder):
        os.mkdir(path_folder)
        print("Directory " , path_folder ,  " Created ")
    else:    
        print("WARNING: Directory " , path_folder ,  " already exists")

    
    cvfit_log_file = open(os.path.join(path_folder, 'cvfit_log.txt'), "w")

        
    for param_combination in cv_results_params:
        splits = folds.split(X)
        model_current = model.__class__(**param_combination)
        
        for fold, (tr_ind, va_ind) in enumerate(splits):
            fold_str = f"split{fold}_test_score"
            name = model_str(model_current, fold, param_combination)
                
            model_current.fit(X[tr_ind], y[tr_ind])
            
            if scorer:
                if scorer.__name__ ==  "roc_auc_score":
                    test_score_fold = scorer(model_current.predict_proba(X[va_ind]), y[va_ind])
                elif scorer == "roc_auc_score":
                    test_score_fold = roc_auc_score(model_current.predict_proba(X[va_ind]), y[va_ind])
                else:
                    test_score_fold = scorer(model_current.predict(X[va_ind]), y[va_ind])
            else:
                test_score_fold       = model_current.score(X[va_ind], y[va_ind])
            
            cv_results_[fold_str] = np.append(cv_results_[fold_str], test_score_fold) 
            
            line_for_log = name + f" --> test_score={test_score_fold}"
            cvfit_log_file.write(line_for_log + "\n")
            
            if save_models:
                file_path = os.path.join(path_folder, name) 
                fileName = Path(file_path)
            
                if fileName.is_file():
                    print("WARNING: An exact model with the same hyperparameters is trying to be saved")
                    print(f"file {file_path} already exists")
                    return None
                else:
                    joblib.dump(model_current, file_path) 
            
               
            if verbose ==1:
                print(name,f" --> test_score={test_score_fold}")
            
    cvfit_log_file.close()
    test_scores = [cv_results_[f"split{i}_test_score"] for i in range(4)]
    test_scores_arr = np.array(test_scores)
    
    #cv_results_["best_test_score"] = np.max(test_scores_arr)
    cv_results_["mean_test_score"] = test_scores_arr.mean(axis=0)
    cv_results_["std_test_score"] = test_scores_arr.mean(axis=0)

    rank = scipy.stats.rankdata(-cv_results_["mean_test_score"], method='ordinal')
    cv_results_["rank_test_score"] = rank
    # Save in key "best_params" the best combination of hyperparams found
    # index_best  = np.argmax(rf_grid.cv_results_["mean_test_score"])
    # best_params = rf_grid.cv_results_["params"][ind_best]
    # cv_results_["best_params"] = best_params

    if save_summary:
        cv_results_df = pd.DataFrame(cv_results_).sort_values(["rank_test_score"])
        cv_results_df.to_csv( os.path.join(path_folder, "cv_results_df.csv"),index=False)
        
    return cv_results_ 

In [441]:
cv = 4
rf = sklearn.ensemble.RandomForestRegressor()
path_models = "./saved_models"
cv_results_ = CVfit(X, y, rf, cv, param_grid, scorer=None, 
                    path_folder=path_models, 
                    save_models=True, add_time=False, save_summary=True, verbose=1)

Directory  ./saved_models  Created 




RandomForestRegressor__fold00__max_depth=5__max_features=auto.joblib  --> test_score=0.5410553804345379
RandomForestRegressor__fold01__max_depth=5__max_features=auto.joblib  --> test_score=0.6293521452137196
RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib  --> test_score=0.5310345264500143
RandomForestRegressor__fold03__max_depth=5__max_features=auto.joblib  --> test_score=0.4779886097279118




RandomForestRegressor__fold00__max_depth=5__max_features=0.5.joblib  --> test_score=0.48391586439129697
RandomForestRegressor__fold01__max_depth=5__max_features=0.5.joblib  --> test_score=0.6053155529144562
RandomForestRegressor__fold02__max_depth=5__max_features=0.5.joblib  --> test_score=0.5200855437708678
RandomForestRegressor__fold03__max_depth=5__max_features=0.5.joblib  --> test_score=0.47701358070600564




RandomForestRegressor__fold00__max_depth=None__max_features=auto.joblib  --> test_score=0.5219805060595394
RandomForestRegressor__fold01__max_depth=None__max_features=auto.joblib  --> test_score=0.7201074668026454
RandomForestRegressor__fold02__max_depth=None__max_features=auto.joblib  --> test_score=0.5950860431097469
RandomForestRegressor__fold03__max_depth=None__max_features=auto.joblib  --> test_score=0.5884795157097984




RandomForestRegressor__fold00__max_depth=None__max_features=0.5.joblib  --> test_score=0.5427432321265331
RandomForestRegressor__fold01__max_depth=None__max_features=0.5.joblib  --> test_score=0.7116653483733969
RandomForestRegressor__fold02__max_depth=None__max_features=0.5.joblib  --> test_score=0.5816723574974257
RandomForestRegressor__fold03__max_depth=None__max_features=0.5.joblib  --> test_score=0.5650162821608253




RandomForestRegressor__fold00__max_depth=10__max_features=auto.joblib  --> test_score=0.5551012007678806
RandomForestRegressor__fold01__max_depth=10__max_features=auto.joblib  --> test_score=0.7104417896554114
RandomForestRegressor__fold02__max_depth=10__max_features=auto.joblib  --> test_score=0.5979676016606522
RandomForestRegressor__fold03__max_depth=10__max_features=auto.joblib  --> test_score=0.5895472341329304




RandomForestRegressor__fold00__max_depth=10__max_features=0.5.joblib  --> test_score=0.5363102282995238
RandomForestRegressor__fold01__max_depth=10__max_features=0.5.joblib  --> test_score=0.705049207022969
RandomForestRegressor__fold02__max_depth=10__max_features=0.5.joblib  --> test_score=0.6008648914338821
RandomForestRegressor__fold03__max_depth=10__max_features=0.5.joblib  --> test_score=0.5431611312564343



This function returns a dict `cv_results` containing the evaluation metrics of the crossvalidation process

In [442]:
cv_results_.keys()

dict_keys(['params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

In [443]:
rf_grid.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_max_features', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

In [444]:
os.listdir()

['saved_models', 'generating_validation_indices.ipynb', '.ipynb_checkpoints']

Since we executed with `save_summary=True` we will have `cv_results_df.csv` inside the folder where the models are saved.

In [445]:
cv_results_df = pd.read_csv( os.path.join(path_models,"cv_results_df.csv"))

In [451]:
cv_results_df

Unnamed: 0,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,"{'max_depth': 10, 'max_features': 'auto'}",0.555101,0.710442,0.597968,0.589547,0.613264,0.613264,1
1,"{'max_depth': None, 'max_features': 'auto'}",0.521981,0.720107,0.595086,0.58848,0.606413,0.606413,2
2,"{'max_depth': None, 'max_features': 0.5}",0.542743,0.711665,0.581672,0.565016,0.600274,0.600274,3
3,"{'max_depth': 10, 'max_features': 0.5}",0.53631,0.705049,0.600865,0.543161,0.596346,0.596346,4
4,"{'max_depth': 5, 'max_features': 'auto'}",0.541055,0.629352,0.531035,0.477989,0.544858,0.544858,5
5,"{'max_depth': 5, 'max_features': 0.5}",0.483916,0.605316,0.520086,0.477014,0.521583,0.521583,6


We can also inspect the log of the cvfit and see the information that is printed when verbose=1

In [450]:

!cat saved_models/cvfit_log.txt

RandomForestRegressor__fold00__max_depth=5__max_features=auto.joblib --> test_score=0.5410553804345379
RandomForestRegressor__fold01__max_depth=5__max_features=auto.joblib --> test_score=0.6293521452137196
RandomForestRegressor__fold02__max_depth=5__max_features=auto.joblib --> test_score=0.5310345264500143
RandomForestRegressor__fold03__max_depth=5__max_features=auto.joblib --> test_score=0.4779886097279118
RandomForestRegressor__fold00__max_depth=5__max_features=0.5.joblib --> test_score=0.48391586439129697
RandomForestRegressor__fold01__max_depth=5__max_features=0.5.joblib --> test_score=0.6053155529144562
RandomForestRegressor__fold02__max_depth=5__max_features=0.5.joblib --> test_score=0.5200855437708678
RandomForestRegressor__fold03__max_depth=5__max_features=0.5.joblib --> test_score=0.47701358070600564
RandomForestRegressor__fold00__max_depth=None__max_features=auto.joblib --> test_score=0.5219805060595394
RandomForestRegressor__fold01__max_depth=None__max_features=aut

### TODO: Make a function that given cv_results_df and path_folder...

- Loads different models (topK) and creates a "metamodel". To do so I need to create a new MetaModel class.

    - The metamodel is essentially a list of models: `[m1,m2,m3,...]`
    
    - The metamodel.predict(X) simply does `np.mean(m1.predict(X), m2.predict(X),....)`
    
    - The metamodel should have a weighted average prediction method. 
    
          - The `mean_test_score` could be used to assign a weight into the average
          - The `std_test_score` could be used to assign a confidence of the prediction of the metamodel
      

The models will be saved in `path_folder`

In [408]:
models = os.listdir(path_models)
models

['RandomForestRegressor__fold00__max_depth=10__max_features=auto.joblib',
 'RandomForestRegressor__fold00__max_depth=10__max_features=0.5.joblib',
 'RandomForestRegressor__fold03__max_depth=10__max_features=auto.joblib',
 'RandomForestRegressor__fold02__max_depth=10__max_features=0.5.joblib',
 'RandomForestRegressor__fold03__max_depth=5__max_features=auto.joblib',
 'RandomForestRegressor__fold03__max_depth=5__max_features=0.5.joblib',
 'RandomForestRegressor__fold00__max_depth=5__max_features=0.5.joblib',
 'RandomForestRegressor__fold01__max_depth=5__max_features=auto.joblib',
 'RandomForestRegressor__fold01__max_depth=10__max_features=0.5.joblib',
 'RandomForestRegressor__fold03__max_depth=None__max_features=0.5.joblib',
 'RandomForestRegressor__fold01__max_depth=10__max_features=auto.joblib',
 'cv_results_df.csv',
 'RandomForestRegressor__fold00__max_depth=5__max_features=auto.joblib',
 'RandomForestRegressor__fold02__max_depth=None__max_features=auto.joblib',
 'RandomForestRegressor

We can load a model from disk using **`joblib.load`**

In [410]:
model = joblib.load(os.path.join(path_models, models[0]))
model

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)


# Finding indices with `GroupKFold`

In [315]:
from sklearn.model_selection import GroupKFold

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

2

In [14]:
print(group_kfold)

GroupKFold(n_splits=2)


In [17]:
for train_index, test_index in group_kfold.split(X, y, groups):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)
    print("\n")

TRAIN: [0 1] TEST: [2 3]
[[1 2]
 [3 4]] [[5 6]
 [7 8]] [1 2] [3 4]


TRAIN: [2 3] TEST: [0 1]
[[5 6]
 [7 8]] [[1 2]
 [3 4]] [3 4] [1 2]


