# About
Hyperparameter optimization is required to get the most out of your machine learning models.

Hyperparameters are points of choice or configuration that allow a machine learning model to be customized for a specific task or dataset.

Parameters are different from hyperparameters. Parameters are learned automatically; hyperparameters are set manually to help guide the learning process.

Choosing a hyperparameter grid is probably the most difficult part of hyperparameter tuning: it's nearly impossible ahead of time to say which values of hyperparameters will work well and the optimal settings will depend on the dataset. Moreover, the hyperparameters have complex interactions with each other which means that just tuning one at a time doesn't work because when we start changing other hyperparameters that will affect the one we just tuned!

! https://practicaldatascience.co.uk/machine-learning/how-to-use-model-selection-and-hyperparameter-tuning

# Libraries

In [1]:
%run "/home/cesar/Python_NBs/HDL_Project/HDL_Project/global_fv.ipynb"

User information is ready!


In [2]:
import os
os.getcwd()

# Save trained models
import joblib

# Data
from sklearn.model_selection import train_test_split
from sklearn.utils.multiclass import type_of_target

# Hypertuning tools
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV

# Metrics
from sklearn.metrics import SCORERS

# Nonlinear models
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn import svm
from sklearn.gaussian_process import GaussianProcessRegressor

# Ensemble models
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

# Clone of time class
s = t

# Random seed
np.random.seed(101)

os.getcwd()

'/home/cesar/Python_NBs/HDL_Project/HDL_Project/2_Models/Multivariate/ML'

# User-Defined Functions

In [3]:
def hyper_tuning(name, model, space, X, y):
    # The searching algorithm includes a “cv” argument that allows:
    # a) An integer number of folds to be specified, e.g. 5
    #cross_val = 5
    # b) A configured cross-validation object.
    kfold = KFold(n_splits=3, shuffle=False)

    # The scoring metric must be maximizing, meaning better models result in larger scores.
    scoring_metric = 'neg_mean_squared_error'

    # Search for best hyperparameters
    grid = RandomizedSearchCV(estimator=model, 
                              param_distributions=search_space, 
                              cv=kfold, 
                              n_iter=100,
                              scoring=scoring_metric)

    result = grid.fit(X_test, y_test)
    
    # Save the trained model
    filename = 'trained_ml_models_mvi/{}.sav'.format(name)
    joblib.dump(result, filename)

    return result

In [None]:
# Evaluate a single model
def single_model_evaluation(X_test, y_test, name):
    # Load the trained model
    filename = 'trained_ml_models_mvi/{}.sav'.format(name)
    model = joblib.load(filename)

    # make predictions
    y_prediction = model.predict(X_test)
    
    metrics = dict()
    # evaluate predictions
    metrics["RMSE"] = mean_squared_error(y_test, y_prediction, squared=False)
    metrics["MAE"] = mean_absolute_error(y_test, y_prediction)
    metrics["MAPE (%)"] = mean_absolute_percentage_error(y_test, y_prediction) *100
    metrics["R^2 (%)"] = r2_score(y_test, y_prediction) * 100
    metrics["Max Error"] = max_error(y_test, y_prediction)    
    
    return metrics

In [None]:
# Load the trained model
filename = 'trained_ml_models_mvi/{}.sav'.format("KNN")
model = joblib.load(filename)

# make predictions
y_prediction = model.predict(X_test)

metrics = dict()
# evaluate predictions
metrics["RMSE"] = mean_squared_error(y_test, y_prediction, squared=False)
metrics["MAE"] = mean_absolute_error(y_test, y_prediction)
metrics["MAPE (%)"] = mean_absolute_percentage_error(y_test, y_prediction) *100
metrics["R^2 (%)"] = r2_score(y_test, y_prediction) * 100
metrics["Max Error"] = max_error(y_test, y_prediction)    

print(y_prediction)
metrics

# Data

## Sample preparation

In [4]:
sql_table = "MVI_sima_station_CE"
target = "pm25"

# Define columns of interest from sql table
#     Select all columns:
column = "*"
#     Select specific columns:
#column = "datetime, prs, rainf, rh, sr, tout, wdr, wsr, " + str(target)

# Filter data with WHERE command
sql_where = "where datetime >=\'2021-04-17 23:00:00\'"
#"where datetime > \'2020-04-20\'"

# Initialize class to create multivariate samples:
multi_ts = multivariate_samples(sql_table, target, column, sql_where)

# Datasets can't be trained with sample batches by default. So parameter is 1.
X, y, _ = multi_ts.samples_creation(1, target)

# Training and test datasets are prepared, avoiding shuffling because it is a time series.
X_train, X_test, y_train, y_test = train_test_split(X[:,0,:], y, test_size = 0.30, shuffle= False)

In [5]:
type_of_target(y_train)

'continuous'

# Hyperparameter tuning

## Objective function

In [36]:
#sorted(SCORERS.keys())

# Random Search
RandomizedSearchCV for random search evaluates models for a given hyperparameter vector using cross-validation, hence the “CV” suffix of each class name.

It requires two arguments. 
1. The first is the model that you are optimizing. This is an instance of the model with values of hyperparameters set that you want to optimize. 
2. The second is the search space. This is defined as a dictionary where the names are the hyperparameter arguments to the model and the values are discrete values or a distribution of values to sample in the case of a random search.

## K-Nearest Neighbors
KNeighborsRegressor()

In [7]:
# Select an algorithm
model = KNeighborsRegressor()
model.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

In [8]:
# define search space
search_space = [{
    'n_neighbors': list(range(1,10)),
    'weights': list(['uniform', 'distance']),
    'algorithm': list(['auto', 'ball_tree', 'kd_tree', 'brute']),
    'leaf_size': list(range(15, 45)),
    'p': list([1,2]),
    'metric': list(['euclidean', 'manhattan','chebyshev', 'minkowski']),
    # The search can be made parallel using various if not all of your CPU cores 
    # We can set it to -1 to automatically use all of the cores in the system.
    'n_jobs': list([-1])
}]

In [9]:
t.tic()
result_KNN = hyper_tuning("KNN", model, search_space, X_train, y_train)
t.toc(restart=True)
# Get the results
print(result_KNN.best_score_)
print("")
print(result_KNN.best_estimator_)
print("")
print(result_KNN.best_params_)

Elapsed time is 6.124174 seconds.
-126.0946106700254

KNeighborsRegressor(algorithm='kd_tree', leaf_size=18, n_jobs=-1, n_neighbors=9,
                    p=1, weights='distance')

{'weights': 'distance', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'metric': 'minkowski', 'leaf_size': 18, 'algorithm': 'kd_tree'}


## Classification and Regression Tree
DecisionTreeRegressor()

In [10]:
# Select an algorithm
model = DecisionTreeRegressor()
model.get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

In [11]:
# define search space
search_space = [{
    'criterion': list(['squared_error', 'friedman_mse', 'absolute_error', 'poisson'])
    , 'splitter': list(['best', 'random'])
    , 'max_depth': list(range(1,10))
    , 'min_samples_split': list(range(2,10))
    , 'min_samples_leaf': list(range(1,10))
    , 'min_weight_fraction_leaf': list(np.linspace(0.0,0.5))
}]

In [12]:
t.tic()
result_DTR = hyper_tuning("DecisionTrees", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_DTR.best_score_)
print("")
print(result_DTR.best_estimator_)
print("")
print(result_DTR.best_params_)

Elapsed time is 3.182892 seconds.
-146.27069594910802

DecisionTreeRegressor(criterion='friedman_mse', max_depth=9, min_samples_leaf=9,
                      min_samples_split=7,
                      min_weight_fraction_leaf=0.02040816326530612)

{'splitter': 'best', 'min_weight_fraction_leaf': 0.02040816326530612, 'min_samples_split': 7, 'min_samples_leaf': 9, 'max_depth': 9, 'criterion': 'friedman_mse'}


66 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
66 fits failed with the following error:
Traceback (most recent call last):
  File "/home/cesar/anaconda3/envs/hdl_project/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/cesar/anaconda3/envs/hdl_project/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 1315, in fit
    super().fit(
  File "/home/cesar/anaconda3/envs/hdl_project/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 178, in fit
    raise ValueError(
ValueError: Some value(s) of y are negative which is not allowed for Poisson regression.

 -274.1872302          

## Support Vector Regression - Polynomial
svm.SVR(kernel='poly')

In [13]:
# Select an algorithm
model = svm.SVR()
model.get_params()

{'C': 1.0,
 'cache_size': 200,
 'coef0': 0.0,
 'degree': 3,
 'epsilon': 0.1,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [14]:
# define search space
search_space = [{
    'kernel': list(['poly'])
    # `degree` is a parameter used when kernel is set to ‘poly’.
    , 'degree': list([0, 2, 3, 4, 5, 6])
    # Gamma is a parameter for non linear hyperplanes. 
    # The higher the gamma value it tries to exactly fit the training data set
    , 'gamma' : list([0.1, 1, 10, 100])
    # C is the penalty parameter of the error term. 
    # It controls the trade off between smooth decision boundary and classifying the training points correctly.
    , 'C': list([0.1, 1, 10, 100, 1000])
}]

In [15]:
if(False):
    t.tic()
    result_SVM_poly = hyper_tuning("SVR_Poly", model, search_space, X_train, y_train)
    t.toc(restart=True)

    # Get the results
    print(result_SVM_poly.best_score_)
    print("")
    print(result_SVM_poly.best_estimator_)
    print("")
    print(result_SVM_poly.best_params_)

## Support Vector Regression - RBF
svm.SVR(kernel='rbf')

In [16]:
# Select an algorithm
model = svm.SVR()
model.get_params()

{'C': 1.0,
 'cache_size': 200,
 'coef0': 0.0,
 'degree': 3,
 'epsilon': 0.1,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [17]:
# define search space
search_space = [{
    'kernel': list(['rbf'])
    # Gamma is a parameter for non linear hyperplanes. 
    # The higher the gamma value it tries to exactly fit the training data set
    , 'gamma' : list([0.1, 1, 10, 100])
    # C is the penalty parameter of the error term. 
    # It controls the trade off between smooth decision boundary and classifying the training points correctly.
    , 'C': list([0.1, 1, 10, 100, 1000])
}]

In [18]:
t.tic()
result_SVM_RBF = hyper_tuning("SVR_RBF", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_SVM_RBF.best_score_)
print("")
print(result_SVM_RBF.best_estimator_)
print("")
print(result_SVM_RBF.best_params_)



Elapsed time is 28.680515 seconds.
-85.89965613835875

SVR(C=100, gamma=1)

{'kernel': 'rbf', 'gamma': 1, 'C': 100}


## Support Vector Regression - Linear
svm.SVR(kernel='linear')

In [19]:
# Select an algorithm
model = svm.SVR()
model.get_params()

{'C': 1.0,
 'cache_size': 200,
 'coef0': 0.0,
 'degree': 3,
 'epsilon': 0.1,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [20]:
# define search space
search_space = [{
    'kernel': list(['linear'])
    # Gamma is a parameter for non linear hyperplanes. 
    # The higher the gamma value it tries to exactly fit the training data set
    , 'gamma' : list([0.1, 1, 10, 100])
    # C is the penalty parameter of the error term. 
    # It controls the trade off between smooth decision boundary and classifying the training points correctly.
    , 'C': list([0.1, 1, 10, 100, 1000])
}]

In [21]:
t.tic()
result_SVM_Linear = hyper_tuning("SVR_Linear", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_SVM_Linear.best_score_)
print("")
print(result_SVM_Linear.best_estimator_)
print("")
print(result_SVM_Linear.best_params_)



Elapsed time is 17.232813 seconds.
-100.96341745684497

SVR(C=100, gamma=0.1, kernel='linear')

{'kernel': 'linear', 'gamma': 0.1, 'C': 100}


## Random Forest
RandomForestRegressor()

In [22]:
# Select an algorithm
model = RandomForestRegressor()
model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [23]:
# define search space
search_space = [{
    # `n_estimators` represents the number of trees in the forest. 
    # Usually the higher the number of trees the better to learn the data. It is also computationally expensive.
    'n_estimators': list([100, 200, 300, 400, 500])
    # `max_depth` represents the depth of each tree in the forest. 
    # The deeper the tree, the more splits it has and it captures more information about the data.
    , 'max_depth': list(np.linspace(1, 32, 32, endpoint=True))
    # `min_samples_split` represents the minimum number of samples required to split an internal node. 
    , 'min_samples_split': list([2, 3, 4, 5, 6, 7, 8, 9, 10]) # list(np.linspace(1, 1, 10, endpoint=True))
    # `min_samples_leaf` The minimum number of samples required to be at a leaf node.
    #, 'min_samples_leafs': list([1, 2, 4])
    # `max_features`: Represents the number of features to consider when looking for the best split.
    , 'max_features': list(range(1,X_train.shape[1]))
}]

In [24]:
t.tic()
result_RF = hyper_tuning("RandomForest", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results 
print(result_RF.best_score_)
print("")
print(result_RF.best_estimator_)
print("")
print(result_RF.best_params_)

Elapsed time is 326.752128 seconds.
-95.08136703129003

RandomForestRegressor(max_depth=19.0, max_features=9, min_samples_split=8,
                      n_estimators=200)

{'n_estimators': 200, 'min_samples_split': 8, 'max_features': 9, 'max_depth': 19.0}


## Extra-trees regressor
ExtraTreesRegressor()

In [25]:
# Select an algorithm
model = ExtraTreesRegressor()
model.get_params()

{'bootstrap': False,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [26]:
# define search space
search_space = [{
    # `n_estimators` represents the number of trees in the forest. 
    # Usually the higher the number of trees the better to learn the data. It is also computationally expensive.
    'n_estimators': list([1, 2, 4, 8, 16, 32, 64, 100, 200])
    , 'criterion': ['squared_error']
    # `max_depth` represents the depth of each tree in the forest. 
    # The deeper the tree, the more splits it has and it captures more information about the data.
    , 'max_depth': list(np.linspace(1, 32, 32, endpoint=True))
    # `min_samples_split` represents the minimum number of samples required to split an internal node. 
    , 'min_samples_split': list([2, 3, 4, 5, 6, 7, 8, 9, 10]) # list(np.linspace(1, 1, 10, endpoint=True))
    # `min_samples_leaf` The minimum number of samples required to be at a leaf node.
    #, 'min_samples_leafs': list(np.linspace(0.1, 0.5, 5, endpoint=True))
    # `max_features`: Represents the number of features to consider when looking for the best split.
    , 'max_features': list(range(1,X_train.shape[1]))

}]

In [27]:
t.tic()
result_ETR = hyper_tuning("ExtraTrees", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_ETR.best_score_)
print("")
print(result_ETR.best_estimator_)
print("")
print(result_ETR.best_params_)

Elapsed time is 20.617258 seconds.
-92.2642297906054

ExtraTreesRegressor(max_depth=17.0, max_features=12)

{'n_estimators': 100, 'min_samples_split': 2, 'max_features': 12, 'max_depth': 17.0, 'criterion': 'squared_error'}


## XG Boost 
XGBRegressor()

In [28]:
# Select an algorithm
model = XGBRegressor()
model.get_params()

{'objective': 'reg:squarederror',
 'base_score': None,
 'booster': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'gamma': None,
 'gpu_id': None,
 'importance_type': 'gain',
 'interaction_constraints': None,
 'learning_rate': None,
 'max_delta_step': None,
 'max_depth': None,
 'min_child_weight': None,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 100,
 'n_jobs': None,
 'num_parallel_tree': None,
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'scale_pos_weight': None,
 'subsample': None,
 'tree_method': None,
 'validate_parameters': None,
 'verbosity': None}

In [29]:
# define search space
search_space = [{
    'max_depth': [3, 5, 6, 10, 15, 20]
    , 'learning_rate': [0.01, 0.1, 0.2, 0.3]
    , 'subsample': np.arange(0.5, 1.0, 0.1)
    , 'colsample_bytree': np.arange(0.4, 1.0, 0.1)
    , 'colsample_bylevel': np.arange(0.4, 1.0, 0.1)
    , 'n_estimators': [100, 500, 1000, 1500, 2000]
}]

In [30]:
t.tic()
result_XGB = hyper_tuning("XGBoost", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_XGB.best_score_)
print("")
print(result_XGB.best_estimator_)
print("")
print(result_XGB.best_params_)

Elapsed time is 404.661776 seconds.
-87.43927484514329

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=0.4,
             colsample_bynode=1, colsample_bytree=0.7, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.01, max_delta_step=0, max_depth=5,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=2000, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.5,
             tree_method='exact', validate_parameters=1, verbosity=None)

{'subsample': 0.5, 'n_estimators': 2000, 'max_depth': 5, 'learning_rate': 0.01, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.4}


# Loading and evaluating models

In [32]:
# Evaluate a dict of models {name:object}, returns {name:score}
def multiple_model_evaluation(X_test, y_test, models_list):
    metrics_df = pd.DataFrame()
    
    for name in models_list:
        # evaluate the model
        s.tic()
        tmp_df = pd.DataFrame(single_model_evaluation(X_test, y_test, name), index=[0])
        tmp_df.insert(0, "Model Name", name, True)
        tmp_df.insert(0, "Type", "ML", True)
        metrics_df = metrics_df.append(tmp_df)
        print("> {}.".format(name))
        s.toc(restart=True)
        
    return metrics_df.reset_index(drop = True)

In [33]:
# get model list
models_list = ["KNN", "DecisionTrees", "SVR_RBF", "SVR_Linear", "RandomForest", "ExtraTrees", "XGBoost"]

# evaluate models
t.tic() #Start timer
results = multiple_model_evaluation(X_test, y_test, models_list)
t.toc() #Time elapsed since t.tic()

results

> KNN.
Elapsed time is 0.081322 seconds.
> DecisionTrees.
Elapsed time is 0.005863 seconds.
> SVR_RBF.
Elapsed time is 0.317110 seconds.
> SVR_Linear.
Elapsed time is 0.153042 seconds.
> RandomForest.
Elapsed time is 0.115748 seconds.
> ExtraTrees.
Elapsed time is 0.079571 seconds.
> XGBoost.
Elapsed time is 0.238433 seconds.
Elapsed time is 0.000695 seconds.


Unnamed: 0,Type,Model Name,RMSE,MAE,MAPE (%),R^2 (%),Max Error
0,ML,KNN,0.0,0.0,0.0,100.0,0.0
1,ML,DecisionTrees,9.509687,6.819189,31.090354,66.813768,75.633333
2,ML,SVR_RBF,7.328579,4.450548,16.945624,80.29098,80.340007
3,ML,SVR_Linear,9.622198,6.519342,27.533241,66.023857,78.242506
4,ML,RandomForest,3.947193,2.599079,11.206802,94.282552,49.621498
5,ML,ExtraTrees,0.843669,0.507079,2.58201,99.738802,8.430837
6,ML,XGBoost,3.567968,2.620707,11.941028,95.32838,21.826057


[ 9.   12.   10.   ... 22.25 25.19 27.87]


{'RMSE': 0.0, 'MAE': 0.0, 'MAPE (%)': 0.0, 'R^2 (%)': 100.0, 'Max Error': 0.0}

In [35]:
y_test

array([ 9.  , 12.  , 10.  , ..., 22.25, 25.19, 27.87])

# Sources:
## Main 
https://practicaldatascience.co.uk/machine-learning/how-to-use-model-selection-and-hyperparameter-tuning


* sklearn.model_selection.RandomizedSearchCV
    - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html 
    - https://scikit-learn.org/stable/modules/grid_search.html?highlight=randomsearchcv
* sklearn.model_selection.KFold
    - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
    - https://machinelearningmastery.com/k-fold-cross-validation/


## Models
* KNN
    - https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
    - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html#sklearn.metrics.DistanceMetric
* DecisionTreeRegressor()
    - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
* pmdarima
    - https://towardsdatascience.com/efficient-time-series-using-pythons-pmdarima-library-f6825407b7f0
* SVM
    - https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html?highlight=svm%20svr%20kernel%20poly
    - https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769
* Random Forest
    - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
    - https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d
* XGBoost
    - https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663
    
* Gaussian NB
    - https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
    - https://medium.com/analytics-vidhya/how-to-improve-naive-bayes-9fa698e14cba
    - https://www.analyticsvidhya.com/blog/2021/01/gaussian-naive-bayes-with-hyperpameter-tuning/
    
## Metrics
* Metrics and scoring: quantifying the quality of predictions
    - https://scikit-learn.org/stable/modules/model_evaluation.html
    - https://openclassrooms.com/en/courses/6401081-improve-the-performance-of-a-machine-learning-model/6539936-improve-your-feature-selection