# About
Hyperparameter optimization is required to get the most out of your machine learning models.

Hyperparameters are points of choice or configuration that allow a machine learning model to be customized for a specific task or dataset.

Parameters are different from hyperparameters. Parameters are learned automatically; hyperparameters are set manually to help guide the learning process.

Choosing a hyperparameter grid is probably the most difficult part of hyperparameter tuning: it's nearly impossible ahead of time to say which values of hyperparameters will work well and the optimal settings will depend on the dataset. Moreover, the hyperparameters have complex interactions with each other which means that just tuning one at a time doesn't work because when we start changing other hyperparameters that will affect the one we just tuned!

! https://practicaldatascience.co.uk/machine-learning/how-to-use-model-selection-and-hyperparameter-tuning

# Libraries

In [1]:
%run "/home/cesar/Python_NBs/HDL_Project/HDL_Project/global_fv.ipynb"

User information is ready!


In [2]:
import os
os.getcwd()

# Save trained models
import joblib

# Data
from sklearn.model_selection import train_test_split
from sklearn.utils.multiclass import type_of_target

# Hypertuning tools
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV

# Metrics
from sklearn.metrics import SCORERS

# Nonlinear models
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn import svm
from sklearn.gaussian_process import GaussianProcessRegressor

# Ensemble models
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

# Clone of time class
s = t

# Random seed
np.random.seed(101)

os.getcwd()

'/home/cesar/Python_NBs/HDL_Project/Mini HDL/Baseline_ML_Pollution_Concentration_MMA/2_Models'

# User-Defined Functions

In [3]:
def hyper_tuning(name, model, space, X, y):
    # The searching algorithm includes a “cv” argument that allows:
    # a) An integer number of folds to be specified, e.g. 5
    #cross_val = 5
    # b) A configured cross-validation object.
    kfold = KFold(n_splits=3, shuffle=False)

    # The scoring metric must be maximizing, meaning better models result in larger scores.
    scoring_metric = 'neg_mean_squared_error'

    # Search for best hyperparameters
    grid = RandomizedSearchCV(estimator=model, 
                              param_distributions=search_space, 
                              cv=kfold, 
                              n_iter=100,
                              scoring=scoring_metric)

    result = grid.fit(X_test, y_test)
    
    # Save the trained model
    filename = 'trained_ml_models_mvi/{}.sav'.format(name)
    joblib.dump(result, filename)

    return result

In [4]:
# Evaluate a single model
def single_model_evaluation(X_test, y_test, name):
    # Load the trained model
    filename = 'trained_ml_models_mvi/{}.sav'.format(name)
    model = joblib.load(filename)

    # make predictions
    y_prediction = model.predict(X_test)
    
    metrics = dict()
    # evaluate predictions
    metrics["RMSE"] = mean_squared_error(y_test, y_prediction, squared=False)
    metrics["MAE"] = mean_absolute_error(y_test, y_prediction)
    metrics["MAPE (%)"] = mean_absolute_percentage_error(y_test, y_prediction) *100
    metrics["R^2 (%)"] = r2_score(y_test, y_prediction) * 100
    metrics["Max Error"] = max_error(y_test, y_prediction)    
    
    return metrics

In [5]:
# Load the trained model
filename = 'trained_ml_models_mvi/{}.sav'.format("KNN")
model = joblib.load(filename)

# make predictions
y_prediction = model.predict(X_test)

metrics = dict()
# evaluate predictions
metrics["RMSE"] = mean_squared_error(y_test, y_prediction, squared=False)
metrics["MAE"] = mean_absolute_error(y_test, y_prediction)
metrics["MAPE (%)"] = mean_absolute_percentage_error(y_test, y_prediction) *100
metrics["R^2 (%)"] = r2_score(y_test, y_prediction) * 100
metrics["Max Error"] = max_error(y_test, y_prediction)    

print(y_prediction)
metrics

FileNotFoundError: [Errno 2] No such file or directory: 'trained_ml_models_mvi/KNN.sav'

# Data

## Sample preparation

In [None]:
sql_table = "MVI_sima_station_CE"
target = "pm25"

# Define columns of interest from sql table
#     Select all columns:
column = "datetime, co, no, no2, o3, pm10, pm25, prs, rainf, rh, so2, sr, tout, wdr, wsr"
# We remove NOx because it has high correlation with NO.

#column = "*"
#     Select specific columns:
#column = "datetime, co, no, no2, nox, o3, pm10, pm25, prs, rainf, rh, so2, sr, tout, wdr, wsr "

# Filter data with WHERE command
sql_where = "where datetime >=\'2021-04-17 23:00:00\'"
#"where datetime > \'2020-04-20\'"

# Initialize class to create multivariate samples:
multi_ts = multivariate_samples(sql_table, target, column, sql_where)

# Datasets can't be trained with sample batches by default. So parameter is 1.
X, y, _ = multi_ts.samples_creation(1, target)

# Training and test datasets are prepared, avoiding shuffling because it is a time series.
X_train, X_test, y_train, y_test = train_test_split(X[:,0,:], y, test_size = 0.30, shuffle= False)

In [None]:
type_of_target(y_train)

# Hyperparameter tuning

## Objective function

In [None]:
#sorted(SCORERS.keys())

# Random Search
RandomizedSearchCV for random search evaluates models for a given hyperparameter vector using cross-validation, hence the “CV” suffix of each class name.

It requires two arguments. 
1. The first is the model that you are optimizing. This is an instance of the model with values of hyperparameters set that you want to optimize. 
2. The second is the search space. This is defined as a dictionary where the names are the hyperparameter arguments to the model and the values are discrete values or a distribution of values to sample in the case of a random search.

## K-Nearest Neighbors
KNeighborsRegressor()

In [None]:
# Select an algorithm
model = KNeighborsRegressor()
model.get_params()

In [None]:
# define search space
search_space = [{
    'n_neighbors': list(range(1,10)),
    'weights': list(['uniform', 'distance']),
    'algorithm': list(['auto', 'ball_tree', 'kd_tree', 'brute']),
    'leaf_size': list(range(15, 45)),
    'p': list([1,2]),
    'metric': list(['euclidean', 'manhattan','chebyshev', 'minkowski']),
    # The search can be made parallel using various if not all of your CPU cores 
    # We can set it to -1 to automatically use all of the cores in the system.
    'n_jobs': list([-1])
}]

In [None]:
t.tic()
result_KNN = hyper_tuning("KNN", model, search_space, X_train, y_train)
t.toc(restart=True)
# Get the results
print(result_KNN.best_score_)
print("")
print(result_KNN.best_estimator_)
print("")
print(result_KNN.best_params_)

## Classification and Regression Tree
DecisionTreeRegressor()

In [None]:
# Select an algorithm
model = DecisionTreeRegressor()
model.get_params()

In [None]:
# define search space
search_space = [{
    'criterion': list(['squared_error', 'friedman_mse', 'absolute_error', 'poisson'])
    , 'splitter': list(['best', 'random'])
    , 'max_depth': list(range(1,10))
    , 'min_samples_split': list(range(2,10))
    , 'min_samples_leaf': list(range(1,10))
    , 'min_weight_fraction_leaf': list(np.linspace(0.0,0.5))
}]

In [None]:
t.tic()
result_DTR = hyper_tuning("DecisionTrees", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_DTR.best_score_)
print("")
print(result_DTR.best_estimator_)
print("")
print(result_DTR.best_params_)

## Support Vector Regression - Polynomial
svm.SVR(kernel='poly')

In [None]:
# Select an algorithm
model = svm.SVR()
model.get_params()

In [None]:
# define search space
search_space = [{
    'kernel': list(['poly'])
    # `degree` is a parameter used when kernel is set to ‘poly’.
    , 'degree': list([0, 2, 3, 4, 5, 6])
    # Gamma is a parameter for non linear hyperplanes. 
    # The higher the gamma value it tries to exactly fit the training data set
    , 'gamma' : list([0.1, 1, 10, 100])
    # C is the penalty parameter of the error term. 
    # It controls the trade off between smooth decision boundary and classifying the training points correctly.
    , 'C': list([0.1, 1, 10, 100, 1000])
}]

In [None]:
if(False):
    t.tic()
    result_SVM_poly = hyper_tuning("SVR_Poly", model, search_space, X_train, y_train)
    t.toc(restart=True)

    # Get the results
    print(result_SVM_poly.best_score_)
    print("")
    print(result_SVM_poly.best_estimator_)
    print("")
    print(result_SVM_poly.best_params_)

## Support Vector Regression - RBF
svm.SVR(kernel='rbf')

In [None]:
# Select an algorithm
model = svm.SVR()
model.get_params()

In [None]:
# define search space
search_space = [{
    'kernel': list(['rbf'])
    # Gamma is a parameter for non linear hyperplanes. 
    # The higher the gamma value it tries to exactly fit the training data set
    , 'gamma' : list([0.1, 1, 10, 100])
    # C is the penalty parameter of the error term. 
    # It controls the trade off between smooth decision boundary and classifying the training points correctly.
    , 'C': list([0.1, 1, 10, 100, 1000])
}]

In [None]:
t.tic()
result_SVM_RBF = hyper_tuning("SVR_RBF", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_SVM_RBF.best_score_)
print("")
print(result_SVM_RBF.best_estimator_)
print("")
print(result_SVM_RBF.best_params_)

## Support Vector Regression - Linear
svm.SVR(kernel='linear')

In [None]:
# Select an algorithm
model = svm.SVR()
model.get_params()

In [None]:
# define search space
search_space = [{
    'kernel': list(['linear'])
    # Gamma is a parameter for non linear hyperplanes. 
    # The higher the gamma value it tries to exactly fit the training data set
    , 'gamma' : list([0.1, 1, 10, 100])
    # C is the penalty parameter of the error term. 
    # It controls the trade off between smooth decision boundary and classifying the training points correctly.
    , 'C': list([0.1, 1, 10, 100, 1000])
}]

In [None]:
t.tic()
result_SVM_Linear = hyper_tuning("SVR_Linear", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_SVM_Linear.best_score_)
print("")
print(result_SVM_Linear.best_estimator_)
print("")
print(result_SVM_Linear.best_params_)

## Random Forest
RandomForestRegressor()

In [None]:
# Select an algorithm
model = RandomForestRegressor()
model.get_params()

In [None]:
# define search space
search_space = [{
    # `n_estimators` represents the number of trees in the forest. 
    # Usually the higher the number of trees the better to learn the data. It is also computationally expensive.
    'n_estimators': list([100, 200, 300, 400, 500])
    # `max_depth` represents the depth of each tree in the forest. 
    # The deeper the tree, the more splits it has and it captures more information about the data.
    , 'max_depth': list(np.linspace(1, 32, 32, endpoint=True))
    # `min_samples_split` represents the minimum number of samples required to split an internal node. 
    , 'min_samples_split': list([2, 3, 4, 5, 6, 7, 8, 9, 10]) # list(np.linspace(1, 1, 10, endpoint=True))
    # `min_samples_leaf` The minimum number of samples required to be at a leaf node.
    #, 'min_samples_leafs': list([1, 2, 4])
    # `max_features`: Represents the number of features to consider when looking for the best split.
    , 'max_features': list(range(1,X_train.shape[1]))
}]

In [None]:
t.tic()
result_RF = hyper_tuning("RandomForest", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results 
print(result_RF.best_score_)
print("")
print(result_RF.best_estimator_)
print("")
print(result_RF.best_params_)

## Extra-trees regressor
ExtraTreesRegressor()

In [None]:
# Select an algorithm
model = ExtraTreesRegressor()
model.get_params()

In [None]:
# define search space
search_space = [{
    # `n_estimators` represents the number of trees in the forest. 
    # Usually the higher the number of trees the better to learn the data. It is also computationally expensive.
    'n_estimators': list([1, 2, 4, 8, 16, 32, 64, 100, 200])
    , 'criterion': ['squared_error']
    # `max_depth` represents the depth of each tree in the forest. 
    # The deeper the tree, the more splits it has and it captures more information about the data.
    , 'max_depth': list(np.linspace(1, 32, 32, endpoint=True))
    # `min_samples_split` represents the minimum number of samples required to split an internal node. 
    , 'min_samples_split': list([2, 3, 4, 5, 6, 7, 8, 9, 10]) # list(np.linspace(1, 1, 10, endpoint=True))
    # `min_samples_leaf` The minimum number of samples required to be at a leaf node.
    #, 'min_samples_leafs': list(np.linspace(0.1, 0.5, 5, endpoint=True))
    # `max_features`: Represents the number of features to consider when looking for the best split.
    , 'max_features': list(range(1,X_train.shape[1]))

}]

In [None]:
t.tic()
result_ETR = hyper_tuning("ExtraTrees", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_ETR.best_score_)
print("")
print(result_ETR.best_estimator_)
print("")
print(result_ETR.best_params_)

## XG Boost 
XGBRegressor()

In [None]:
# Select an algorithm
model = XGBRegressor()
model.get_params()

In [None]:
# define search space
search_space = [{
    'max_depth': [3, 5, 6, 10, 15, 20]
    , 'learning_rate': [0.01, 0.1, 0.2, 0.3]
    , 'subsample': np.arange(0.5, 1.0, 0.1)
    , 'colsample_bytree': np.arange(0.4, 1.0, 0.1)
    , 'colsample_bylevel': np.arange(0.4, 1.0, 0.1)
    , 'n_estimators': [100, 500, 1000, 1500, 2000]
}]

In [None]:
t.tic()
result_XGB = hyper_tuning("XGBoost", model, search_space, X_train, y_train)
t.toc(restart=True)

# Get the results
print(result_XGB.best_score_)
print("")
print(result_XGB.best_estimator_)
print("")
print(result_XGB.best_params_)

# Loading and evaluating models

In [None]:
# Evaluate a dict of models {name:object}, returns {name:score}
def multiple_model_evaluation(X_test, y_test, models_list):
    metrics_df = pd.DataFrame()
    
    for name in models_list:
        # evaluate the model
        s.tic()
        tmp_df = pd.DataFrame(single_model_evaluation(X_test, y_test, name), index=[0])
        tmp_df.insert(0, "Model Name", name, True)
        tmp_df.insert(0, "Type", "ML", True)
        metrics_df = metrics_df.append(tmp_df)
        print("> {}.".format(name))
        s.toc(restart=True)
        
    return metrics_df.reset_index(drop = True)

In [None]:
# get model list
models_list = ["KNN", "DecisionTrees", "SVR_RBF", "SVR_Linear", "RandomForest", "ExtraTrees", "XGBoost"]

# evaluate models
t.tic() #Start timer
results = multiple_model_evaluation(X_test, y_test, models_list)
t.toc() #Time elapsed since t.tic()

results

In [None]:
y_test

# Sources:
## Main 
https://practicaldatascience.co.uk/machine-learning/how-to-use-model-selection-and-hyperparameter-tuning


* sklearn.model_selection.RandomizedSearchCV
    - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html 
    - https://scikit-learn.org/stable/modules/grid_search.html?highlight=randomsearchcv
* sklearn.model_selection.KFold
    - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
    - https://machinelearningmastery.com/k-fold-cross-validation/


## Models
* KNN
    - https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
    - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html#sklearn.metrics.DistanceMetric
* DecisionTreeRegressor()
    - https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
* pmdarima
    - https://towardsdatascience.com/efficient-time-series-using-pythons-pmdarima-library-f6825407b7f0
* SVM
    - https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html?highlight=svm%20svr%20kernel%20poly
    - https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769
* Random Forest
    - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
    - https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d
* XGBoost
    - https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663
    
* Gaussian NB
    - https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
    - https://medium.com/analytics-vidhya/how-to-improve-naive-bayes-9fa698e14cba
    - https://www.analyticsvidhya.com/blog/2021/01/gaussian-naive-bayes-with-hyperpameter-tuning/
    
## Metrics
* Metrics and scoring: quantifying the quality of predictions
    - https://scikit-learn.org/stable/modules/model_evaluation.html
    - https://openclassrooms.com/en/courses/6401081-improve-the-performance-of-a-machine-learning-model/6539936-improve-your-feature-selection