# Practical Examples - Training
This notebook present the trainig functions used for different ML techniques based on the hyperparameters of each model.
* Note 1: Before training you have to load your dataset and define your X and Y variables.
* Note 2: The Scalers are pre-processing tools used for normalization and standarization of the data (e.g. RobustScaler or MinMaxScaler from sklearn).
* Note 3: Each problem is different, so you probably have to explore your data and define your pre-processing strategies and modify the training functions for each ML technique.
* Note 4: We recommend exploring the hyperparameters of each technique and consult each model with the Tensorflow/Keras and Sklean documentation.

In [None]:
#Import some libraries and frameworks
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso

### Artificial Neural Networks (ANN)
The function have as input:
* X_train: Input variables for the training of the model (pandas Dataframe).
* y_train: Output variable for the training of the model (pandas Series or Dataframe).
* scalerX and scalerY: Scalers for X and Y data.
* distribution: List of hidden layers and number of neurons in each one (e.g. [16,32,16] is 3 hidden layers with 16, 32, and 16 neurons, respectively).
* activation: Activation function for the ANN (must be 'relu', 'leaky_relu' or 'elu').
* regularizer: Type of regularization in each layers of the ANN (must be 'l1', 'l2' or None).
* regularizer_value: Applied regularization value (usually 0.01, 0.001 or 0.0001).

In [None]:
def Train_ANN_Model(X_train, y_train, scalerX, scalerY,
                    distribution, activation, regularizer, regularizer_value):
    #Transformation of the data
    X_train_scaled = pd.DataFrame(scalerX.transform(X_train), columns=X_train.columns)
    y_train_scaled = scalerY.transform(y_train.values.reshape(-1, 1)).ravel()
    #Applying regularization
    if regularizer == "l1":
        regularizer_ = tf.keras.regularizers.l1(regularizer_value)
    elif regularizer == "l2":
        regularizer_ = tf.keras.regularizers.l2(regularizer_value)
    else:
        regularizer = None
    #Adding the hidden layers of the ANN
    red = []
    n_capa = 0
    for i in distribution:
        n_capa += 1
        if n_capa == 1:
            red.append(tf.keras.layers.Dense(i, input_shape=(len(X_train.columns),), kernel_regularizer=regularizer_))
        else:
            red.append(tf.keras.layers.Dense(i, kernel_regularizer=regularizer_))
        #Activation functions
        if activation == 'relu':
            red.append(tf.keras.layers.ReLU())
        elif activation == 'leaky_relu':
            red.append(tf.keras.layers.LeakyReLU(alpha=0.1))
        elif activation == 'elu':
            red.append(tf.keras.layers.ELU(alpha=1.0))
    #Applying Batch Normalization and the output layer
    red.append(tf.keras.layers.BatchNormalization())
    red.append(tf.keras.layers.Dense(1, activation='linear'))
    #Defining the model
    regressor = tf.keras.Sequential(red)
    #Compile the model
    regressor.compile(optimizer='adam', loss='mse', metrics=["mae"])
    #Split the data into training and validation
    X_t, X_v, Y_t, Y_v = train_test_split(X_train_scaled, y_train_scaled, test_size=0.3)
    #Training the data for 100 epochs
    regressor.fit(X_t, Y_t, epochs=100, validation_data=(X_v, Y_v), batch_size=32, verbose=0)
    #Return the trained model
    return regressor

### Random Forests (RF)
The function have as input:
* X_train: Input variables for the training of the model (pandas Dataframe).
* y_train: Output variable for the training of the model (pandas Series or Dataframe).
* scalerX and scalerY: Scalers for X and Y data.
* n_estimator: Number of decision trees in RF model.
* max_depth:Maximum depth of the trees.
* min_samples_split: Minimum number of samples to split a node.
* min_samples_leaf: Minimum number of samples to be at a leaf node.
* max_feature: Number of features to consider at every split.
* bootstrap_: Method of selecting samples for training trees (must be True or False).

- Note 1: We highly recommend exploring how those values affect the performance of the model, avoiding problems such as overfitting or underfitting.
- Note 2: Adjust the parameters n_jobs according to your PC capacities.

In [None]:
def Train_RF_Model(X_train, y_train, scalerX, scalerY,
                   n_estimator, max_depth, min_samples_split, min_samples_leaf, max_feature , bootstrap_):
    #Transformation of the data
    X_train_scaled = pd.DataFrame(scalerX.transform(X_train), columns=X_train.columns)
    y_train_scaled = scalerY.transform(y_train.values.reshape(-1, 1)).ravel()
    #Model definition
    regressor = RandomForestRegressor(n_estimators=n_estimator,
                                      max_depth=max_depth,
                                      min_samples_split=min_samples_split,
                                      min_samples_leaf=min_samples_leaf,
                                      max_features=max_feature,
                                      bootstrap=bootstrap_,
                                      n_jobs=-1)
    #Training
    regressor.fit(X_train_scaled, y_train_scaled)
    #Return the trained model
    return regressor

### Gradient Boosting Machines (GBM)
The function have as input:
* X_train: Input variables for the training of the model (pandas Dataframe).
* y_train: Output variable for the training of the model (pandas Series or Dataframe).
* scalerX and scalerY: Scalers for X and Y data.
* n_estimator: Number of decision trees in RF model.
* max_depth: Maximum depth of the trees.
* max_feature: Number of features to consider at every split.
* subsample: Fraction of samples in each tree (between 0 and 1)
* learning_rate: Contribution of each tree (usually 0.01, 0.05 or 0.1).

- Note 1: We highly recommend exploring how those values affect the performance of the model, avoiding problems such as overfitting or underfitting.

In [None]:
def Train_GBM_Model(X_train, y_train, scalerX, scalerY,
                    n_estimator, max_depth, max_feature, subsample, learning_rate):
    #Transformation of the data
    X_train_scaled = pd.DataFrame(scalerX.transform(X_train), columns=X_train.columns)
    y_train_scaled = scalerY.transform(y_train.values.reshape(-1, 1)).ravel()
    #Model definition
    regressor = GradientBoostingRegressor(n_estimators=n_estimator,
                                          learning_rate=learning_rate,
                                          max_depth=max_depth,
                                          max_features=max_feature,
                                          subsample=subsample)
    #Training
    regressor.fit(X_train_scaled, y_train_scaled)
    #Return the trained model
    return regressor

### LASSO Regression (LASSO)
The function have as input:
* X_train: Input variables for the training of the model (pandas Dataframe).
* y_train: Output variable for the training of the model (pandas Series or Dataframe).
* scalerX and scalerY: Scalers for X and Y data.
* alpha: Applied regularization (L1) (usually <0.1).

In [None]:
def Train_LASSO_Model(X_train, y_train, scalerX, scalerY,
                      alpha):
    #Transformation of the data
    X_train_scaled = pd.DataFrame(scalerX.transform(X_train), columns=X_train.columns)
    y_train_scaled = scalerY.transform(y_train.values.reshape(-1, 1)).ravel()
    #Model definition
    regressor = Lasso(alpha=alpha)
    #Training
    regressor.fit(X_train_scaled, y_train_scaled)
    #Return the trained model
    return regressor