# Training, Evaluating and Comparing Models

In this notebook, different model architectures will be trained on the main data
as well as on the integrated data. All models will be evaluated, and finally
compared to another to select the best performing one.

Here, the train, val and test data is just loaded from files.
The data analysis and preparation was done in the notebook
`01-data_preparation.ipynb`.

## Preparations

In [13]:
# dependencies
import pandas as pd
from pathlib import Path

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor

from sklearn.metrics import root_mean_squared_error

import xgboost as xgb

# paths
PATH_DATA = Path("../data/processed/main/")
PATH_MODELS = Path("../models/")

In [3]:
# load data

# train data
X_train = pd.read_csv(PATH_DATA / "sleep_data_main_train_features.csv")
y_train = pd.read_csv(PATH_DATA / "sleep_data_main_train_labels.csv")

# validation data
X_val = pd.read_csv(PATH_DATA / "sleep_data_main_val_features.csv")
y_val = pd.read_csv(PATH_DATA / "sleep_data_main_val_labels.csv")

# test data
X_test = pd.read_csv(PATH_DATA / "sleep_data_main_test_features.csv")
y_test = pd.read_csv(PATH_DATA / "sleep_data_main_test_labels.csv")


### Defining the Training Function

In [25]:
# FIXME
# this function works for sklearn models so far
# it must work for XGBoost as well
# it may be that it works for XGBoost alrready, but I didn't test that so far

# function for trianing any regression model and evaluating it using RMSE as metric
def train_evaluate_model_sklearn(model,
                                  X_t=X_train,
                                  y_t=y_train,
                                  X_v=X_val,
                                  y_v=y_val):
    """
    Train a model and evaluate it using RMSE as metric.
    
    Parameters
    ----------
    model: sklearn model
        The model to train.
        Model hyperparameters must be set beforehand by either initializing a
        model object with the desired parameters and passing it here, or by
        directly setting them in the function call.
    X_t: pd.DataFrame
        The training features.
        Default: X_train
    y_t: pd.Series
        The training labels.
        Default: y_train
    X_v: pd.DataFrame
        The validation features.
        Default: X_val
    y_v: pd.Series
        The validation labels.
        Default: y_val
        
    Returns
    -------
    rmse: float
        The RMSE of the model on the validation set rounded to 4 decimal places.
    """
    
    # convert y to 1d array
    y_t = y_t.squeeze()
    y_v = y_v.squeeze()

    model.fit(X_t, y_t)
    y_p = model.predict(X_v)
    rmse = root_mean_squared_error(y_v, y_p)
    rmse = round(rmse, 4)
    
    return rmse



# function for trianing any regression model and evaluating it using RMSE as metric
# works for any model used here
def train_evaluate_model(model,
                         framework,
                         hyperparameters=None,
                         X_t=X_train,
                         y_t=y_train,
                         X_v=X_val,
                         y_v=y_val):
    """
    Train a model and evaluate it using RMSE as metric.
    
    Parameters
    ----------
    model: sklearn model or xgboost model
        The model to train.
        Model hyperparameters must be set beforehand by either initializing a
        model object with the desired parameters and passing it here, or by
        directly setting them in the function call.
    framework: str
        The framework used to train the model.
        Must be either "sklearn" or "xgboost".
    hyperparameters: dict
        The hyperparameters of the model.
        Default: None, so models are trained with default hyperparameters when
        no hyperparameters are passed.
    X_t: pd.DataFrame
        The training features.
        Default: X_train
    y_t: pd.Series
        The training labels.
        Default: y_train
    X_v: pd.DataFrame
        The validation features.
        Default: X_val
    y_v: pd.Series
        The validation labels.
        Default: y_val
        
    Returns
    -------
    rmse: float
        The RMSE of the model on the validation set rounded to 4 decimal places.
    """
    
    # sklearn model
    if framework == "sklearn":
        
        # convert y to 1d array
        y_t = y_t.squeeze()
        y_v = y_v.squeeze()

        # set hyperparameters if passed
        if hyperparameters is not None:
            model.set_params(**hyperparameters)

        model.fit(X_t, y_t)
        y_p = model.predict(X_v)
        rmse = root_mean_squared_error(y_v, y_p)
        rmse = round(rmse, 4)
    
    # xgboost model
    #elif framework == "xgboost":
    
    # return rmse regardless of framework used
    return rmse
        


## Solve it using Scikit-Learn's built in function for grid seach

In [32]:
from sklearn.model_selection import GridSearchCV

# create dataframe with models and their parameter grids
models_df = pd.DataFrame([
    {
        'name': 'linear_regression',
        'model': LinearRegression(),
        'param_grid': {
            'fit_intercept': [True, False],
            'positive': [True, False]
        }
    },
    {
        'name': 'random_forest',
        'model': RandomForestRegressor(random_state=1337, n_jobs=-1),
        'param_grid': {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20],
            'min_samples_split': [2, 5, 10]
        }
    },
    {
        'name': 'xgboost',
        'model': xgb.XGBRegressor(random_state=1337, n_jobs=-1),
        'param_grid': {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 6, 9],
            'learning_rate': [0.01, 0.1, 0.3]
        }
    }
])

def train_model_with_gridsearch(row):
    """Train model using GridSearchCV and return best results."""
    # setup grid search
    grid_search = GridSearchCV(
        estimator=row['model'],
        param_grid=row['param_grid'],
        scoring='neg_root_mean_squared_error',  # sklearn uses negative RMSE
        cv=5,                                   # 5-fold cross-validation
        n_jobs=-1,                             # use all cores
        verbose=1
    )
    
    # fit grid search
    grid_search.fit(X_train, y_train.squeeze())
    
    # get best results
    best_rmse = -grid_search.best_score_  # convert back to positive RMSE
    
    return pd.Series({
        'best_params': grid_search.best_params_,
        'best_rmse': round(best_rmse, 4),
        'best_model': grid_search.best_estimator_
    })

# apply grid search to each model
results = models_df.apply(train_model_with_gridsearch, axis=1)
models_df = pd.concat([models_df, results], axis=1)

# print results
for _, row in models_df.iterrows():
    print(f"\nModel: {row['name']}")
    print(f"Best parameters: {row['best_params']}")
    print(f"Best RMSE: {row['best_rmse']}")

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Fitting 5 folds for each of 27 candidates, totalling 135 fits
Fitting 5 folds for each of 27 candidates, totalling 135 fits

Model: linear_regression
Best parameters: {'fit_intercept': True, 'positive': True}
Best RMSE: 0.3393

Model: random_forest
Best parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}
Best RMSE: 0.3424

Model: xgboost
Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
Best RMSE: 0.3376


In [34]:
models_df = pd.DataFrame([
    {
        'name': 'linear_regression',
        'model': LinearRegression(),
        'param_grid': {
            'fit_intercept': [True, False],
            'positive': [True, False]
        }
    },
    {
        'name': 'random_forest',
        'model': RandomForestRegressor(random_state=1337, n_jobs=-1),
        'param_grid': {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20],
            'min_samples_split': [2, 5, 10]
        }
    },
    {
        'name': 'xgboost',
        'model': xgb.XGBRegressor(random_state=1337, n_jobs=-1),
        'param_grid': {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 6, 9],
            'learning_rate': [0.01, 0.1, 0.3]
        }
    }
])

models_df

Unnamed: 0,name,model,param_grid
0,linear_regression,LinearRegression(),"{'fit_intercept': [True, False], 'positive': [..."
1,random_forest,"RandomForestRegressor(n_jobs=-1, random_state=...","{'n_estimators': [50, 100, 200], 'max_depth': ..."
2,xgboost,"XGBRegressor(base_score=None, booster=None, ca...","{'n_estimators': [50, 100, 200], 'max_depth': ..."


Lol :D Default linear regression is the best model so far :DDD

I guess that's because the data set is tiny!

Now make the function work for XGBoost as well.

In [None]:
# make it work for XGBoost

## Establish Baseline: Linear Regression

The final goal is to use a tree based model here.
I want to use a simpler model as baseline though, and see if and how much a more
sophistic model can improve over that.

The way I prepared my data, also the label was normalized.
I could have decided against that, and treat it as a classification task.
I decided for it though to have all values in same range.
The values are not discrete anymore, and this is now a regression task.

The simplest model for this is linear regression.

### Train a raw linear regression with default parameters to see how well it does

### Tune parameters of linear regression to see how well it can get

## Train more sophisticated model: Decision Tree Regressor

This is the classic tree based one. It has just one tree. It will be interesting
to see if it will show improved performance over linear regression already.

## Train more sophisticated model: Random Forest

This the classic tree ensemble. It has more than one tree and may have improved
performance in comparison to a single tree and the baseline

## Train more sophisticated model: Extra Trees

This is an alternative, but very interesting tree based ensemble.
Unfortunately, I have never used it before, but I am highly interested in how it
will perform.
I heard the increased randomness is supposed to prevent overfitting, which may
be beneficial for the tiny data set used here.
The data set just has a few hundred rows (bit more if integration works), and
the smaller the data set, the larger the risk of overfitting.
Maybe XGBoost will already be overkill here, and Extra Trees will be just right.
I am excited!

## Train more sophisticated model: XBGoost

Frequently one of the best if not the best performing models in Kaggle
competitions, and something like the crown of the evolution of the tree based
models. Very powerful, but also prone to overfitting.