# Machine Learning Models

The detailed procedure to generate Machine Learning models for the classification of a patient's obesity level based on their lifestyle habits is presented in this notebook. For the fine-tunning of the hyperparameters, [Optuna](https://optuna.org) is used together with the training dataset and the models are evaluated on the evaluation dataset.

Section [1. Preprocessing Pipeline](#1-preprocessing-pipeline) defines the preprocessing of the real value features, which are applied a standard scaling, and the other features are left unchanged. This scaling is done so that all the features are in the same range and to avoid bias problems that can be generated.

In section [2. Models Definition](#2-models-definition) the models to be used are created, where the priority is to have a diversification of classification techniques, together with the space of hyperparameters to be fine-tuned by means of [Optuna](https://optuna.org). In addition, a brief justification of the choice of the hyperparameters to be optimized is presented based on the training flexibility of the models (so that they fit adequately to the training set).

Finally, in section [3. Models Fitting](#3-models-fitting) the hyperparameters of each model are fitted and the models are trained with the best hyperparameters found. For the evaluation of the models, F1 score was used due to the imbalance in the dataset with respect to `NObeyesdad`.

# 1. Preprocesising Pipeline

After the treatment of the dataset in the EDA, only a standard scaling (standardization) is applied to the numerical features so that most of the features are in a similar range, this will favor both the training and the predictions of the models due to the reduction of the bias contained in the features.

In [1]:
# Importing auxiliar libraries

import marimo as mo

# Importing libraries

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Importing Functions and Utils

import SourceModels as src

In [2]:
# Defining useful variables

PATH = './'
PATH_SAVE = PATH + 'SaveModels/'

NUM_JOBS = src.GetNumJobs()

RANDOM_STATE = 8013

In [3]:
# Loading datasets

_DatasetFilename = PATH + 'Dataset_{}.csv'

Dataset_Train: pd.DataFrame = None
Dataset_Evaluation: pd.DataFrame = None
for _type_dataset in ['Train','Evaluation']:
    globals()[f'Dataset_{_type_dataset}'] = pd.read_csv(_DatasetFilename.format(_type_dataset),engine='pyarrow')

In [4]:
# Splitting features 

NumericalFeatures , CategoricalFeatures , Target = src.SplitFeatures(Dataset_Train)
Features = [*NumericalFeatures,*CategoricalFeatures]

In [5]:
# Preprocessing pipeline

PreprocessingPipeline = ColumnTransformer(
    [
        ('NumericalFeatures',StandardScaler(),NumericalFeatures),
    ],
    remainder='passthrough',
    n_jobs=NUM_JOBS,
)

PreprocessingPipeline

0,1,2
,transformers,"[('NumericalFeatures', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,4
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True


# 2. Models Definition

In this section the candidate models to be trained are defined, where both linear and nonlinear models are used to generate greater flexibility when solving the classification problem. The hyperparameters to be optimized during training and fine-tunning are also defined.

The models that were chosen represent a reduced collection of techniques and ways of approaching the classification problem, where the priority was to have a greater diversification of them. Specifically, the following were chosen:

* [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)

## 2.1. Logistic Regression

In order to generate more flexibility in the hyperparameter fine-tuning, it was decided to use the penalty [`elasticnet`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that allows using a convex combination of the `l1` and `l2` penalties, so that the optimizer can choose and give a higher weight to the most convenient penalty for the classification problem.

In [6]:
# Defining Logistic Regression model

from sklearn.linear_model import LogisticRegression

LogisticRegression_Model = Pipeline(
    [
        ('Preprocessing',PreprocessingPipeline),
        ('Model',LogisticRegression(
            penalty='elasticnet',
            solver='saga',
            random_state=RANDOM_STATE,
            n_jobs=NUM_JOBS,
            )
        ),
    ]
)

LogisticRegression_Parameters = {
    'Model__C':('float',[1e-10,2]),
    'Model__l1_ratio':('float',[0,1]),
}


LogisticRegression_Model

0,1,2
,steps,"[('Preprocessing', ...), ('Model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('NumericalFeatures', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,4
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'elasticnet'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,8013
,solver,'saga'
,max_iter,100


## 2.2. Random Forest

Fine-tunning is performed on the most relevant hyperparemeters of [Random Fores](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) which are: `n_estimators`,`max_depth` and `criterion`. They are relevant since they allow to control the overfit and underfit of the model.

In [7]:
# Defining Random Forest model

from sklearn.ensemble import RandomForestClassifier

RandomForest_Model = Pipeline(
    [
        ('Preprocessing',PreprocessingPipeline),
        ('Model',RandomForestClassifier(
            random_state=RANDOM_STATE,
            n_jobs=NUM_JOBS,
            )
        ),
    ]
)

RandomForest_Parameters = {
    'Model__n_estimators': ('int',[1,100]),
    'Model__max_depth': ('int',[1,12]),
    'Model__criterion': ('categorical',['gini','entropy'])
}


RandomForest_Model

0,1,2
,steps,"[('Preprocessing', ...), ('Model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('NumericalFeatures', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,4
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## 2.3. Support Vector Machine (SVM)

The most important hyperparameter in [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) is the `kernel`, because it controls the nonlinearity of the algorithm; and for more flexibility during training other parameters are considered for the kernel of the model.

In [8]:
# Defining Random Forest model

from sklearn.svm import SVC

SVM_Model = Pipeline(
    [
        ('Preprocessing',PreprocessingPipeline),
        ('Model',SVC(
            random_state=RANDOM_STATE,
            )
        ),
    ]
)

SVM_Parameters = {
    'Model__C':('float',[1e-10,2]),
    'Model__kernel':('categorical',['poly','rbf','sigmoid']),
    'Model__degree':('int',[1,5]),
    'Model__gamma':('float',[1e-10,2]),
    'Model__coef0':('float',[0,2]),
}


SVM_Model

0,1,2
,steps,"[('Preprocessing', ...), ('Model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('NumericalFeatures', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,4
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


## 2.4. Adaptive Boosting (AdaBoost)

AdaBoost is used as an ensemble model where the number of estimators (`n_estimators`) is the main hyperparameter to optimize, being which allows to control the general underfit and overfit of the model.

In [9]:
# Defining Random Forest model

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

AdaBoost_Model = Pipeline(
    [
        ('Preprocessing',PreprocessingPipeline),
        ('Model',AdaBoostClassifier(
            random_state=RANDOM_STATE,
            )
        ),
    ]
)

base_estimators = [DecisionTreeClassifier(max_depth=depth,random_state=RANDOM_STATE) for depth in range(1,3)]
AdaBoost_Parameters = {
    'Model__estimator':('categorical',base_estimators),
    'Model__n_estimators':('int',[1,100]),
    'Model__learning_rate':('float',[1e-12,2]),
}


AdaBoost_Model

0,1,2
,steps,"[('Preprocessing', ...), ('Model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('NumericalFeatures', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,4
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,estimator,
,n_estimators,50
,learning_rate,1.0
,algorithm,'deprecated'
,random_state,8013


# 3. Models Fitting

With the definition of the models and hyperparameters to be optimized, the model fitting is performed using [Optuna](https://optuna.org) as framework to search for the best hyperparameters of each model according to the search space defined in [2. Models Definition](#2-models-definition). For determining the best hyperparameters, F1 score with weighted average is used because the dataset is slightly imbalanced with respect to the target (`NObeyesdad`), as described in [Exploratory Data Analysis](../ExploratoryDataAnalysis/ExploratoryDataAnalysis.ipynb). Finally, the models are trained with the best hyperparameters and then saved.

In [10]:
# Defining containers for models and their params to optimize, 
# and variables for saving best models

ModelsName = [
    'Logistic Regression',
    'Random Forest',
    'SVM',
    'AdaBoost',
]

ModelsParams = [
    (LogisticRegression_Model , LogisticRegression_Parameters),
    (RandomForest_Model , RandomForest_Parameters),
    (SVM_Model , SVM_Parameters),
    (AdaBoost_Model , AdaBoost_Parameters)
]

BestModels = []

In [11]:
# Importing auxiliars para ignore warnings

import warnings
from sklearn.exceptions import ConvergenceWarning

# Fine-tunning and training of models

from copy import deepcopy

_NumTrials = 8
_Metric = src.F1_ML
with warnings.catch_warnings():
    warnings.simplefilter('ignore',category=ConvergenceWarning)
    warnings.simplefilter('ignore',category=UserWarning)

    TrainDataset_X = Dataset_Train[Features]
    TrainDataset_y = Dataset_Train[Target]
    EvaluationDataset_X = Dataset_Evaluation[Features]
    EvaluationDataset_y = Dataset_Evaluation[Target]

    for (_model , _params) , _model_name in zip(ModelsParams,ModelsName):
        # Defining optimizer
        _trainer = src.MachinLearningTrainer(
            _model,
            _params,
            _Metric,
        )

        # Fine-tuning of hyperparameters
        print(f' Start Fine-Tuning of {_model_name} '.center(50,'='))
        _best_params = _trainer(
            TrainDataset_X,
            TrainDataset_y,
            EvaluationDataset_X,
            EvaluationDataset_y,
            NumTrials=_NumTrials,
            NumJobs=NUM_JOBS,
        )

        # Training model with the best parameters
        _best_model = deepcopy(_model)
        _best_model.set_params(**_best_params)
        _best_model.fit(TrainDataset_X,TrainDataset_y)
        BestModels.append(deepcopy(_best_model))

    print('\n',' Start Models Evaluation '.center(50,'='))
    for _best_model , _model_name in zip(BestModels,ModelsName):
        _score = _Metric(_best_model,EvaluationDataset_X,EvaluationDataset_y)
        print(f'Best {_model_name} Model obtains :: {_score} Score')

[I 2025-07-21 19:05:33,282] A new study created in memory with name: OptimizeModel


==== Start Fine-Tuning of Logistic Regression ====


[I 2025-07-21 19:05:39,841] Trial 1 finished with value: 0.8918585491471113 and parameters: {'Model__C': 1.278268283081966, 'Model__l1_ratio': 0.09757419902773024}. Best is trial 1 with value: 0.8918585491471113.
[I 2025-07-21 19:05:39,919] Trial 2 finished with value: 0.8987636168211431 and parameters: {'Model__C': 0.747329156057461, 'Model__l1_ratio': 0.7819868691128737}. Best is trial 2 with value: 0.8987636168211431.
[I 2025-07-21 19:05:39,986] Trial 3 finished with value: 0.8898462202790223 and parameters: {'Model__C': 0.5159542289252547, 'Model__l1_ratio': 0.33710572218454904}. Best is trial 2 with value: 0.8987636168211431.
[I 2025-07-21 19:05:40,039] Trial 0 finished with value: 0.9109477924749446 and parameters: {'Model__C': 1.5835742990149535, 'Model__l1_ratio': 0.8118439981759853}. Best is trial 0 with value: 0.9109477924749446.
[I 2025-07-21 19:05:40,301] Trial 4 finished with value: 0.8279363878622307 and parameters: {'Model__C': 0.11462733919244109, 'Model__l1_ratio': 0.2



[I 2025-07-21 19:05:42,037] Trial 2 finished with value: 0.9762768518087762 and parameters: {'Model__n_estimators': 17, 'Model__max_depth': 8, 'Model__criterion': 'entropy'}. Best is trial 2 with value: 0.9762768518087762.
[I 2025-07-21 19:05:42,185] Trial 0 finished with value: 0.9675413677708696 and parameters: {'Model__n_estimators': 46, 'Model__max_depth': 7, 'Model__criterion': 'entropy'}. Best is trial 2 with value: 0.9762768518087762.
[I 2025-07-21 19:05:42,432] Trial 3 finished with value: 0.8520143673661975 and parameters: {'Model__n_estimators': 45, 'Model__max_depth': 3, 'Model__criterion': 'entropy'}. Best is trial 2 with value: 0.9762768518087762.
[I 2025-07-21 19:05:42,492] Trial 1 finished with value: 0.9764299883078417 and parameters: {'Model__n_estimators': 68, 'Model__max_depth': 9, 'Model__criterion': 'entropy'}. Best is trial 1 with value: 0.9764299883078417.
[I 2025-07-21 19:05:42,787] Trial 4 finished with value: 0.5883258154688386 and parameters: {'Model__n_estim



[I 2025-07-21 19:05:44,692] Trial 0 finished with value: 0.9192106189863334 and parameters: {'Model__C': 0.5072913982795977, 'Model__kernel': 'poly', 'Model__degree': 4, 'Model__gamma': 1.0668441039205656, 'Model__coef0': 1.2024623641753713}. Best is trial 2 with value: 0.9285378778537835.
[I 2025-07-21 19:05:44,696] Trial 2 finished with value: 0.9285378778537835 and parameters: {'Model__C': 1.6804613641256672, 'Model__kernel': 'poly', 'Model__degree': 2, 'Model__gamma': 0.7038475118661064, 'Model__coef0': 0.6703625213056372}. Best is trial 2 with value: 0.9285378778537835.
[I 2025-07-21 19:05:44,823] Trial 1 finished with value: 0.14591234385075363 and parameters: {'Model__C': 1.3194978297504647, 'Model__kernel': 'sigmoid', 'Model__degree': 2, 'Model__gamma': 1.9662102809548283, 'Model__coef0': 1.9552425232024104}. Best is trial 2 with value: 0.9285378778537835.
[I 2025-07-21 19:05:44,939] Trial 3 finished with value: 0.1306133392369995 and parameters: {'Model__C': 0.5118037321591196



[I 2025-07-21 19:05:45,384] Trial 1 finished with value: 0.15481256675878674 and parameters: {'Model__estimator': DecisionTreeClassifier(max_depth=1, random_state=8013), 'Model__n_estimators': 3, 'Model__learning_rate': 1.314918328694236}. Best is trial 1 with value: 0.15481256675878674.
[I 2025-07-21 19:05:46,236] Trial 2 finished with value: 0.4685265828064599 and parameters: {'Model__estimator': DecisionTreeClassifier(max_depth=1, random_state=8013), 'Model__n_estimators': 39, 'Model__learning_rate': 0.6444380111669017}. Best is trial 2 with value: 0.4685265828064599.
[I 2025-07-21 19:05:46,568] Trial 4 finished with value: 0.5914744650559728 and parameters: {'Model__estimator': DecisionTreeClassifier(max_depth=1, random_state=8013), 'Model__n_estimators': 46, 'Model__learning_rate': 0.8327294241845509}. Best is trial 4 with value: 0.5914744650559728.
[I 2025-07-21 19:05:47,194] Trial 0 finished with value: 0.9671976438587336 and parameters: {'Model__estimator': DecisionTreeClassifi


Best Logistic Regression Model obtains :: 0.9109477924749446 Score
Best Random Forest Model obtains :: 0.9793491204577431 Score
Best SVM Model obtains :: 0.9314272332080897 Score
Best AdaBoost Model obtains :: 0.9732093824605369 Score


In [12]:
# Saving models

for _best_model , _model_name in zip(BestModels,ModelsName):
    print('\n',f' Start Save {_model_name} Model '.center(50,'='))
    # src.SaveModelML(_best_model,PATH_SAVE,_model_name.replace(' ',''))





