# Obesity Estimation - Feature Engineering and Machine Learning Models

### In this notebook we will implement various Machine Learning Classification alogrithms to select the best performing model
- Decision Tree </br>
- Random Forest</br>
- KNN </br>
- XGBClassifier</br>


### We will use the following steps:
##### 1. Split the Dataset:
Split the dataset into training and test sets before applying transformations to avoid data leakage.
##### 2. Preprocessing:
Apply one-hot encoding only to the training set using fit_transform.</br>
Use the same encoder to transform the test set with transform, ensuring consistency.</br>
Handle unknown categories with handle_unknown='ignore'.
Apply a StandardScaler to the numerical columns.
Use the same scaler to transform the test set with transform.
##### 3. Label Encoding for Target Variable:
Apply label encoding to the entire target column (train + test) for consistent label mappings across splits.
This is safe as it will not allow information to leak from features to the data.

#### Save Encoders/Scalers:
Save the fitted encoders and scalers to ensure consistent transformations for future data.

##### 5. Perform GridsearchCV and 5-fold cross-validation to compare the accuracies.

#### 6. Explore Feature Importance

#### 7. Tune Hyperparameters
Tune hyperparamters on the models that performed best in GridSearch.

In [341]:
import pandas as pd

In [355]:
# Read data and convert to a dataframe
clean_data_df = pd.read_csv(r'../data/clean_data.csv')

clean_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Gender                     2087 non-null   object 
 1   Age                        2087 non-null   float64
 2   Height                     2087 non-null   float64
 3   Weight                     2087 non-null   float64
 4   Family_History             2087 non-null   object 
 5   High_Cal_Foods_Frequently  2087 non-null   object 
 6   Freq_Veg                   2087 non-null   float64
 7   Num_Meals                  2087 non-null   float64
 8   Snacking                   2087 non-null   object 
 9   Smoke                      2087 non-null   object 
 10  Water_Intake               2087 non-null   float64
 11  Calorie_Monitoring         2087 non-null   object 
 12  Phys_Activity              2087 non-null   float64
 13  Tech_Use                   2087 non-null   float

In [356]:
# Feature columns by preprocessing type: categorical and continous
cat_cols = ['Gender', 'Family_History', 'High_Cal_Foods_Frequently', 'Snacking','Smoke', 'Calorie_Monitoring', 'Freq_Alcohol', 'Transportation']

num_cols = ['Age', 'Height', 'Weight', 'Freq_Veg', 'Num_Meals','Water_Intake', 'Phys_Activity', 'Tech_Use']


### Define dataframes X and y 

In [357]:
X = clean_data_df.drop('Obesity_Level',axis=1)  
y = clean_data_df['Obesity_Level'] 

X.shape, y.shape

((2087, 16), (2087,))

### 1. Train test split - stratified splitting
Stratified splitting means that when you generate a training / validation dataset split, it will attempt to keep the same percentages of classes in each split.

These dataset divisions are usually generated randomly according to a target variable. However, when doing so, the proportions of the target variable among the different splits can differ, especially in the case of small datasets.

In [358]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

In [359]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1669, 16), (418, 16), (1669,), (418,))

### 2. Preprocess the Data

In [None]:
import sklearn
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

transformer = make_column_transformer(
        (OneHotEncoder(), cat_cols),
        remainder=StandardScaler())

transformer

  - X_train Preprocessing

In [362]:
X_train= transformer.fit_transform(X_train)

In [363]:
print(X_train.shape)
X_train[0]

(1669, 31)


array([ 0.        ,  1.        ,  0.        ,  1.        ,  0.        ,
        1.        ,  0.        ,  0.        ,  1.        ,  0.        ,
        1.        ,  0.        ,  1.        ,  0.        ,  0.        ,
        0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  1.        ,  0.        ,  0.24828996,  0.70553681,
        1.04502551, -0.37588513,  0.39180083,  0.25310232,  0.30853754,
       -0.45291571])

- X_test Encoding

In [None]:
# Transforming
transformed = transformer.transform(X_test)
# Transformating back
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# One-hot encoding removed an index. Let's put it back:
transformed_df.index = X_test.index
# Joining tables
X_test = pd.concat([X_test, transformed_df], axis=1)
# Dropping old categorical columns
X_test.drop(cat_cols, axis=1, inplace=True)
# CHecking result
X_test.head()

Unnamed: 0,Age,Height,Weight,Freq_Veg,Num_Meals,Water_Intake,Phys_Activity,Tech_Use,one_hot__Gender_Female,one_hot__Gender_Male,...,one_hot__Transportation_Public_Transportation,one_hot__Transportation_Walking,standard_scaler__Age,standard_scaler__Height,standard_scaler__Weight,standard_scaler__Freq_Veg,standard_scaler__Num_Meals,standard_scaler__Water_Intake,standard_scaler__Phys_Activity,standard_scaler__Tech_Use
1153,19.955257,1.5891,72.713611,3.0,3.856434,2.0,1.32417,1.0,1.0,0.0,...,1.0,0.0,-0.690071,-1.202029,-0.537291,1.08428,1.51203,-0.000656,0.376859,0.550747
132,30.0,1.77,109.0,3.0,3.0,1.0,2.0,0.0,0.0,1.0,...,0.0,0.0,0.89981,0.736866,0.848928,1.08428,0.391801,-1.644954,1.182627,-1.095082
1923,20.601222,1.738717,128.114161,3.0,3.0,1.797041,1.427413,0.966181,1.0,0.0,...,1.0,0.0,-0.587828,0.401573,1.579131,1.08428,0.391801,-0.334381,0.499952,0.495086
846,16.950499,1.603501,65.0,2.96008,1.0,2.0,0.736032,1.344072,1.0,0.0,...,1.0,0.0,-1.165664,-1.047678,-0.831968,1.008997,-2.224232,-0.000656,-0.324357,1.11703
1246,29.506287,1.82697,108.751502,2.0,2.877583,2.358038,0.877295,1.817146,0.0,1.0,...,0.0,0.0,0.821665,1.347473,0.839435,-0.801583,0.231677,0.588066,-0.155934,1.895629


#### 3.  Apply Label Encoder to y_train

In [336]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)


### 4. Classifier Models using GridSearch

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

In [None]:
models={'RandomForest':RandomForestClassifier(),
        'DecisionTree':DecisionTreeClassifier(),
        'KNeighbors':KNeighborsClassifier(),
        'xgbc': XGBClassifier()}

In [339]:
param_grids={'RandomForest': {
                'n_estimators': [100, 200],
                'max_depth': [None, 10, 20]},
            'DecisionTree': {
                'max_depth': [None, 10, 20],
                'min_samples_split': [2, 5, 10]
            },'KNeighbors':{
            'n_neighbors': [3, 5, 7]
        }, 'xgbc':{
            'n_estimators':[600],
            'learning_rate':[0.03],
            'objective':['multi:softmax'], 
            'verbosity':[0], 
            'nthread':[-1], 
            'random_state':[42]
        }
    }

#### Model Results from GridSearch

In [325]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
scoring = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted')}
best_models = {}

for model in models:
    print(f"\nRunning GridSearch for {model}...")
    gsv = GridSearchCV(
        estimator=models[model],
        param_grid=param_grids[model],
        cv=5,
        scoring=scoring,
        refit='accuracy'  # Primary metric for model selection
    )
    gsv.fit(X_train, y_train_encoded)
    best_models[model] = gsv.best_estimator_
    best_index = gsv.best_index_
    print(f'Best parameters for {model}: {gsv.best_params_}')
    print(f'Best accuracy: {gsv.cv_results_["mean_test_accuracy"][best_index]:.4f}')
    print(f'Best precision: {gsv.cv_results_["mean_test_precision"][best_index]:.4f}')
    print(f'Best recall: {gsv.cv_results_["mean_test_recall"][best_index]:.4f}')


Running GridSearch for RandomForest...
Best parameters for RandomForest: {'max_depth': 20, 'n_estimators': 200}
Best accuracy: 0.9377
Best precision: 0.9420
Best recall: 0.9377

Running GridSearch for DecisionTree...
Best parameters for DecisionTree: {'max_depth': 20, 'min_samples_split': 2}
Best accuracy: 0.9191
Best precision: 0.9200
Best recall: 0.9191

Running GridSearch for KNeighbors...
Best parameters for KNeighbors: {'n_neighbors': 3}
Best accuracy: 0.8388
Best precision: 0.8390
Best recall: 0.8388


### 5. Feature Importance

In [None]:
col = 

In [None]:
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import plotly.express as px

from matplotlib import pyplot as plt
from sklearn.model_selection import cross_val_score

palette = ['#008080','#FF6347', '#E50000', '#D2691E'] # Creating color palette for plots

clf = RandomForestClassifier(max_depth=8, min_samples_leaf=3, min_samples_split=3, n_estimators=5000, random_state=13)
clf = clf.fit(X_train, y_train_encoded)

fimp = pd.Series(data=clf.feature_importances_, index=  ).sort_values(ascending=False)
plt.figure(figsize=(17,13))
plt.title("Feature importance")
ax = sns.barplot(y=fimp.index, x=fimp.values, palette=palette, orient='h')