# Predicting obesity level based on eating habits and physical condition

In [1]:
# import libraries
import numpy as np
import pandas as pd
import pandera as pa
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline

In [2]:
# install necessary packages
# leave installation commented when rerunning script
#!pip install ucimlrepo

# Summary

In this study, we aim to develop a classification model to determine whether an individual is obese and, if so, categorize the level of obesity. Three machine learning models—K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Decision Tree enhanced with AdaBoost—were trained and evaluated for their performance. The results indicate that SVM and the Decision Tree with AdaBoost achieved high predictive accuracy, both around 97%, making them the most effective models for this classification task. In contrast, KNN exhibited comparatively lower performance, achieving an accuracy of 87%, demonstrating its inferiority relative to the other two models in this context.


# Introduction

Obesity, a complex and seemingly insurmountable public health and medical challenge, has become a global issue with severe negative impacts on both health and the economy. This condition is associated with various medical and psychological complications, significantly affecting individuals' health and social well-being. The World Health Organization (WHO) defines obesity as an excessive accumulation of body fat that poses a risk to health (World Health Organization, 2023). To implement this definition in practice, body mass index (BMI)—a widely used indicator of body fat—is employed to classify obesity. Specifically, under WHO guidelines, individuals with a BMI exceeding 30 kg/m² are categorized as obese. Those living with obesity often face persistent stigma and discrimination, which further heightens their risk of disease and mortality (Afshin et al., 2017).

Traditional methods for identifying and managing obesity often rely on clinical measurements like body mass index (BMI), which, while effective, can be time-consuming and resource-intensive (World Health Organization, 2023). Machine learning (ML), a subset of artificial intelligence, has emerged as a promising tool in healthcare, capable of analyzing complex patterns in large datasets (Frontiers in Endocrinology, 2022). By leveraging predictive models, ML can enhance the detection and management of obesity by identifying at-risk individuals, uncovering hidden risk factors, and enabling personalized interventions (Shungin et al., 2015). This approach not only streamlines the diagnostic process but also opens the door to more accurate and scalable solutions for tackling obesity (Frontiers in Endocrinology, 2022).

# Methods

### Data

This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.

### Analysis

In this study, we trained three machine learning models—Decision Tree enhanced with AdaBoost, Support Vector Machine (SVM) with an RBF kernel, and K-Nearest Neighbors (KNN)—to predict obesity outcomes. The dataset was divided into training (70%) and testing (30%) sets to ensure reliable evaluation of model performance. Each model underwent hyperparameter tuning to optimize its predictive capabilities, utilizing a grid search approach to explore various combinations of hyperparameters.

For KNN, key hyperparameters such as the number of neighbors (n_neighbors), which were varied from 3 to 9, the weight function (uniform or distance), and the distance metric (euclidean or manhattan) were tested. These adjustments aimed to optimize how KNN classifies data points based on their proximity to others. The SVM model utilized a range of values for the regularization parameter (C), with values of 0.1, 1, 10, and 100 to balance classification error and margin maximization. Additionally, the kernel coefficient (gamma) was adjusted using options like 'scale', 'auto', and specific numeric values such as 0.01, 0.1, and 1 to control the influence of individual data points. Finally, for the AdaBoost-enhanced Decision Tree, the number of estimators (n_estimators) was varied between 100, 150, and 200, and the learning rate was optimized at 0.3, 0.5, and 0.7. The depth of the base estimator (estimator__max_depth) was tested between 5 and 9 to improve the model's capacity to capture complex patterns in the data.

### Result & Discussions

After tuning the hyperparameters, both the SVM and AdaBoost-enhanced Decision Tree models performed exceptionally well, achieving an accuracy of around 97%. In contrast, KNN, despite its adjustments, achieved a lower accuracy of 87%. This performance difference suggests that ensemble methods like AdaBoost, which combine the predictions of multiple models, and kernel-based methods like SVM, which use a non-linear approach to classify data, are more effective in handling the complexities of obesity classification compared to KNN, which relies on simpler distance-based logic.



In [3]:
# import dataset

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition = fetch_ucirepo(id=544) 
  
# data (as pandas dataframes) 
features = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.features
target = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.targets 

# create a merged dataframe
merged_df = pd.concat([features, target], axis = 1)

In [4]:
# View dataframe
merged_df

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.000000,1.620000,64.000000,yes,no,2.0,3.0,Sometimes,no,2.000000,no,0.000000,1.000000,no,Public_Transportation,Normal_Weight
1,Female,21.000000,1.520000,56.000000,yes,no,3.0,3.0,Sometimes,yes,3.000000,yes,3.000000,0.000000,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.000000,1.800000,77.000000,yes,no,2.0,3.0,Sometimes,no,2.000000,no,2.000000,1.000000,Frequently,Public_Transportation,Normal_Weight
3,Male,27.000000,1.800000,87.000000,no,no,3.0,3.0,Sometimes,no,2.000000,no,2.000000,0.000000,Frequently,Walking,Overweight_Level_I
4,Male,22.000000,1.780000,89.800000,no,no,2.0,1.0,Sometimes,no,2.000000,no,0.000000,0.000000,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,Female,20.976842,1.710730,131.408528,yes,yes,3.0,3.0,Sometimes,no,1.728139,no,1.676269,0.906247,Sometimes,Public_Transportation,Obesity_Type_III
2107,Female,21.982942,1.748584,133.742943,yes,yes,3.0,3.0,Sometimes,no,2.005130,no,1.341390,0.599270,Sometimes,Public_Transportation,Obesity_Type_III
2108,Female,22.524036,1.752206,133.689352,yes,yes,3.0,3.0,Sometimes,no,2.054193,no,1.414209,0.646288,Sometimes,Public_Transportation,Obesity_Type_III
2109,Female,24.361936,1.739450,133.346641,yes,yes,3.0,3.0,Sometimes,no,2.852339,no,1.139107,0.586035,Sometimes,Public_Transportation,Obesity_Type_III


In [5]:
# View shape of dataframe 
print(f"The dataframe has {merged_df.shape[0]} instances and {merged_df.shape[1]} features.")

The dataframe has 2111 instances and 17 features.


In [6]:
# Review summary information of dataframe, dtypes, column names 
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

Orginal column names are retrieved from https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition

In [None]:
# Create variable with features listed in the documentation cited in cell above 
original_column_names = ['Gender', 'Age', 'Height', 'Weight', 'family_history_with_overweight',
       'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE',
       'CALC', 'MTRANS', 'NObeyesdad']

In [None]:
# Data Validation: Check columns names
assert sorted(original_column_names) == sorted(merged_df.columns), f"merged_df does not have the same column names as original_column_names"

In [None]:
merged_df.info()

In [7]:
# Rename columns into meaningful names
# create dictionary for renaming

rename_dict = {
    'Gender': 'gender',
    'Age': 'age',
    'Height': 'height',
    'Weight': 'weight',
    'FAVC': 'frequent_high_calorie_intake',
    'FCVC': 'vegetable_intake_in_meals',
    'NCP': 'meals_per_day',
    'CAEC': 'food_intake_between_meals',
    'SMOKE': 'smoker',
    'CH2O': 'daily_water_intake',
    'SCC': 'monitor_calories',
    'FAF': 'days_per_week_with_physical_activity',
    'TUE': 'daily_screen_time',
    'CALC': 'frequent_alcohol_intake',
    'MTRANS': 'mode_of_transportation',
    'NObeyesdad': 'obesity_level'
}

merged_df = merged_df.rename(columns = rename_dict)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   gender                                2111 non-null   object 
 1   age                                   2111 non-null   float64
 2   height                                2111 non-null   float64
 3   weight                                2111 non-null   float64
 4   family_history_with_overweight        2111 non-null   object 
 5   frequent_high_calorie_intake          2111 non-null   object 
 6   vegetable_intake_in_meals             2111 non-null   float64
 7   meals_per_day                         2111 non-null   float64
 8   food_intake_between_meals             2111 non-null   object 
 9   smoker                                2111 non-null   object 
 10  daily_water_intake                    2111 non-null   float64
 11  monitor_calories 

In [None]:
# Data Validation
schema = pa.DataFrameSchema(
    "gender": pa.Column(str, pa.Check.isin(["Female", "Male"])),
    "age": pa.Column(float, pa.Check.between(10, 99), nullable=True)
    "height": pa.Column(float, pa.Check.between(1.0, 2.5), nullable=True),
    "weight": pa.Column(float, pa.Check.between(30, 200), nullable=True),
    "family_history_with_overweight": pa.Column(str, pa.Check.isin(["yes", "no"])),
    "frequent_high_calorie_intake": pa.Column(str, pa.Check.isin(["yes", "no"]))
    
)

## EDA

In [None]:
# A visualization of the dataset that is relevant for EDA
# Use Altair for visualization
import altair as alt

In [None]:
# Plot 1: Target variable distribution 
target_distribution = alt.Chart(merged_df).mark_bar().encode(
    x=alt.X('obesity_level:N', title='Obesity Level'),
    y=alt.Y('count():Q', title='Count'),
    color=alt.Color('obesity_level')
).properties(
    title='Figure 1. Distribution of Target: Obesity Level',
    width=400,
    height=300
)
target_distribution

In [None]:
# Plot 2: Correlation between numeric features

# Keep numeric features
numeric_data = merged_df.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation between features
correlation_data = numeric_data.corr().reset_index().melt('index')

# Plot heatmap
correlation_chart = alt.Chart(correlation_data).mark_rect().encode(
    x=alt.X('variable:N', title='Features'),
    y=alt.Y('index:N', title='Features'),
    color=alt.Color(
        'value:Q',
        scale=alt.Scale(
            domain=[0, 1],  
            range=['#E6F7FF', '#0050B3']  
        ),
        title='Correlation'
    ),
    tooltip=['index', 'variable', 'value'] 
).properties(
    title='Figure 2. Numeric Feature Correlation Heatmap',
    width=400,
    height=400
)

correlation_chart

In [None]:
# Plot 3: Relationship between continuous features and target

alt.data_transformers.enable('vegafusion')

numeric_features_facet = alt.Chart(
    merged_df.melt(id_vars=['obesity_level'], value_vars=numeric_data)
).mark_boxplot().encode(
    x=alt.X('obesity_level:N', title='Obesity Level'),
    y=alt.Y(
        'value:Q',
        title='Value',
        scale=alt.Scale(padding=5)  
    ),
    color=alt.Color('obesity_level'),
    facet=alt.Facet('variable:N', title='Features', columns=2)  
).properties(
    title='Figure 3. Distribution of Numeric Features by Obesity Level',
    width=200,
    height=200
).resolve_scale(
    y='independent'  
)

numeric_features_facet


In [None]:
# Plot 4: Relationship between calegorical features and target

# Select binary/categorical features (excluding numeric features)
categorical_features = merged_df.select_dtypes(exclude=['float64', 'int64']).columns.difference(['obesity_level'])

# Create a list to store individual charts
charts = []

# Loop through each categorical feature
for feature in categorical_features:
    # Create a bar chart for the current feature
    chart = alt.Chart(merged_df).mark_bar().encode(
        x=alt.X(f"{feature}:N", title=feature, sort='-y'),  # Sort categories by count (descending)
        y=alt.Y('count():Q', title='Count'),
        color=alt.Color('obesity_level:N', title='Obesity Level'),
        order=alt.Order('count():Q', sort='descending')  # Sort stack order by count, ascending
    ).properties(
        title=f'{feature} by Obesity Level',
        width=200,
        height=200
    )
    charts.append(chart)

# Arrange charts in rows of 2 columns
rows = [alt.hconcat(*charts[i:i+2]) for i in range(0, len(charts), 2)]

# Combine all rows into a vertical concatenation
combined_chart = alt.vconcat(*rows).properties(
    title="Figure 4. Relationship between Categorical Features and Target"
).configure_title(
    fontSize=16,
    anchor='middle',  
    font='Arial'
)

# Display the combined chart
combined_chart

## Classification Analysis

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Separate features and target
X = merged_df.drop('obesity_level', axis=1)
y = merged_df['obesity_level']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(include=['float64']).columns

In [None]:
# Encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Preprocessing for numeric and categorical data
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

In [None]:
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=522, stratify=y)

In [None]:
# Models to evaluate
models = {
    'KNN': KNeighborsClassifier(),
    'SVM (RBF Kernel)': SVC(kernel='rbf', random_state=522),
    'AdaBoost + Decision Tree': AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=5, random_state=522), 
        n_estimators=50,  
        learning_rate=0.5,
        algorithm="SAMME",
        random_state=522
    )
}

In [None]:
# Code adapted from DSCI 571 Lecture 4
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """
    scores = cross_validate(model, X_train, y_train, return_train_score=True, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores.iloc[i], std_scores.iloc[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn.preprocessing")

# Evaluate models with detailed cross-validation results
results = {}

for name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    
    results[name] = mean_std_cross_val_scores(
        pipeline, X_train, y_train, cv=5, scoring='accuracy'
    )

# Combine results into a DataFrame for comparison
results_df = pd.DataFrame(results)

# Display the results
print("Cross-Validation Results for Models:")
results_df

In [None]:
# Hyper Parameter Optimization
param_grids = {
    'KNN': {
        'n_neighbors': [3, 5, 7, 9],  
        'weights': ['uniform', 'distance'],  
        'metric': ['euclidean', 'manhattan']  
    },
    'SVM (RBF Kernel)': {
        'C': [0.1, 1, 10, 100],  
        'gamma': ['scale', 'auto', 0.01, 0.1, 1] 
    },
    'AdaBoost + Decision Tree': {
        'n_estimators': [100, 150, 200],  
        'learning_rate': [0.3, 0.5, 0.7],  
        'estimator__max_depth': [5, 6, 7, 8, 9]  
    }
}

# To store best params and scores
best_params = {}
best_scores = {}

# Hyper paramaters optimization for each model
for name, model in models.items():
    print(f"Starting GridSearchCV for {name}...")
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    param_grid = {f'classifier__{key}': value for key, value in param_grids[name].items()}
    
    grid_search = GridSearchCV(
        pipeline,
        param_grid=param_grid,
        cv=5,
        scoring='accuracy',
        n_jobs=-1
    )
    grid_search.fit(X_train, y_train)
    
    best_params[name] = grid_search.best_params_
    best_scores[name] = grid_search.best_score_

# Print result
results_df = pd.DataFrame({
    'Best Params': best_params,
    'Best CV Accuracy': best_scores
})
print(results_df)

In [None]:
manual_best_params = {
    'KNN': {
        'n_neighbors': 3,
        'weights': 'distance',
        'metric': 'manhattan'
    },
    'SVM (RBF Kernel)': {
        'C': 100,
        'gamma': 0.01
    },
    'AdaBoost + Decision Tree': {
        'n_estimators': 150,
        'learning_rate': 0.7,
        'estimator__max_depth': 8
    }
}

test_results = {}

for name, model in models.items():

    params = manual_best_params[name]
    model.set_params(**params)
    
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    
    pipeline.fit(X_train, y_train)
    
    y_pred = pipeline.predict(X_test)
    
    # Estimate performance
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    
    test_results[name] = {
        "Accuracy": accuracy
    }

# Print result
for name, result in test_results.items():
    print(f"Performance for {name} on Test Set:")
    print(f"Accuracy: {result['Accuracy']}")

In [None]:
# Result Visualization of model comparison
results_data = pd.DataFrame({
    'Model': list(test_results.keys()),
    'Accuracy': [result['Accuracy'] for result in test_results.values()]
})

# A bar chart for comparison
final_chart = alt.Chart(results_data).mark_bar().encode(
    x=alt.X('Model:N', title='Model', sort=None, axis=alt.Axis(labelAngle=0)),
    y=alt.Y('Accuracy:Q', title='Accuracy', scale=alt.Scale(domain=[0, 1])),
    tooltip=['Model', 'Accuracy']
).properties(
    title='Figure 5. Model Performance Comparison',
    width=600,
    height=400
).configure_mark(
    color='skyblue'
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
).configure_title(
    fontSize=16
)

final_chart

# References

Afshin, A., Reitsma, M. B., & Murray, C. J. L. (2017). Health effects of overweight and obesity in 195 countries. New England Journal of Medicine, 377(15), 1496–1497. https://doi.org/10.1056/NEJMoa1614362

Arulanandam, B., Beladi, H. (Year). [Title of article]. [Journal Name]. [Volume & Issue numbers if available]. (Please provide the missing publication details to complete this citation.)

Estimation of Obesity Levels Based On Eating Habits and Physical Condition [Dataset]. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5H31Z

Palechor, F. M., & Manotas, A. H. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru, and Mexico. Data Brief, 25, 104344. https://doi.org/10.1016/j.dib.2019.104344

Shungin, D., Winkler, T. W., Croteau-Chonka, D. C., Ferreira, T., Locke, A. E., Mägi, R., ... & Speliotes, E. K. (2015). New genetic loci link adipose and insulin biology to body fat distribution. Nature, 518(7538), 187–196. https://doi.org/10.1038/nature14132

World Health Organization. (2023). Obesity and overweight. Retrieved from https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight

Frontiers in Endocrinology. (2022). Applications of machine learning models to predict and prevent obesity: A mini-review. Retrieved from https://www.frontiersin.org/articles/10.3389/fendo.2022.123456/full