# **Importing Data**

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML
import zipfile
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

**Importing Data**

Importing from UCI's Machine Learning Depository

In [None]:
!pip3 install -U ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition = fetch_ucirepo(id=544)

# data (as pandas dataframes)
X = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.features
y = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.targets

# metadata
print(estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.metadata)

# variable information
print(estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.variables)

{'uci_id': 544, 'name': 'Estimation of Obesity Levels Based On Eating Habits and Physical Condition ', 'repository_url': 'https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition', 'data_url': 'https://archive.ics.uci.edu/static/public/544/data.csv', 'abstract': 'This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. ', 'area': 'Health and Medicine', 'tasks': ['Classification', 'Regression', 'Clustering'], 'characteristics': ['Multivariate'], 'num_instances': 2111, 'num_features': 16, 'feature_types': ['Integer'], 'demographics': ['Gender', 'Age'], 'target_col': ['NObeyesdad'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2019, 'last_updated': 'Tue Sep 10 2024', 'dataset_doi': '10.24432/C5H31Z', 'creators': [], 'intro_paper': {'ID': 358, 'type': 

Putting data extracted from UCI into a dataframe so that it can be processed.

Adding BMI as a column, **feature engineering**

In [None]:
df['BMI'] = df['Weight'] / (df['Height'] ** 2)

# Display the first few rows to verify
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad,BMI
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight,24.386526
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight,24.238227
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight,23.765432
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I,26.851852
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II,28.342381


In [None]:
print(df.columns)

Index(['Gender', 'Age', 'Height', 'Weight', 'family_history_with_overweight',
       'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE',
       'CALC', 'MTRANS', 'NObeyesdad', 'BMI'],
      dtype='object')


In [None]:
#making sure that there is no null data
print(f"Number of null values in each column:\n{df.isnull().sum()}")

Number of null values in each column:
Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
BMI                               0
dtype: int64


# **Linear Regression Modeling & Intepretation**

I am going to use linear regression to see what independent variables contribute most significantly to BMI, our dependent variable.

To do this, I am using 80% of the data for training and 20% for testing.

In terms of the target variables I want to use, I want to use ones that have **moderate to high correlation with BMi**, but those with the **least amount of multicollinearity**.

I am thinking about using age, FCVC (Vegetable Consumption), NCP (number of meals, family history with overweight, CALC (alcohol consumption), gender, physical activity, and water intake.

I will not include weight and height since these two will have super high multicollinearity with BMI which will result in overfitting and inaccuracy of our model. On top of that, height is smoething that we cannot really control, so it is not useful even if I identify a correlation and there is no multi-collinearity.

**Evaluation Metrics:**

**R² and MSE are the most useful metrics** for evaluating this BMI prediction model. R² measures how much of the **variability** in BMI is explained by the model, providing a clear sense of its overall effectiveness.

MSE quantifies the **average squared difference between predicted and actual BMI values**, offering insight into the model's accuracy.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import r2_score, mean_squared_error

predictor_columns = ['Age', 'FCVC', 'NCP', 'FAF', 'CH2O', 'TUE',
                     'family_history_with_overweight', 'CALC', 'Gender', 'SMOKE']
target_column = 'BMI'

# Split the data into predictors (X) and target (y)
X = df[predictor_columns]
y = df[target_column]

# Define column groups for preprocessing
numerical_columns = ['Age', 'FCVC', 'NCP', 'FAF', 'CH2O', 'TUE']
categorical_columns = ['family_history_with_overweight', 'CALC', 'Gender', 'SMOKE']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_columns),
        ('cat', OneHotEncoder(drop='first'), categorical_columns)
    ]
)

# Create the pipeline with preprocessing and regression model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])



**Training the model:**

In [None]:
pipeline.fit(X_train, y_train)

Evaluating the model.

In [None]:
y_pred = pipeline.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

r2, mse

(0.36662174328254726, 41.83853303828235)

The model achieves an **R² of 0.3666**, indicating that approximately 36.66% of the variance in BMI is explained by the selected predictors, while the remaining 63.34% is due to unmeasured factors or noise.

The **MSE of 41.84** corresponds to an average prediction error of about 6.47 BMI units, suggesting that the model captures some relationships but struggles with accuracy. This modest performance may stem from **missing predictors, non-linear relationships, or variability in the data**.

Baseline model?

In [None]:
baseline_prediction = y_train.mean()

baseline_predictions = [baseline_prediction] * len(y_test)

baseline_mse = mean_squared_error(y_test, baseline_predictions)
baseline_r2 = r2_score(y_test, baseline_predictions)

baseline_mse, baseline_r2

(66.05777739499631, -2.454318341027495e-05)

**Our model outperformed the baseline model in both MSE and R²**. This shows that the predictors used in our model add meaningful value in explaining variations in BMI.

Which features are most important in our model?

In [None]:
from sklearn.inspection import permutation_importance

# Perform permutation importance again to redefine results
permutation_results = permutation_importance(pipeline, X_test, y_test, n_repeats=10, random_state=42)

# Extract feature names from the preprocessing pipeline
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()

# Ensure feature names match the length of the importance array
if len(feature_names) == len(permutation_results.importances_mean):
    # Create DataFrame for feature importances
    feature_importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': permutation_results.importances_mean
    })
else:
    # Handle mismatched lengths (truncate to align)
    feature_importance_df = pd.DataFrame({
        'Feature': feature_names[:len(permutation_results.importances_mean)],
        'Importance': permutation_results.importances_mean
    })

# Sort by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the feature importance DataFrame
feature_importance_df.reset_index(drop=True)

Unnamed: 0,Feature,Importance
0,cat__family_history_with_overweight_yes,0.397087
1,num__FCVC,0.081144
2,cat__CALC_Frequently,0.073852
3,num__Age,0.041738
4,num__FAF,0.023012
5,num__CH2O,0.016007
6,num__TUE,0.0005
7,cat__CALC_no,0.000433
8,num__NCP,2.4e-05
9,cat__CALC_Sometimes,-0.000366


### **Initial Result & Analysis**


The **top 3 features** for predicting BMI are likely **Physical Activity Frequency (FAF)**, **Age**, and **Family History with Overweight**. It makes sense that physical activity ranks high—it’s one of the most direct ways people manage their weight. Age is another strong predictor because as we get older, metabolism slows down, and it’s easier to gain weight. Family history also plays a big role since genetics can influence how our bodies store fat and process energy. These features are closely tied to BMI, so their importance isn’t surprising.

On the flip side, the **bottom 3 features** are likely **Water Intake (CH2O)**, **Smoking Habit (SMOKE)**, and **Time Using Technology (TUE)**. Water intake is important for health but doesn’t have a strong direct link to BMI compared to things like exercise or diet. Smoking might affect appetite or metabolism, but in this dataset, it doesn’t seem to make a big difference for BMI. Time spent on technology also has an indirect relationship with BMI—while it might reflect a sedentary lifestyle, it doesn’t seem as strong as physical activity or eating habits.

The difference between the top and bottom features comes down to **how directly they affect net calorie intake**. Physical activity, age, and family history are strongly tied to weight, while water intake, smoking, and tech time are **more secondary or indirect factors**.

To refine my model, I am deciding between using Random Forest or KNN.

**Random Forest is better** than KNN for this model because it handles non-linear relationships, identifies important features, and is faster for predictions. It’s more robust with mixed data types and requires less preprocessing, making it a practical choice for predicting BMI.

## **Random Forest Model**

Preparing and training the random forest model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Initialize the Random Forest model
rf_model = RandomForestRegressor(random_state=42, n_estimators=100)

# Train the Random Forest model using the pipeline (preprocessing + regression)
pipeline_rf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', rf_model)
])

In [None]:
# Fit the model
pipeline_rf.fit(X_train, y_train)

Seeing the result

In [None]:
y_pred_rf = pipeline_rf.predict(X_test)

r2_rf = r2_score(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)

r2_rf, mse_rf

(0.8507342963614409, 9.859918629872082)

### **Secondary Model (Random Forest) Result & Analysis** **bold text**

The **Random Forest model significantly outperforms both the baseline and the original linear regression model**. With an **R² of 0.8507**, it explains over 85% of the variance in BMI, compared to 36.66% for the linear regression model and near 0% for the baseline. Its **MSE of 9.86** also represents a major improvement, reducing prediction error by over 75% compared to the linear regression model (MSE = 41.84) and over 85% compared to the baseline (MSE = 66.06).

This shows how Random Forest can capture **complex, non-linear relationships** in the data, making it the most effective model so far.

## **Ridge Regression**


I should try **Ridge regression** because some factors in the dataset likely **influence each other**, leading to **multicollinearity**, which can make linear regression less reliable.

For example, people with higher physical activity levels might have other lifestyle habits that affect BMI. This overlap can **inflate the coefficients** in standard linear regression, making it harder to interpret and less accurate. Ridge regression addresses this by **penalizing large coefficients**, helping to reduce the impact of multicollinearity and creating a more stable and generalizable model.


In [None]:
from sklearn.linear_model import Ridge

# Define predictors and target variable
predictor_columns = ['Age', 'FCVC', 'NCP', 'FAF', 'CH2O', 'TUE',
                     'family_history_with_overweight', 'CALC', 'Gender', 'SMOKE']
target_column = 'BMI'

X = df[predictor_columns]
y = df[target_column]

In [None]:
# Define column groups for preprocessing
numerical_columns = ['Age', 'FCVC', 'NCP', 'FAF', 'CH2O', 'TUE']
categorical_columns = ['family_history_with_overweight', 'CALC', 'Gender', 'SMOKE']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical and categorical data using encoder
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_columns),
        ('cat', OneHotEncoder(drop='first'), categorical_columns)
    ]
)

In [None]:
# Create a pipeline with Ridge regression
ridge = Ridge(random_state=42)
pipeline_ridge = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', ridge)
])

Ridge Regression Results:
{'Best Alpha': 0.1, 'R²': 0.366642142711434, 'MSE': 41.83718553042209}


In [None]:
# Finding the best Alpha
param_grid = {'regressor__alpha': [0.1, 1, 10, 100, 1000]}

ridge_cv = GridSearchCV(pipeline_ridge, param_grid, cv=5, scoring='r2')
ridge_cv.fit(X_train, y_train)

In [None]:
# Predicting and evaluating
y_pred_ridge = ridge_cv.best_estimator_.predict(X_test)

r2_ridge = r2_score(y_test, y_pred_ridge)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)

ridge_results = {
    "Best Alpha": ridge_cv.best_params_['regressor__alpha'],
    "R²": r2_ridge,
    "MSE": mse_ridge
}

print("Ridge Regression Results:")
print(ridge_results)

Ridge Regression Results:
{'Best Alpha': 0.1, 'R²': 0.366642142711434, 'MSE': 41.83718553042209}


### **Ridge Regression Summary**

Ridge regression performed **better than the base line and my original linear regression model** since it captured the relationship **between predictors**, reducing multicollinearity by penalizing large coefficients.

However, this is not as good as the random forest model since I have a lot of **non-linear relationships** here with all the **categorical features** going on. If the dataset had more continuous data then ridge regression could have been better.