<a href="https://colab.research.google.com/github/dianalves00/6-7-edition/blob/main/Supervised%20Learning/Multi_Class_Prediction_of_Obesity_Risk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Multi-Class Prediction of Obesity Risk](https://www.kaggle.com/competitions/playground-series-s4e2)

### Dataset Description
The dataset for this competition (both train and test) was generated from a deep learning model trained on the Obesity or CVD risk dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

Note: This dataset is particularly well suited for visualizations, clustering, and general EDA. Show off your skills!

### Files
train.csv - the training dataset; NObeyesdad is the categorical target

test.csv - the test dataset; your objective is to predict the class of NObeyesdad for each row

In [72]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score

train_data  = "https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/main/Supervised%20Learning/Datasets/playground-series-s4e2/train.csv"
test_data = "https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/main/Supervised%20Learning/Datasets/playground-series-s4e2/test.csv"

# Load the dataset
train_data = pd.read_csv(train_data)

In [73]:
train_data.head()

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,0,Male,24.443011,1.699998,81.66995,yes,yes,2.0,2.983297,Sometimes,no,2.763573,no,0.0,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,1,Female,18.0,1.56,57.0,yes,yes,2.0,3.0,Frequently,no,2.0,no,1.0,1.0,no,Automobile,Normal_Weight
2,2,Female,18.0,1.71146,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,3,Female,20.952737,1.71073,131.274851,yes,yes,3.0,3.0,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II


In [74]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   Gender                          20758 non-null  object 
 2   Age                             20758 non-null  float64
 3   Height                          20758 non-null  float64
 4   Weight                          20758 non-null  float64
 5   family_history_with_overweight  20758 non-null  object 
 6   FAVC                            20758 non-null  object 
 7   FCVC                            20758 non-null  float64
 8   NCP                             20758 non-null  float64
 9   CAEC                            20758 non-null  object 
 10  SMOKE                           20758 non-null  object 
 11  CH2O                            20758 non-null  float64
 12  SCC                             

In [75]:
train_data.describe()

Unnamed: 0,id,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
count,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0
mean,10378.5,23.841804,1.700245,87.887768,2.445908,2.761332,2.029418,0.981747,0.616756
std,5992.46278,5.688072,0.087312,26.379443,0.533218,0.705375,0.608467,0.838302,0.602113
min,0.0,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,5189.25,20.0,1.631856,66.0,2.0,3.0,1.792022,0.008013,0.0
50%,10378.5,22.815416,1.7,84.064875,2.393837,3.0,2.0,1.0,0.573887
75%,15567.75,26.0,1.762887,111.600553,3.0,3.0,2.549617,1.587406,1.0
max,20757.0,61.0,1.975663,165.057269,3.0,4.0,3.0,3.0,2.0


In [76]:
train_data["NObeyesdad"].value_counts()

Unnamed: 0_level_0,count
NObeyesdad,Unnamed: 1_level_1
Obesity_Type_III,4046
Obesity_Type_II,3248
Normal_Weight,3082
Obesity_Type_I,2910
Insufficient_Weight,2523
Overweight_Level_II,2522
Overweight_Level_I,2427


Per usual, start with some EDA, bar charts and/or correlation

In [77]:
#one hot encoding on caec column
train_data = pd.get_dummies(train_data, columns=["CAEC"])
#one hot encoding on mtrans column
train_data = pd.get_dummies(train_data, columns=["MTRANS"])
#one hot encoding on calc column
train_data = pd.get_dummies(train_data, columns=["CALC"])

In [78]:
#gender
train_data["Gender"] = train_data["Gender"].astype(bool).astype(int)


In [79]:

#family history
train_data["family_history_with_overweight"] = train_data["family_history_with_overweight"].astype(bool).astype(int)
#favc
train_data["FAVC"] = train_data["FAVC"].astype(bool).astype(int)
#smoke
train_data["SMOKE"] = train_data["SMOKE"].astype(bool).astype(int)
#scc
train_data["SCC"] = train_data["SCC"].astype(bool).astype(int)

In [80]:
train_data["NObeyesdad"] = train_data["NObeyesdad"].replace({"Insufficient_Weight": 0, "Normal_Weight": 1, "Overweight_Level_I": 2, "Overweight_Level_II":3, "Obesity_Type_I":4, "Obesity_Type_II":5, "Obesity_Type_III":6})

  train_data["NObeyesdad"] = train_data["NObeyesdad"].replace({"Insufficient_Weight": 0, "Normal_Weight": 1, "Overweight_Level_I": 2, "Overweight_Level_II":3, "Obesity_Type_I":4, "Obesity_Type_II":5, "Obesity_Type_III":6})


In [81]:
train_data.head()

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,SMOKE,...,CAEC_Sometimes,CAEC_no,MTRANS_Automobile,MTRANS_Bike,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking,CALC_Frequently,CALC_Sometimes,CALC_no
0,0,1,24.443011,1.699998,81.66995,1,1,2.0,2.983297,1,...,True,False,False,False,False,True,False,False,True,False
1,1,1,18.0,1.56,57.0,1,1,2.0,3.0,1,...,False,False,True,False,False,False,False,False,False,True
2,2,1,18.0,1.71146,50.165754,1,1,1.880534,1.411685,1,...,True,False,False,False,False,True,False,False,False,True
3,3,1,20.952737,1.71073,131.274851,1,1,3.0,3.0,1,...,True,False,False,False,False,True,False,False,True,False
4,4,1,31.641081,1.914186,93.798055,1,1,2.679664,1.971472,1,...,True,False,False,False,False,True,False,False,True,False


In [82]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 27 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   Gender                          20758 non-null  int64  
 2   Age                             20758 non-null  float64
 3   Height                          20758 non-null  float64
 4   Weight                          20758 non-null  float64
 5   family_history_with_overweight  20758 non-null  int64  
 6   FAVC                            20758 non-null  int64  
 7   FCVC                            20758 non-null  float64
 8   NCP                             20758 non-null  float64
 9   SMOKE                           20758 non-null  int64  
 10  CH2O                            20758 non-null  float64
 11  SCC                             20758 non-null  int64  
 12  FAF                             

In [83]:
train_data.corr()

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,SMOKE,...,CAEC_Sometimes,CAEC_no,MTRANS_Automobile,MTRANS_Bike,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking,CALC_Frequently,CALC_Sometimes,CALC_no
id,1.0,,0.007634,0.012041,0.01402,,,0.002098,-0.000332,,...,0.001005,0.00516,-0.012575,-0.004036,-0.012429,0.012802,0.002244,-0.013384,0.002654,0.002141
Gender,,,,,,,,,,,...,,,,,,,,,,
Age,0.007634,,1.0,-0.011713,0.283381,,,0.034414,-0.048479,,...,0.201389,-0.051281,0.605392,0.006393,0.013754,-0.545798,-0.07864,0.036106,0.049159,-0.063896
Height,0.012041,,-0.011713,1.0,0.416677,,,-0.071546,0.191383,,...,0.128341,-0.078078,0.054758,0.013701,0.002375,-0.068578,0.040525,0.038126,0.06771,-0.083777
Weight,0.01402,,0.283381,0.416677,1.0,,,0.245682,0.095947,,...,0.426569,-0.08322,-0.002079,-0.021761,-0.023137,0.043695,-0.099298,-0.048021,0.263987,-0.254933
family_history_with_overweight,,,,,,,,,,,...,,,,,,,,,,
FAVC,,,,,,,,,,,...,,,,,,,,,,
FCVC,0.002098,,0.034414,-0.071546,0.245682,,,1.0,0.113349,,...,0.02239,-0.075708,-0.095624,-0.012134,0.006724,0.09336,-0.006344,-0.03676,0.162723,-0.154531
NCP,-0.000332,,-0.048479,0.191383,0.095947,,,0.113349,1.0,,...,-0.023038,-0.119536,0.007457,0.00285,0.004866,-0.024884,0.04557,-0.003395,0.107963,-0.110182
SMOKE,,,,,,,,,,,...,,,,,,,,,,


In [64]:
#find most important features
train_data.corr()["NObeyesdad"].sort_values(ascending=False)

Unnamed: 0,NObeyesdad
NObeyesdad,1.0
Weight,0.92125
CAEC_Sometimes,0.45095
Age,0.356211
CH2O,0.273154
FCVC,0.272933
CALC_Sometimes,0.236984
Height,0.150141
MTRANS_Public_Transportation,0.062733
NCP,0.027227


Now pre-process and perform feature engineering.

P.s Use at least 5 features.

In [91]:
#p.s this is just an implementation suggestion, feel free to split it into smaller steps
def preprocess_data(data):
    """
    Preprocess the data by handling categorical values and scaling features.
    """
    # Encode features and label
    # TODO
    # ps. we have more than 2 labels, you can use LabelEncoder for that. Check the documentation
    # extra ps. most estimators (models) of sklearn do plenty of magic by themselves, one of them is label encoding

    # Separate features and target
    X = train_data.drop("NObeyesdad", axis=1)
    y = train_data["NObeyesdad"]

    # Scale features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    # p.s some algorithms dont 'gain' much from scaling

    return X, y

# Preprocess training data
X, y = preprocess_data(train_data)

Its training time i.e hyperparameter search, cross validation and evaluation.

1. [Logistic Regression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) - with [OVR](https://scikit-learn.org/1.5/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) and [OVO](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html)
2. KNN
3. Naive-Bayes

P.s read the documentation

In [92]:
# Preprocess training data
X, y = preprocess_data(train_data)

# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
#Note: what about data leakage ? Is something wrong here?

#Remember to do hyperparameter search for example with https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

#logistic regression
model_lr = LogisticRegression()
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(model_lr, param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
y_pred_log_reg = best_model.predict(X_val)


#knn
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred_knn = model.predict(X_val)

#naive bayes
model = GaussianNB()
model.fit(X_train, y_train)
y_pred_nb = model.predict(X_val)



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [90]:
# Evaluation Summary
def evaluate_model(y_true, y_pred, model_name):
    print(f"\nEvaluation Metrics for {model_name}:")
    print(classification_report(y_val, y_pred_nb))
    print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}")
    print(f"Precision: {precision_score(y_true, y_pred, average='weighted'):.2f}")
    print(f"Recall: {recall_score(y_true, y_pred, average='weighted'):.2f}")

evaluate_model(y_val, y_pred_log_reg, "Logistic Regression")
evaluate_model(y_val, y_pred_knn, "KNN")
evaluate_model(y_val, y_pred_nb, "Naive Bayes")


Evaluation Metrics for Logistic Regression:
              precision    recall  f1-score   support

           0       0.59      0.83      0.69       524
           1       0.51      0.17      0.25       626
           2       0.57      0.10      0.17       484
           3       0.42      0.17      0.24       514
           4       0.30      0.63      0.41       543
           5       0.71      0.88      0.79       657
           6       0.82      1.00      0.90       804

    accuracy                           0.58      4152
   macro avg       0.56      0.54      0.49      4152
weighted avg       0.58      0.58      0.53      4152

Accuracy: 0.86
Precision: 0.86
Recall: 0.86

Evaluation Metrics for KNN:
              precision    recall  f1-score   support

           0       0.59      0.83      0.69       524
           1       0.51      0.17      0.25       626
           2       0.57      0.10      0.17       484
           3       0.42      0.17      0.24       514
           4  

## Discussion
1. Do we have a balanced dataset?
2. Which model is more balanced?
3. Any class trumps the rest?

Awesome work!

Try to submit your best results to the official kaggle competition! Pay close attention to the sample submission for it to be a valid submission.