# S4E2 Challenge: Multi-Class Prediction of Obesity Risk
The goal of this Kaggle competition is to use various factors to predict obesity risk in individuals, which is related to cardiovascular disease. For this submission, I utilize CatBoost for classification, Optuna for hyperparameter optimization and Plotly Express for data visualization.

### Table of Contents:
* [Import Libraries](#section-one)
* [Data Summary & Overview](#section-two)
* [Exploratory Data Analysis](#section-three)
* [Model Building](#section-four)
* [Hyperparameter Optimization](#section-five)
* [Predictions using Test Data](#section-six)

<a id="section-one"></a>
## Import Libraries

In [1]:
import numpy as np 
import pandas as pd 
import warnings

import plotly.express as px

import optuna
from catboost import CatBoostClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
warnings.filterwarnings("ignore")

px.defaults.height = 500
px.defaults.width = 750

px.defaults.color_continuous_scale = px.colors.diverging.delta
px.defaults.color_discrete_sequence = ["darkseagreen", "teal"]

optuna.logging.set_verbosity(optuna.logging.WARNING)

<a id="section-two"></a>
## Data Summary & Overview

In [3]:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e2/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e2/test.csv')

print("Train data shape: ", train_data.shape)
print("Test data shape: ", test_data.shape)

Train data shape:  (20758, 18)
Test data shape:  (13840, 17)


In [4]:
print(pd.concat([test_data, train_data]).info(memory_usage=False))

<class 'pandas.core.frame.DataFrame'>
Index: 34598 entries, 0 to 20757
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              34598 non-null  int64  
 1   Gender                          34598 non-null  object 
 2   Age                             34598 non-null  float64
 3   Height                          34598 non-null  float64
 4   Weight                          34598 non-null  float64
 5   family_history_with_overweight  34598 non-null  object 
 6   FAVC                            34598 non-null  object 
 7   FCVC                            34598 non-null  float64
 8   NCP                             34598 non-null  float64
 9   CAEC                            34598 non-null  object 
 10  SMOKE                           34598 non-null  object 
 11  CH2O                            34598 non-null  float64
 12  SCC                             34598

### Abbreviations
- FAVC: Frequent consumption of high caloric food
- FCVC: Frequency of consumption of vegetables
- NCP: Number of main meals
- CAEC: Consumption of food between meals
- CH20: Consumption of water daily
- SCC: Consumption of alcohol
- FAF: Calories consumption monitoring
- TUE: Physical activity frequency
- CALC: Time using technology devices
- MTRANS: Transportation used

In [5]:
train_data.describe().T.drop(columns=['count'])

Unnamed: 0,mean,std,min,25%,50%,75%,max
id,10378.5,5992.46278,0.0,5189.25,10378.5,15567.75,20757.0
Age,23.841804,5.688072,14.0,20.0,22.815416,26.0,61.0
Height,1.700245,0.087312,1.45,1.631856,1.7,1.762887,1.975663
Weight,87.887768,26.379443,39.0,66.0,84.064875,111.600553,165.057269
FCVC,2.445908,0.533218,1.0,2.0,2.393837,3.0,3.0
NCP,2.761332,0.705375,1.0,3.0,3.0,3.0,4.0
CH2O,2.029418,0.608467,1.0,1.792022,2.0,2.549617,3.0
FAF,0.981747,0.838302,0.0,0.008013,1.0,1.587406,3.0
TUE,0.616756,0.602113,0.0,0.0,0.573887,1.0,2.0


In [6]:
train_data.describe(include='O').T.drop(columns=['count'])

Unnamed: 0,unique,top,freq
Gender,2,Female,10422
family_history_with_overweight,2,yes,17014
FAVC,2,yes,18982
CAEC,4,Sometimes,17529
SMOKE,2,no,20513
SCC,2,no,20071
CALC,3,Sometimes,15066
MTRANS,5,Public_Transportation,16687
NObeyesdad,7,Obesity_Type_III,4046


<a id="section-three"></a>
## Exploratory Data Analysis 

In [7]:
for col in train_data.select_dtypes('O'):
    if col != 'Gender':
        fig = px.histogram(data_frame=train_data, x=col, color='Gender', barmode='group')
        fig.update_layout(xaxis={'title': col}, yaxis={'title': None })
        fig.show()

In [8]:
skew = train_data.drop("id", axis=1).select_dtypes('number').skew()

fig = px.bar(skew)
fig.update_layout(title="Skew of Obesity Risk Features", 
                  xaxis={"title": None }, yaxis={"title": "Skew"},
                  showlegend=False)
fig.show()

In [9]:
corr = train_data.drop(columns=['id']).corr(numeric_only=True)

fig = px.imshow(corr, text_auto=True, aspect='auto', width=1000)
fig.update_xaxes(side="top")
fig.show()

<a id="section-four"></a>
## Model Building 

In [10]:
def get_fitted_catboost(params): 
    cat_features = train_data.select_dtypes('O').columns.values[:-1]
    X_train, X_test, y_train, y_test = train_test_split(train_data.drop(columns=['NObeyesdad']), 
                                                        train_data.NObeyesdad, test_size=0.25, random_state=42) 
   

    clf = CatBoostClassifier(**params, loss_function='MultiClass', auto_class_weights='Balanced', eval_metric='Accuracy',
                             early_stopping_rounds=100, iterations=2000, random_state=42)
    
    clf.fit(X_train,y_train, 
            eval_set=(X_test,y_test),
            cat_features=cat_features, 
            verbose=False)

    return clf, X_test, y_test

<a id="section-five"></a>
## Hyperparameter Optimization 

In [11]:
def objective(trial: optuna.Trial) -> float:
    params = {
        'learning_rate': trial.suggest_float('learning_rate', 1e-2, 1e-1, log=True),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1, log=True),
        "random_strength": trial.suggest_float("random_strength", 0.5, 1.0, log=True),
        
        "depth": trial.suggest_int("depth", 1, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 25, 75),
        "l2_leaf_reg": trial.suggest_int("l2_leaf_reg", 1, 10),
    }
    
    clf, X_test, y_test = get_fitted_catboost(params)
    predictions = clf.predict(X_test)
    
    return accuracy_score(y_test, predictions)

In [12]:
study = optuna.create_study(study_name="obesity-accuracy", direction="maximize")
study.optimize(objective, n_trials=30, show_progress_bar=True, n_jobs=-1)

best_study = study.best_trial

print('Trials completed:', len(study.trials))
print("Best params: ", best_study.params)
print("Best value: ", best_study.value)

  0%|          | 0/30 [00:00<?, ?it/s]

Trials completed: 30
Best params:  {'learning_rate': 0.0999313668518443, 'colsample_bylevel': 0.06246075115283145, 'random_strength': 0.9978264957187448, 'depth': 8, 'min_data_in_leaf': 25, 'l2_leaf_reg': 4}
Best value:  0.9021194605009634


In [13]:
print("Parameter importances:")

print(optuna.importance.get_param_importances(study))

Parameter importances:
{'colsample_bylevel': 0.8506451065163767, 'learning_rate': 0.04747243053224288, 'random_strength': 0.04527778696393174, 'l2_leaf_reg': 0.043226181191232387, 'depth': 0.010939975203328444, 'min_data_in_leaf': 0.0024385195928878743}


<a id="section-six"></a>
## Predictions using Test Data

In [14]:
clf = get_fitted_catboost(best_study.params)[0]
predictions = clf.predict(test_data)

In [15]:
submission = pd.read_csv('/kaggle/input/playground-series-s4e2/sample_submission.csv', index_col='id')
submission['NObeyesdad'] = predictions[:,0]

submission.to_csv('submission.csv')

submission.head()

Unnamed: 0_level_0,NObeyesdad
id,Unnamed: 1_level_1
20758,Obesity_Type_II
20759,Overweight_Level_I
20760,Obesity_Type_III
20761,Obesity_Type_I
20762,Obesity_Type_III


In [16]:
fig = px.histogram(predictions, title="Frequency of Obesity Risk Predictions")
fig.update_layout(xaxis={"title": None }, yaxis={"title": None }, showlegend=False)
fig.show()