# S4E2 Challenge: Multi-Class Prediction of Obesity Risk
The goal of this Kaggle competition is to use various factors to predict obesity risk in individuals, which is related to cardiovascular disease. For this submission, I utilize CatBoost for classification, Optuna for hyperparameter optimization and Plotly Express for data visualization.

## Table of Contents
* [Import Libraries](#subsection1)
* [Data Overview](#subsection2)
* [Exploratory Data Analysis](#subsection3)
* [Model Building](#subsection4) 
* [Hyperparameter Optimization](#subsection5)
* [Predictions using Test Data](#subsection6)

### Import Libraries <a class="anchor"  id="subsection1"></a>

In [1]:
import numpy as np 
import pandas as pd 
import warnings

import plotly.express as px
import optuna

from catboost import CatBoostClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
warnings.filterwarnings("ignore")

px.defaults.height = 500
px.defaults.width = 500
px.defaults.color_continuous_scale = px.colors.diverging.delta
px.defaults.color_discrete_sequence = ["darkseagreen", "teal"]

optuna.logging.set_verbosity(optuna.logging.WARNING)

### Data Overview <a class="anchor"  id="subsection2"></a>

In [3]:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e2/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e2/test.csv')

print("Train data shape: ", train_data.shape)
print("Test data shape: ", test_data.shape)

Train data shape:  (20758, 18)
Test data shape:  (13840, 17)


In [4]:
print(pd.concat([test_data, train_data]).info(memory_usage=False))

<class 'pandas.core.frame.DataFrame'>
Index: 34598 entries, 0 to 20757
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              34598 non-null  int64  
 1   Gender                          34598 non-null  object 
 2   Age                             34598 non-null  float64
 3   Height                          34598 non-null  float64
 4   Weight                          34598 non-null  float64
 5   family_history_with_overweight  34598 non-null  object 
 6   FAVC                            34598 non-null  object 
 7   FCVC                            34598 non-null  float64
 8   NCP                             34598 non-null  float64
 9   CAEC                            34598 non-null  object 
 10  SMOKE                           34598 non-null  object 
 11  CH2O                            34598 non-null  float64
 12  SCC                             34598

#### Numerical Features
* id: 
* Age:
* Height:
* Weight:
* FCVC:
* NCP:
* CH20:
* FAF:
* TUE:

In [5]:
train_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,20758.0,10378.5,5992.46278,0.0,5189.25,10378.5,15567.75,20757.0
Age,20758.0,23.841804,5.688072,14.0,20.0,22.815416,26.0,61.0
Height,20758.0,1.700245,0.087312,1.45,1.631856,1.7,1.762887,1.975663
Weight,20758.0,87.887768,26.379443,39.0,66.0,84.064875,111.600553,165.057269
FCVC,20758.0,2.445908,0.533218,1.0,2.0,2.393837,3.0,3.0
NCP,20758.0,2.761332,0.705375,1.0,3.0,3.0,3.0,4.0
CH2O,20758.0,2.029418,0.608467,1.0,1.792022,2.0,2.549617,3.0
FAF,20758.0,0.981747,0.838302,0.0,0.008013,1.0,1.587406,3.0
TUE,20758.0,0.616756,0.602113,0.0,0.0,0.573887,1.0,2.0


#### Categorical Features
* Gender: Binary gender
* family_history_with_overweight:
* FAVC:
* CAEC:
* SMOKE:
* SCC:
* CALC:
* MTRANS:
* Nobeyesdad(TARGET):

In [6]:
train_data.describe(include='O').T

Unnamed: 0,count,unique,top,freq
Gender,20758,2,Female,10422
family_history_with_overweight,20758,2,yes,17014
FAVC,20758,2,yes,18982
CAEC,20758,4,Sometimes,17529
SMOKE,20758,2,no,20513
SCC,20758,2,no,20071
CALC,20758,3,Sometimes,15066
MTRANS,20758,5,Public_Transportation,16687
NObeyesdad,20758,7,Obesity_Type_III,4046


### Exploratory Data Analysis <a class="anchor"  id="subsection3"></a>

In [7]:
for col in train_data.select_dtypes('O'):
    if col != 'Gender':
        fig = px.histogram(data_frame=train_data, x=col, color='Gender', title=col)
        fig.update_layout(xaxis={'title': None}, yaxis={'title': None })
        fig.show()

In [8]:
skew = train_data.select_dtypes('number').skew()
fig = px.bar(skew)
fig.update_layout(title="Skew of Obesity Risk Features", 
                  xaxis={"title": None }, yaxis={"title": "Skew"},
                  showlegend=False)
fig.show()

In [9]:
corr = train_data.drop(columns=['id']).corr(numeric_only=True)
fig = px.imshow(corr, text_auto=True, aspect='auto', height=500, width=1000)
fig.update_xaxes(side="top")
fig.show()

### Model Building <a class="anchor"  id="subsection4"></a>

In [10]:
def get_fitted_catboost(params): 
    cat_features = train_data.select_dtypes('O').columns.values[:-1]
    X_train, X_test, y_train, y_test = train_test_split(train_data.drop(columns=['NObeyesdad']), 
                                                        train_data.NObeyesdad, 
                                                        test_size=0.3, 
                                                        random_state=42) 

    clf = CatBoostClassifier(**params, 
                             loss_function='MultiClass',
                             auto_class_weights='Balanced',
                             early_stopping_rounds=50,
                             iterations=100, 
                             random_state=42)
    
    clf.fit(X_train,y_train, 
            eval_set=(X_test,y_test),
            cat_features=cat_features, 
            verbose=False)

    return clf, X_test, y_test

### Hyperparameter Optimization <a class="anchor"  id="subsection5"></a>

In [11]:
def objective(trial: optuna.Trial) -> float:
    params = {
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 1e-1, log=True),
    }
    clf, X_test, y_test = get_fitted_catboost(params)
    
    predictions = clf.predict(X_test)
    return accuracy_score(y_test, predictions)

In [12]:
study = optuna.create_study(study_name="obesity-accuracy", direction="maximize")
study.optimize(objective, n_trials=50)

best_study = study.best_trial

print('Trials completed:', len(study.trials))
print("Best params: ", best_study.params)
print("Best value: ", best_study.value)

Trials completed: 50
Best params:  {'learning_rate': 0.09961636263702213}
Best value:  0.8903339755940912


In [13]:
print("Parameter importances:")
print(optuna.importance.get_param_importances(study))

Parameter importances:
{'learning_rate': 1.0}


### Predictions using Test Data <a class="anchor"  id="subsection6"></a>

In [14]:
clf = get_fitted_catboost(best_study.params)[0]
predictions = clf.predict(test_data)

In [15]:
fig = px.histogram(predictions, title="Distribution of Obesity Risk Predictions")
fig.update_layout(xaxis={"title": None },
                  yaxis={"title": "Skew"},
                  showlegend=False)
fig.show()

In [16]:
submission = pd.read_csv('/kaggle/input/playground-series-s4e2/sample_submission.csv', index_col='id')
submission['NObeyesdad'] = predictions[:,0]
submission.to_csv('submission.csv')
submission.head()

Unnamed: 0_level_0,NObeyesdad
id,Unnamed: 1_level_1
20758,Obesity_Type_II
20759,Overweight_Level_I
20760,Obesity_Type_III
20761,Obesity_Type_I
20762,Obesity_Type_III
