<a href="https://www.kaggle.com/code/vidhikishorwaghela/multi-class-prediction-of-obesity-risk?scriptVersionId=161655664" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Multi-Class Prediction of Obesity Risk

## Overview

In this Kaggle competition, the objective was to predict the risk of obesity in individuals based on various factors associated with cardiovascular health. The primary evaluation metric used for model performance was accuracy.

## Tech Stack

### Data Analysis and Manipulation
- **Pandas:** Utilized for efficient data manipulation, exploration, and cleaning. Its DataFrame structure facilitated seamless handling of the dataset.

- **NumPy:** Employed for numerical operations and array manipulations, enhancing the efficiency of mathematical computations.

### Machine Learning and Model Development
- **scikit-learn:** Leveraged for its comprehensive set of tools for machine learning tasks. Key components included:
  - `train_test_split`: For splitting the dataset into training and validation sets.
  - `RandomForestClassifier`: Chosen for its ensemble learning capabilities and suitability for classification tasks.
  - `StandardScaler`: Used for scaling numeric features.
  - `OneHotEncoder`: Applied for one-hot encoding categorical features.

### Data Preprocessing
- **Pipeline:** Implemented to streamline and automate the workflow of feature transformation and model training.

- **ColumnTransformer:** Applied to selectively apply different preprocessing steps to numeric and categorical features.

- **SimpleImputer:** Utilized for imputing missing values, ensuring completeness of the dataset.

### Model Evaluation
- **scikit-learn.metrics.accuracy_score:** Employed to evaluate the model's accuracy on the validation set.



In [1]:
import pandas as pd
import numpy as np
import os

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

## Dataset Exploration

The dataset comprises two main files: `train.csv` and `test.csv`. The training dataset consists of 18 columns, including an 'id' column and the target variable 'NObeyesdad.' The features encompass diverse information, such as gender, age, height, weight, family history of overweight, and lifestyle factors.

Upon loading and exploring the training dataset, we observed that there are no missing values, and the data types are appropriate. Descriptive statistics provided insights into the distribution of numeric features.


In [2]:
#Loading the datasets:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e2/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e2/test.csv')

In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   Gender                          20758 non-null  object 
 2   Age                             20758 non-null  float64
 3   Height                          20758 non-null  float64
 4   Weight                          20758 non-null  float64
 5   family_history_with_overweight  20758 non-null  object 
 6   FAVC                            20758 non-null  object 
 7   FCVC                            20758 non-null  float64
 8   NCP                             20758 non-null  float64
 9   CAEC                            20758 non-null  object 
 10  SMOKE                           20758 non-null  object 
 11  CH2O                            20758 non-null  float64
 12  SCC                             

In [4]:
train_data.describe()

Unnamed: 0,id,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
count,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0
mean,10378.5,23.841804,1.700245,87.887768,2.445908,2.761332,2.029418,0.981747,0.616756
std,5992.46278,5.688072,0.087312,26.379443,0.533218,0.705375,0.608467,0.838302,0.602113
min,0.0,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,5189.25,20.0,1.631856,66.0,2.0,3.0,1.792022,0.008013,0.0
50%,10378.5,22.815416,1.7,84.064875,2.393837,3.0,2.0,1.0,0.573887
75%,15567.75,26.0,1.762887,111.600553,3.0,3.0,2.549617,1.587406,1.0
max,20757.0,61.0,1.975663,165.057269,3.0,4.0,3.0,3.0,2.0


In [5]:
train_data.head(5)

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,0,Male,24.443011,1.699998,81.66995,yes,yes,2.0,2.983297,Sometimes,no,2.763573,no,0.0,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,1,Female,18.0,1.56,57.0,yes,yes,2.0,3.0,Frequently,no,2.0,no,1.0,1.0,no,Automobile,Normal_Weight
2,2,Female,18.0,1.71146,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,3,Female,20.952737,1.71073,131.274851,yes,yes,3.0,3.0,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II


In [6]:
train_data.tail(5)

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
20753,20753,Male,25.137087,1.766626,114.187096,yes,yes,2.919584,3.0,Sometimes,no,2.151809,no,1.330519,0.19668,Sometimes,Public_Transportation,Obesity_Type_II
20754,20754,Male,18.0,1.71,50.0,no,yes,3.0,4.0,Frequently,no,1.0,no,2.0,1.0,Sometimes,Public_Transportation,Insufficient_Weight
20755,20755,Male,20.101026,1.819557,105.580491,yes,yes,2.407817,3.0,Sometimes,no,2.0,no,1.15804,1.198439,no,Public_Transportation,Obesity_Type_II
20756,20756,Male,33.852953,1.7,83.520113,yes,yes,2.671238,1.971472,Sometimes,no,2.144838,no,0.0,0.973834,no,Automobile,Overweight_Level_II
20757,20757,Male,26.680376,1.816547,118.134898,yes,yes,3.0,3.0,Sometimes,no,2.003563,no,0.684487,0.713823,Sometimes,Public_Transportation,Obesity_Type_II


In [7]:
# Split the dataset into features (X) and target variable (y)
X = train_data.drop(['id', 'NObeyesdad'], axis=1)
y = train_data['NObeyesdad']

In [8]:
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


## Data Preprocessing

We carried out necessary preprocessing steps to prepare the data for modeling. This involved creating separate datasets for features (X) and the target variable (y). Additionally, we split the data into training and validation sets using the `train_test_split` function.

For handling missing values and scaling numeric features, we utilized a preprocessing pipeline. Numeric features were imputed using the median and scaled using the StandardScaler. Categorical features were imputed with the most frequent value and one-hot encoded using the `OneHotEncoder`.


In [9]:
# Define preprocessing steps
numeric_features = ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']
categorical_features = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC', 'MTRANS']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])


## Model Building

We employed the Random Forest Classifier as the predictive model. The model pipeline incorporated the previously defined preprocessing steps.


In [10]:
# Define the model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [11]:
# Fit the model
model.fit(X_train, y_train)


## Model Training and Evaluation

The model was trained on the training dataset and evaluated on a validation set. The evaluation metric used was accuracy. The model exhibited an accuracy of approximately 88.6% on the validation set, indicating a good predictive performance.

## Making Predictions

The trained model was then applied to the provided test dataset to make predictions on individuals' obesity risk.


In [12]:
# Make predictions on the validation set
y_pred = model.predict(X_val)

In [13]:
# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Model Accuracy on Validation Set: {accuracy}')

Model Accuracy on Validation Set: 0.8860789980732178


In [14]:
# Apply the model to the test dataset
test_predictions = model.predict(test_data.drop('id', axis=1))

## Submission File

Finally, we created a submission file containing the predicted 'NObeyesdad' values for the test dataset. This file, named 'submission.csv,' can be submitted to Kaggle for evaluation.


In [15]:
# Create a submission file
submission_df = pd.DataFrame({'id': test_data['id'], 'NObeyesdad': test_predictions})
submission_df.to_csv('submission.csv', index=False)

In summary, our approach involved thorough data exploration, preprocessing, and the application of a Random Forest Classifier for predicting obesity risk based on individual characteristics. The resulting model demonstrated promising performance on the validation set.
