# Obesity Classification Data Science Project

## Overview

This project focuses on the classification of individuals based on their obesity status. The dataset used for this project contains information collected from various sources, including medical records, surveys, and self-reported data. The goal is to analyze and classify individuals into different obesity categories using the provided data.

## Dataset

The dataset includes the following columns:

- **ID**: A unique identifier for each individual
- **Age**: The age of the individual
- **Gender**: The gender of the individual
- **Height**: The height of the individual in centimeters
- **Weight**: The weight of the individual in kilograms
- **BMI**: The body mass index of the individual, calculated as weight divided by height squared
- **Label**: The obesity classification of the individual, which can be one of the following:
  - Normal Weight
  - Overweight
  - Obese
  - Underweight

## Objective

The goal of this project is to:

1. Understand and visualize the dataset.
2. Train a classification model to accurately predict the class of obesity based on the four features.
3. Evaluate the performance of different classification models using metrics such as accuracy, precision, recall, and F1-score.

## Problem Statement

1. **Model Training**:
   - Build and train classification the model using algorithms such as:
     - K Nearest Neighbors
   
2. **Model Evaluation**:
   - Evaluate the models using appropriate classification metrics:
     - **Accuracy**: Percentage of correctly classified instances.
     - **Precision**: The proportion of predicted positives that are actually positive.
     - **Recall**: The proportion of actual positives that are correctly predicted.
     - **F1-Score**: The harmonic mean of precision and recall.

### Load Libraries

In [24]:
# General
import pandas as pd
import numpy as np
import os
import warnings

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Model and Evaluation Metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Model Optimization
from sklearn.model_selection import GridSearchCV

### Settings

In [4]:
# Warnings
warnings.filterwarnings("ignore")
# Path
data_path = "../data"
csv_path = os.path.join(data_path, "obesity_cleaned.csv")

### Load Data

In [5]:
df = pd.read_csv(csv_path)

In [6]:
# Check Data
df.head()

Unnamed: 0,Age,Gender,Height,Weight,BMI,Label
0,25,1,175,80,25.3,1
1,30,0,160,60,22.5,1
2,35,1,180,90,27.3,2
3,40,0,150,50,20.0,0
4,45,1,190,100,31.2,3


### Preprocessing

In [7]:
# Separate Input and Output features
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [8]:
# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [12]:
# Scaling to standardize
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [21]:
# Define a function to train and evaluate a model
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict train and test with trained model
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Print Evaluation Metrics
    print("=" * 60)
    print("EVALUATION OF TRAIN")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred): 0.3f}")
    print(f"Precision: {precision_score(y_train, y_train_pred, average= 'weighted'): 0.3f}")
    print(f"Recall: {recall_score(y_train, y_train_pred, average= 'weighted'): 0.3f}")
    print(f"F1: {f1_score(y_train, y_train_pred, average= 'weighted'): 0.3f}")
    print("=" * 60)
    print("EVALUATION OF TEST")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred): 0.3f}")
    print(f"Precision: {precision_score(y_test, y_test_pred, average= 'weighted'): 0.3f}")
    print(f"Recall: {recall_score(y_test, y_test_pred, average= 'weighted'): 0.3f}")
    print(f"F1: {f1_score(y_test, y_test_pred, average= 'weighted'): 0.3f}")

In [22]:
# Try KNN Classifier
knn = KNeighborsClassifier()
train_evaluate(knn)

EVALUATION OF TRAIN
Accuracy:  0.907
Precision:  0.912
Recall:  0.907
F1:  0.908
EVALUATION OF TEST
Accuracy:  0.864
Precision:  0.882
Recall:  0.864
F1:  0.868


### Insights

The evaluation metrics for the KNN classifier on the Obesity classification task show good overall performance, especially in terms of generalization to the testing data. Here’s a detailed analysis:

#### Training Metrics (High scores):

- **Accuracy (0.91)**: The model correctly classifies **90.7%** of the training data. This indicates that the model has learned the patterns in the training data well.
- **Precision (0.91)**: The model is highly precise on the training data, correctly identifying **91.2%** of the people predicted to have obesity as actual positives. This means the model is making very few false positive predictions.
- **Recall (0.91)**: The recall of **90.7%** indicates that the model is correctly identifying **91%** of the actual label of obesity. However, some cases are missed, it is still good.
- **F1 Score (0.91)**: The F1 score, which balances precision and recall, is **90.8%**, indicating solid performance.

#### Testing Metrics (Slightly lower):

- **Accuracy (0.86)**: The model correctly classifies **86.4%** of the testing data, a good performance. This suggests the model generalizes well to unseen data. But the difference from the training accuracy is showing slightly overfitting.
- **Precision (0.88)**: Precision is  **88.2%** on the test data, meaning that it correctly identifies the labels of obesity for most cases.
- **Recall (0.86)**: The recall on the test set is **86.4%**, meaning the model correctly identifies **86.4%** of the actual actual obesity.
- **F1 Score (0.87)**: The F1 score on the test data is **86.8.%**, reflecting a well-balanced model that handles both precision and recall well.

#### Analysis of Performance:

- **Overfitting:** The model’s testing performance is slightly lower than the training performance. This suggests that the model may be overfitted. This is may be due to imbalance in dataset.
- **F1 Score Balance:** The F1 score, which balances precision and recall, is quite high in both training (**0.88**) and testing (**0.87**). The high F1 score suggests that the model's trade-off between precision and recall is more favorable when generalized to unseen data.

### Model Optimization

In [25]:
# Define a function for Hyperparameter Tuning
def tune_hyperparameter(model, param_dict):
    # Define Tuner
    tuner = GridSearchCV(
        model,
        param_grid= param_dict,
        verbose= 2,
        cv= 5
    )
    # Train the tuner
    tuner.fit(X, y)
    # Find best parameters and best score
    best_params = tuner.best_params_
    print(f"Best Accuracy: {tuner.best_score_}")

    return best_params

In [30]:
# Hyperparameter tuning for KNN Classifier

# Define parameter dictionary
param_dict = {
    "n_neighbors": [5, 7, 9, 11, 13],
    "weights": ["uniform", "distance"]
}

# Define KNN Classifier model
knn_ht = KNeighborsClassifier()

# Tune parameter
best_params = tune_hyperparameter(knn_ht, param_dict)
print(f"Best Parameter set: {best_params}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END .....................n_neighbors=5, weights=uniform; total time=   0.0s
[CV] END .....................n_neighbors=5, weights=uniform; total time=   0.0s
[CV] END .....................n_neighbors=5, weights=uniform; total time=   0.0s
[CV] END .....................n_neighbors=5, weights=uniform; total time=   0.0s
[CV] END .....................n_neighbors=5, weights=uniform; total time=   0.0s
[CV] END ....................n_neighbors=5, weights=distance; total time=   0.0s
[CV] END ....................n_neighbors=5, weights=distance; total time=   0.0s
[CV] END ....................n_neighbors=5, weights=distance; total time=   0.0s
[CV] END ....................n_neighbors=5, weights=distance; total time=   0.0s
[CV] END ....................n_neighbors=5, weights=distance; total time=   0.0s
[CV] END .....................n_neighbors=7, weights=uniform; total time=   0.0s
[CV] END .....................n_neighbors=7, wei

In [31]:
# Train the model with best parameters and check evaluation metrics
model = KNeighborsClassifier(**best_params)
train_evaluate(model)

EVALUATION OF TRAIN
Accuracy:  1.000
Precision:  1.000
Recall:  1.000
F1:  1.000
EVALUATION OF TEST
Accuracy:  0.955
Precision:  0.961
Recall:  0.955
F1:  0.955


### Conclussion

After hyperparameter tuning, the KNN classifier for Obesity classification shows significant improvements across both training and testing metrics. Let's discuss the metrics,

#### Training Metrics:

- **Accuracy (1.0)**: The model correctly classifies **100%** of the training data, indicating a perfect fit to the training set.
- **Precision (1.0)**: All predicted positive instances(Obesity labels) in the training data are actually positive.
- **Recall (1.0)**: Recall has improved significantly compared to the default configuration, indicating that the model is now correctly identifying all(**100%**) of actual positive cases (obesity labels), which is crucial for minimizing false negatives.
- **F1 Score (1.0)**: The F1 score, a balance between precision and recall, is perfect **100%**, reflecting the model's strong overall performance in identifying positive cases with few false positives and false negatives.

#### Testing Metrics (Improved performance):

- **Accuracy (0.96)**: The model achieves **95.5%** accuracy on the test data, which is a very strong result and shows good generalization to unseen data.
- **Precision (0.96)**: Similar to the training set, the precision on the test set is very high, meaning the model is making highly reliable positive predictions with a very few false positives.
- **Recall (0.96)**: The recall on the test set is **95.5%**, showing that the model identifies most of the actual positive cases in the test set, though a small percentage of cases may still go undetected (false negatives).
**F1 Score (0.96)**: The F1 score on the test set is **95.5%**, indicating that the model performs well in balancing precision and recall, even on unseen data.

#### Summary

- **Improved Performance:** Compared to the previous results, all metrics have been improved on both the training and testing datasets. This means the model is now identifying a higher proportion of true positives, which is essential in obesity detection. Fewer cases are being missed.

- **Generalization to Unseen Data:** The testing performance closely mirrors the training performance. This suggests that the model generalizes well to new, unseen data, avoiding overfitting and maintaining high predictive power.