# Obesity Prediction Problem

**Column Description**
<ul>
    <li>Gender – Male or Female.</li>
    <li>Age – The person’s age in years.</li>
    <li>Height – Height in meters.</li>
    <li>Weight – Weight in kilograms.</li>
    <li>family_history_with_overweight – Whether the person has a family history of being overweight (yes/no).</li>
    <li>FAVC – If the person frequently consumes high-calorie foods (yes/no).</li>
    <li>FCVC – Frequency of vegetable consumption (scale from 1 to 3).</li>
    <li>NCP – Number of main meals per day.</li>
    <li>CAEC – Frequency of consuming food between meals (Never, Sometimes, Frequently, Always).</li>
    <li>SMOKE – Whether the person smokes (yes/no).</li>
    <li>CH2O – Daily water intake (scale from 1 to 3).</li>
    <li>SCC – If the person monitors their calorie intake (yes/no).</li>
    <li>FAF – Physical activity frequency (scale from 0 to 3).</li>
    <li>TUE – Time spent using technology (scale from 0 to 3).</li>
    <li>CALC – Frequency of alcohol consumption (Never, Sometimes, Frequently, Always).</li>
    <li>MTRANS – Main mode of transportation (Automobile, Bike, Motorbike, Public Transportation, Walking).</li>
    <li>NObeyesdad – Obesity level (Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, Obesity Type III).</li>
</ul>

In [1]:
# Importing Libraries
import pandas as pd 
import numpy as np 

# Importing visualization libraries
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px 

## Exploratory Data Analysis

In [2]:
# Importing Dataset
df = pd.read_csv("../EDA/ObesityDataSet_raw_and_data_sinthetic.csv")
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [3]:
# Check columns 
df.columns

Index(['Gender', 'Age', 'Height', 'Weight', 'family_history_with_overweight',
       'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE',
       'CALC', 'MTRANS', 'NObeyesdad'],
      dtype='object')

In [4]:
# Checking for anomalies
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   int64  
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

In [7]:
# Numerical Summary Statistsics
df.describe()

Unnamed: 0,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
count,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0
mean,24.315964,1.70162,86.586035,2.418986,2.685651,2.008053,1.010313,0.657861
std,6.357078,0.093368,26.191163,0.533996,0.778079,0.61295,0.850613,0.608926
min,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,20.0,1.63,65.47,2.0,2.66,1.585,0.125,0.0
50%,23.0,1.7,83.0,2.39,3.0,2.0,1.0,0.625
75%,26.0,1.77,107.43,3.0,3.0,2.48,1.67,1.0
max,61.0,1.98,173.0,3.0,4.0,3.0,3.0,2.0


In [8]:
# Categorical Summary Statistics
df.describe(include = 'O')

Unnamed: 0,Gender,family_history_with_overweight,FAVC,CAEC,SMOKE,SCC,CALC,MTRANS,NObeyesdad
count,2111,2111,2111,2111,2111,2111,2111,2111,2111
unique,2,2,2,4,2,2,4,5,7
top,Male,yes,yes,Sometimes,no,no,Sometimes,Public_Transportation,Obesity_Type_I
freq,1068,1726,1866,1765,2067,2015,1401,1580,351


In [9]:
# Check for missing values
df.isnull().sum()

Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

In [10]:
# Check for duplicates
df.duplicated().sum()

24

In [13]:
# Removing duplicates
df.drop_duplicates(inplace = True)

In [15]:
# Check for unique values in each column
# df.nunique()

In [16]:
df.duplicated().sum()

0