# **Packages**

In [1]:
import pandas as pd
import ipywidgets as widgets
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# **Dataset description**

Features:  

    Demographic: Gender, Age, Height, Weight, family_history_with_overweight  

    Dietary: FAVC (Frequent High Calorie Food), FCVC (Vegetable Consumption Frequency), NCP (Number of Daily Meals), CAEC (Consumption of Food Between Meals)  
    
    Lifestyle: SMOKE (Smoking Habit), CH2O (Daily Water Intake), SCC (Calorie Monitoring), FAF (Physical Activity Frequency), TUE (Technological Device Usage Time), CALC (Alcohol Consumption Frequency), MTRANS (Main Mode of Transportation)


# **Obesity dataset exploration**

In [2]:
train_set_df = pd.read_csv("train.csv")
train_set_df.head()

Unnamed: 0,ID,Age,Gender,Height,Weight,CALC,FAVC,FCVC,NCP,SCC,SMOKE,CH2O,family_history_with_overweight,FAF,TUE,CAEC,MTRANS,NObeyesdad
0,1,21.0,Female,1.62,64.0,no,no,2.0,3.0,no,no,2.0,yes,0.0,1.0,Sometimes,Public_Transportation,Normal_Weight
1,2,21.0,,1.52,56.0,Sometimes,no,3.0,3.0,yes,yes,3.0,yes,3.0,,Sometimes,Public_Transportation,Normal_Weight
2,3,,Male,,77.0,Frequently,no,2.0,3.0,no,no,2.0,yes,2.0,,Sometimes,Public_Transportation,Normal_Weight
3,4,27.0,,,87.0,Frequently,no,3.0,3.0,no,,2.0,,2.0,0.0,,Walking,Overweight_Level_I
4,5,22.0,Male,1.78,89.8,Sometimes,no,2.0,1.0,no,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [3]:
train_set_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1110 entries, 0 to 1109
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ID                              1110 non-null   int64  
 1   Age                             801 non-null    float64
 2   Gender                          760 non-null    object 
 3   Height                          873 non-null    float64
 4   Weight                          753 non-null    float64
 5   CALC                            1056 non-null   object 
 6   FAVC                            609 non-null    object 
 7   FCVC                            1110 non-null   float64
 8   NCP                             1108 non-null   float64
 9   SCC                             1110 non-null   object 
 10  SMOKE                           777 non-null    object 
 11  CH2O                            1110 non-null   float64
 12  family_history_with_overweight  99

In [4]:
test_obesity_df = pd.read_csv("test.csv")
test_obesity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ID                              1001 non-null   int64  
 1   Age                             721 non-null    float64
 2   Gender                          662 non-null    object 
 3   Height                          781 non-null    float64
 4   Weight                          684 non-null    float64
 5   CALC                            933 non-null    object 
 6   FAVC                            1001 non-null   object 
 7   FCVC                            859 non-null    float64
 8   NCP                             1001 non-null   float64
 9   SCC                             668 non-null    object 
 10  SMOKE                           1001 non-null   object 
 11  CH2O                            916 non-null    float64
 12  family_history_with_overweight  87

The Dataset contains 2111 rows. each row represents a patient's information and therefore must be unique.
Apart from the patient ID (ID), all of the columns have null values. I will go column by column to see whether I can implement those values.

# **Cleaning Data**

In [27]:
columns_ob = train_set_df.columns
columns_ob

Index(['ID', 'Age', 'Gender', 'Height', 'Weight', 'CALC', 'FAVC', 'FCVC',
       'NCP', 'SCC', 'SMOKE', 'CH2O', 'family_history_with_overweight', 'FAF',
       'TUE', 'CAEC', 'MTRANS', 'NObeyesdad'],
      dtype='object')

## **Column "ID"**  
Because I have concatenated both train and test set, I need to verify if there are no duplicates. There must be unique patient IDs.

In [29]:
len(train_set_df["ID"].unique())

1110

In [30]:
duplicates = train_set_df[train_set_df.duplicated(subset=columns_ob[1:], keep=False)]
duplicates

Unnamed: 0,ID,Age,Gender,Height,Weight,CALC,FAVC,FCVC,NCP,SCC,SMOKE,CH2O,family_history_with_overweight,FAF,TUE,CAEC,MTRANS,NObeyesdad
179,180,21.0,,1.62,70.0,Sometimes,yes,2.0,1.0,no,no,3.0,no,1.0,0.0,no,Public_Transportation,Overweight_Level_I
763,764,21.0,,1.62,70.0,Sometimes,yes,2.0,1.0,no,no,3.0,no,1.0,0.0,no,Public_Transportation,Overweight_Level_I


Although they do not have the same ID number, the rest of the columns are exactly the same.  
And because it's the only row with the exact same information, I will remove one of them.

In [31]:
train_set_df.drop(train_set_df.index[179], axis=0, inplace=True)

## **Column "Age"**

In [32]:
miss_Age = train_set_df["Age"].isna().sum()

In [33]:
percentage_miss_age = round(miss_Age/len(train_set_df)*100, 2)
print(f"The missing Age represents {percentage_miss_age}% of the data.")

The missing Age represents 27.86% of the data.


This is way too much. I cannot remove them.

In [34]:
missing_age = train_set_df[train_set_df["Age"].isna()].copy(deep=False)
#Be careful as when I'll modify missing_age I will also modify obesity_df (and vice versa)

In [35]:
missing_age.info()

<class 'pandas.core.frame.DataFrame'>
Index: 309 entries, 2 to 1105
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ID                              309 non-null    int64  
 1   Age                             0 non-null      float64
 2   Gender                          216 non-null    object 
 3   Height                          240 non-null    float64
 4   Weight                          213 non-null    float64
 5   CALC                            300 non-null    object 
 6   FAVC                            173 non-null    object 
 7   FCVC                            309 non-null    float64
 8   NCP                             309 non-null    float64
 9   SCC                             309 non-null    object 
 10  SMOKE                           219 non-null    object 
 11  CH2O                            309 non-null    float64
 12  family_history_with_overweight  265 non-

In [36]:
print(missing_age["NObeyesdad"].value_counts(normalize=True))

NObeyesdad
Overweight_Level_I     0.265372
Insufficient_Weight    0.252427
Normal_Weight          0.245955
Overweight_Level_II    0.174757
Obesity_Type_I         0.035599
Obesity_Type_III       0.019417
Obesity_Type_II        0.006472
Name: proportion, dtype: float64


In [37]:
train_set_df["Age"].describe()

count    800.000000
mean      22.872886
std        6.386400
min       14.000000
25%       19.000000
50%       21.000000
75%       23.611663
max       55.246250
Name: Age, dtype: float64

In [40]:
non_miss_Age = train_set_df[~(train_set_df["Age"].isna())]
non_miss_Age["NObeyesdad"].value_counts(normalize=True)

NObeyesdad
Normal_Weight          0.26375
Overweight_Level_I     0.25875
Insufficient_Weight    0.24250
Overweight_Level_II    0.17125
Obesity_Type_I         0.04500
Obesity_Type_II        0.01125
Obesity_Type_III       0.00750
Name: proportion, dtype: float64

In [8]:
mental_health_df = pd.read_csv("train_mental_heath.csv")
mental_health_df.head()

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,Aaradhya,Female,49.0,Ludhiana,Working Professional,Chef,,5.0,,,2.0,More than 8 hours,Healthy,BHM,No,1.0,2.0,No,0
1,1,Vivan,Male,26.0,Varanasi,Working Professional,Teacher,,4.0,,,3.0,Less than 5 hours,Unhealthy,LLB,Yes,7.0,3.0,No,1
2,2,Yuvraj,Male,33.0,Visakhapatnam,Student,,5.0,,8.97,2.0,,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
3,3,Yuvraj,Male,22.0,Mumbai,Working Professional,Teacher,,5.0,,,1.0,Less than 5 hours,Moderate,BBA,Yes,10.0,1.0,Yes,1
4,4,Rhea,Female,30.0,Kanpur,Working Professional,Business Analyst,,1.0,,,1.0,5-6 hours,Unhealthy,BBA,Yes,9.0,4.0,Yes,0


In [9]:
mental_health_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140700 entries, 0 to 140699
Data columns (total 20 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   id                                     140700 non-null  int64  
 1   Name                                   140700 non-null  object 
 2   Gender                                 140700 non-null  object 
 3   Age                                    140700 non-null  float64
 4   City                                   140700 non-null  object 
 5   Working Professional or Student        140700 non-null  object 
 6   Profession                             104070 non-null  object 
 7   Academic Pressure                      27897 non-null   float64
 8   Work Pressure                          112782 non-null  float64
 9   CGPA                                   27898 non-null   float64
 10  Study Satisfaction                     27897 non-null   