### <b> Step 1: Importing Required Libraries </b>
We import essential Python libraries for data manipulation and visualization:  
- **Pandas** for handling datasets.  
- **NumPy** for numerical computations.  
- **Matplotlib** and **Seaborn** for static data visualizations.  
- **Plotly Express** for interactive charts.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import kagglehub
import warnings
warnings.filterwarnings('ignore')

colors = ["#89CFF0", "#FF69B4", "#FFD700", "#7B68EE", "#FF4500",
          "#9370DB", "#32CD32", "#8A2BE2", "#AE4532", "#20B2AA",
          "#FF69B4", "#00CED1", "#FF7F50", "#7FFF00", "#DA70D6"]

  from .autonotebook import tqdm as notebook_tqdm


### <b> Step2: Downloading the Dataset from Kaggle </b>

In this step, we use the **KaggleHub** library to download the dataset directly from Kaggle.  
- `kagglehub.dataset_download("kamilpytlak/personal-key-indicators-of-heart-disease")` fetches the dataset titled *"Personal Key Indicators of Heart Disease"* from Kaggle and stores it locally.  
- The variable `path` stores the local directory path where the dataset files are downloaded.  
- `print("Path to dataset files:", path)` displays the location so we can easily access and load the data in the next steps.


In [3]:
path = kagglehub.dataset_download("kamilpytlak/personal-key-indicators-of-heart-disease")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\Lenovo\.cache\kagglehub\datasets\kamilpytlak\personal-key-indicators-of-heart-disease\versions\6


In [4]:
Data_path=path+'/2022/heart_2022_with_nans.csv'
Data =pd.read_csv(Data_path)

### <b> Step 3: Exploring the Dataset </b>
We start exploring the dataset to understand its structure and content.  
- `Data.info()` → shows data types and missing values.  
- `Data.head()` → displays the first few rows.  
- `Data.describe()` → gives statistical summaries of numerical columns.
- `Data.columns` → to display the names of all columns (features) in the dataset.

In [5]:
Data.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Alabama,Female,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,,No,...,,,,No,No,Yes,No,"Yes, received tetanus shot but not sure what type",No,No
1,Alabama,Female,Excellent,0.0,0.0,,No,6.0,,No,...,1.6,68.04,26.57,No,No,No,No,"No, did not receive any tetanus shot in the pa...",No,No
2,Alabama,Female,Very good,2.0,3.0,Within past year (anytime less than 12 months ...,Yes,5.0,,No,...,1.57,63.5,25.61,No,No,No,No,,No,Yes
3,Alabama,Female,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,,No,...,1.65,63.5,23.3,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No
4,Alabama,Female,Fair,2.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,,No,...,1.57,53.98,21.77,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,No


In [6]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      445132 non-null  object 
 1   Sex                        445132 non-null  object 
 2   GeneralHealth              443934 non-null  object 
 3   PhysicalHealthDays         434205 non-null  float64
 4   MentalHealthDays           436065 non-null  float64
 5   LastCheckupTime            436824 non-null  object 
 6   PhysicalActivities         444039 non-null  object 
 7   SleepHours                 439679 non-null  float64
 8   RemovedTeeth               433772 non-null  object 
 9   HadHeartAttack             442067 non-null  object 
 10  HadAngina                  440727 non-null  object 
 11  HadStroke                  443575 non-null  object 
 12  HadAsthma                  443359 non-null  object 
 13  HadSkinCancer              44

In [7]:
Data.describe()

Unnamed: 0,PhysicalHealthDays,MentalHealthDays,SleepHours,HeightInMeters,WeightInKilograms,BMI
count,434205.0,436065.0,439679.0,416480.0,403054.0,396326.0
mean,4.347919,4.382649,7.022983,1.702691,83.07447,28.529842
std,8.688912,8.387475,1.502425,0.107177,21.448173,6.554889
min,0.0,0.0,1.0,0.91,22.68,12.02
25%,0.0,0.0,6.0,1.63,68.04,24.13
50%,0.0,0.0,7.0,1.7,80.74,27.44
75%,3.0,5.0,8.0,1.78,95.25,31.75
max,30.0,30.0,24.0,2.41,292.57,99.64


In [8]:
Data.columns

Index(['State', 'Sex', 'GeneralHealth', 'PhysicalHealthDays',
       'MentalHealthDays', 'LastCheckupTime', 'PhysicalActivities',
       'SleepHours', 'RemovedTeeth', 'HadHeartAttack', 'HadAngina',
       'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD',
       'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis',
       'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty',
       'DifficultyConcentrating', 'DifficultyWalking',
       'DifficultyDressingBathing', 'DifficultyErrands', 'SmokerStatus',
       'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory', 'AgeCategory',
       'HeightInMeters', 'WeightInKilograms', 'BMI', 'AlcoholDrinkers',
       'HIVTesting', 'FluVaxLast12', 'PneumoVaxEver', 'TetanusLast10Tdap',
       'HighRiskLastYear', 'CovidPos'],
      dtype='object')

In [9]:
Data.isna().sum()

State                            0
Sex                              0
GeneralHealth                 1198
PhysicalHealthDays           10927
MentalHealthDays              9067
LastCheckupTime               8308
PhysicalActivities            1093
SleepHours                    5453
RemovedTeeth                 11360
HadHeartAttack                3065
HadAngina                     4405
HadStroke                     1557
HadAsthma                     1773
HadSkinCancer                 3143
HadCOPD                       2219
HadDepressiveDisorder         2812
HadKidneyDisease              1926
HadArthritis                  2633
HadDiabetes                   1087
DeafOrHardOfHearing          20647
BlindOrVisionDifficulty      21564
DifficultyConcentrating      24240
DifficultyWalking            24012
DifficultyDressingBathing    23915
DifficultyErrands            25656
SmokerStatus                 35462
ECigaretteUsage              35660
ChestScan                    56046
RaceEthnicityCategor

###  <b>Step 4: Splitting the Dataset into Logical Feature Groups</b>

In this step, we organize the dataset into multiple smaller DataFrames based on feature categories.  
This makes the analysis, cleaning, and visualization process more structured and manageable, especially since the dataset has many columns.

- **`Data_Demographics`** →
  *These features describe personal and social background.*

- **`Data_lifestyle`** →
  *Used to study how daily habits affect health risks.*

- **`Data_general`** → 
  *Helps understand overall well-being and recent checkup frequency.*

- **`Data_disease`** → 
  *These columns will be key predictors for heart disease risk.*

- **`Data_disability`** → 
  *Represents functional limitations that may correlate with health status.*

- **`Data_body`** → 
  *Useful for identifying patterns related to obesity or general physical condition.*

- **`Data_vaccine`** →  *May help assess preventive healthcare behavior.*

- **`target`** →  *These are the outcomes that predictive models can be trained on later.*

<b> Overall, this segmentation improves clarity, reduces complexity, and prepares the dataset for focused EDA and modeling. </b>


In [10]:
Data_Demographics=Data[['State','Sex','RaceEthnicityCategory','AgeCategory']]
Data_lifestyle = Data[['PhysicalActivities', 'SleepHours', 'SmokerStatus', 'AlcoholDrinkers', 'ECigaretteUsage']]
Data_general = Data[['GeneralHealth', 'PhysicalHealthDays', 'MentalHealthDays', 'LastCheckupTime']]
Data_disease = Data[['HadHeartAttack','HadAngina','HadStroke','HadAsthma','HadSkinCancer','HadCOPD','HadDepressiveDisorder','HadKidneyDisease','HadArthritis','HadDiabetes']]
Data_disability = Data[['DeafOrHardOfHearing','BlindOrVisionDifficulty','DifficultyConcentrating','DifficultyWalking','DifficultyDressingBathing','DifficultyErrands']]
Data_body = Data[['HeightInMeters','WeightInKilograms','BMI','RemovedTeeth']]
Data_vaccine = Data[['ChestScan','HIVTesting','FluVaxLast12','PneumoVaxEver','TetanusLast10Tdap','HighRiskLastYear']]
target = Data[['HadHeartAttack','CovidPos']]

### <b> Detecting Outliers Using IQR </b>

This function identifies outliers in a numeric column using the **Interquartile Range (IQR)** method.

- Calculates the **25th percentile (Q1)** and **75th percentile (Q3)**.  
- Computes the **IQR** as Q3 − Q1.  
- Defines outliers as values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.  
- Returns only the rows containing these outlier values.

**Purpose:**  
<b>To detect extreme values that may distort analysis or modeling results.</b>


In [11]:
def detect_outliers(df, column):
    """
    Detect outliers in a specific column using the IQR method.
    Returns only the rows that contain outliers.
    """
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers

### <b> Imputing Missing Values Using Distribution </b>

This function fills missing values in a column while preserving the original distribution of categories.  

- It first identifies missing entries in the specified column.  
- If there are no missing values, it returns the column unchanged.  
- Otherwise, it calculates the relative frequency (probability) of each category.  
- Missing values are then randomly assigned based on these probabilities, ensuring the distribution remains consistent.  

**Purpose:**  
<b>To handle missing categorical data without distorting the dataset’s natural proportions. </b>

In [12]:
def fill_column_by_distribution(df, col_name):
    column = df[col_name]
    mask = column.isna()
    
    if mask.sum() == 0:
        return column
    
    probs = column.value_counts(normalize=True)
    categories = probs.index
    weights = probs.values
    
    column.loc[mask] = np.random.choice(categories, size=mask.sum(), p=weights)
    return column


### Step: Cleaning <b>Demographic Data</b>

We checked the demographic features for missing values and inconsistencies.  
Missing `RaceEthnicityCategory` values were replaced with the most frequent category.  
Missing `AgeCategory` values were filled while keeping the original distribution.  
<b> Now, the demographic data is complete, consistent, and ready for analysis.</b>


In [13]:
Data_Demographics.sample(5)

Unnamed: 0,State,Sex,RaceEthnicityCategory,AgeCategory
136406,Kentucky,Male,"White only, Non-Hispanic",Age 35 to 39
110171,Indiana,Female,"White only, Non-Hispanic",Age 50 to 54
364866,Utah,Male,"White only, Non-Hispanic",Age 65 to 69
81021,Georgia,Male,"White only, Non-Hispanic",Age 40 to 44
280924,North Dakota,Female,"White only, Non-Hispanic",Age 65 to 69


In [14]:
Data_Demographics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 4 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   State                  445132 non-null  object
 1   Sex                    445132 non-null  object
 2   RaceEthnicityCategory  431075 non-null  object
 3   AgeCategory            436053 non-null  object
dtypes: object(4)
memory usage: 13.6+ MB


In [15]:
Data_Demographics.describe()

Unnamed: 0,State,Sex,RaceEthnicityCategory,AgeCategory
count,445132,445132,431075,436053
unique,54,2,5,13
top,Washington,Female,"White only, Non-Hispanic",Age 65 to 69
freq,26152,235893,320421,47099


In [16]:
Data_Demographics.isna().sum()

State                        0
Sex                          0
RaceEthnicityCategory    14057
AgeCategory               9079
dtype: int64

In [17]:
Data['RaceEthnicityCategory'].fillna(Data['RaceEthnicityCategory'].mode()[0], inplace=True)

In [18]:
probs=Data['AgeCategory'].value_counts(normalize=True)
probs

AgeCategory
Age 65 to 69       0.108012
Age 60 to 64       0.102077
Age 70 to 74       0.099694
Age 55 to 59       0.084442
Age 80 or older    0.083134
Age 50 to 54       0.077156
Age 75 to 79       0.074574
Age 40 to 44       0.068666
Age 45 to 49       0.065430
Age 35 to 39       0.065419
Age 18 to 24       0.061784
Age 30 to 34       0.059183
Age 25 to 29       0.050430
Name: proportion, dtype: float64

In [19]:
fill_column_by_distribution(Data,'AgeCategory')

0         Age 80 or older
1         Age 80 or older
2            Age 55 to 59
3            Age 50 to 54
4            Age 40 to 44
               ...       
445127       Age 18 to 24
445128       Age 50 to 54
445129       Age 65 to 69
445130       Age 70 to 74
445131       Age 40 to 44
Name: AgeCategory, Length: 445132, dtype: object

In [20]:
Data[['AgeCategory','RaceEthnicityCategory']].isna().sum()

AgeCategory              0
RaceEthnicityCategory    0
dtype: int64

### Step: Cleaning <b>Lifestyle Data</b>

In this step, we focused on ensuring the lifestyle-related features are complete and accurate.  
We started by checking the dataset for missing values and inspecting the overall structure.  
Rows with missing entries in key columns such as `PhysicalActivities`, `SleepHours`, `SmokerStatus`, `AlcoholDrinkers`, and `ECigaretteUsage` were removed to maintain data integrity.  
Next, we identified extreme values in `SleepHours` using the <b>IQR method</b> and removed these outliers to avoid skewing the analysis.  
After these steps, the `Data_lifestyle` dataset is clean, consistent, and ready for exploration, visualization, and further analysis.


In [21]:
Data_lifestyle.sample(5)

Unnamed: 0,PhysicalActivities,SleepHours,SmokerStatus,AlcoholDrinkers,ECigaretteUsage
16545,No,8.0,Current smoker - now smokes every day,No,Never used e-cigarettes in my entire life
202811,Yes,7.0,Never smoked,Yes,Never used e-cigarettes in my entire life
239493,Yes,7.0,Never smoked,No,Never used e-cigarettes in my entire life
212468,Yes,7.0,Former smoker,Yes,Never used e-cigarettes in my entire life
90091,Yes,6.0,Never smoked,Yes,Never used e-cigarettes in my entire life


In [22]:
Data_lifestyle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   PhysicalActivities  444039 non-null  object 
 1   SleepHours          439679 non-null  float64
 2   SmokerStatus        409670 non-null  object 
 3   AlcoholDrinkers     398558 non-null  object 
 4   ECigaretteUsage     409472 non-null  object 
dtypes: float64(1), object(4)
memory usage: 17.0+ MB


In [23]:
display(Data_lifestyle.describe(include=object),Data_lifestyle.describe())

Unnamed: 0,PhysicalActivities,SmokerStatus,AlcoholDrinkers,ECigaretteUsage
count,444039,409670,398558,409472
unique,2,4,2,4
top,Yes,Never smoked,Yes,Never used e-cigarettes in my entire life
freq,337559,245955,210891,311988


Unnamed: 0,SleepHours
count,439679.0
mean,7.022983
std,1.502425
min,1.0
25%,6.0
50%,7.0
75%,8.0
max,24.0


In [24]:
Data_lifestyle.isna().sum()

PhysicalActivities     1093
SleepHours             5453
SmokerStatus          35462
AlcoholDrinkers       46574
ECigaretteUsage       35660
dtype: int64

In [25]:
Data.dropna(subset=['PhysicalActivities','SleepHours','SmokerStatus','AlcoholDrinkers','ECigaretteUsage'],inplace=True)

In [26]:
Data[['SleepHours','PhysicalActivities','SmokerStatus','AlcoholDrinkers','ECigaretteUsage']].isna().sum()

SleepHours            0
PhysicalActivities    0
SmokerStatus          0
AlcoholDrinkers       0
ECigaretteUsage       0
dtype: int64

In [27]:
outliers = detect_outliers(Data, 'SleepHours')
Data = Data.drop(outliers.index)

### Step: Cleaning <b>General Health Features</b>

We reviewed the general health data for completeness and accuracy.  
Missing values in `LastCheckupTime` were replaced with the most common category.  
Missing entries in `GeneralHealth` were filled using probability-based sampling to maintain the original distribution.  
The dataset is now clean, consistent, and ready for further analysis.


In [28]:
Data_general.sample(5)

Unnamed: 0,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime
188195,Fair,0.0,30.0,Within past 5 years (2 years but less than 5 y...
434010,Fair,0.0,0.0,Within past year (anytime less than 12 months ...
200915,Excellent,0.0,0.0,5 or more years ago
228337,Good,1.0,0.0,Within past year (anytime less than 12 months ...
59303,Good,0.0,0.0,Within past year (anytime less than 12 months ...


In [29]:
Data_general.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   GeneralHealth       443934 non-null  object 
 1   PhysicalHealthDays  434205 non-null  float64
 2   MentalHealthDays    436065 non-null  float64
 3   LastCheckupTime     436824 non-null  object 
dtypes: float64(2), object(2)
memory usage: 13.6+ MB


In [30]:
display(Data_general.describe(include=object),Data_general.describe())

Unnamed: 0,GeneralHealth,LastCheckupTime
count,443934,436824
unique,5,4
top,Very good,Within past year (anytime less than 12 months ...
freq,148444,350944


Unnamed: 0,PhysicalHealthDays,MentalHealthDays
count,434205.0,436065.0
mean,4.347919,4.382649
std,8.688912,8.387475
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,3.0,5.0
max,30.0,30.0


In [31]:
Data_general.isna().sum()

GeneralHealth          1198
PhysicalHealthDays    10927
MentalHealthDays       9067
LastCheckupTime        8308
dtype: int64

In [32]:
Data.dropna(subset=['MentalHealthDays','PhysicalHealthDays'], inplace=True)

In [33]:
Data[['MentalHealthDays','PhysicalHealthDays']].isna().sum()

MentalHealthDays      0
PhysicalHealthDays    0
dtype: int64

In [34]:
Data['GeneralHealth'].value_counts(normalize=True)

GeneralHealth
Very good    0.346887
Good         0.321439
Excellent    0.163782
Fair         0.128635
Poor         0.039256
Name: proportion, dtype: float64

In [35]:
Data['LastCheckupTime'].fillna(Data['LastCheckupTime'].mode()[0],inplace=True)
fill_column_by_distribution(Data,'GeneralHealth')

0         Very good
1         Excellent
2         Very good
3         Excellent
4              Fair
            ...    
445124         Good
445126         Good
445128    Excellent
445130    Very good
445131    Very good
Name: GeneralHealth, Length: 371174, dtype: object

In [36]:
Data[['LastCheckupTime','GeneralHealth']].isna().sum()

LastCheckupTime    0
GeneralHealth      0
dtype: int64

### Step: Cleaning <b>Disease Features</b>

We reviewed the disease-related columns for missing values.  
Rows with missing data in key features were removed, as the proportion of nulls was small.  
Categorical responses (`Yes`/`No`) were converted to numeric values (1/0) for consistency.  
Removing these few nulls ensures data integrity and reliable analysis.  
The dataset is now complete, clean, and ready for modeling.  
All disease features are consistent and numeric.


In [37]:
Data_disease.sample(5)

Unnamed: 0,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadSkinCancer,HadCOPD,HadDepressiveDisorder,HadKidneyDisease,HadArthritis,HadDiabetes
184449,No,No,No,No,No,No,No,No,Yes,Yes
279290,Yes,Yes,No,No,No,No,No,No,Yes,Yes
389980,No,No,No,No,No,No,No,No,No,No
215257,No,No,No,No,No,No,No,No,No,No
342732,No,No,No,No,No,No,No,No,Yes,No


In [38]:
Data_disease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 10 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   HadHeartAttack         442067 non-null  object
 1   HadAngina              440727 non-null  object
 2   HadStroke              443575 non-null  object
 3   HadAsthma              443359 non-null  object
 4   HadSkinCancer          441989 non-null  object
 5   HadCOPD                442913 non-null  object
 6   HadDepressiveDisorder  442320 non-null  object
 7   HadKidneyDisease       443206 non-null  object
 8   HadArthritis           442499 non-null  object
 9   HadDiabetes            444045 non-null  object
dtypes: object(10)
memory usage: 34.0+ MB


In [39]:
Data_disease.describe()

Unnamed: 0,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadSkinCancer,HadCOPD,HadDepressiveDisorder,HadKidneyDisease,HadArthritis,HadDiabetes
count,442067,440727,443575,443359,441989,442913,442320,443206,442499,444045
unique,2,2,2,2,2,2,2,2,2,4
top,No,No,No,No,No,No,No,No,No,No
freq,416959,414176,424336,376665,406504,407257,350910,422891,291351,368722


In [40]:
Data['HadDiabetes'].value_counts(dropna=False)

HadDiabetes
No                                         309016
Yes                                         49828
No, pre-diabetes or borderline diabetes      8531
Yes, but only during pregnancy (female)      3194
NaN                                           605
Name: count, dtype: int64

In [41]:
missing_count = Data_disease.isna().sum()
missing_percentage = ((missing_count / len(Data)) * 100).round(3)
missing_summary = pd.DataFrame({'MissingCount': missing_count,'MissingPercentage': missing_percentage})
missing_summary[missing_summary['MissingCount'] > 0].sort_values(by='MissingPercentage', ascending=False)

Unnamed: 0,MissingCount,MissingPercentage
HadAngina,4405,1.187
HadSkinCancer,3143,0.847
HadHeartAttack,3065,0.826
HadDepressiveDisorder,2812,0.758
HadArthritis,2633,0.709
HadCOPD,2219,0.598
HadKidneyDisease,1926,0.519
HadAsthma,1773,0.478
HadStroke,1557,0.419
HadDiabetes,1087,0.293


In [42]:
columns = ['HadAngina','HadSkinCancer','HadHeartAttack','HadDepressiveDisorder',
           'HadArthritis','HadCOPD','HadKidneyDisease','HadAsthma','HadStroke','HadDiabetes']
Data.dropna(subset=columns, inplace=True)

In [43]:
Data[columns].isna().sum()

HadAngina                0
HadSkinCancer            0
HadHeartAttack           0
HadDepressiveDisorder    0
HadArthritis             0
HadCOPD                  0
HadKidneyDisease         0
HadAsthma                0
HadStroke                0
HadDiabetes              0
dtype: int64

In [44]:
Data[columns] = Data[columns].replace({'Yes':1,'No':0})
Data['HadDiabetes'] = Data['HadDiabetes'].replace({'No, pre-diabetes or borderline diabetes': 0,'Yes, but only during pregnancy (female)': 1})

In [45]:
Data[columns]

Unnamed: 0,HadAngina,HadSkinCancer,HadHeartAttack,HadDepressiveDisorder,HadArthritis,HadCOPD,HadKidneyDisease,HadAsthma,HadStroke,HadDiabetes
0,0,0,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
445124,0,0,0,0,1,0,0,0,1,1
445126,0,0,0,0,0,0,0,0,0,0
445128,0,0,0,0,0,0,0,0,0,0
445130,0,0,1,0,0,0,0,1,0,0


In [46]:
(Data[columns]).info()

<class 'pandas.core.frame.DataFrame'>
Index: 357944 entries, 0 to 445131
Data columns (total 10 columns):
 #   Column                 Non-Null Count   Dtype
---  ------                 --------------   -----
 0   HadAngina              357944 non-null  int64
 1   HadSkinCancer          357944 non-null  int64
 2   HadHeartAttack         357944 non-null  int64
 3   HadDepressiveDisorder  357944 non-null  int64
 4   HadArthritis           357944 non-null  int64
 5   HadCOPD                357944 non-null  int64
 6   HadKidneyDisease       357944 non-null  int64
 7   HadAsthma              357944 non-null  int64
 8   HadStroke              357944 non-null  int64
 9   HadDiabetes            357944 non-null  int64
dtypes: int64(10)
memory usage: 30.0 MB


In [47]:
Data['HadDiabetes'].value_counts()

HadDiabetes
0    307633
1     50311
Name: count, dtype: int64

### STEP :cleaning  <b>Disability features</b>
We identified missing values within the Disability columns.
Since the proportion of nulls was minimal, those rows were removed to maintain data quality.
All categorical responses `(Yes/No)` were then encoded as numeric values `(1/0)` for uniformity.
By eliminating the few missing entries, the dataset became more consistent and reliable for analysis.
Now, all disability-related features are `numeric` , `clean` , and `fully prepared for modeling` .

In [48]:
Data_disability.sample(5)

Unnamed: 0,DeafOrHardOfHearing,BlindOrVisionDifficulty,DifficultyConcentrating,DifficultyWalking,DifficultyDressingBathing,DifficultyErrands
3465,,,,,,
360526,No,No,No,No,No,No
379654,No,No,No,No,No,No
272076,No,No,No,No,No,No
97039,No,Yes,No,Yes,No,No


In [49]:
Data_disability.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 6 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   DeafOrHardOfHearing        424485 non-null  object
 1   BlindOrVisionDifficulty    423568 non-null  object
 2   DifficultyConcentrating    420892 non-null  object
 3   DifficultyWalking          421120 non-null  object
 4   DifficultyDressingBathing  421217 non-null  object
 5   DifficultyErrands          419476 non-null  object
dtypes: object(6)
memory usage: 20.4+ MB


In [50]:
display(pd.DataFrame({
    'Non-Null Count': Data_disability.count(),
    'Missing Count': Data_disability.isna().sum(),
    'Missing Percentage': (Data_disability.isna().sum() / len(Data_disability) * 100).round(3),
}))


Unnamed: 0,Non-Null Count,Missing Count,Missing Percentage
DeafOrHardOfHearing,424485,20647,4.638
BlindOrVisionDifficulty,423568,21564,4.844
DifficultyConcentrating,420892,24240,5.446
DifficultyWalking,421120,24012,5.394
DifficultyDressingBathing,421217,23915,5.373
DifficultyErrands,419476,25656,5.764


In [51]:
disability_columns = ['DeafOrHardOfHearing','BlindOrVisionDifficulty','DifficultyConcentrating',
                    'DifficultyWalking','DifficultyDressingBathing','DifficultyErrands']
Data.dropna(subset=disability_columns,inplace=True)

In [52]:
Data[disability_columns].isna().sum()

DeafOrHardOfHearing          0
BlindOrVisionDifficulty      0
DifficultyConcentrating      0
DifficultyWalking            0
DifficultyDressingBathing    0
DifficultyErrands            0
dtype: int64

In [53]:
for col in Data[disability_columns]:
    print(f"\n{col} : {Data[col].nunique()} unique values")
    print(Data[col].unique())



DeafOrHardOfHearing : 2 unique values
['No' 'Yes']

BlindOrVisionDifficulty : 2 unique values
['No' 'Yes']

DifficultyConcentrating : 2 unique values
['No' 'Yes']

DifficultyWalking : 2 unique values
['No' 'Yes']

DifficultyDressingBathing : 2 unique values
['No' 'Yes']

DifficultyErrands : 2 unique values
['No' 'Yes']


In [54]:
Data[disability_columns] = Data[disability_columns].replace({'Yes':1,'No':0})


### STEP : cleaning <b>Body features </b>
We handled missing values in the HeightInMeters, WeightInKilograms, and BMI columns carefully to ensure realistic and consistent data.
Since the missing proportion was small, missing values were replaced using the `median` , which is `less sensitive` to outliers and preserves the overall data distribution.
All extreme values were examined and retained, as they represent possible real human variations.

The outliers were real values but rare so not deleted.

For the RemovedTeeth column, which contained categorical responses (None of them, 1 to 5, 6 or more, but not all, All), missing values were filled based on the `Mode` becouse, the missing proportion was small.


In [56]:
Data_body.sample(5)

Unnamed: 0,HeightInMeters,WeightInKilograms,BMI,RemovedTeeth
333202,,,,"6 or more, but not all"
49308,1.7,90.72,31.32,1 to 5
146380,1.57,60.78,24.51,None of them
149311,1.68,90.72,32.28,"6 or more, but not all"
146268,1.65,76.2,27.96,None of them


In [57]:
Data_body.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 4 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   HeightInMeters     416480 non-null  float64
 1   WeightInKilograms  403054 non-null  float64
 2   BMI                396326 non-null  float64
 3   RemovedTeeth       433772 non-null  object 
dtypes: float64(3), object(1)
memory usage: 13.6+ MB


In [69]:
outliers_height= detect_outliers(Data_body,'HeightInMeters')
outliers_height

Unnamed: 0,HeightInMeters,WeightInKilograms,BMI,RemovedTeeth
555,1.24,86.18,55.64,"6 or more, but not all"
874,1.22,136.08,91.55,1 to 5
902,1.22,44.91,30.21,All
1534,2.03,70.31,17.03,
1551,2.01,95.25,23.66,
...,...,...,...,...
444695,2.13,68.04,14.95,1 to 5
444721,1.27,63.50,39.37,None of them
444851,2.01,106.14,26.36,
444939,1.35,63.05,34.79,"6 or more, but not all"


In [59]:
outliers_weight= detect_outliers(Data_body, 'WeightInKilograms')
outliers_weight

Unnamed: 0,HeightInMeters,WeightInKilograms,BMI,RemovedTeeth
48,1.73,136.08,45.61,
142,1.60,145.15,56.68,
158,1.83,158.76,47.47,
208,1.63,136.08,51.49,
336,1.85,136.08,39.58,
...,...,...,...,...
444674,1.73,136.08,45.61,"6 or more, but not all"
444854,1.83,158.76,47.47,1 to 5
444987,1.83,170.10,50.86,1 to 5
445008,1.68,163.29,58.10,1 to 5


In [None]:
print("Max height:", Data_body['HeightInMeters'].max())
print("Min height:", Data_body['HeightInMeters'].min())
print("Max weight:", Data_body['WeightInKilograms'].max())
print("Min weight:", Data_body['WeightInKilograms'].min())

Max height: 2.41
Min height: 0.91
Max weight: 292.57
Min weight: 22.68


In [70]:
display(pd.DataFrame({
    'Non-Null Count': Data_body.count(),
    'Missing Count': Data_body.isna().sum(),
    'Missing Percentage': (Data_body.isna().sum() / len(Data_body) * 100).round(3),
}))

Unnamed: 0,Non-Null Count,Missing Count,Missing Percentage
HeightInMeters,416480,28652,6.437
WeightInKilograms,403054,42078,9.453
BMI,396326,48806,10.964
RemovedTeeth,433772,11360,2.552


In [75]:
print(Data_body['RemovedTeeth'].unique())

[nan 'None of them' '1 to 5' '6 or more, but not all' 'All']


In [None]:
Data['HeightInMeters'].fillna(Data['HeightInMeters'].median(), inplace=True)
Data['WeightInKilograms'].fillna(Data['WeightInKilograms'].median(), inplace=True)
Data['BMI'] = (Data['WeightInKilograms'] / (Data['HeightInMeters'] ** 2)).round(2)
Data['RemovedTeeth'].fillna(Data['RemovedTeeth'].mode()[0], inplace=True)

In [77]:
body_columns = ['HeightInMeters','WeightInKilograms','BMI','RemovedTeeth']
Data[body_columns].sample(5)

Unnamed: 0,HeightInMeters,WeightInKilograms,BMI,RemovedTeeth
250914,1.8,74.84,23.1,None of them
440744,1.7,78.02,27.0,None of them
399694,1.83,77.11,23.03,None of them
367620,1.78,72.57,22.9,1 to 5
191658,1.63,92.08,34.66,None of them


In [76]:
Data[body_columns].isna().sum()

HeightInMeters       0
WeightInKilograms    0
BMI                  0
RemovedTeeth         0
dtype: int64