<a href="https://colab.research.google.com/github/abdel2ty/IntenseAI_Notebooks_v1/blob/main/heart_disease_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="background-color:#28a745; padding:15px; border-radius:10px; text-align:center; display:flex; justify-content:center; align-items:center; gap:12px;">
    <img src="attachment:91fd954c-9081-4fd6-8af8-4aa0aeac1208.jpg" alt="Evergreen Logo" style="width:70px; height:70px; border-radius:50%;">
    <h1 style="color:white; font-size:40px; font-weight:bold; margin:0;">
        Evergreen.ai
    </h1>

<div style="margin-left:auto; padding-right:10px;">
        <h2 style="color:white; font-size:40px; font-weight:600; margin:0;">
            Pandas Project
        </h2>
</div>
</div>


# ü´Ä Heart Disease Analysis



In [None]:
# Step 1 ‚Äî Import Libraries
import pandas as pd
import numpy as np

## Load Dataset

In [None]:
df = pd.read_csv("heart_disease_dir.csv")
df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.60,Yes,No,No,3,30,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0,0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,
2,No,,Yes,No,No,20,30,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0,0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28,0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319789,Yes,27.41,Yes,No,No,7,0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No
319790,No,29.84,Yes,No,No,0,0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No
319791,No,24.24,No,No,No,0,0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No
319792,No,32.81,No,No,No,0,0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No


## Dataset Overview



### **üîπ Heart Disease Dataset ‚Äî Feature Explanations**

| **Feature**          | **Type**    | **Example Values**    | **Meaning / Description**                                                 | **Importance for Analysis**                                  |
| -------------------- | ----------- | --------------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------ |
| **HeartDisease**     | Categorical | Yes / No              | Whether the person **has heart disease** or not                           | **Target variable** ‚Äì we predict/analyze this                |
| **BMI**              | Numerical   | 22.5, 28.7, 35.1      | **Body Mass Index** ‚Üí weight vs height                                    | High/low BMI can increase heart risk                         |
| **Smoking**          | Categorical | Yes / No              | Whether the person **smokes** regularly                                   | Strong lifestyle risk factor                                 |
| **AlcoholDrinking**  | Categorical | Yes / No              | If the person **drinks alcohol heavily**                                  | Excessive drinking can increase risk                         |
| **Stroke**           | Categorical | Yes / No              | If the person **has ever had a stroke**                                   | Stroke history strongly linked to heart disease              |
| **PhysicalHealth**   | Numerical   | 0, 5, 15, 30          | **Number of days in the past 30** the person had **poor physical health** | High values ‚Üí poor overall health ‚Üí higher risk              |
| **MentalHealth**     | Numerical   | 0, 10, 20             | **Number of days in the past 30** with **poor mental health**             | Stress & depression are linked to heart problems             |
| **DiffWalking**      | Categorical | Yes / No              | Difficulty **walking or climbing stairs**                                 | Physical limitations ‚Üí higher risk                           |
| **Sex**              | Categorical | Male / Female         | Biological sex of the person                                              | Men often have higher early risk, women later in life        |
| **AgeCategory**      | Categorical | 18-24, 45-49, 80+     | Person‚Äôs **age group**                                                    | Older people have **much higher risk**                       |
| **Race**             | Categorical | White, Black, Asian   | Ethnicity of the person                                                   | Certain groups may have genetic or lifestyle risks           |
| **Diabetic**         | Categorical | Yes, No, Borderline   | Whether the person **has diabetes**                                       | One of the strongest risk factors                            |
| **PhysicalActivity** | Categorical | Yes / No              | Whether the person exercises regularly                                    | Less activity ‚Üí higher heart disease probability             |
| **GenHealth**        | Categorical | Excellent, Good, Poor | Self-reported **general health status**                                   | Strong predictor ‚Äì poor health = higher heart disease        |
| **SleepTime**        | Numerical   | 4, 6, 8, 10           | **Average hours of sleep per day**                                        | Both **too little** and **too much** sleep can increase risk |
| **Asthma**           | Categorical | Yes / No              | If the person has asthma                                                  | Sometimes related to heart/lung conditions                   |
| **KidneyDisease**    | Categorical | Yes / No              | If the person has chronic kidney disease                                  | High correlation ‚Äì kidney issues often affect heart          |
| **SkinCancer**       | Categorical | Yes / No              | If the person has skin cancer                                             | Less direct, but can relate to overall health patterns       |

---


In [None]:
# General information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319794 entries, 0 to 319793
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319794 non-null  object 
 1   BMI               287815 non-null  float64
 2   Smoking           319794 non-null  object 
 3   AlcoholDrinking   319794 non-null  object 
 4   Stroke            319794 non-null  object 
 5   PhysicalHealth    319794 non-null  int64  
 6   MentalHealth      319794 non-null  int64  
 7   DiffWalking       319794 non-null  object 
 8   Sex               319794 non-null  object 
 9   AgeCategory       319794 non-null  object 
 10  Race              319794 non-null  object 
 11  Diabetic          319794 non-null  object 
 12  PhysicalActivity  319794 non-null  object 
 13  GenHealth         319794 non-null  object 
 14  SleepTime         310201 non-null  float64
 15  Asthma            319794 non-null  object 
 16  KidneyDisease     31

In [None]:
print("\nDataset Shape:", df.shape)


Dataset Shape: (319794, 18)


## Check Missing Values

In [None]:
df.isnull().sum()

HeartDisease            0
BMI                 31979
Smoking                 0
AlcoholDrinking         0
Stroke                  0
PhysicalHealth          0
MentalHealth            0
DiffWalking             0
Sex                     0
AgeCategory             0
Race                    0
Diabetic                0
PhysicalActivity        0
GenHealth               0
SleepTime            9593
Asthma                  0
KidneyDisease           0
SkinCancer          22385
dtype: int64

## **üîπ Analytical Question**

> **"What percentage of people in the dataset have heart disease versus those who do not?"**
>    ** ŸÜÿ≥ÿ®ÿ© ÿßŸÜÿ™ÿ¥ÿßÿ± ŸÖÿ±ÿ∂ ÿßŸÑŸÇŸÑÿ® **



### **Step 1 ‚Äî Count Total People with & without Heart Disease**


In [None]:
df["HeartDisease"].value_counts()

HeartDisease
No     292422
Yes     27372
Name: count, dtype: int64

### **Step 2 ‚Äî Calculate Percentage Distribution**


In [None]:
df["HeartDisease"].value_counts(normalize=True) * 100

HeartDisease
No     91.44074
Yes     8.55926
Name: proportion, dtype: float64

### **Step 4 ‚Äî Insights**

* Out of **319,794 people**,
  **27,372 people (\~8.56%)** reported **having heart disease**.
* **292,422 people (\~91.44%)** reported **no heart disease**.
* This indicates that **heart disease is relatively less common** in this dataset.
* However, **8.5%** is still significant ‚Üí we should analyze **risk factors** contributing to these cases.

---

## **Next Step**
  > "Which factors contribute most to heart disease?"



## **üîπ Analysis of Gender & Diabetes Impact on Heart Disease**


### **Q1 ‚Äî How many males vs. females are in the dataset?**


In [None]:
df.shape

(319794, 18)

In [None]:
# Count number of males and females
df["Sex"].value_counts()

Sex
Female    167804
Male      151990
Name: count, dtype: int64

In [None]:
df["Sex"].value_counts(normalize=True) * 100 # 167804 / 319794

Sex
Female    52.472529
Male      47.527471
Name: proportion, dtype: float64


### **Insight**

* There are **167,804 females** (**52.47%**) and **151,990 males** (**47.53%**).
* The dataset is slightly **female-dominant**.

### **Insight**

* **52.47%** of participants are **female**.
* **47.53%** are **male**.
* There's no huge gender imbalance, so analysis won't be biased.

### **Q3 ‚Äî How does gender affect the likelihood of heart disease?**

In [None]:
df.groupby(["Sex", "HeartDisease"]).size().unstack()

HeartDisease,No,Yes
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,156571,11233
Male,135851,16139




#### **Analysis**

Let's calculate heart disease prevalence per gender:

* **Females** ‚Üí `11,233 / 167,804 ‚âà 6.7%`
* **Males** ‚Üí `16,139 / 151,990 ‚âà 10.6%`

#### **Insight**

* **Males** are **1.6x more likely** to have heart disease compared to **females**.
* Gender seems to be an **important risk factor**.

### **Q4 ‚Äî Does diabetes increase the risk of heart disease?**

In [None]:
df.groupby(["Diabetic", "HeartDisease"]).size().unstack()

HeartDisease,No,Yes
Diabetic,Unnamed: 1_level_1,Unnamed: 2_level_1
No,252134,17519
"No, borderline diabetes",5992,789
Yes,31845,8956
Yes (during pregnancy),2451,108



### **Insight**

* People **without diabetes** ‚Üí only **6.5%** have heart disease.
* People **with diabetes** ‚Üí almost **22%** have heart disease.
* Diabetes increases the risk of heart disease by **\~3.4x**.
* Even **borderline diabetes** doubles the risk.

---

### **Final Key Insights**

1. **Gender effect**:

   * Males: **10.6%** heart disease rate
   * Females: **6.7%**
     ‚ûù **Males are at higher risk**.
2. **Diabetes effect**:

   * Diabetics: **22%** risk
   * Non-diabetics: **6.5%**
     ‚ûù **Diabetes is a major risk factor**.
3. These two variables (**Sex** + **Diabetic**) are **strong predictors** for heart disease.

---

### **üîπ Analytical Question**

> **"Our dataset is imbalanced: 91% ‚ÄòNo‚Äô vs 9% ‚ÄòYes‚Äô. How can we balance it to improve analysis accuracy?"**

We want to **down-sample** the **majority class** ("No") to **reduce bias** and get **more accurate insights** when analyzing features affecting **HeartDisease**.

---

## **Step 1 ‚Äî Separate Positive & Negative Cases**

### **Explanation**

* `df_yes` ‚Üí all people who **have heart disease**.
* `df_no` ‚Üí all people who **do not have heart disease**.

In [None]:
# Separate "Yes" and "No" cases
df_yes = df[df["HeartDisease"] == "Yes"]
df_no = df[df["HeartDisease"] == "No"]

In [None]:
df_yes.head(2)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
5,Yes,28.87,Yes,No,No,6,0,Yes,Female,75-79,Black,No,No,Fair,12.0,No,No,
10,Yes,34.3,Yes,No,No,30,0,Yes,Male,60-64,White,Yes,No,Poor,15.0,Yes,No,No


In [None]:
df_yes.shape

(27372, 18)

In [None]:
df_no.head(2)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3,30,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0,0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,


In [None]:
df_no.shape

(292422, 18)


#### **Step 2 ‚Äî Down-Sample the Majority Class ("No")**

#### **Explanation**

* Since we have **27,372 people with heart disease** (`Yes`),
  we randomly **pick the same number** of `No` cases ‚Üí **27,372**.
* `random_state=42` ensures **reproducibility** (same result every time).


In [None]:
# Randomly sample 28,000 rows from "No" cases
df_no_sampled = df_no.sample(n=27372, random_state=42)  # random_state for reproducibility

In [None]:
df_no_sampled.shape

(27372, 18)

#### **Step 3 ‚Äî Combine Yes + Sampled No Cases**

#### **Explanation**

* We merge both datasets into **one balanced dataset**.
* Now, we have:

  * **27,372 Yes cases**
  * **27,372 No cases**

In [None]:
# Combine "Yes" cases with the reduced "No" cases
df_balanced = pd.concat([df_yes, df_no_sampled])

#### **Step 4 ‚Äî Shuffle the Dataset**

```python
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)
```

#### **Explanation**

* `frac=1` ‚Üí shuffle **100%** of the dataset.
* Resetting the index (`reset_index(drop=True)`) ensures a **clean sequence**.

---

In [None]:
# Shuffle the dataset (optional but recommended)
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

#### **Step 5 ‚Äî Verify New Distribution**


In [None]:
# Check new distribution
print(df_balanced["HeartDisease"].value_counts())
print("New shape:", df_balanced.shape)

HeartDisease
No     27372
Yes    27372
Name: count, dtype: int64
New shape: (54744, 18)


#### **üîπ Why Down-Sampling Improves Accuracy**

Originally:

* **Yes = 27,372 (\~9%)**
* **No = 292,422 (\~91%)**

If we **don‚Äôt balance** the dataset:

* Models & analysis will **favor "No" cases**.
* For example, a model predicting **‚ÄúNo‚Äù for everyone** will still be **91% accurate**,
  but it will **miss most real heart disease cases**.

After balancing:

* **Yes = 50%**
* **No = 50%**
* We get **fairer insights** & **better understanding** of relationships between features and HeartDisease.


### **üîπ Analytical Question**

> **"Which age groups have the highest number of people with heart disease?"**
> We want to understand **how age impacts heart disease prevalence**.

---

#### **Step 1 ‚Äî Code to Count Heart Disease Cases per Age Category**

#### **Explanation**

* `df_balanced["HeartDisease"]=="Yes"` ‚Üí filters the dataset to include only people **with heart disease**.
* `["AgeCategory"]` ‚Üí selects the age column.
* `.value_counts()` ‚Üí counts how many cases exist in each age group.

In [None]:
df_balanced[df_balanced["HeartDisease"]=="Yes"]["AgeCategory"].value_counts()

AgeCategory
80 or older    5448
70-74          4847
65-69          4101
75-79          4049
60-64          3327
55-59          2202
50-54          1383
45-49           744
40-44           486
35-39           296
30-34           226
25-29           133
18-24           130
Name: count, dtype: int64



### Insights

#### **1. Older adults have the highest risk**

* The **"80 or older"** group has the highest number of heart disease cases (**5,448**).
* **70-74** and **65-69** follow closely with **4,847** and **4,101** cases.

#### **2. Risk increases after age 60**

* People aged **60+** account for **\~70%** of all heart disease cases.
* From **18 to 50**, heart disease cases remain **relatively low**.

#### **3. Younger adults are at much lower risk**

* Only **130 cases** in the **18-24** category ‚Üí **<1%** of total cases.
* This suggests **age is one of the strongest predictors** for heart disease.

---

###  Analytical Conclusion

* **Age is a strong risk factor** for heart disease.
* Most cases occur in people **aged 60+**.
* Very **few cases** exist in people **under 40**.
* Any predictive model should **treat age as a key feature**.

---

In [None]:
# Mark Elderly People
df.loc[df["AgeCategory"].isin(["50-54","60-64","55-59","65-69","70-74","75-79", "80 or older"]), "Elderly"] = "Yes"
df.loc[~df["AgeCategory"].isin(["50-54","60-64","55-59","65-69","70-74","75-79", "80 or older"]), "Elderly"] = "No"
df.head()

In [None]:
df["Elderly"] = np.where(df["AgeCategory"].isin(["50-54","60-64","55-59","65-69","70-74","75-79", "80 or older"]),"Yes" , "NO" )

In [None]:
df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer,Elderly
0,No,16.60,Yes,No,No,3,30,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes,Yes
1,No,20.34,No,No,Yes,0,0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,,Yes
2,No,,Yes,No,No,20,30,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No,Yes
3,No,24.21,No,No,No,0,0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes,Yes
4,No,23.71,No,No,No,28,0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319789,Yes,27.41,Yes,No,No,7,0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No,Yes
319790,No,29.84,Yes,No,No,0,0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No,NO
319791,No,24.24,No,No,No,0,0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No,NO
319792,No,32.81,No,No,No,0,0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No,NO


### **üîπ Analytical Question**

> **"How does BMI\_Status affect the likelihood of heart disease?"**
> We want to check whether people who are **Normal** or **Overweight** have a higher risk of **HeartDisease**.


In [None]:
# Add BMI Status Column
df_balanced["BMI_Status"] = np.where(df_balanced["BMI"] >= 25, "Overweight", "Normal")
df_balanced

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer,BMI_Status
0,No,19.67,No,No,No,0,0,No,Female,40-44,White,No,Yes,Excellent,8.0,No,No,No,Normal
1,No,23.44,No,No,No,0,0,No,Female,50-54,White,No,Yes,Very good,8.0,Yes,No,No,Normal
2,No,25.83,Yes,No,No,0,0,No,Male,80 or older,White,No,Yes,Very good,8.0,No,No,No,Overweight
3,Yes,33.33,No,No,No,10,0,Yes,Female,80 or older,White,No,No,Fair,8.0,No,No,Yes,Overweight
4,Yes,25.75,No,No,No,0,30,No,Female,60-64,Hispanic,No,No,Fair,3.0,No,No,No,Overweight
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54739,No,25.09,No,No,No,0,0,No,Female,55-59,White,Yes,Yes,Good,8.0,Yes,No,No,Overweight
54740,No,37.12,Yes,No,No,7,7,Yes,Female,70-74,White,Yes,No,Good,8.0,No,No,No,Overweight
54741,No,43.58,Yes,No,No,6,15,No,Female,40-44,White,No,Yes,Good,7.0,Yes,No,No,Overweight
54742,Yes,27.46,No,No,No,0,20,No,Female,80 or older,White,No,Yes,Fair,6.0,No,No,Yes,Overweight


In [None]:
df_balanced.groupby(["BMI_Status", "HeartDisease"]).size().unstack()

HeartDisease,No,Yes
BMI_Status,Unnamed: 1_level_1,Unnamed: 2_level_1
Normal,10721,8766
Overweight,16651,18606


In [None]:
pd.crosstab(df_balanced["BMI_Status"], df_balanced["HeartDisease"], normalize="index") * 100

HeartDisease,No,Yes
BMI_Status,Unnamed: 1_level_1,Unnamed: 2_level_1
Normal,55.016165,44.983835
Overweight,47.227501,52.772499


###  Insights

#### **1. Overweight people are at higher risk**

* **53%** of overweight individuals have **HeartDisease**.
* Only **45%** of normal-weight individuals have **HeartDisease**.

#### **2. BMI is an important predictor**

* There's a **clear link** between higher BMI and **heart disease prevalence**.
* Overweight individuals **dominate heart disease cases** (18,606 vs 8,766 for normal).

#### **3. Healthy weight lowers risk**

* Maintaining a **normal BMI** **reduces heart disease risk** by roughly **8%** compared to overweight individuals.




### **üîπ Analytical Question**

> **"Does sleeping more than 7 hours reduce the risk of Heart Disease?"**

We want to see if people who sleep **more than 7 hours** (‚ÄúPerfect‚Äù) have fewer **HeartDisease** cases than those who sleep less or equal to 7 hours (‚ÄúNot Perfect‚Äù).

---


In [None]:
df_balanced["HighSleep"] = np.where(df_balanced["SleepTime"] > 7 , "Perfect", "Not Perfect")

In [None]:
df_balanced.head(1)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer,BMI_Status,HighSleep
0,No,19.67,No,No,No,0,0,No,Female,40-44,White,No,Yes,Excellent,8.0,No,No,No,Normal,Perfect


In [None]:
df_balanced.groupby(["HighSleep","HeartDisease"]).size().unstack(fill_value=0)

HeartDisease,No,Yes
HighSleep,Unnamed: 1_level_1,Unnamed: 2_level_1
Not Perfect,17137,15860
Perfect,10235,11512



#### **Step 3 ‚Äî Calculate Percentages**

In [None]:
pd.crosstab(df_balanced["HighSleep"], df_balanced["HeartDisease"], normalize="index") * 100

HeartDisease,No,Yes
HighSleep,Unnamed: 1_level_1,Unnamed: 2_level_1
Not Perfect,51.935024,48.064976
Perfect,47.063963,52.936037


#### Insights**
* People who **sleep more than 7 hours** actually **show slightly higher heart disease cases**.
* Sleep **duration alone** isn‚Äôt a strong predictor ‚Äî we need to analyze it alongside other features.


In [None]:
df_balanced

In [None]:
df_balanced["SleepCategory"] = pd.cut(df_balanced["SleepTime"],
                             bins=[0,6,8,24],
                             labels=["Short","Normal","Long"])

In [None]:
def categorize_sleep(sleep):
    if sleep < 6:
        return "Short"
    elif sleep <= 8:
        return "Normal"
    else:
        return "Long"

df_balanced["SleepCategory"] = df_balanced["SleepTime"].apply(categorize_sleep)

In [None]:
df_balanced[["SleepTime", "SleepCategory"]].head()

Unnamed: 0,SleepTime,SleepCategory
0,8.0,Normal
1,8.0,Normal
2,8.0,Normal
3,8.0,Normal
4,3.0,Short


In [None]:
df_balanced.groupby(["SleepCategory","HeartDisease"]).size().unstack(fill_value=0) # fill_value if any cat include nan return 0

HeartDisease,No,Yes
SleepCategory,Unnamed: 1_level_1,Unnamed: 2_level_1
Long,2983,4263
Normal,22000,19468
Short,2389,3641


In [None]:
pd.crosstab(df_balanced["SleepCategory"], df_balanced["HeartDisease"], normalize="index") * 100

HeartDisease,No,Yes
SleepCategory,Unnamed: 1_level_1,Unnamed: 2_level_1
Long,41.167541,58.832459
Normal,53.052956,46.947044
Short,39.618574,60.381426


In [None]:
df_balanced["SleepCategory"].value_counts(normalize=True) * 100

SleepCategory
Normal    75.748941
Long      13.236154
Short     11.014906
Name: proportion, dtype: float64

In [None]:
df_balanced.groupby(["GenHealth","HeartDisease"]).size().unstack(fill_value=0) # fill_value if any cat include nan return 0

In [None]:
df_balanced["GenHealth"].value_counts(normalize=True) * 100

### **Insights**

#### **Strong correlation between GenHealth & HeartDisease**

* People with **Poor** health ‚Üí **85% have heart disease** ‚úÖ **(highest risk group)**.
* People with **Fair** health ‚Üí **74% have heart disease**.
* People with **Good** health ‚Üí **55% have heart disease**.
* People with **Very good** health ‚Üí **35% have heart disease**.
* People with **Excellent** health ‚Üí **only 20% have heart disease** ‚úÖ **(lowest risk group)**.


---
### **üîπ Analytical Question**

> **"Does smoking significantly increase the risk of heart disease?"**

We want to analyze if **smokers** have a higher chance of getting **heart disease** compared to **non-smokers**.


In [None]:
df_balanced.groupby(["Smoking", "HeartDisease"]).size().unstack(fill_value=0)

HeartDisease,No,Yes
Smoking,Unnamed: 1_level_1,Unnamed: 2_level_1
No,16526,11335
Yes,10846,16037


In [None]:
heart_patients = df_balanced[df_balanced["HeartDisease"] == "Yes"]
heart_patients["Smoking"].value_counts(normalize=True) * 100

Smoking
Yes    58.589069
No     41.410931
Name: proportion, dtype: float64

In [None]:
smoking_comparison = pd.crosstab(df_balanced["Smoking"], df_balanced["HeartDisease"], normalize="index") * 100
smoking_comparison


### **Insights**

#### **Smoking strongly increases heart disease risk**

* Among **smokers**, **59.65%** have heart disease.
* Among **non-smokers**, only **40.68%** have heart disease.
* **Conclusion ‚Üí Smokers are \~1.5 times more likely** to suffer from heart disease.

#### **Analytical Conclusion**

* Smoking **significantly increases** the risk of heart disease.
* People who smoke are **\~50% more likely** to develop heart disease compared to non-smokers.
* This makes **smoking** one of the **most important lifestyle risk factors** in our dataset.

In [None]:
df[df["BMI"].isnull()]

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer,Elderly
2,No,,Yes,No,No,20,30,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No,Yes
11,No,,Yes,No,No,0,0,No,Female,55-59,White,No,Yes,Very good,5.0,No,No,No,Yes
13,No,,No,No,No,7,0,Yes,Female,80 or older,White,No,No,Good,7.0,No,No,No,Yes
14,No,,Yes,No,No,0,30,Yes,Female,60-64,White,No,No,Good,5.0,No,No,No,Yes
19,No,,No,No,No,0,0,No,Male,80 or older,White,No,Yes,Excellent,8.0,No,No,Yes,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319768,No,,No,No,No,0,0,No,Male,65-69,Hispanic,Yes,Yes,Good,7.0,No,No,No,Yes
319769,No,,No,No,No,0,0,No,Female,45-49,Hispanic,No,No,Very good,6.0,No,No,No,NO
319777,No,,No,No,No,0,0,No,Female,25-29,Hispanic,No,No,Very good,8.0,No,No,No,NO
319780,Yes,,Yes,No,No,0,0,No,Male,35-39,Hispanic,No,Yes,Very good,7.0,No,No,No,NO


In [None]:
df_balanced["BMI"].isnull().sum()

np.int64(0)

In [None]:
df["BMI"].fillna(df["BMI"].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["BMI"].fillna(df["BMI"].mean(), inplace=True)


In [None]:
df["BMI"].isnull().sum()

np.int64(0)

In [None]:
df_balanced['BMI'] = df_balanced['BMI'].fillna(df_balanced.groupby('HeartDisease')['BMI'].transform('mean'))

In [None]:
df['BMI'] = df['BMI'].fillna(df.groupby('HeartDisease')['BMI'].transform('median'))
df['BMI'] = df['BMI'].fillna(df['BMI'].median())

In [None]:
# Fill NaNs in SkinCancer with the most frequent value (mode)
df_balanced["SkinCancer"].fillna(df_balanced["SkinCancer"].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["SkinCancer"].fillna(df["SkinCancer"].mode()[0], inplace=True)


In [None]:
print(df_balanced["SkinCancer"].isnull().sum())
print(df_balanced["SkinCancer"].value_counts())

3891
SkinCancer
No     44018
Yes     6835
Name: count, dtype: int64


In [None]:
df_balanced["SkinCancer"] = df_balanced.groupby("HeartDisease")["SkinCancer"].transform(
    lambda x: x.fillna(x.mode()[0]))

In [None]:
print(df_balanced["SkinCancer"].isnull().sum())

0


<div style="background-color:#28a745; padding:15px; border-radius:10px; text-align:center; display:flex; justify-content:center; align-items:center; gap:12px;">
    <img src="attachment:3f91443c-7322-410d-98ba-257cc58c136c.jpg" alt="Evergreen Logo" style="width:70px; height:70px; border-radius:50%;">
    <h1 style="color:white; font-size:40px; font-weight:bold; margin:0;">
        Evergreen.ai
    </h1>

<div style="margin-left:auto; padding-right:10px;">
        <h2 style="color:white; font-size:30px; font-weight:600; margin:0;">
            Eng / Mahmoud Talaat ->
            Thanks
        </h2>
        <h2 style="color:white; font-size:30px; font-weight:600; margin:0;">
            01146544662
        </h2>
</div>
</div>
