<a href="https://colab.research.google.com/github/affu-11/Heart-Disease-Risk-Analysis/blob/main/week_2_GNCIPL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Project Name:** Heart Disease Risk Analysis
###  **Project type:** EDA
### **Objective:** The objective of this project is to analyze the UCI Heart Disease dataset using Exploratory Data Analysis (EDA). The goal is to identify important risk factors such as age, cholesterol, blood pressure, chest pain type, and maximum heart rate, and study their relationship with heart disease outcomes. Through visualizations, correlation analysis, and clustering, this project aims to provide meaningful insights that can support early detection and prevention of heart disease.


In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import plotly.express as px


# **Load The Dataset**

In [21]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv("/content/drive/MyDrive/Datasets/heart.csv")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [23]:
df.info()
df.describe()
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


# **Age vs Cholesterol with Outcome**

In [24]:
# Scatter plot
fig = px.scatter(
    df,
    x="age",
    y="chol",
    color="target",   # disease outcome
    hover_data=["age", "chol", "trestbps", "thalach"]  # show on hover
)

fig.show()


What It Means

**Each dot** = one patient from the dataset.

**X-axis**= age of the patient.

**Y-axis** = cholesterol level.

**Color** = target (whether patient has heart disease or not).

**Blue (0)** = no heart disease

**Yellow (1)** = has heart disease

**Hover effect →** When you move your mouse over a dot, it shows:

1. Age

2. Cholesterol level

3. Resting blood pressure (trestbps)

4. Max heart rate (thalach)

In [25]:
# Correlation matrix
corr = df.corr()

# Plotly interactive heatmap
fig = px.imshow(
    corr,
    text_auto=True,
    color_continuous_scale="RdBu_r",
    title="Feature Correlation Heatmap"
)
fig.update_layout(width=800, height=600)
fig.show()


In [26]:
age_groups = pd.cut(df['age'], bins=[29,39,49,59,69,79],
                    labels=['30-39','40-49','50-59','60-69','70-79'])
age_outcome = pd.crosstab(age_groups, df['target'])
age_outcome = age_outcome.reset_index().melt(id_vars='age', value_name='count', var_name='target')
# Interactive bar plot
fig = px.bar(
    age_outcome,
    x="age",
    y="count",
    color="target",
    text="count",
    title="Heart Disease Outcome by Age Group",
    barmode="stack"
)
fig.update_layout(xaxis_title="Age Group", yaxis_title="Count")
fig.show()


In [27]:
fig = px.box(
    df,
    y="chol",
    points="all",   # show all individual points
    title="Interactive Cholesterol Outliers"
)
fig.update_layout(width=600, height=500)
fig.show()

In [28]:
# Select features
features = ['age','trestbps','chol','thalach']
X = StandardScaler().fit_transform(df[features])

# PCA (reduce to 2 components)
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X)

# KMeans Clustering
kmeans = KMeans(n_clusters=2, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

# Add PCA results back to DataFrame
df['PCA1'] = pca_result[:,0]
df['PCA2'] = pca_result[:,1]

# Interactive scatter plot
fig = px.scatter(
    df,
    x="PCA1",
    y="PCA2",
    color="Cluster",
    hover_data=["age", "trestbps", "chol", "thalach"],  # details on hover
    title="PCA Clustering of Patients (Interactive)"
)

fig.update_layout(width=700, height=500)
fig.show()


In [29]:
df['target_label'] = df['target'].map({0: "No Heart Disease", 1: "Heart Disease"})

# Interactive histogram
fig = px.histogram(
    df,
    x="sex",
    color="target_label",       # use text labels instead of 0/1
    barmode="group",
    text_auto=True,
    labels={"sex": "Gender", "target_label": "Heart Disease Outcome"},
    title="Heart Disease by Gender (Interactive)",
    color_discrete_map={
        "No Heart Disease": "blue",
        "Heart Disease": "yellow"
    }
)

fig.update_xaxes(tickvals=[0, 1], ticktext=["Female", "Male"])

# Adjust plot size
fig.update_layout(width=700, height=500)
fig.show()


# **Conclusion**

In [30]:
# Conclusion matrix data
data = {
    "Feature": [
        "Age", "Sex", "Chest Pain (cp)", "Cholesterol (chol)", "Resting BP (trestbps)",
        "Fasting Blood Sugar (fbs)", "ECG (restecg)", "Max Heart Rate (thalach)",
        "Exercise Angina (exang)", "ST Depression (oldpeak)", "Slope", "CA (vessels)", "Thal"
    ],
    "Conclusion": [
        "Risk ↑ after 50 years",
        "Men at higher risk",
        "Certain chest pain strongly indicates disease",
        "High values linked but weak overall",
        "High BP contributes slightly",
        "Minimal effect",
        "Abnormal ECG increases risk",
        "Lower max HR = higher risk",
        "Exercise pain indicates disease",
        "Higher ST depression = higher risk",
        "Downsloping ST linked to disease",
        "More blocked vessels = higher risk",
        "Defects (fixed/reversible) linked to disease"
    ],
    "Risk_Impact": [
        0.7, 0.6, 0.9, 0.4, 0.5,
        0.2, 0.5, 0.8, 0.7, 0.7,
        0.6, 0.9, 0.9
    ]  # scale 0 (low) → 1 (high)
}
df_conclusion = pd.DataFrame(data)
# Simple color-coded table using Plotly
fig = px.imshow(
    [df_conclusion["Risk_Impact"]],
    labels=dict(x="Feature", y="Risk Impact", color="Impact Level"),
    x=df_conclusion["Feature"],
    y=["Risk Impact"],
    color_continuous_scale="RdBu_r"
)
fig.update_layout(
    title="Conclusion Matrix - Heart Disease Risk Analysis",
    width=1000,
    height=400
)
fig.show()
print(df_conclusion[["Feature", "Conclusion"]])


                      Feature                                     Conclusion
0                         Age                          Risk ↑ after 50 years
1                         Sex                             Men at higher risk
2             Chest Pain (cp)  Certain chest pain strongly indicates disease
3          Cholesterol (chol)            High values linked but weak overall
4       Resting BP (trestbps)                   High BP contributes slightly
5   Fasting Blood Sugar (fbs)                                 Minimal effect
6               ECG (restecg)                    Abnormal ECG increases risk
7    Max Heart Rate (thalach)                     Lower max HR = higher risk
8     Exercise Angina (exang)                Exercise pain indicates disease
9     ST Depression (oldpeak)             Higher ST depression = higher risk
10                      Slope               Downsloping ST linked to disease
11               CA (vessels)             More blocked vessels = higher risk

# **Conclusion** :The analysis of the UCI Heart Disease dataset shows that age,cholesterol, chest pain type, blood pressure, and maximum heart rate are key risk factors for heart disease. The risk is higher in males and individuals above 50 years, with abnormal ECG, exercise-induced angina, and high ST depression further increasing vulnerability. Outlier detection revealed that very high cholesterol values are associated with greater risk.