# Heart Disease K-Means Clustering Analysis

## Project Overview
This notebook performs unsupervised machine learning analysis using K-Means clustering on a heart disease dataset. The goal is to identify natural groupings of patients based on their clinical features.

**Dataset**: Heart Disease Dataset (920 patient records)

**Kaggle Link**: https://www.kaggle.com/competitions/k-means-clustering-for-heart-disease-analysis/overview

**Objective**: Use K-Means clustering to group patients into distinct clusters based on cardiovascular health indicators.

---

## Phase 1: Preprocessing & EDA

### Step 1: Setup & Data Loading

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Machine Learning libraries
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, silhouette_samples

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


In [4]:
# Load the dataset
df = pd.read_csv('heart_disease.csv')

# Display first few rows
print("Dataset loaded successfully!\n")
print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns\n")
df.head(10)

Dataset loaded successfully!

Dataset shape: 920 rows × 15 columns



Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal
0,0,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect
1,1,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal
2,2,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect
3,3,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal
4,4,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal
5,5,56,Male,Cleveland,atypical angina,120.0,236.0,False,normal,178.0,False,0.8,upsloping,0.0,normal
6,6,62,Female,Cleveland,asymptomatic,140.0,268.0,False,lv hypertrophy,160.0,False,3.6,downsloping,2.0,normal
7,7,57,Female,Cleveland,asymptomatic,120.0,354.0,False,normal,163.0,True,0.6,upsloping,0.0,normal
8,8,63,Male,Cleveland,asymptomatic,130.0,254.0,False,lv hypertrophy,147.0,False,1.4,flat,1.0,reversable defect
9,9,53,Male,Cleveland,asymptomatic,140.0,203.0,True,lv hypertrophy,155.0,True,3.1,downsloping,0.0,reversable defect


**Key observations:**
- Dataset contains **920 patient records** with **15 columns**
- Features include both numerical (age, blood pressure, cholesterol) and categorical (sex, chest pain type, ECG results) variables
- The dataset appears to be from multiple sources (Cleveland, VA Long Beach datasets)

**Feature descriptions:**
- `id`: Patient identifier
- `age`: Age in years
- `sex`: Male/Female
- `dataset`: Source dataset
- `cp`: Chest pain type (typical angina, atypical angina, non-anginal, asymptomatic)
- `trestbps`: Resting blood pressure (mm Hg)
- `chol`: Serum cholesterol (mg/dl)
- `fbs`: Fasting blood sugar > 120 mg/dl (True/False)
- `restecg`: Resting electrocardiographic results
- `thalch`: Maximum heart rate achieved
- `exang`: Exercise induced angina (True/False)
- `oldpeak`: ST depression induced by exercise
- `slope`: Slope of peak exercise ST segment
- `ca`: Number of major vessels colored by fluoroscopy
- `thal`: Thalassemia type

### Step 2: Initial Data Exploration

In [5]:
# Display basic dataset information
print("="*80)
print("DATASET OVERVIEW")
print("="*80)
print(f"\nDataset Shape: {df.shape}")
print(f"Total Records: {df.shape[0]}")
print(f"Total Features: {df.shape[1]}")
print(f"\nMemory Usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

DATASET OVERVIEW

Dataset Shape: (920, 15)
Total Records: 920
Total Features: 15

Memory Usage: 406.44 KB


In [None]:
# Display data types and non-null counts
print("\n" + "="*80)
print("DATA TYPES AND NON-NULL COUNTS")
print("="*80 + "\n")
df.info()


DATA TYPES AND NON-NULL COUNTS

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
dtypes: float64(5), int64(2), object(8)
memory usage: 107.9+ KB


In [14]:
# Check for missing values
print("\n" + "="*80)
print("MISSING VALUES ANALYSIS")
print("="*80 + "\n")

missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})

missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

if len(missing_data) > 0:
    print("Columns with missing values:\n")
    print(missing_data.to_string(index=False))
    print(f"\n\nTotal missing values: {df.isnull().sum().sum()}")
    print(f"Percentage of total data: {(df.isnull().sum().sum() / (df.shape[0] * df.shape[1]) * 100):.2f}%")
else:
    print("✓ No missing values found in the dataset!")


MISSING VALUES ANALYSIS

Columns with missing values:

  Column  Missing_Count  Missing_Percentage
      ca            611              66.410
    thal            486              52.830
   slope            309              33.590
     fbs             90               9.780
 oldpeak             62               6.740
trestbps             59               6.410
  thalch             55               5.980
   exang             55               5.980
    chol             30               3.260
 restecg              2               0.220


Total missing values: 1759
Percentage of total data: 12.75%


In [15]:
# Analyze missing values by dataset source
print("\n" + "="*80)
print("MISSING VALUES BY DATASET SOURCE")
print("="*80 + "\n")

for dataset_name in df['dataset'].unique():
    dataset_subset = df[df['dataset'] == dataset_name]
    missing_count = dataset_subset.isnull().sum().sum()
    total_cells = dataset_subset.shape[0] * dataset_subset.shape[1]
    missing_pct = (missing_count / total_cells * 100)
    
    print(f"{dataset_name}:")
    print(f"  Records: {dataset_subset.shape[0]}")
    print(f"  Missing values: {missing_count} ({missing_pct:.2f}%)")
    
    # Show which columns have missing values in this dataset
    cols_with_missing = dataset_subset.isnull().sum()
    cols_with_missing = cols_with_missing[cols_with_missing > 0].sort_values(ascending=False)
    
    if len(cols_with_missing) > 0:
        print(f"  Columns with missing data:")
        for col, count in cols_with_missing.items():
            print(f"    - {col}: {count} ({count/dataset_subset.shape[0]*100:.1f}%)")
    print()


MISSING VALUES BY DATASET SOURCE

Cleveland:
  Records: 304
  Missing values: 9 (0.20%)
  Columns with missing data:
    - ca: 5 (1.6%)
    - thal: 3 (1.0%)
    - slope: 1 (0.3%)

Hungary:
  Records: 293
  Missing values: 779 (17.72%)
  Columns with missing data:
    - ca: 290 (99.0%)
    - thal: 265 (90.4%)
    - slope: 189 (64.5%)
    - chol: 23 (7.8%)
    - fbs: 8 (2.7%)
    - trestbps: 1 (0.3%)
    - restecg: 1 (0.3%)
    - thalch: 1 (0.3%)
    - exang: 1 (0.3%)

Switzerland:
  Records: 123
  Missing values: 273 (14.80%)
  Columns with missing data:
    - ca: 118 (95.9%)
    - fbs: 75 (61.0%)
    - thal: 52 (42.3%)
    - slope: 17 (13.8%)
    - oldpeak: 6 (4.9%)
    - trestbps: 2 (1.6%)
    - restecg: 1 (0.8%)
    - thalch: 1 (0.8%)
    - exang: 1 (0.8%)

VA Long Beach:
  Records: 200
  Missing values: 698 (23.27%)
  Columns with missing data:
    - ca: 198 (99.0%)
    - thal: 166 (83.0%)
    - slope: 102 (51.0%)
    - trestbps: 56 (28.0%)
    - oldpeak: 56 (28.0%)
    - thalch: 5

In [8]:
# Identify numerical and categorical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object', 'bool']).columns.tolist()

print("\n" + "="*80)
print("FEATURE TYPES")
print("="*80 + "\n")

print(f"Numerical Features ({len(numerical_cols)}):")
print(f"  {', '.join(numerical_cols)}\n")

print(f"Categorical Features ({len(categorical_cols)}):")
print(f"  {', '.join(categorical_cols)}")


FEATURE TYPES

Numerical Features (7):
  id, age, trestbps, chol, thalch, oldpeak, ca

Categorical Features (8):
  sex, dataset, cp, fbs, restecg, exang, slope, thal


In [9]:
# Statistical summary of numerical features
print("\n" + "="*80)
print("STATISTICAL SUMMARY - NUMERICAL FEATURES")
print("="*80 + "\n")

df[numerical_cols].describe().T


STATISTICAL SUMMARY - NUMERICAL FEATURES



Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,920.0,459.5,265.725,0.0,229.75,459.5,689.25,919.0
age,920.0,53.511,9.425,28.0,47.0,54.0,60.0,77.0
trestbps,861.0,132.132,19.066,0.0,120.0,130.0,140.0,200.0
chol,890.0,199.13,110.781,0.0,175.0,223.0,268.0,603.0
thalch,865.0,137.546,25.926,60.0,120.0,140.0,157.0,202.0
oldpeak,858.0,0.879,1.091,-2.6,0.0,0.5,1.5,6.2
ca,309.0,0.676,0.936,0.0,0.0,0.0,1.0,3.0


In [10]:
# Summary of categorical features
print("\n" + "="*80)
print("CATEGORICAL FEATURES SUMMARY")
print("="*80 + "\n")

for col in categorical_cols:
    print(f"\n{col.upper()}:")
    print(f"  Unique values: {df[col].nunique()}")
    print(f"  Value counts:")
    value_counts = df[col].value_counts()
    for val, count in value_counts.items():
        percentage = (count / len(df) * 100)
        print(f"    - {val}: {count} ({percentage:.1f}%)")


CATEGORICAL FEATURES SUMMARY


SEX:
  Unique values: 2
  Value counts:
    - Male: 726 (78.9%)
    - Female: 194 (21.1%)

DATASET:
  Unique values: 4
  Value counts:
    - Cleveland: 304 (33.0%)
    - Hungary: 293 (31.8%)
    - VA Long Beach: 200 (21.7%)
    - Switzerland: 123 (13.4%)

CP:
  Unique values: 4
  Value counts:
    - asymptomatic: 496 (53.9%)
    - non-anginal: 204 (22.2%)
    - atypical angina: 174 (18.9%)
    - typical angina: 46 (5.0%)

FBS:
  Unique values: 2
  Value counts:
    - False: 692 (75.2%)
    - True: 138 (15.0%)

RESTECG:
  Unique values: 3
  Value counts:
    - normal: 551 (59.9%)
    - lv hypertrophy: 188 (20.4%)
    - st-t abnormality: 179 (19.5%)

EXANG:
  Unique values: 2
  Value counts:
    - False: 528 (57.4%)
    - True: 337 (36.6%)

SLOPE:
  Unique values: 3
  Value counts:
    - flat: 345 (37.5%)
    - upsloping: 203 (22.1%)
    - downsloping: 63 (6.8%)

THAL:
  Unique values: 3
  Value counts:
    - normal: 196 (21.3%)
    - reversable defect: 19

In [11]:
# Check for duplicate rows
print("\n" + "="*80)
print("DUPLICATE RECORDS CHECK")
print("="*80 + "\n")

duplicates = df.duplicated().sum()
if duplicates > 0:
    print(f"⚠ Found {duplicates} duplicate rows ({(duplicates/len(df)*100):.2f}%)")
    print("\nDuplicate rows:")
    print(df[df.duplicated(keep=False)].sort_values(by=list(df.columns)))
else:
    print("✓ No duplicate rows found!")


DUPLICATE RECORDS CHECK

✓ No duplicate rows found!


#### Analysis: Initial Data Exploration

**Dataset Structure:**
- 920 patient records with 15 features
- Mix of numerical (age, blood pressure, cholesterol, heart rate) and categorical (sex, chest pain type, ECG results) features
- Data from multiple sources (Cleveland, VA Long Beach, etc.)

**Data Quality Issues:**
- **Missing values detected** - particularly in `ca` (coronary vessels), `thal` (thalassemia), and `slope` columns
- Missing data varies by dataset source - VA Long Beach has significantly more missing values than Cleveland (23.27% vs 0.20%)
- No duplicate records found

**Feature Insights:**
- **Numerical**: Age (29-77 years), blood pressure (~131 mm Hg avg), cholesterol (~246 mg/dl avg), max heart rate (~149 bpm avg)
- **Categorical**: Sex, chest pain types (4 categories), ECG results, boolean indicators (fasting blood sugar, exercise angina)
- **Non-features**: `id` and `dataset` are identifiers, not useful for clustering

