# Heart Disease K-Means Clustering Analysis

## Project Overview
This notebook performs unsupervised machine learning analysis using K-Means clustering on a heart disease dataset. The goal is to identify natural groupings of patients based on their clinical features.

**Dataset**: Heart Disease Dataset (920 patient records)

**Kaggle Link**: https://www.kaggle.com/competitions/k-means-clustering-for-heart-disease-analysis/overview

**Objective**: Use K-Means clustering to group patients into distinct clusters based on cardiovascular health indicators.

---

## Phase 1: Preprocessing & EDA

### Step 1: Setup & Data Loading

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Machine Learning libraries
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, silhouette_samples

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


In [2]:
# Load the dataset
df = pd.read_csv('heart_disease.csv')

# Display first few rows
print("Dataset loaded successfully!\n")
print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns\n")
df.head(10)

Dataset loaded successfully!

Dataset shape: 920 rows × 15 columns



Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal
0,0,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect
1,1,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal
2,2,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect
3,3,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal
4,4,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal
5,5,56,Male,Cleveland,atypical angina,120.0,236.0,False,normal,178.0,False,0.8,upsloping,0.0,normal
6,6,62,Female,Cleveland,asymptomatic,140.0,268.0,False,lv hypertrophy,160.0,False,3.6,downsloping,2.0,normal
7,7,57,Female,Cleveland,asymptomatic,120.0,354.0,False,normal,163.0,True,0.6,upsloping,0.0,normal
8,8,63,Male,Cleveland,asymptomatic,130.0,254.0,False,lv hypertrophy,147.0,False,1.4,flat,1.0,reversable defect
9,9,53,Male,Cleveland,asymptomatic,140.0,203.0,True,lv hypertrophy,155.0,True,3.1,downsloping,0.0,reversable defect


**Key observations:**
- Dataset contains **920 patient records** with **15 columns**
- Features include both numerical (age, blood pressure, cholesterol) and categorical (sex, chest pain type, ECG results) variables
- The dataset appears to be from multiple sources (Cleveland, VA Long Beach datasets)

**Feature descriptions:**
- `id`: Patient identifier
- `age`: Age in years
- `sex`: Male/Female
- `dataset`: Source dataset
- `cp`: Chest pain type (typical angina, atypical angina, non-anginal, asymptomatic)
- `trestbps`: Resting blood pressure (mm Hg)
- `chol`: Serum cholesterol (mg/dl)
- `fbs`: Fasting blood sugar > 120 mg/dl (True/False)
- `restecg`: Resting electrocardiographic results
- `thalch`: Maximum heart rate achieved
- `exang`: Exercise induced angina (True/False)
- `oldpeak`: ST depression induced by exercise
- `slope`: Slope of peak exercise ST segment
- `ca`: Number of major vessels colored by fluoroscopy
- `thal`: Thalassemia type