
# Heart Disease Prediction - UCI Dataset

## Objective
- Import and explore the dataset
- Perform data cleaning and preprocessing
- Train a classification model
- Evaluate performance metrics
- Summarize findings


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load dataset
df = pd.read_csv('/mnt/data/heart_disease_uci.csv')

df.head()


In [None]:

# Basic information
df.info()

# Statistical summary
df.describe()

# Check missing values
df.isnull().sum()


In [None]:

# Handle missing values (fill with median for numerical columns)
df = df.fillna(df.median(numeric_only=True))

# Encode categorical variables if any
df = pd.get_dummies(df, drop_first=True)

df.head()


In [None]:

# Assuming target column is named 'target'
X = df.drop('target', axis=1)
y = df['target']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_test.shape


In [None]:

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)


In [None]:

accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", cm)
print("\nClassification Report:\n", report)



## Summary of Analysis

- Dataset was successfully imported and explored.
- Missing values were handled using median imputation.
- Categorical variables were encoded using one-hot encoding.
- Features were standardized before model training.
- Logistic Regression was implemented for classification.
- Model performance was evaluated using accuracy, confusion matrix, precision, recall, and F1-score.

### Key Findings
- The model demonstrates reasonable predictive performance.
- Standardization improved model convergence.
- Further improvement can be achieved using ensemble methods like Random Forest or hyperparameter tuning.
