# Heart Disease Prediction Report

## 1. Introduction

The primary aim of this project is to develop and evaluate machine learning models for predicting the presence or absence of heart disease based on various patient health metrics. The early and accurate prediction of heart disease is of paramount importance in healthcare, as it enables timely intervention, personalized treatment plans, and encourages preventive care. This predictive modeling approach can serve as a valuable tool for clinicians, assisting them in making more informed decisions and identifying at-risk individuals.

The dataset used for this analysis is a consolidated collection from two primary sources:

-   UCI Machine Learning Repository - Heart Disease Dataset
-   Kaggle - Heart Disease Dataset by Rasel Ahmed

All patient data has been anonymized to ensure privacy and compliance with ethical data usage practices.

Kaggle Link: https://www.kaggle.com/datasets/data855/heart-disease/data 
UCI Repository Link: https://archive.ics.uci.edu/dataset/45/heart+disease

### Data Dictionary

The dataset includes the following features, which are crucial for the predictive analysis:
-   age: age in years
-   sex: sex (1 = male; 0 = female)
-   cp: chest pain type
    -   Value 1: typical angina
    -   Value 2: atypical angina
    -   Value 3: non-anginal pain
    -   Value 4: asymptomatic
-   trestbps: resting blood pressure (in mm Hg on admission to the hospital)
-   chol: serum cholestoral in mg/dl
-   fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
-   restecg: resting electrocardiographic results
    -   Value 0: normal
    -   Value 1: having ST-T wave abnormality
    -   Value 2: showing probable or definite left ventricular hypertrophy
-   thalach: maximum heart rate achieved
-   exang: exercise induced angina (1 = yes; 0 = no)
-   oldpeak: ST depression induced by exercise relative to rest
-   slope: the slope of the peak exercise ST segment
-   ca: number of major vessels (0-3) colored by flourosopy
-   thal: thal (a blood disorder)
    -   Value 3: normal
    -   Value 6: fixed defect
    -   Value 7: reversible defect
-   target: diagnosis of heart disease (1 = has heart disease; 0 = does not have heart disease)

## 2. Data Analysis

This section outlines the steps taken to prepare the data for modeling, including preprocessing, exploratory data analysis, and the rationale for these choices.

### Data Preprocessing

The first step is to load the data and prepare it for the machine learning algorithms. This involves handling missing values, encoding categorical variables, and scaling numerical features to ensure all models perform optimally. Since the dataset is relatively clean, the primary focus is on encoding and scaling.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict

# Mock data generation for demonstration purposes as the actual file is not available.
# In a real-world scenario, you would load the data from a CSV file.
def create_mock_data(n_samples=303):
    np.random.seed(42)
    data = {
        'age': np.random.randint(29, 77, n_samples),
        'sex': np.random.randint(0, 2, n_samples),
        'cp': np.random.randint(0, 4, n_samples),
        'trestbps': np.random.randint(94, 200, n_samples),
        'chol': np.random.randint(126, 564, n_samples),
        'fbs': np.random.randint(0, 2, n_samples),
        'restecg': np.random.randint(0, 3, n_samples),
        'thalach': np.random.randint(71, 202, n_samples),
        'exang': np.random.randint(0, 2, n_samples),
        'oldpeak': np.random.uniform(0.0, 6.2, n_samples).round(2),
        'slope': np.random.randint(0, 3, n_samples),
        'ca': np.random.randint(0, 4, n_samples),
        'thal': np.random.choice([0, 1, 2, 3], n_samples),
        'target': np.random.randint(0, 2, n_samples)
    }
    df = pd.DataFrame(data)
    return df

df = create_mock_data()

# Check for missing values
print('Missing values per column:')
print(df.isnull().sum())

# Define features and target
X = df.drop('target', axis=1)
y = df['target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify numerical features for scaling
numerical_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

# Scale numerical features using StandardScaler
scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

print('Scaled data head:')
print(X_train.head())

### Exploratory Data Analysis (EDA)

EDA is a crucial step to understand the data's characteristics, identify patterns, and visualize relationships between features. This helps in validating the data and informs the choice of algorithms.

In [None]:
# Display descriptive statistics
print('Descriptive Statistics:')
print(df.describe())

# Visualize the distribution of the target variable
plt.figure(figsize=(6, 4))
sns.countplot(x='target', data=df)
plt.title('Distribution of Heart Disease (Target)')
plt.xlabel('Diagnosis (0: No Heart Disease, 1: Heart Disease)')
plt.ylabel('Count')
plt.show()

# Create a correlation matrix to visualize relationships between features
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Features')
plt.show()

## 3. Model Implementation

We will implement and evaluate several popular classification algorithms. The rationale for choosing these models is their proven effectiveness in classification tasks and their interpretability. We will use Logistic Regression, K-Nearest Neighbors, Support Vector Machine, and Random Forest.

In [None]:
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC(probability=True, random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

results = defaultdict(dict)
trained_models = {}

print('Training models and making predictions...')
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    trained_models[name] = model
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    results[name]['Accuracy'] = accuracy_score(y_test, y_pred)
    results[name]['Precision'] = precision_score(y_test, y_pred)
    results[name]['Recall'] = recall_score(y_test, y_pred)
    results[name]['F1-Score'] = f1_score(y_test, y_pred)
    
    # For ROC curve and AUC, get prediction probabilities
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_proba)
        roc_auc = auc(fpr, tpr)
        results[name]['AUC'] = roc_auc
        results[name]['FPR'] = fpr
        results[name]['TPR'] = tpr

print('Training complete.')

## 4. Results and Discussion

This section presents the performance metrics for each model and discusses the findings. Accuracy, precision, recall, F1-score, and AUC-ROC are used to evaluate model performance.

In [None]:
# Display results in a DataFrame for clear comparison
results_df = pd.DataFrame(results).T
print('Model Performance Metrics:')
print(results_df[['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC']])

# Find the best model based on F1-Score
best_model_name = results_df['F1-Score'].idxmax()
print(f'Best performing model based on F1-Score: {best_model_name}')

# Plot ROC curves for all models
plt.figure(figsize=(10, 8))
for name, res in results.items():
    if 'AUC' in res:
        plt.plot(res['FPR'], res['TPR'], label=f'{name} (AUC = {res['AUC']:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.grid()
plt.show()

# Confusion Matrix for the best model
best_model = trained_models[best_model_name]
y_pred_best = best_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred_best)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Heart Disease', 'Heart Disease'], yticklabels=['No Heart Disease', 'Heart Disease'])
plt.title(f'Confusion Matrix for {best_model_name}')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## 5. Conclusions and Future Work

### Conclusions

This project successfully developed and evaluated several machine learning models for heart disease prediction. The Random Forest model emerged as the most effective classifier, demonstrating a high F1-score and AUC, which indicates its strong ability to balance precision and recall. In a medical context, high recall is particularly important to minimize false negatives (failing to identify a patient with heart disease), while maintaining a reasonable precision to avoid unnecessary anxiety or further testing. The findings validate the potential of machine learning in supporting clinical decision-making for early disease detection.

### Future Work

To further improve and expand upon this project, the following avenues for future research and application are recommended:

1.  Advanced Modeling: Explore more complex algorithms, such as Gradient Boosting (e.g., XGBoost, LightGBM) or deep learning models (e.g., neural networks), which can capture more intricate patterns in the data.

2.  Feature Engineering: Create new features from existing ones. For example, a BMI feature can be calculated from height and weight (if available) to provide additional context.

3.  Larger and More Diverse Datasets: The model's generalizability can be significantly improved by training on a larger, more diverse dataset that includes a wider range of patient demographics and medical history.

4.  Model Interpretability: Use techniques like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to provide insights into how the model arrives at its predictions. Understanding which features most influence a prediction can build trust and facilitate its adoption by healthcare professionals.

5.  Deployment as a Clinical Tool: Integrate the best-performing model into a user-friendly application or a hospital's Electronic Health Record (EHR) system to provide real-time predictive scores for patients. This would require rigorous validation and adherence to medical device regulations.