## Predictive Modeling of Heart Disease

### Acknowledgement of Dataset
**Heart Disease UCI:** [https://archive.ics.uci.edu/ml/datasets/Heart+Disease](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)

**Database Source:**

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science. 

**Database Creators:**
- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

### Features

| Feature                | Description                                                | Units        |
|------------------------|------------------------------------------------------------|--------------|
| age                    | Age                                                        | Years        |
| sex                    | Sex                                                        | -            |
| cp                     | Chest pain type (4 values)                                 | -            |
| trestbps               | Resting blood pressure (in mm Hg on admission to the hospital) | mm Hg        |
| chol                   | Serum cholesterol                                          | mg/dl        |
| fbs                    | Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)      | -            |
| restecg                | Resting electrocardiographic results (values 0 = normal, 1, 2) | -            |
| thalach                | Maximum heart rate achieved                                | bpm          |
| exang                  | Exercise induced angina (1 = yes; 0 = no)                  | -            |
| oldpeak                | ST depression induced by exercise relative to rest         | -            |
| slope                  | The slope of the peak exercise ST segment (values 1 = upsloping, 2 = flat, 3 = downsloping) | -            |
| ca                     | Number of major vessels (0-3) colored by fluoroscopy       | -            |
| thal                   | 3 = normal; 6 = fixed defect; 7 = reversible defect        | -            |



In [None]:
%pip install seaborn

In [None]:
'''
Import pandas, NumPy, Matplotlib and Seaborn.
Import the following from sklearn:
  - metrics and svm;
  - GaussianNB from naive_bayes;
  - confusion_matrix, plot_confusion_matrix, classification_report from metrics
  - LogisticRegression from linear_model
'''

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

print('Libraries have been imported.')

In [None]:
# Read in the dataset heart.csv using pandas

heart = pd.read_csv('heart.csv')

In [None]:
# Read first 5 rows of the dataset

heart.head()

In [None]:
# Rename the columns for better readability

columns_names = {'cp':'chest_pain_type',
                 'trestbps':'resting_blood_pressure',
                 'exang':'exercise_ang',
                 'chol': 'serum_cholesterol', 
                 'fbs': 'fasting_blood_sugar',
                 'exang': 'exercise_ang',
                 'thal': 'max_heart_rate'}

# pandas rename() method and input columns_names to it

heart.rename(columns=columns_names, inplace=True)

In [None]:
# Display the updated datatypes for all the features

heart.info()

In [None]:
# Check for NaN values

heart.isna().any()

In [None]:
# Check the number of missing values per feature

heart.isna().sum()

## Visualization
Now that the dataset has been cleaned, let's visualize it.

In [None]:
# Plot the distribution of age feature in the dataset using Seaborn
# use Seaborn's distplot() method and specify a number of bins

sns.set_style(style='whitegrid')
sns.histplot(heart['age'], color='red', bins=25, kde=True)
plt.show()

The plot shows that a high percentage of the observations in this dataset are for people aged between 50 and 60.
Before implementing any models, we define the variables, and then split the data into training and test sets.

In [None]:
# Define variables: X is everything but target; y is target.

X = heart.drop('target', axis=1)
y = heart['target']

In [None]:
# Display the first few X entries and compare to the cleaned dataset

X.head()

In [None]:
# Display the first few y entries and compare to the cleaned dataset

y.head()

In [None]:
# Split data to training and test sets with split 70-30
# Use random_state = 42

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Building Models
Run Logistic Regression, SVM, and Naive Bayes classifiers and obtain the following for every model:
1. Confusion matrix
2. Accuracy score
3. Classification report

# Logistic Regression

In [None]:
# Run Logistic Regression model

logreg = LogisticRegression(solver='liblinear', max_iter=1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

y_pred, y_pred.shape, y_test.shape, X_test.shape

In [None]:
# Obtain and plot the confusion matrix

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['no disease', 'heart disease'])
disp.plot(cmap=plt.cm.Reds)
disp.ax_.set_title('Confusion Matrix')
plt.show()

Confusion matrix shows 32 correct negatives and 42 correct positives.

In [None]:
# Compute the accuracy score
logreg_s = logreg.score(X_test, y_test)
print('Accuracy:', logreg_s*100, '%')

# Obtain the classification report
print(classification_report(y_test, y_pred, labels=[0,1], target_names=['no disease', 'heart disease']))

# Obtain the precision score

**Logistic Regression Results:**
- Accuracy score is approximately 0.813, meaning that 81.3% of the observations were correctly classified (as disease or no_disease).
- Precision for "no disease" is 0.80, meaning that out of all the instances predicted as "no disease," 80% were correctly predicted.
- Precision for "heart disease" is 0.82, meaning that out of all the instances predicted as "heart disease," 82% were correctly predicted.

Therefore, the model is performing reasonably well, but we need to compare those metrics with the other model's metrics to decide which of the
three models performs better on the given dataset.

# SVM Classifier

In [None]:
# Run SVM model and obtain the predictions
# create SVM classifier
svm_model = svm.SVC(kernel='linear')

# train the model using the training sets
svm_model.fit(X_train, y_train)

# predict the response for test dataset
y_pred = svm_model.predict(X_test)
y_pred, y_pred.shape, y_test.shape, X_test.shape

In [None]:
# Obtain and plot the confusion matrix

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['no disease', 'heart disease'])
disp.plot(cmap=plt.cm.Reds)
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Obtain the accuracy score
svm_model_s = svm_model.score(X_test, y_test)
print('Accuracy:', svm_model_s*100, '%')

# Obtain the classification report
print(classification_report(y_test, y_pred, labels=[0,1], target_names=['no disease', 'heart disease']))

# Obtain the Precision Score

**SVM Classifier Results:**
- Accuracy score is approximately 0.813, meaning that 81.3% of the observations were correctly classified (as disease or no_disease).
- Precision for "no disease" is 0.80, meaning that out of all the instances predicted as "no disease," 80% were correctly predicted.
- Precision for "heart disease" is 0.82, meaning that out of all the instances predicted as "heart disease," 82% were correctly predicted.

Those values are very close to the metric values obtained by the Logistic Regression model which means both models have the same performance.

# Naive Bayes Classifier

In [None]:
# Run the Naive Bayes algorithm
# create a Gaussian Classifier
gnb_model = GaussianNB()

# train the model using the training sets
gnb_model.fit(X_train, y_train)

# predict the reponse for test dataset
y_pred = gnb_model.predict(X_test)
y_pred, y_pred.shape, y_test.shape, X_test.shape

In [None]:
# Obtain and plot the confusion matrix

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['no disease', 'heart disease'])
disp.plot(cmap=plt.cm.Reds)
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Obtain the accuracy score
gnb_model_s = gnb_model.score(X_test, y_test)
print('Accuracy:', gnb_model_s*100, '%')

# Obtain the classification report
print(classification_report(y_test, y_pred, labels=[0,1], target_names=['no disease', 'heart disease']))

# Obtain the Precision Score

**Naive Bayes Classifier Results:**
- Accuracy score is approximately 0.835, meaning that 83.5% of the observations were correctly classified (as disease or no_disease).
- Precision for "no disease" is 0.78, meaning that out of all the instances predicted as "no disease," 78% were correctly predicted.
- Precision for "heart disease" is 0.89, meaning that out of all the instances predicted as "heart disease," 89% were correctly predicted.

The accuracy of the Naive Bayes Classifier model is slightly higher than the Logistic Regression and SVM Classifier model (both at 0.813).

## Classifier Performance Comparison

| Metric             | SVM     | Logistic Regression | Naive Bayes |
|--------------------|---------|---------------------|-------------|
| **Accuracy**       | 0.813   | 0.813               | 0.835       |
| **Precision (No Disease)** | 0.80    | 0.80                | 0.78        |
| **Precision (Heart Disease)** | 0.82    | 0.82                | 0.89        |
| **Recall (No Disease)**    | 0.78    | 0.78                | 0.88        |
| **Recall (Heart Disease)** | 0.84    | 0.84                | 0.80        |
| **F1-Score (No Disease)**  | 0.79    | 0.79                | 0.83        |
| **F1-Score (Heart Disease)** | 0.83    | 0.83                | 0.84        |

### Summary of Findings:
1. **Accuracy**:
   - Both the SVM and Logistic Regression classifiers achieved an accuracy of approximately 81.3%.
   - The Naive Bayes classifier outperformed the other two with an accuracy of approximately 83.5%.


2. **Precision**:
   - For predicting "no disease", SVM and Logistic Regression have higher precision (0.80) compared to Naive Bayes (0.78).
   - For predicting "heart disease", Naive Bayes has the highest precision (0.89), significantly better than both SVM and Logistic Regression (0.82).


3. **Recall**:
   - For predicting "no disease", Naive Bayes achieves the highest recall (0.88), indicating it is very effective at identifying patients without heart disease.
   - For predicting "heart disease", SVM and Logistic Regression have higher recall (0.84) compared to Naive Bayes (0.80).


4. **F1-Score**:
   - For predicting "no disease", Naive Bayes has the highest F1-score (0.83), indicating a good balance between precision and recall.
   - For predicting "heart disease", the F1-scores are very close, with Naive Bayes slightly leading (0.84) over SVM and Logistic Regression (0.83).
  
### Next Steps:
- Experimenting with more advanced machine learning models such as Random Forests, Gradient Boosting Machines (GBM), and XGBoost can provide insights into their performance compared to the current models.
- Deep learning approaches using neural networks, specifically feedforward neural networks or more sophisticated architectures like convolutional neural networks (CNNs) and recurrent neural networks (RNNs), can be explored for possibly better accuracy and predictive power.
- Conducting hyperparameter tuning using techniques such as grid search or randomized search can optimize model performance.
- Implementing cross-validation will ensure that the results are robust and generalizable to new, unseen data.

### Conclusion:
Based on the comparison, the Naive Bayes classifier demonstrates superior performance overall, particularly in accuracy and precision for predicting heart disease. While SVM and Logistic Regression also perform well and consistently, they do not surpass Naive Bayes in key metrics. Therefore, for the dataset used, Naive Bayes is recommended as the best-performing classifier.
