# Machine Learning for prediction of Heart Disease

| Feature         | Description                                                  | Type
| ---             | ---                                                          | ---                                    
| **age**         | Age                                                          | Real
| **sex**         | Sex                                                          | Binary
| **cp**          | Chest pain type (4 values)                                   | Nominal
| **trestbps**    | Resting blood age                                            | Real
| **chol**        | Serum cholesterol (in mg/dl)                                 | Real
| **fbs**         | Fasting blood sugar > 120 mg/dl                              | Binary
| **restecg**     | Resting electrocardiographic results (values 0,1,2)          | Nominal
| **thalach**     | Maximum heart rate achieved                                  | Real
| **exang**       | Exercise induced angina                                      | Binary
| **oldpeak**     | Oldpeak = ST depression induced by exercise relative to rest | Real
| **slope**       | The slope of the peak exercise ST segment                    | Ordered
| **ca**          | Number of major vessels (0-3) colored by flouroscopy         | Real
| **thal**        | Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect    | Nominal
| **target**      | 1 = no disease; 2 = presence of disease                      | 

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

detail = {"age": "Age", "sex": "Sex", "cp": "Chest Pain Type", "trestbps": "Resting Blood Pressure",
          "chol": "Serum Cholesterol", "fbs": "Fasting Blood Sugar", "restecg": "Resting ECG",
          "thalach": "Max Heart Rate", "exang": "Exercise Induced Angina", "oldpeak": "Oldpeak",
          "slope": "Slope", "ca": "Number of major vessels", "thal": "Thal", "target": "(0 - no disease, 1 - disease))"}

sns.set_theme(context="paper", font_scale=1.5, style="whitegrid", palette="Set2")

data = pd.read_csv("heart.dat", sep="\\s+", header=None)
data.columns = detail.keys()

numericalFeatures = ["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]
categoricalFeatures = ["sex", "cp", "fbs", "restecg", "exang", "slope", "thal"]

## Data Pre-Processing


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Check for missing values
print("Number of missing values:", data.isnull().sum().sum())
# Check for duplicates
print("Number of duplicates:", data.duplicated().sum())

X = data.iloc[:, :-1]
# Target label is converted to binary (0 - no disease, 1 - disease)
Y = data.iloc[:, -1] - 1


# Removing extreme OUTLIERS (only 1) (3 * IQR below and above the Q1 and Q3, respectively)
Q1 = X[numericalFeatures].quantile(0.25)
Q3 = X[numericalFeatures].quantile(0.75)
IQR = Q3 - Q1

# Find the outlier row(s)
outlier_mask = ((X[numericalFeatures] < (Q1 - 3 * IQR)) | (X[numericalFeatures] > (Q3 + 3 * IQR))).any(axis=1)
# Remove the outlier row(s) from X
X = X[~outlier_mask].reset_index(drop=True)
Y = Y[~outlier_mask].reset_index(drop=True)

print("Number of outliers removed:", len(data) - len(X), '\n')

X_numerical = X[numericalFeatures]
X_categorical = X[categoricalFeatures]

# Apply SCALING only to numerical variables
# Standardizing
X_stand = X.copy()
X_stand[numericalFeatures] = StandardScaler().fit_transform(X_numerical)
# Normalizing
X_norm = X.copy()
X_norm[numericalFeatures] = MinMaxScaler().fit_transform(X_numerical)

# ONE HOT ENCODING
X_oneHot = pd.get_dummies(X, columns=["cp", "restecg", "slope", "thal"])
X_oneHot_stand = pd.get_dummies(X_stand, columns=["cp", "restecg", "slope", "thal"])
X_oneHot_norm = pd.get_dummies(X_norm, columns=["cp", "restecg", "slope", "thal"])


## Feature Correlation

### Heatmap

Although the heatmap should work better with numeric features, categorical binary ones are simple enough that a numeric relationship can apply to its categorical nature.
Features used:
- **numerical** - age, trestbps, chol, thalach, oldpeak, ca
- **categorical** - sex, fbs, exang

In [None]:
figsize = (10, 8)
vmin = -0.75
vmax = 0.75

X_heatmap = X[numericalFeatures + ["sex", "fbs", "exang"]].copy()

dataCorr = pd.concat([X_heatmap, data["target"]], axis=1).corr()

upperHalf_mask = np.tril(np.ones_like(dataCorr, dtype=bool))    # remove bottom left corner

plt.figure(figsize=figsize)
plt.title("Feature Heatmap")
sns.heatmap(dataCorr, annot=True, linewidths=2,
            mask=upperHalf_mask, cmap="Spectral_r", vmin=vmin, vmax=vmax
)
plt.savefig(f"plots/heatmap/heatmap.png")

By looking at **trestbps** density distribution and boxplot we can assess that it provides little information for the classification, as the distributions for disease and no disease are pratically overlapping.  

A slighter version of the same phenomenon happens for **chol**.  

The categorical feature **fbs** also appears to have little effect on separating the two classes, as the normalized bar plot shows the same proportions between target = 2 (diease) and target = 1 (no disease) for both **fbs** classes.  

It could be relevant to try out the models without these 3 features.

## Principal Component Analysis

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=len(X_numerical.columns))
pca.fit(X_numerical)
X_pca = pca.transform(X_numerical)

explained_variance = pca.explained_variance_ratio_

# plot no of components vs cumulative explained variance
explained_variance = np.cumsum(pca.explained_variance_ratio_)
plt.figure(figsize=(4, 3))
plt.plot(explained_variance, )
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
# plt.savefig("plots/pcVsCEV.png")

plt.show()

By plotting the *Number of Components* against the *Cumulative Explained Variation*, we can see that 3 principal components are useful to explain 100% of the variance, the same number of numeric features.

In [None]:
from mpl_toolkits.mplot3d import Axes3D

nPCs = 3
pca = PCA(n_components=nPCs)
pca_result = pca.fit_transform(X_pca)

explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance: {explained_variance}")
# print(pca.components_)  # feature weight for each pc

# Convert it back to a DataFrame
pca_df = pd.DataFrame(data=pca_result, columns=["PC" + str(i + 1) for i in range(nPCs)])

X_PCAed = pd.concat([pca_df, X.drop(X_numerical.columns, axis=1)], axis=1)

print(X_PCAed.head())
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(pca_df['PC1'], pca_df['PC2'], pca_df['PC3'])

ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')

## Data Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB, CategoricalNB, MultinomialNB, BernoulliNB

def naiveBayes_implement(nb_model, X_bayes, stringX):
    perfMetrics = ["recall", "f1", "roc_auc"]
    nb_results = cross_validate(nb_model, X_bayes, Y, cv=5, scoring=perfMetrics, return_train_score=True)
    results_df = pd.DataFrame(nb_results)
    results_df.drop(columns=['fit_time', 'score_time'], inplace=True)  # Exclude fit_time and score_time
    print(f"{stringX} \n", results_df.mean(), "\n\n")


# Drop ca
X_noCa = X_numerical.copy().drop(["ca"], axis=1)
# Drop trestbpse col
X_noTrestBpsChol = X_numerical.copy().drop(["trestbps", "chol"], axis=1)
# Drop thalach
X_noThalach = X_numerical.copy().drop(["thalach"], axis=1)
# Drop fbs
X_noFbs = X_categorical.copy().drop(["fbs"], axis=1)
# PCAed without categorical
X_PCAed_noCategorical = X_PCAed.copy().drop(X_categorical.columns, axis=1)

naiveBayes_implement(GaussianNB(), X, "X")
naiveBayes_implement(GaussianNB(), X_noCa, "X_noCa")
naiveBayes_implement(GaussianNB(), X_numerical, "X_numerical")
naiveBayes_implement(GaussianNB(), X_noTrestBpsChol, "X_noTrestBpsChol")
naiveBayes_implement(GaussianNB(), X_noThalach, "X_noThalach")
naiveBayes_implement(CategoricalNB(), X_categorical, "X_categorical")
naiveBayes_implement(CategoricalNB(), X_noFbs, "X_noFbs")
naiveBayes_implement(GaussianNB(), X_PCAed, "X_PCAed")
naiveBayes_implement(GaussianNB(), X_PCAed_noCategorical, "X_PCAed_noCategorical")


### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

param_grid = [    
    {
    'C' : [0.01,0.1,1,10,100],
    }
]

clf = GridSearchCV(LogisticRegression(), param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"], return_train_score=True, refit=False)
clf.fit(X_oneHot_stand, Y)
results_gs = pd.DataFrame(clf.cv_results_)
results_gs[['param_C','mean_train_recall','mean_train_f1', 'mean_train_roc_auc', 'mean_test_recall','mean_test_f1', 'mean_test_roc_auc']]




### K-Nearest Neighbors

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Calcular o erro para K's entre 1 e 30
knn_model = GridSearchCV(KNeighborsClassifier(), {"n_neighbors": range(1, 30, 30)}, cv=5, scoring=["f1"], return_train_score=False, refit=False)
knn_model.fit(X_oneHot, Y)
results_knn = pd.DataFrame(knn_model.cv_results_)

plt.figure(figsize=(12, 6))
plt.plot(results_knn['param_n_neighbors'], results_knn['mean_test_f1'], color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('F1 Error Rate (K Value)')
plt.xlabel('K Value')
plt.ylabel('F1 Error')
plt.show()

clf = GridSearchCV(KNeighborsClassifier(), {"n_neighbors": [2, 6, 8, 11]}, cv=5, scoring=["recall", "f1", "roc_auc"], return_train_score=True, refit=False)
clf.fit(X_oneHot, Y)
results_gs = pd.DataFrame(clf.cv_results_)

# Print the metrics table
print("Metrics Table:")
print(results_gs[['param_n_neighbors', 'mean_test_recall', 'mean_test_f1', 'mean_test_roc_auc']])
print(results_gs[['param_n_neighbors', 'mean_train_recall', 'mean_train_f1', 'mean_train_roc_auc']])

### Decision Trees

In [117]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier


param_grid = {"max_depth": [3, 5, 7], "min_samples_split": [2, 4, 8]}

clf = GridSearchCV(DecisionTreeClassifier(), param_grid=param_grid, cv=5, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"], return_train_score=True, refit=False)
clf.fit(X_oneHot, Y)
results_gs = pd.DataFrame(clf.cv_results_)
results_gs[['param_max_depth','param_min_samples_split','mean_train_recall', 'mean_train_f1','mean_train_roc_auc', 'mean_test_recall', 'mean_test_f1','mean_test_roc_auc']]


Unnamed: 0,param_max_depth,param_min_samples_split,mean_train_recall,mean_train_f1,mean_train_roc_auc,mean_test_recall,mean_test_f1,mean_test_roc_auc
0,3,2,0.810417,0.837568,0.911711,0.7,0.729519,0.807217
1,3,4,0.810417,0.837568,0.911711,0.708333,0.73262,0.811245
2,3,8,0.810417,0.837568,0.911711,0.7,0.729519,0.807217
3,5,2,0.902083,0.939268,0.982127,0.716667,0.73759,0.766159
4,5,4,0.897917,0.933837,0.980791,0.683333,0.706881,0.749397
5,5,8,0.885417,0.913812,0.974278,0.7,0.716786,0.783046
6,7,2,0.96875,0.984,0.998792,0.716667,0.717196,0.752203
7,7,4,0.941667,0.964641,0.996904,0.683333,0.707278,0.756317
8,7,8,0.908333,0.927459,0.988256,0.7,0.713725,0.790704
