# Heart Disease Classification

## 1. Problem

Cardiovascular diseases are the leading cause of death globally. It is therefore necessary to identify key risk factors and develop a system to predict heart attacks in an effective manner. The data below has the information about the factors that might have an impact on cardiovascular health. While correlation does not imply causation, this study serves as a preliminary exploration to generate hypotheses for future research.

<img src="./Heart%20Disease%20Mind%20Map.png" alt="Heart Rate Image" width="800"/>

## 2. Exploratory Data Analysis

In [None]:
# Import Libraries
import pickle
import logging

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from tabulate import tabulate

In [None]:
# Instantiate Logging file
logging.basicConfig(filename='/HeartDisease.log',
                    filemode='w',
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    level=logging.INFO )

In [None]:
# Import data from CSV File
df = pd.read_csv("Heart1.csv")

In [None]:
# Explore Schema
df.info(memory_usage='missing')

There are 303 rows and 14 columns. All columns are numerical (13 int64, 1 float64). This file is 33.3 KB. 

In [None]:
# Desciptive Statistics
df.describe()

In [None]:
# Correlation on a Heatmap
def find_correlations(frame):
    frame = frame.corr()
    vmin = -1.0
    vmax = -0.3
    cmap = "Reds"
    annot=True
    fmt=".1f"
    g = sns.heatmap(
        frame,
        vmin=vmin,
        vmax=vmax,
        cmap=cmap,
        annot=annot,
        fmt=fmt,
        linewidths=1)
    g.set_title("Correlation Chart")
    
find_correlations(df)   

Target is positively correlated with chest pain type (0.4) and maximum heart rate achieved (0.4). There is also a positve correlation between the slope of peak exercise ST segment and maximum heart rate achieved (0.4). The slope of peak exercise ST segment and ST depression induced by exercise relative to rest (-0.6) have a negative correlation. 

In [None]:
# Missing Value Treatment
df.isnull().sum()

There are no missing values in the dataset. 

In [None]:
# Histogram of each column
df.hist(figsize=(15,20))
plt.show()

In [None]:
# Occurrence of CVD across the Age Category
sns.violinplot(df,x='target',y='age')
plt.show()

For the group with no cardiovascular disease (target=0), the median age was 58 years old and was also the most frequent age in this group. For the group with cardiovasular disease (target=1), the median was 52 years old as the group was more distributed as the most frequent age was at 53 years old.

In [None]:
# Composition of all patients with respect to the Sex category
pd.crosstab(df['target'], df['sex'])

Regarding gender, 93 males and 72 females in this dataset has a cardiovascular disease. 114 males and 24 females do not have a cardiovascular disease.

In [None]:
# Relationship between cholesterol levels and a target variable
chol_t = df[['chol','target']]
print(chol_t.corr())
sns.violinplot(df,x='target',y='chol')
plt.show()

In [None]:
# Boxplot to see outliers
plt.boxplot(df)
plt.figure(figsize =(100, 70))

In [None]:
sns.countplot(data=df,x='target')

The dependent variable (target) seems to be balanced (0=140;1=160)

## 3. Feature Selection

In [None]:
# Seperate Dependent Variable from Independent Variables
y = df["target"]
X = df.drop(columns=["target"])

In [None]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42)

In [None]:
# Export columns used for predictions on new data
with open ('pred_columns.pkl', 'wb') as name:
    pickle.dump(X_train.columns.tolist(), name)

In [None]:
# Istantiate Logistical Regression
lr = LogisticRegression(max_iter=1000)

In [None]:
# Accuracy Score on Train Dataset
lr.fit(X_train,y_train)

In [None]:
# Accuracy Score on Test Dataset
y_pred = lr.predict(X_test)

In [None]:
# Random forest model
rfc = RandomForestClassifier(n_estimators=1000)
rfc.fit(X_train,y_train)

In [None]:
predictions = rfc.predict(X_test)

In [None]:
# Scores for Logistic Regression
lr_ac = accuracy_score(y_test,y_pred)
lr_pre = precision_score(y_test,y_pred)
lr_rec = recall_score(y_test,y_pred)
lr_f = f1_score(y_test,y_pred)
lr_train_roc = roc_auc_score(y_train, lr.predict(X_train))
lr_test_roc = roc_auc_score(y_test, y_pred)

# Scores for Random Forest
rf_ac = accuracy_score(y_test,predictions)
rf_pre = precision_score(y_test,predictions)
rf_rec = recall_score(y_test,predictions)
rf_f = f1_score(y_test,predictions)
rf_train_roc = roc_auc_score(y_train, rfc.predict(X_train))
rf_test_roc = roc_auc_score(y_test, predictions)

In [None]:
#Performing evaluation matrix comparison
m_tab = pd.DataFrame(columns = ["Comparison Matrix", "Logistic Regression Model", "Random Forest Model"])
m_tab["Comparison Matrix"] = ["Accuracy Score", "Precision Score","Recall Score", "F1 Score","Train ROC","Test ROC"]
m_tab["Logistic Regression Model"] = [lr_ac, lr_pre, lr_rec, lr_f, lr_train_roc, lr_test_roc]
m_tab["Random Forest Model"] = [rf_ac, rf_pre, rf_rec, rf_f, rf_train_roc, rf_test_roc]

print(tabulate(m_tab, headers = 'keys', tablefmt = 'psql', numalign="left"))

## Hyperparamater Tuning

In [None]:
# RandomizedSearchCV
distributions = {"n_estimators": randint(1, 100),
                 "max_depth": randint(3,10)}
RFC = RandomForestClassifier()
RFC_clf_rs = RandomizedSearchCV(RFC, distributions, n_iter=20, verbose=False)
RFC_clf_rs.fit(X_train, y_train)
print(RFC_clf_rs.best_params_)
print(RFC_clf_rs.best_score_)

In [None]:
# Train final model
RFC_final = RandomForestClassifier(n_estimators= RFC_clf_rs.best_params_["n_estimators"], 
                               max_depth= RFC_clf_rs.best_params_["max_depth"])

In [None]:
RFC_final.fit(X_train, y_train)

In [None]:
predictions1 = RFC_final.predict(X_test)

In [None]:
rf_ac1 = accuracy_score(y_test,predictions1)
rf_pre1 = precision_score(y_test,predictions1)
rf_rec1 = recall_score(y_test,predictions1)
rf_f1 = f1_score(y_test,predictions1)
rf_train_roc1 = roc_auc_score(y_train, RFC_final.predict(X_train))
rf_test_roc1 = roc_auc_score(y_test, predictions1)

In [None]:
#Performing evaluation matrix comparison
m_tab = pd.DataFrame(columns = ["Comparison Matrix", "Logistic Regression Model", "Random Forest Model"])
m_tab["Comparison Matrix"] = ["Accuracy Score", "Precision Score","Recall Score", "F1 Score","Train ROC","Test ROC"]
m_tab["Logistic Regression Model"] = [lr_ac, lr_pre, lr_rec, lr_f, lr_train_roc, lr_test_roc]
m_tab["Random Forest Model"] = [rf_ac1, rf_pre1, rf_rec1, rf_f1, rf_train_roc1, rf_test_roc1]

print(tabulate(m_tab, headers = 'keys', tablefmt = 'psql', numalign="left"))

In [None]:
RFC_final.fit(X_train, y_train)
with open ('RFC_final.pkl', 'wb') as name:
    pickle.dump(RFC_final, name)