# Predicting heart disease in patients using machine learning

This project looks into using various hyphen based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether a patient has a heart disease based on their medical attributes.

We're going to take the following aapproach:
1. Problem defintion
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

## 1. Problem Definition
> Given clinical parameter about a patient, can we predict whether or not they have a heart disease?


## 2. Data
The original data came from the Cleveland Data of the UCI Machine learning repository. 
https://archive.ics.uci.edu/ml/datasets/heart+Disease

There is also a version available on Kaggle.
https://www.kaggle.com/datasets/cherngs/heart-disease-cleveland-uci


## 3. Evaluation
> If we can reach 95% accuracy at prdicting whether or not a patient has heart disease, we will pursue the project.


## 4. Features
This is where you'll get the information on each of the features used in the dataset

**Create Data Dictionary**

* age: age in years
* sex: sex (1 = male; 0 = female)
* cp: chest pain type
    * Value 0: typical angina
    * Value 1: atypical angina
    * Value 2: non-anginal pain
    * Value 3: asymptomatic
* trestbps: resting blood pressure (in mm Hg on admission to the hospital)
* chol: serum cholestoral in mg/dl
* fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg: resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach: maximum heart rate achieved
* exang: exercise induced angina (1 = yes; 0 = no)
* oldpeak = ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment
    * Value 0: upsloping
    * Value 1: flat
    * Value 2: downsloping
* ca: number of major vessels (0-3) colored by flourosopy
* thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
* and the label
* target: 0 = no disease, 1 = disease

In [2]:
# Preparing the tools
# Importing

# Regular EDA and plotting libraries
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

# Scikit learn modelling tools
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Evaluation tools
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, f1_score, plot_roc_curve

In [3]:
# Load Data
df = pd.read_csv("../src/data/heart-disease.csv")
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


## 5. Modelling

In [4]:
# split data to X and y
X = df.drop("target", axis=1)
y = df["target"]

In [8]:
# split into train and test set
np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.28)


In [9]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
167,62,0,0,140,268,0,0,160,0,3.6,0,2,2
250,51,1,0,140,298,0,1,122,1,4.2,1,3,3
19,69,0,3,140,239,0,1,151,0,1.8,2,2,2
143,67,0,0,106,223,0,1,142,0,0.3,2,2,2
79,58,1,2,105,240,0,0,154,1,0.6,1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,50,1,2,140,233,0,1,163,0,0.6,1,1,3
71,51,1,2,94,227,0,1,154,1,0.0,2,1,3
106,69,1,3,160,234,1,0,131,0,0.1,1,1,2
270,46,1,0,120,249,0,0,144,0,0.8,2,0,3


In [11]:
y_train, len(y_train)

(167    0
 250    0
 19     1
 143    1
 79     1
       ..
 188    0
 71     1
 106    1
 270    0
 102    1
 Name: target, Length: 218, dtype: int64,
 218)

Now its crucial to choose the right model
So we are going to try 3 different models:
1. Logistic Regression
2. K-Neighbours Classifiers
3. RandomForestClassifier

In [44]:
# all the models
models = {"LogisticRegression": LogisticRegression(), 
          "KNeighborsClassifier": KNeighborsClassifier(),
          "RandomForestClassifier": RandomForestClassifier()}


# Create a function to fit and score models
def fit_and_tune_and_evaluate(models, X_train, X_test, y_train, y_test): 
    '''fits and evaluates a dataset on sklearn models'''
    # Set random seed
    np.random.seed(42)

    # Make a dictionary to store model scores
    model_scores = {}

    for name, model in models.items():
        # hyperparameter tuning
        test_scores = {}
        if name == "KNeighborsClassifier":
            neighbors = range(1,21)
            for i in neighbors:
                model.set_params(n_neighbors=i)
                # fit data to model
                model.fit(X_train, y_train)
                test_scores[i] = model.score(X_test, y_test)
            model_scores[name] = f"{max(test_scores.values())*100:.2f}% : n_neighbours={key  for (key, value) in test_scores.items() if value == max(test_scores.values())}"
            
        if name == "LogisticRegression":
            # Different hyperparameters for our LogisticRegression model
            log_reg_grid = {"C": np.logspace(-4, 4, 30),
                            "solver": ["liblinear"]}

            # Setup grid hyperparameter search for LogisticRegression
            gs_log_reg = GridSearchCV(LogisticRegression(),
                                    param_grid=log_reg_grid,
                                    cv=5,
                                    verbose=True)

            # Fit grid hyperparameter search model
            gs_log_reg.fit(X_train, y_train)

            cv_prec = np.mean((cross_val_score(gs_log_reg, X, y, cv=5, scoring="precision"))*100)

            model_scores[name] = f"{cv_prec}%"
        
        else:
            # Create a hyperparameter grid for RandomForestClassifier
            rf_grid = {"n_estimators": np.arange(10, 1000, 50),
                       "max_depth": [None, 3, 5, 10],
                       "min_samples_split": np.arange(2, 20, 2),
                       "min_samples_leaf": np.arange(1, 20, 2)}

            # Setup random hyperparameter search for RandomForestClassifier
            gs_rf = GridSearchCV(RandomForestClassifier(), 
                                    param_grid=rf_grid,
                                    cv=5,
                                    verbose=True)

            # Fit random hyperparameter search model for RandomForestClassifier()
            gs_rf.fit(X_train, y_train)

            cv_prec = np.mean((cross_val_score(gs_rf, X, y, cv=5, scoring="precision"))*100)

            model_scores[name] = f"{cv_prec}%"
    
    model_compare = pd.DataFrame(model_scores, index=["precision"])
 
    return model_compare.T.plot.bar()