# Predicting heart disease using machine learning

This notebook uses Python-based machine learning and data science libraries to build a machine learning model capable of predicting whether or not someone has heart disease based on their medical report attributes.

In [2]:
approach = {
    1: 'Problem definition',
    2: 'Data',
    3: 'Evaluation',
    4: 'Features',
    5: 'Modelling',
    6: 'Experimentation'
}

## 1. Problem definition
In a statement, 
> Given the clinical report parameters of a patient, can we predict whether or not they have heart disease?

## 2. Data 
The original data came from the cleveland data from the UCI machine learning Repository. https://archive.ics.uci.edu/dataset/45/heart+disease

There is also a version of it available on Kaggle. https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data

## 3. Evaluation

> Analyse ff we can reach 95% at predicting whether or not a patient has heart disease duringthe proof of concept, then pursue the project.

## 4.Features

This is where you'll get different information about each of the feature in your data

**Create data dictionary**

1. age - age in years

2. sex-(1 = male; 0 = female)

3. cp - chest pain type
   * 0: Typical angina: chest pain related decrease supply to the heart
   * 1: Atypical angina: chest pain not related to
   * 2: Non-anginal pain: typically esophageal spasms (non heart related)
   * 3: Asymtomatic: chest pain not showing signs of disease

4. trestbps - resting blood pressure (in mm Hg on admission to the hospital) anything above 130-140 is typically cause for concern

5. chol - serum cholestoral in mg/dl
   * serum = LDL + HDL + .2*triglycerides
   * above 200 is cause for concern

6. fbs - (fasting blood sugar &gt; 120 mg/dl) (1 = true; 0 = false)
   * fbs above 126 mg/dL signals diabetes

7. restecg - resting electrocardiographic results
   * 0: Nothing to note
   * 1: ST-T Wave abnormality
       * can range from mild symtoms to severe problems
       * signals non-normal heart beat
   * 2: Possible or definite left ventricular hypertrophy
       * Enlarged heart's maain pumping chamber     

8. thalach - maximum heart rate achieved

9. exang - exercise induced angina (1 = yes; 0 = no)

10. old peakST depression induced by exercise relative to rest looks at stress of heart during exercise unhealthy heart will stress more
  
11. slope - the slope of the peak exercise ST segment
   * 0: Upsloping: better heart rate with ST segment
   * 1: Flatsloping: minimal change (typical healthy heart)
   * 2: Downsloping: signs of unhealth heart
     
12. ca - number of major vessels (0-3) colored by flourosopy
   * coloured vessel means the doctor can see the blood passing through
   * the more blood movement the better (no clots)
     
13. thal - thalium stress result
    * 1, 3  = normal
    * 6 = fixed defect
    * 7 = reversable defect

* target - have heart disease or not (prediction)
    * 0 = no
    * 1 = yes

## Preparing the tools

We'er going to use pandas, Matplotlib and Numpy for data analysis and manupulation.

Import all the tools we need

In [7]:
# Regular EDA (exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# for plot to appear inside the notebook
%matplotlib inline 

# Models from Scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model_Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay