# Heart Disease Classification - Cleveland

Dataset: UCI Heart Disease dataset available at the UCI Machine Learning data repository - http://archive.ics.uci.edu/ml/datasets/Heart+Disease


### Features in Dataset
- There are 13 features in this dataset
1. `age`: Age of the patient in years
2. `sex`: 1 = Male, 0 = Female
3. `cp`: chest pain type, 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic
4. `trestbps`: resting blood pressure in mmHg on admission to the hospital
5. `chol`: serum cholesterol in mg/dl
6. `fbs`: if fasting blood sugar > 120 mg/dl, 1 = true, 0 = false
7. `restecg`: resting electrocardiographic results, 0 = normal, 1 = having ST-T wave abnormality, 2 = probable or definite left ventricular hypertrophy by Estes' criteria
8. `thalach`: maximum heart rate achieved
9. `exang`: exercise-induced angina, 1 = yes, 0 = no
10. `oldpeak`: ST depression induced by exercise relative to rest
11. `slope`: the slope of the peak exercise ST segment, 1 = upsloping, 2 = flat, 3 = downsloping
12. `ca`: number of major vessels (0-3) colored by fluoroscopy
13. `thal`: 3 = normal, 6 = fixed defect, 7 = reversible defect


- Categorical features: `sex`, `cp`, `restecg`, `slope`, `thal`
- Binary features: `exang`, `fbs`
- Numerical features: `age`, `trestbps`, `chol`, `thalach`, `oldpeak`, `ca`


### Class Label
- The predicted class label is `num`
- From Jaakko's notes: The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4 (depending on the certainty of presence, i.e., the higher the value the higher the certainty). 

### Data preprocessing
- There are 303 records in this dataset
- Visual inspection of the dataframe shows there are 6 missing values, marked as '?'. These are in the `ca` and `thal` columns

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

In [2]:
df = pd.read_table("processed.cleveland.csv")
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


In [3]:
# Replace all '?' in the data with null values 
df = df.replace('?', np.nan)

In [5]:
# There are 4 null values in ca and 2 null values in thal
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
num         0
dtype: int64

In [22]:
# Fill null values in ca and thal with most common values 
df['ca'].fillna(df['ca'].mode()[0], inplace = True)
df['thal'].fillna(df['thal'].mode()[0], inplace = True)

In [18]:
# Change target labels to 1 and 0 to create a binary classification problem
df['num'] = df['num'].replace([2, 3, 4], 1)

In [19]:
# 164 patients have label 0 (no heart disease), 139 patients have label 1 (heart disease)
df['num'].value_counts()

0    164
1    139
Name: num, dtype: int64

In [23]:
# Perform one-hot encoding for categorical variables
df_processed = pd.get_dummies(df, columns=['sex','cp','restecg','slope','thal'])

In [24]:
df_processed 

Unnamed: 0,age,trestbps,chol,fbs,thalach,exang,oldpeak,ca,num,sex_0.0,...,cp_4.0,restecg_0.0,restecg_1.0,restecg_2.0,slope_1.0,slope_2.0,slope_3.0,thal_3.0,thal_6.0,thal_7.0
0,63.0,145.0,233.0,1.0,150.0,0.0,2.3,0.0,0,0,...,0,0,0,1,0,0,1,0,1,0
1,67.0,160.0,286.0,0.0,108.0,1.0,1.5,3.0,1,0,...,1,0,0,1,0,1,0,1,0,0
2,67.0,120.0,229.0,0.0,129.0,1.0,2.6,2.0,1,0,...,1,0,0,1,0,1,0,0,0,1
3,37.0,130.0,250.0,0.0,187.0,0.0,3.5,0.0,0,0,...,0,1,0,0,0,0,1,1,0,0
4,41.0,130.0,204.0,0.0,172.0,0.0,1.4,0.0,0,1,...,0,0,0,1,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,110.0,264.0,0.0,132.0,0.0,1.2,0.0,1,0,...,0,1,0,0,0,1,0,0,0,1
299,68.0,144.0,193.0,1.0,141.0,0.0,3.4,2.0,1,0,...,1,1,0,0,0,1,0,0,0,1
300,57.0,130.0,131.0,0.0,115.0,1.0,1.2,1.0,1,0,...,1,1,0,0,0,1,0,0,0,1
301,57.0,130.0,236.0,0.0,174.0,0.0,0.0,1.0,1,1,...,0,0,0,1,0,1,0,1,0,0


### Baseline modelling
- https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
- Models: SVC, KNN, DT, RF

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [12]:
# Defining feature vector and class labels
X = df_processed.drop('num', axis=1)
y = df['num']

In [13]:
# Normalisation of feature vector since SVC and KNN require normalisation
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [14]:
# Split into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Collate models to test in dictionary 
models = {
         "SVC": SVC(),
         "KNN": KNeighborsClassifier(),
         "Decision Tree": DecisionTreeClassifier(),
         "Random Forest": RandomForestClassifier()}


# Define function to fit and evaluate models 
def fit_and_evaluate_models(models, X_train, y_train, X_test, y_test):
    
    np.random.seed(42)
    scores = []

    for name, model in models.items():
        clf = model
        clf.fit(X_train, y_train) # Fit model with training data
        y_pred = clf.predict(X_test) # Generate class labels for unseen data
        
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        scores.append({'Classifier_name': name, 'Accuracy': accuracy, 'Precision' : precision, 'Recall': recall, 'F1-score' : f1})
    
    scores_table = pd.DataFrame.from_records(scores)
    return scores_table

In [16]:
fit_and_evaluate_models(models, X_train, y_train, X_test, y_test)

Unnamed: 0,Classifier_name,Accuracy,Precision,Recall,F1-score
0,SVC,0.868852,0.852941,0.90625,0.878788
1,KNN,0.885246,0.857143,0.9375,0.895522
2,Decision Tree,0.737705,0.75,0.75,0.75
3,Random Forest,0.836066,0.866667,0.8125,0.83871


In [1]:
# Change test 
df.head()

NameError: name 'df' is not defined

In [None]:
# Test Test 