# Heart Disease Classification - Cleveland

Dataset: UCI Heart Disease dataset available at the UCI Machine Learning data repository - http://archive.ics.uci.edu/ml/datasets/Heart+Disease


### Features in Dataset
- There are 13 features in this dataset
1. `age`: Age of the patient in years
2. `sex`: 1 = Male, 0 = Female
3. `cp`: chest pain type, 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic
4. `trestbps`: resting blood pressure in mmHg on admission to the hospital
5. `chol`: serum cholesterol in mg/dl
6. `fbs`: if fasting blood sugar > 120 mg/dl, 1 = true, 0 = false
7. `restecg`: resting electrocardiographic results, 0 = normal, 1 = having ST-T wave abnormality, 2 = probable or definite left ventricular hypertrophy by Estes' criteria
8. `thalach`: maximum heart rate achieved
9. `exang`: exercise-induced angina, 1 = yes, 0 = no
10. `oldpeak`: ST depression induced by exercise relative to rest
11. `slope`: the slope of the peak exercise ST segment, 1 = upsloping, 2 = flat, 3 = downsloping
12. `ca`: number of major vessels (0-3) colored by fluoroscopy
13. `thal`: 3 = normal, 6 = fixed defect, 7 = reversible defect


- Categorical features: `sex`, `cp`, `restecg`, `slope`, `thal`
- Binary features: `exang`, `fbs`
- Numerical features: `age`, `trestbps`, `chol`, `thalach`, `oldpeak`, `ca`


### Class Label
- The predicted class label is `num`
- From Jaakko's notes: The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4 (depending on the certainty of presence, i.e., the higher the value the higher the certainty). 

### Data preprocessing
- There are 303 records in this dataset
- Visual inspection of the dataframe shows there are 6 missing values, marked as '?'. These are in the `ca` and `thal` columns

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

In [2]:
df = pd.read_table("processed.cleveland.csv")
df

FileNotFoundError: [Errno 2] No such file or directory: 'processed.cleveland.csv'

In [None]:
# Replace all '?' in the data with null values 
df = df.replace('?', np.nan)

In [None]:
# There are 4 null values in ca and 2 null values in thal
df.isnull().sum()

In [None]:
# Fill null values in ca and thal with most common values 
df['ca'].fillna(df['ca'].mode()[0], inplace = True)
df['thal'].fillna(df['thal'].mode()[0], inplace = True)

In [None]:
# Change target labels to 1 and 0 to create a binary classification problem
df['num'] = df['num'].replace([2, 3, 4], 1)

In [None]:
# 164 patients have label 0 (no heart disease), 139 patients have label 1 (heart disease)
df['num'].value_counts()

In [None]:
# Perform one-hot encoding for categorical variables
df_processed = pd.get_dummies(df, columns=['sex','cp','restecg','slope','thal'])

In [None]:
df_processed 

### Baseline modelling
- https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
- Models: SVC, KNN, DT, RF

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
# Defining feature vector and class labels
X = df_processed.drop('num', axis=1)
y = df_processed['num']

In [None]:
# Normalisation of feature vector since SVC and KNN require normalisation
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [None]:
# Split into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Collate models to test in dictionary 
models = {
         "SVC": SVC(),
         "KNN": KNeighborsClassifier(),
         "Decision Tree": DecisionTreeClassifier(),
         "Random Forest": RandomForestClassifier()}


# Define function to fit and evaluate models 
def fit_and_evaluate_models(models, X_train, y_train, X_test, y_test):
    
    np.random.seed(42)
    scores = []

    for name, model in models.items():
        clf = model
        clf.fit(X_train, y_train) # Fit model with training data
        y_pred = clf.predict(X_test) # Generate class labels for unseen data
        
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        scores.append({'Classifier_name': name, 'Accuracy': accuracy, 'Precision' : precision, 'Recall': recall, 'F1-score' : f1})
    
    scores_table = pd.DataFrame.from_records(scores)
    return scores_table

In [None]:
fit_and_evaluate_models(models, X_train, y_train, X_test, y_test)

# Using scikitlearn pipeline instead 

- Example of using pipeline to transform the data and fit data to a random forest classifier

In [None]:
# Libraries for preprocessing
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# Libraries for modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# Set up random seed
import numpy as np
np.random.seed(42)
               
# Define transformer pipeline for features
categorical_features = ["sex", "cp", "restecg", "slope", "thal"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(missing_values = "?",strategy="most_frequent")),
    ("onehot", OneHotEncoder())])

numeric_features = ["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(missing_values = "?",strategy="most_frequent")),
    ("scaler", MinMaxScaler())])

# Set up preprocessing steps 
preprocessor = ColumnTransformer(
                    transformers=[
                        ("cat", categorical_transformer, categorical_features),
                        ("num", numeric_transformer, numeric_features)
                    ])


# Create a preprocessing and modelling pipeline 
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier())])

# Create datasets
X = df.drop('num', axis=1)
y = df['num']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and score a model
model.fit(X_train, y_train)
model.score(X_test, y_test)