# aiHeart: Final Classifier

- This notebook contains the code to run the final classifier for the aiHeart heart disease classification tool
- This classifier is based on the Cleveland dataset from UCI
- After feature selection, 4 features `ca`, `thal`, `fbs` and `restecg` were excluded. In total, 9 features `age`, `trestbps`, `chol` , `thalac`, `oldpeak`, `exang`, `cp`, `slope`, `sex` were included.
- After comparing multiple models and implementing hyperparameter tuning, we decided to implement `Logistic Regression` with hyperparameters `C = 11.288378916846883` and `solver = 'liblinear'`
- Refer to notebook titled `AiHeart : Heart Disease Classification - Cleveland` for the full code for our project

## 1. Data preparation and preprocessing  <a class="anchor" id="chapter3"></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import pickle
# import joblib

In [2]:
df = pd.read_table("processed.cleveland.csv")
df.drop(["ca", "thal", "fbs", "restecg"], axis=1, inplace=True)
df.rename(columns={"num": "target"}, inplace=True)


In [3]:
# Change target labels to 1 and 0 to create a binary classification problem
df["target"] = df["target"].replace([2, 3, 4], 1)

In [4]:
# Change to appropriate datatypes
df.cp = df.cp.astype("category")
df.slope = df.slope.astype("category")
df.sex = df.sex.astype("category")
df.dtypes

age          float64
sex         category
cp          category
trestbps     float64
chol         float64
thalach      float64
exang        float64
oldpeak      float64
slope       category
target         int64
dtype: object

In [5]:
df

Unnamed: 0,age,sex,cp,trestbps,chol,thalach,exang,oldpeak,slope,target
0,63.0,1.0,1.0,145.0,233.0,150.0,0.0,2.3,3.0,0
1,67.0,1.0,4.0,160.0,286.0,108.0,1.0,1.5,2.0,1
2,67.0,1.0,4.0,120.0,229.0,129.0,1.0,2.6,2.0,1
3,37.0,1.0,3.0,130.0,250.0,187.0,0.0,3.5,3.0,0
4,41.0,0.0,2.0,130.0,204.0,172.0,0.0,1.4,1.0,0
...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,132.0,0.0,1.2,2.0,1
299,68.0,1.0,4.0,144.0,193.0,141.0,0.0,3.4,2.0,1
300,57.0,1.0,4.0,130.0,131.0,115.0,1.0,1.2,2.0,1
301,57.0,0.0,2.0,130.0,236.0,174.0,0.0,0.0,2.0,1


In [6]:
# Set up preprocessing steps for numerical and categorical variables
categorical_features = ["sex", "cp", "slope"]
numeric_features = ["age", "trestbps", "chol", "thalach", "oldpeak"]

preprocessor = ColumnTransformer(
                    transformers=[
                        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
                        ("num", MinMaxScaler(), numeric_features)])

## 2. Training the model  <a class="anchor" id="chapter3"></a>

In [7]:
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", LogisticRegression(C=11.288378916846883, solver='liblinear'))])

In [8]:
# Create feature vector X and class labels y for training
X = df.drop('target', axis=1)
y = df['target']

In [9]:
model.fit(X, y)

## 3. Export model to file  <a class="anchor" id="chapter3"></a>

In [10]:
# Saving the model to the pc and will give you extra file in your pc (same place as this notebook).
# joblib.dump(model, 'saved_model')
pickle.dump(model, open('saved_model', 'wb'))

# loading model from the pc.
# saved_model = joblib.load('saved_model')
saved_model = pickle.load(open('saved_model', 'rb'))