## Ensemble Technique for Prediction of Heart Disease

In last notebook, I did extensive feature selection and ran Machine learning models on **4 different set of features**. After lot of Hyperparameter tuning and cross validation, I got **3 good models with f1 score of 89% and test accuracy of around 88.5%**

- In this Notebook, I will be research and create an **Ensemble Max Voting of 3 best models KNN, Logistic and Random Forest** models saved with the best hyperparameters. 


- The input data will need to be **converted to dummy variables** for new data and **reordered to keep it in same order** as it was during training. Then I also need to **scale features using MinMax Scaling** before predicting the output using the 3 models. 


- **Once I have the output from the 3 models, I can use any ensembling technique to get the final prediction. I will be using a Max voting appraoch here**. 

In [1]:
import streamlit as st
import numpy as np
import pandas as pd
import pickle
import time
import statistics
import pickle
import os
from sklearn.externals import joblib 
os.chdir(r'N:\GITHUB\Data-Science-Machine-Learning\Heart Disease Classification Web App\Streamlit Webapp')

#### Load the 3 Trained Models & Min Max Scaler

In [25]:
knn = joblib.load('./Models/KNNClassifier.pkl')  
logit = joblib.load('./Models/LogisticRegression.pkl')  
rf = joblib.load('./Models/RandomForestclf.pkl')  

In [3]:
scalerfile = './Models/scaler.sav'
scaler = pickle.load(open(scalerfile, 'rb'))
#test_scaled_set = scaler.transform(test_set)

#### Feature sets for models

In [4]:
original_features = ['age', 'sex', 'resting_BP', 'serum_cholestoral', 'fasting_blood_sugar',
       'max_heart_rate', 'exercise_induced_angina', 'oldpeak',
       'chest_pain_type_1', 'chest_pain_type_2', 'chest_pain_type_3',
       'resting_ECG_1', 'resting_ECG_2', 'slope_1', 'slope_2',
       'major_vessels_count_1', 'major_vessels_count_2',
       'major_vessels_count_3', 'major_vessels_count_4', 'thalium_stress_1',
       'thalium_stress_2', 'thalium_stress_3']


selected_features = ['sex', 'max_heart_rate', 'exercise_induced_angina', 'oldpeak',
       'chest_pain_type_1', 'chest_pain_type_2', 'chest_pain_type_3',
       'slope_2', 'major_vessels_count_1', 'major_vessels_count_2',
       'major_vessels_count_3', 'thalium_stress_3']

#### Original columns and categorical Columns from training data 

In [5]:
original_cols = ['age', 'sex', 'chest_pain_type', 'resting_BP', 'serum_cholestoral',
       'fasting_blood_sugar', 'resting_ECG', 'max_heart_rate',
       'exercise_induced_angina', 'oldpeak', 'slope', 'major_vessels_count',
       'thalium_stress']

categorical_cols = [
 'chest_pain_type',
 'resting_ECG',
 'slope',
 'major_vessels_count',
 'thalium_stress']


#### Lookup for encoding values for labels of different categorical columns 

In [6]:
cat_lookup = {'Male':1, 'Female':0, 'Typical Angina':0, 'Atypical Angina':1, 'Non-anginal pain':2, 'Asymptomatic':3, 
             'fasting blood sugar > 120 mg/dl':1, 'fasting blood sugar < 120 mg/dl':0, 
              'Nothing to Note':0, 'ST-T Wave abnormality':1, 'Left Ventricular Hypertrophy':2, 
             'Yes':1, 'No':0, 'Unslopping: Better heart rate with exercise':0, 'Flatsloping: Minimal change':1,
              'Downslopings: Signs of unhealthy heart':2, 'Normal:1':0, 'Normal:3':1, 'Fixed defect:6':2, 'Reversable defect:7':3}

#### Input Data for testing of Pipeline 

In [7]:
#This dataframe will contain the data input from the user for prediction 

df_test = pd.DataFrame([[45, 'Male', 'Typical Angina', 150, 250, 'fasting blood sugar > 120 mg/dl', 
                        'ST-T Wave abnormality', 170, 'Yes', 3.4, 'Flatsloping: Minimal change', 1, 'Fixed defect:6']], 
                       columns=['age', 'sex', 'chest_pain_type', 'resting_BP', 'serum_cholestoral',
       'fasting_blood_sugar', 'resting_ECG', 'max_heart_rate',
       'exercise_induced_angina', 'oldpeak', 'slope', 'major_vessels_count',
       'thalium_stress'])
df_test

Unnamed: 0,age,sex,chest_pain_type,resting_BP,serum_cholestoral,fasting_blood_sugar,resting_ECG,max_heart_rate,exercise_induced_angina,oldpeak,slope,major_vessels_count,thalium_stress
0,45,Male,Typical Angina,150,250,fasting blood sugar > 120 mg/dl,ST-T Wave abnormality,170,Yes,3.4,Flatsloping: Minimal change,1,Fixed defect:6


#### Converting the string values to encoded values for each category

In [8]:
lookup_cols = ['sex', 'chest_pain_type','fasting_blood_sugar', 'resting_ECG',
               'exercise_induced_angina', 'slope','thalium_stress']

for col in lookup_cols:
    df_test[col] = df_test[col].apply(lambda x: cat_lookup[x])

In [9]:
df_test

Unnamed: 0,age,sex,chest_pain_type,resting_BP,serum_cholestoral,fasting_blood_sugar,resting_ECG,max_heart_rate,exercise_induced_angina,oldpeak,slope,major_vessels_count,thalium_stress
0,45,1,0,150,250,1,1,170,1,3.4,1,1,2


#### Creating dummy variable for Categorical features so that it matches the original training dataframe format

In [10]:
cat_dummies = [col for col in original_features
              if '_' in col and '_'.join(col.split('_')[:-1]) in categorical_cols]
cat_dummies

['chest_pain_type_1',
 'chest_pain_type_2',
 'chest_pain_type_3',
 'resting_ECG_1',
 'resting_ECG_2',
 'slope_1',
 'slope_2',
 'major_vessels_count_1',
 'major_vessels_count_2',
 'major_vessels_count_3',
 'major_vessels_count_4',
 'thalium_stress_1',
 'thalium_stress_2',
 'thalium_stress_3']

In [11]:
df_heart = pd.get_dummies(df_test, prefix_sep='_', columns=categorical_cols)

In [12]:
df_heart

Unnamed: 0,age,sex,resting_BP,serum_cholestoral,fasting_blood_sugar,max_heart_rate,exercise_induced_angina,oldpeak,chest_pain_type_0,resting_ECG_1,slope_1,major_vessels_count_1,thalium_stress_2
0,45,1,150,250,1,170,1,3.4,1,1,1,1,1


In [13]:
## Remove additional columns
for col in df_heart.columns:
    if(('_' in col) and ('_'.join(col.split('_')[:-1]) in categorical_cols) and col not in cat_dummies):
        print('Removing additional feature {} not used in training'.format(col))
        df_heart.drop(columns=[col], axis=1, inplace=True)

Removing additional feature chest_pain_type_0 not used in training


Now we need to add the missing columns. We can set all missing columns to a vector of 0s since those values did not appear in the test data

In [14]:
for col in cat_dummies:
    if col not in df_heart.columns:
        print('Adding missing feature {}'.format(col))
        df_heart[col] = 0

Adding missing feature chest_pain_type_1
Adding missing feature chest_pain_type_2
Adding missing feature chest_pain_type_3
Adding missing feature resting_ECG_2
Adding missing feature slope_2
Adding missing feature major_vessels_count_2
Adding missing feature major_vessels_count_3
Adding missing feature major_vessels_count_4
Adding missing feature thalium_stress_1
Adding missing feature thalium_stress_3


In [15]:
df_heart

Unnamed: 0,age,sex,resting_BP,serum_cholestoral,fasting_blood_sugar,max_heart_rate,exercise_induced_angina,oldpeak,resting_ECG_1,slope_1,...,chest_pain_type_1,chest_pain_type_2,chest_pain_type_3,resting_ECG_2,slope_2,major_vessels_count_2,major_vessels_count_3,major_vessels_count_4,thalium_stress_1,thalium_stress_3
0,45,1,150,250,1,170,1,3.4,1,1,...,0,0,0,0,0,0,0,0,0,0


### feature scaling using the Scaler fit on original data

In [16]:
features_SS = scaler.transform(df_heart)
features_SS

array([[0.33333333, 1.        , 0.52830189, 0.28310502, 1.        ,
        0.75572519, 1.        , 0.5483871 , 1.        , 1.        ,
        1.        , 1.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]])

In [17]:
features_SS = pd.DataFrame(features_SS, columns=df_heart.columns)
features_SS

Unnamed: 0,age,sex,resting_BP,serum_cholestoral,fasting_blood_sugar,max_heart_rate,exercise_induced_angina,oldpeak,resting_ECG_1,slope_1,...,chest_pain_type_1,chest_pain_type_2,chest_pain_type_3,resting_ECG_2,slope_2,major_vessels_count_2,major_vessels_count_3,major_vessels_count_4,thalium_stress_1,thalium_stress_3
0,0.333333,1.0,0.528302,0.283105,1.0,0.755725,1.0,0.548387,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Reorder the columns for Prediction using the 3 Models

In [18]:
features_KNN = features_SS[original_features]
features_Logit = features_SS[selected_features]
features_rf = features_SS[original_features]

In [19]:
features_KNN.shape, features_Logit.shape, features_rf.shape

((1, 22), (1, 12), (1, 22))

### Predictions & Ensemble using Max Voting

Now I will use the 3 loaded models with their set of features extracted from the user's input to predict the target. Then I will use **Max Voting Ensemble technique to get a final prediction which will be the most voted output from the 3 models**

In [20]:
pred_knn = knn.predict(features_KNN)
pred_logit = logit.predict(features_Logit)
pred_rf = rf.predict(features_rf)

In [21]:
print('KNN Predicted {}'.format(int(pred_knn)))
print('Logistic Predicted {}'.format(int(pred_logit)))
print('Random Forest Predicted {}'.format(int(pred_rf)))

KNN Predicted 0
Logistic Predicted 0
Random Forest Predicted 0


In [22]:
ensemble_pred = statistics.mode([int(pred_knn), int(pred_logit), int(pred_rf)])
ensemble_pred

0

In [23]:
if ensemble_pred == 0:
    ensemble_diagnostic = 'does not have a Heart Disease'
else:
    ensemble_diagnostic = 'might have a Heart Disease. Please perform further tests.'

In [24]:
ensemble_diagnostic

'does not have a Heart Disease'