#Diagnosing Heart Disease from UCI Dataset
##Motivation

Diagnosing any serious disease will always be a strong motivation for research because health will always be a major interest of humans. Hearth diseases are one of the most common among other diseases and some related attributes are already been collected because of the medical applications nowadays. In this project we propose to diagnose hearth disease based upon the dataset from UCI. The [UCI Dataset](https://www.kaggle.com/ronitf/heart-disease-uci) contains 76 attributes, but all published experiments refer to using a subset of 14 of them. The aim is to determine the presence of heart disease in the patient based upon these attributes. It is integer valued from 0 (no presence) to 4. The dataset will be analyzed and certain Machine Learning models will be estimated.



## Read The Dataset

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sea

data=pd.read_csv("heart.csv")
data.tail()

## Data Analysis
### Corrolation Matrix

A correlation matrix is a table showing correlation coefficients between sets of variables. Each random variable (Xi) in the table is correlated with each of the other values in the table (Xj). This allows you to see which pairs have the highest correlation.

The most corrolated with the classes (target) are:
- cp
- thalach
- exang
- oldpeak
- slope
- ca
- thal


In [0]:
plt.figure(figsize=(20,10))
sea.heatmap(data.corr(),annot=True,linewidth=3,);

### Feature Uniqueness
Doesn't give much information in our scenario.

In [0]:
data.nunique()

##Feature Selection

You can drop the fields that you don't want to include by inserting the key into the drop array below.


In [0]:
# Split dataset in features and target variable

# BEST FEATURES
X = data.drop(['target','trestbps','chol','fbs','restecg','age','sex', 'thal'],axis=1) #feature array

# BEST FEATURES + THAL
# X = data.drop(['target','trestbps','chol','fbs','restecg','age','sex'],axis=1) #feature array

# BEST FEATURES + AGE & SEX
# X = data.drop(['target','trestbps','chol','fbs','restecg'],axis=1) #feature array

# ALL FEATURES
# X = data.drop(['target'],axis=1) #feature array


y = data.target # response/target

print ("Features Array: ", X.shape)  # metrix
print ("Labels Array: ", y.shape) # series

##Split

Split the 80% of the dataset into training set and 20% to test set so that the model can be trained and tested on different data. Split is done randomly.

In [0]:
import sklearn
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Split the dataset into two sets, so that the model can be trained and tested on different data
from sklearn.model_selection import train_test_split

# Split dataset into training set (80%)) and test set (20%)
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=5)
print ("Taining data: ", X_train.shape)
print ("Test data: ", X_test.shape)

##Helper Functions

In [0]:
def plot_cm(cm):
  '''
  Plot the Confusion Matrix.
  Yellow-Orange Themed & Count Included & Labeled as TN,FP,FN,TP
  '''
  plt.clf()
  plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
  classNames = ['Negative','Positive']
  plt.ylabel('True label')
  plt.xlabel('Predicted label')
  tick_marks = np.arange(len(classNames))
  plt.xticks(tick_marks, classNames, rotation=45)
  plt.yticks(tick_marks, classNames)
  s = [['TN','FP'], ['FN','TP']]
  for i in range(2):
      for j in range(2):
          plt.text(j,i, str(s[i][j])+" = "+str(cm[i][j]))
  plt.show() 
  
  
def plot_cm_orange(cm):
  '''
  Orange-Brown Themed & Count Included
  '''
  sea.heatmap(cm,annot=True,cbar=False,cmap="Oranges",fmt="d") 
  
      
def plot_cm_purple(cm):
  '''
  Purple Themed & No Count
  '''
  fig, ax = plt.subplots()
  im = ax.imshow(cm, interpolation='nearest')
  ax.figure.colorbar(im, ax=ax)
  # We want to show all ticks...
  ax.set(xticks=np.arange(cm.shape[1]),
         yticks=np.arange(cm.shape[0]),
         title='Confusion Matrix for test set',
         ylabel='True label',
         xlabel='Predicted label')

  # Rotate the tick labels and set their alignment.
  plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
           rotation_mode="anchor");

#Machine Learning Models
Analyze the Models seperately and print the Confusion Matrices.

##Decision Tree

Apply the Decision Tree algorithm, show the accuracy and plot the Confusion Matrix.

In [0]:
# Model: 1. Create Decision Tree classifer object
from sklearn.tree import DecisionTreeClassifier
cf = DecisionTreeClassifier()
# Train the model using the training sets
cf = cf.fit(X_train,y_train)
# make Predictions on the test dataset
cf_predicted = cf.predict(X_test)

# Classification accuracy, how often is the classifier correct (percentage of correct predictions)?
# Determine the accuracy of the model (compare actual value:y_test with predicted value:cf_predicted)
print ("DECISION TREE:")
print ("Accuracy Score:")
print (metrics.accuracy_score(y_test, cf_predicted))
# Compute confusion matrix to evaluate the accuracy of a classification
cm=metrics.confusion_matrix(y_test, cf_predicted)

plot_cm(cm)

###Generate The Decision Tree

In [0]:
## generate classification tree for DecisionTreeClassifier
from sklearn import tree
import graphviz

plt.figure(figsize=(5,5))

# Create column list (used by graphviz for ptintin decision tree)
# Uncomment the related one with the features selected.


# ALL FEATURES INCLUDED
# feature_cols = ['age','sex','chest_pain','rest_bp','chol','fasting_bloodsugar','rest_ecg','max_heartrate','excercise_angina','oldpeak','slope','n_major_vasel','thal']


# ONLY BEST FEATURES INCLUDED
# cp, thalach, exang, oldpeak, slope, ca, thal
feature_cols = ['chest_pain','max_heartrate','excercise_angina','oldpeak','slope','n_major_vasel','thal']


# ONLY BEST FEATURES INCLUDED
# cp, thalach, exang, oldpeak, slope, ca, thal
# feature_cols = ['age', 'sex', 'chest_pain','max_heartrate','excercise_angina','oldpeak','slope','n_major_vasel','thal']


dot_data = tree.export_graphviz(cf, out_file=None, feature_names=feature_cols,  class_names=['0','1'],  filled=True, rounded=True,  special_characters=True)
graph = graphviz.Source(dot_data)
#graph.render("class") ## print pdf file
graph

##Logistic Regression

Apply the Logistic Regression algorithm, show the accuracy and plot the Confusion Matrix.

In [0]:
# Model: 2. Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear')
#Train the model using the training sets
lr.fit(X_train,y_train)
#Predict the response for test dataset
lr_predicted=lr.predict(X_test)
print ("LOGISTIC REGRESSION:")
print ("Accuracy Score:")
print (metrics.accuracy_score(y_test, lr_predicted))
# Compute confusion matrix to evaluate the accuracy of a classification
cm=metrics.confusion_matrix(y_test, lr_predicted)

plot_cm(cm)

##Support Vector Machine (SVM)

Apply the Support Vector Machine algorithm, show the accuracy and plot the Confusion Matrix.

In [0]:
# Model: 3. Support Vector Machine
from sklearn.svm import SVC,LinearSVC
sm = SVC(gamma='auto')
#Train the model using the training sets
sm.fit(X_train,y_train)
#Predict the response for test dataset
#sm.score(X_test, y_test)
sm_predicted=sm.predict(X_test)
print ("SUPPORT VECTOR MACHINE")
print ("Accuracy Score:")
print (metrics.accuracy_score(y_test, sm_predicted))
# Compute confusion matrix to evaluate the accuracy of a classification
cm=metrics.confusion_matrix(y_test, sm_predicted)
print(metrics.recall_score(y_test,sm_predicted))
plot_cm(cm)

##K-Neighrest Neighbors (kNN)

Apply the Support Vector Machine algorithm, show the accuracy and plot the Confusion Matrix.

In [0]:
# Model: 4. K-Neighrest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
#Train the model using the training sets
knn.fit(X_train,y_train)
#Predict the response for test dataset
knn_predicted = knn.predict(X_test)
print ("K-NEIGHEST NEIGHBORS")
print ("Accuracy Score:")
print (metrics.accuracy_score(y_test, knn_predicted))
# Compute confusion matrix to evaluate the accuracy of a classification
cm = metrics.confusion_matrix(y_test, knn_predicted)

plot_cm(cm)

##Naive Bayes

Apply the Naive Bayes algorithm, show the accuracy and plot the Confusion Matrix.

In [0]:
# Model: 5. Naive Bayes
from sklearn.naive_bayes import GaussianNB
gb = GaussianNB()
gb.fit(X_train,y_train)
gb_predicted = gb.predict(X_test)
print ("NAIVE BAYES")
print ("Accuracy Score:")
print (metrics.accuracy_score(y_test, gb_predicted))
# Compute confusion matrix to evaluate the accuracy of a classification
cm = metrics.confusion_matrix(y_test, gb_predicted)

plot_cm(cm)

##Random Forest

Apply the Random Forest algorithm, show the accuracy and plot the Confusion Matrix.

In [0]:
# Model: 6. Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
rf_predicted = rf.predict(X_test)
print ("RANDOM FOREST")
print ("Accuracy Score:")
print (metrics.accuracy_score(y_test, rf_predicted))
# Compute confusion matrix to evaluate the accuracy of a classification
cm = metrics.confusion_matrix(y_test, rf_predicted)

plot_cm(cm)

##Neural Network

Apply the Neural Network algorithm, show the accuracy and plot the Confusion Matrix.

In [0]:
# Model: 7. Neural Network
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier()
nn.fit(X_train,y_train)
nn_predicted = nn.predict(X_test)
print ("NEURAL NETWORK")
print ("Accuracy Score:")
print (metrics.accuracy_score(y_test, nn_predicted))
# Compute confusion matrix to evaluate the accuracy of a classification
cm = metrics.confusion_matrix(y_test, nn_predicted)

plot_cm(cm)

#Scoring
Run all the Machine Learning Models for 20 times, and give different metrics to score the model results.

In [0]:
# A Dictionary of the models to apply.
classifier_collection = {
    "Decision Tree": DecisionTreeClassifier(),
    "Logistic Regression (LibLinear)": LogisticRegression(solver='liblinear'),
    "Logistic Regression (LBFGS)": LogisticRegression(solver='lbfgs', max_iter=5000),
    "Logistic Regression (Newton-CG)": LogisticRegression(solver='newton-cg'),
    "Logistic Regression (SAGA)": LogisticRegression(solver='saga', max_iter=100000),
    "SVM (Auto)": SVC(gamma='auto'),
    "SVM (Scale)": SVC(gamma='scale'),
#     "LinearSVC": LinearSVC( max_iter=2000000),
    "Nearest Neighbors": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "Random Forest (10)": RandomForestClassifier(n_estimators=10),
    "Random Forest (100)": RandomForestClassifier(n_estimators=100),
    "Neural Network": MLPClassifier(max_iter=500)
}

# Create and evaluate models
# Evaluation criteria: accuracy_score (help you to choose between models and qualify model performance)

all_dict = {}
accuracy_score_dict = {}
confusion_matrix_dict = {}
roc_auc_dict = {}
f1_score_dict, precision_dict, recall_dict = {}, {}, {}
count=0
for classifier_model, classifier in (classifier_collection.items()):
    #print(classifier_name)
    count +=1
    acc_s, roc_auc, f1_s, precision, recall = 0,0,0,0,0
    cnt = 20
    for _ in range(cnt):
      classifier.fit(X_train,y_train)
      predicted = classifier.predict(X_test)
      acc_s += metrics.accuracy_score(y_test, predicted)
      f1_s += metrics.f1_score(y_test, predicted)
      precision += metrics.precision_score(y_test, predicted)
      recall += metrics.recall_score(y_test, predicted)
      roc_auc += metrics.roc_auc_score(y_test, predicted)
    
    classifier.fit(X_train,y_train)
    predicted = classifier.predict(X_test)
    print(classifier_model)
    print(metrics.classification_report(y_test,predicted))
    accuracy_score_dict[classifier_model] = acc_s/cnt
    f1_score_dict[classifier_model] = f1_s/cnt
    precision_dict[classifier_model] = precision/cnt
    recall_dict[classifier_model] = recall/cnt
#     confusion_matrix_dict[classifier_model] = {'matrix' : metrics.confusion_matrix(y_test, predicted)}
    roc_auc_dict[classifier_model] = roc_auc/cnt
    all_dict[classifier_model] = {'accuracy': acc_s/cnt,
                                  'f1_score': f1_s/cnt,
                                  'precision': precision/cnt,
                                  'recall': recall/cnt,
                                  'roc_auc': roc_auc/cnt}

accuracy_score_dict = [(k, accuracy_score_dict[k]) for k in sorted(accuracy_score_dict, key=accuracy_score_dict.get, reverse=True)]
roc_auc_dict = [(k, roc_auc_dict[k]) for k in sorted(roc_auc_dict, key=roc_auc_dict.get, reverse=True)]

##Different Metrics for models 

In [0]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: limegreen; font-weight: bold;' if v else '' for v in is_max]
def custom(s):
    return ['text-align: center; padding:5px;' for _ in s]
  
cm = sea.light_palette("green", as_cmap=True)
metrics_df = pd.DataFrame.from_dict(all_dict, orient='index')
metrics_df.style.format("{:.4}").apply(highlight_max).apply(custom)