# Decision Tree to Diagnose Heart Disease

The objective is to create an intelligent agent, which can suggest a diagnose of heart diseases.<br>
The task is to train a model of human heart, based on measurements, taken from numerous heart disease patients.<br>
For the training we will use data from a public source: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/

## Step 1: Load the Libraries 

In [None]:
# pandas for data structures and operations for manipulating numerical tables and time series
import pandas
from pandas.plotting import scatter_matrix

# matplotlib.pyplot for data plots
import matplotlib.pyplot as plt

# sklearn for machine learning methods
from sklearn import tree
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# for numeric calculations
import numpy as np

# from utilities import visualize_classifier


## Step 2: Load a Dataset

First, we load the data from file __processed.cleveland.data__ by use of pandas<br>
It is a table data in __csv__ format.<br>
Columns contain various parameters of human heart. <br>

In [None]:
# Create URL object
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/reprocessed.hungarian.data"

As the data has no header, we need to specify the names of each column before loading it. We get the information of interpretation from the file __heart-disease.names__

In [None]:
# Create a header
names = ['age','sex','cp','bps','chol','fbs','ecg','hrate','ang','peak','slp','ca','thal','diag']

In [None]:
# Load the data, create a dataset object
dataset = pandas.read_csv(url, names=names, na_values=["?"])

## Step 3: Get to Know The Data

### General Overview
Investigate the dataset. <br>
Find out how many records are available, are they all clean, how many classes they represent. <br>
Create diagrams to visualize the set and its descriptive statistics.

In [None]:
# See the shape (number of rows) and size (number of columns)
print(dataset.shape)

In [None]:
list(dataset)

In [None]:
# See how it looks (get the first 5 records)
dataset.head(10)

In [None]:
# Have the descriptive statistics calculated for the whole dataset
print(dataset.describe())

In [None]:
dataset.info()

In [None]:
# to check null values in data
dataset.isnull().sum()

In [None]:
# to check null values in data
np.isnan(dataset).sum()

In [None]:
# detect non-missing values
dataset[dataset.notnull()]

### Clean The Dataset

In [None]:
dataset = dataset.dropna()

In [None]:
# Group by class attribute diag
# See how many classes are included and how many records per class are distributed
print(dataset.groupby('diag').size())

### Visualization of Dataset Statistics
    1. Draw Histograms
    2. Draw Scatter Plots
    3. Draw Box-Whisker Plots

In [None]:
# Draw histograms for each feature
dataset.hist()
plt.show()

In [None]:
# Generate scatter plot 
plt.scatter(dataset['age'], dataset['chol'],  marker="o", picker=True)
plt.title(f'Desease by Age')
plt.xlabel('age')
plt.ylabel('cholesterol')
plt.show()

In [None]:
# Draw box-whisker plots
dataset.plot(kind='box', subplots=True, layout=(3,5), sharex=False, sharey=False)
plt.show()

These diagrams show the distribution of the values in the columns. <br>
Some of them seem to have Normal (Gaussian) distribution.<br> 
It is good to know, as we can later choose appropriate algorithms for exploitation.

### Prepare The Data For Training

In [None]:
# Convert the dataset into array
array = dataset.values

In [None]:
# Create two (sub) arrays from it
# X - features, all rows, all columns but the last one
# y - labels, all rows, the last column
X, y = array[:, :-1], array[:, -1]

In [None]:
# Separate input data into classes based on labels of diagnoses
class0 = np.array(X[y==0])
class1 = np.array(X[y==1])
class2 = np.array(X[y==2])
class3 = np.array(X[y==3])
class4 = np.array(X[y==4])

## Step 4: Training
Time to try to train a model.
1. Split the dataset into two: __training set__ and __test set__
2. Build the classifier by implementing __Decision Tree__ algorithm over the training set
3. Test the classifier over the test set
3. Estimate how accurate it is

In [None]:
# Split the dataset into into training and testing sets in proportion 8:2 
#   80% of it as training data
#   20% as a validation dataset
set_prop = 0.2

In [None]:
#  Initialize seed parameter for the random number generator used for the split
seed = 7

In [None]:
# Split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=set_prop, random_state=seed)

In [None]:
# Build Decision Trees Classifier 
params = {'max_depth': 5}
classifier = DecisionTreeClassifier(**params)
# classifier = RandomForestClassifier(n_estimators = 100, max_depth = 6)
 
classifier.fit(X_train, y_train)

In [None]:
# Install the graphviz package for DT visualisation
# !pip install graphviz

In [None]:
# draw tree from the trained data by graphviz package
import graphviz
dot_data = tree.export_graphviz(classifier, out_file=None, 
                         feature_names=dataset.columns[:13], class_names = True,        
                         filled=True, rounded=True, proportion = False,
                         special_characters=True)   

In [None]:
# result DT saved in file heart.pdf
graph = graphviz.Source(dot_data)
graph.render("heart") 

In [None]:
# show it here
graph 

## Step 5. Model Validation

We need a metrics for the evaluation
‘accuracy‘ is the percentage % of correctly predicted instances from the total number of instances in the dataset.

In [None]:
# Set the metrics
scoring = 'accuracy'

Now we can try to implement the model on our test set.


In [None]:
# Predict the labels of the test data
y_testp = classifier.predict(X_test)
y_testp

In [None]:
# Calculated the accuracy of the model comparing the observed data and predicted data
print ("Accuracy is ", accuracy_score(y_test,y_testp))

In [None]:
# Create confusion matrix
confusion_mat = confusion_matrix(y_test,y_testp)
confusion_mat

In [None]:
confusion = pandas.crosstab(y_test,y_testp)
confusion

In [None]:
# Visualize confusion matrix
plt.imshow(confusion_mat, interpolation='nearest')
plt.title('Confusion matrix')
plt.colorbar()
ticks = np.arange(5)
plt.xticks(ticks, ticks)
plt.yticks(ticks, ticks)
plt.ylabel('True labels')
plt.xlabel('Predicted labels')
plt.show()

In [None]:
import seaborn as sns
sns.heatmap(confusion_mat, annot=True)

In [None]:
# The diagonal elements (TN, TP) represent the number of points for which the predicted label is equal to the true label, 
# while off-diagonal elements are those that are mislabeled by the classifier. 
# The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.
# FN - False Negative prediction
# FP - False Positive prediction

In [None]:
# Confusion matrix provides an indication of the  errors made in predictions, here in text format
# print(confusion_matrix(y_test, y_testp))

In [None]:
class_names = ['Class0', 'Class1', 'Class2','Class3', 'Class4']
# Classifier performance on training dataset
print(classification_report(y_train, classifier.predict(X_train), target_names=class_names))
plt.show()

![image.png](attachment:d7165919-2fdc-4595-89c4-e123ca6565ca.png)

In [None]:
# Classifier performance on test dataset
print(classification_report(y_test, classifier.predict(X_test), target_names=class_names))
plt.show()

## <span style="color:red">Task</span>
Try to improve the model by applying Random Forest classifier provided in sklearn.<br>
Repeat the training, testing and validation.<br>
Compare the Decision Tree and Random Forest methods.
Answer to the question: Which method gives better results?