# Problem Statement:
You are the data scientist at a medical research facility. The facility wants you to
build a machine learning model to classify if the given data of a patient should tell
if the patient is at the risk of a heart attack.

Heart Disease Dataset:
UCI Heart Disease Dataset
(https://archive.ics.uci.edu/ml/datasets/Heart+Disease?spm=5176.100239.blogco
nt54260.8.TRNGoO)

### Lab Environment:
Jupyter Notebooks

### Domain:
Healthcare

### Tasks To Be Performed:

1. Data Analysis:
a. Import the dataset
b. Get information about the dataset (mean, max, min, quartiles etc.)
c. Find the correlation between all fields

2. Data Visualization:
a. Visualize the number of patients having a heart disease and not having
a heart disease
b. Visualize the age and whether a patient has disease or not
c. Visualize correlation between all features using a heat map

3. Logistic Regression:
a. Build a simple logistic regression model:
i. Divide the dataset in 70:30 ratio
ii. Build the model on train set and predict the values on test set
iii. Build the confusion matrix and get the accuracy score

4. Decision Tree:
a. Build a decision tree model:
i. Divide the dataset in 70:30 ratio
ii. Build the model on train set and predict the values on test set
iii. Build the confusion matrix and calculate the accuracy
iv. Visualize the decision tree using the Graphviz package

5. Random Forest:
a. Build a Random Forest model:
i. Divide the dataset in 70:30 ratio
ii. Build the model on train set and predict the values on test set
iii. Build the confusion matrix and calculate the accuracy
iv. Visualize the model using the Graphviz package

6. Select the best model
a. Print the confusion matrix of all classifiers
b. Print the classification report of all classifiers
c. Calculate Recall Precision and F1 score of all the models
d. Visualize confusion matrix using heatmaps
e. Select the best model based on the best accuracies

In [1]:
! pip install graphviz

Collecting graphviz
  Obtaining dependency information for graphviz from https://files.pythonhosted.org/packages/00/be/d59db2d1d52697c6adc9eacaf50e8965b6345cc143f671e1ed068818d5cf/graphviz-0.20.3-py3-none-any.whl.metadata
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
   ---------------------------------------- 0.0/47.1 kB ? eta -:--:--
   ---------------------------------- ----- 41.0/47.1 kB 960.0 kB/s eta 0:00:01
   ---------------------------------------- 47.1/47.1 kB 784.8 kB/s eta 0:00:00
Installing collected packages: graphviz
Successfully installed graphviz-0.20.3


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import *
from sklearn.preprocessing import StandardScaler

from sklearn import tree
import graphviz


A) DATA ANALYSIS

a. Import the dataset

b. Get information about dataset (mean, max, min, quartiles etc.)

c. Find the correlation between all fields.

In [3]:
df = pd.read_csv('dataset.csv')
df

FileNotFoundError: [Errno 2] No such file or directory: 'dataset.csv'

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

In [None]:
df = df.drop_duplicates()

In [None]:
df.describe()

In [None]:
df.corr()

B) DATA VISUALIZATION:

a. Visualize the number of patients having a heart disease and not having a heart disease.

b. Visualize the age and weather patient has disease or not

c. Visualize correlation between all features using a heat map

In [None]:

f = plt.figure(figsize=(5, 7))
custom_lines = [Line2D([0], [0], color='blue', lw=2),Line2D([0], [0], color='orange', lw=2)]

barPlot = sns.barplot(x=df['target'].value_counts().index, y=df['target'].value_counts().values, palette=['blue', 'orange'])
                      
plt.xlabel('Count of People with and without Heart Disease')
plt.ylabel('Count')
plt.bar_label(barPlot.containers[0])
plt.legend(custom_lines, ['No Heart Disease', 'Heart Disease'])


In [None]:
f = plt.figure(figsize=(12, 4))
sns.scatterplot(x=df['age'], y=df['target'], color='red')


In [None]:
plt.figure(figsize = (10, 10))
sns.heatmap(df.corr(), annot = True)
plt.savefig('Correlation')
plt.show()

C) LOGISTIC REGRESSION

a. Build a simple logistic regression model

i. Divide the dataset in 70:30 ratio

ii. Build the model on train set and predict the values on test set

iii. Build the confusion matrix and get the accuracy score

In [None]:
x = df.iloc[:,:-1]
y = df['target']

stdScalar = StandardScaler()
x = stdScalar.fit_transform(x)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,random_state=67)

In [None]:
logisticModel = LogisticRegression()
logisticModel.fit(x_train, y_train)
y_pred1 = logisticModel.predict(x_test)

In [None]:
pred_error1 = pd.DataFrame()
pred_error1['Actual'] = y_test
pred_error1['Predicted'] = y_pred1
pred_error1

In [None]:
c1 = confusion_matrix(y_test, y_pred1)
c1

In [None]:
a1 = accuracy_score(y_test, y_pred1)*100
a1

In [None]:
report1 = classification_report(y_test, y_pred1, output_dict=True)
report1 = pd.DataFrame(report1).transpose()
print(report1)

In [None]:
precision1 = precision_score(y_test, y_pred1)*100
recall1 = recall_score(y_test, y_pred1)*100
f1Score1 = f1_score(y_test, y_pred1)*100
print("Precision:", precision1, "\nRecall Score:", recall1, "\nF1Score:", f1Score1)

D) DECISION TREE:

a. Build a decision tree model

i. Divide the dataset in 70:30 ratio

ii. Build the model on train set and predict the values on test set

iii. Build the confusion matrix and calculate the accuracy

iv. Visualize the decision tree using the graphviz package

In [None]:
treeModel = DecisionTreeClassifier(criterion='entropy', max_depth=3)
treeModel.fit(x_train, y_train)
y_pred2 = treeModel.predict(x_test)

In [None]:
pred_error2 = pd.DataFrame()
pred_error2['Actual'] = y_test
pred_error2['Predicted'] = y_pred2
pred_error2

In [None]:
c2 = confusion_matrix(y_test, y_pred2)
c2

In [None]:
a2 = accuracy_score(y_test, y_pred2)*100
a2

In [None]:
report2 = classification_report(y_test, y_pred2, output_dict=True)
report2 = pd.DataFrame(report2).transpose()
print(report2)

In [None]:
precision2 = precision_score(y_test, y_pred2)*100
recall2 = recall_score(y_test, y_pred2)*100
f1Score2 = f1_score(y_test, y_pred2)*100
print("Precision:", precision2, "\nRecall Score:", recall2, "\nF1Score:", f1Score2)

In [None]:
col_List = []

col_List = list(df.columns)
col_List.pop(-1)

In [None]:

# graph_data = tree.export_graphviz(treeModel, out_file=None, 
#                                 feature_names=col_List,  
#                                 class_names='target',
#                                 rounded=True,
#                                 filled=True)

# graph = graphviz.Source(graph_data) 
# graph

In [None]:
# Visualize the decision tree
plt.figure(figsize=(15, 10))
tree.plot_tree(treeModel, feature_names=col_List, class_names='target', filled=True)
plt.show()

E) RANDOM FOREST:

a. Build a Random Forest model 

i. Divide the dataset in 70:30 ratio

ii. Build the model on train set and predict the values on test set

iii. Build the confusion matrix and calculate the accuracy

iv. Visualize the model using the graphviz package

In [None]:
randomForestModel = RandomForestClassifier(n_estimators=20)
randomForestModel.fit(x_train, y_train)
y_pred3 = randomForestModel.predict(x_test)


In [None]:
pred_error3 = pd.DataFrame()
pred_error3['Actual'] = y_test
pred_error3['Predicted'] = y_pred3
pred_error3

In [None]:
c3 = confusion_matrix(y_test, y_pred3)
c3

In [None]:
a3 = accuracy_score(y_test, y_pred3)*100
a3

In [None]:
report3 = classification_report(y_test, y_pred3, output_dict=True)
report3 = pd.DataFrame(report3).transpose()
print(report3)


In [None]:
precision3 = precision_score(y_test, y_pred3)*100
recall3 = recall_score(y_test, y_pred3)*100
f1Score3 = f1_score(y_test, y_pred3)*100
print("Precision:", precision3, "\nRecall Score:", recall3, "\nF1Score:", f1Score3)

In [None]:
# graph_data = tree.export_graphviz(randomForestModel.estimators_[0], out_file=None, 
#                                 feature_names=col_List,  
#                                 class_names='target',
#                                 rounded=True,
#                                 filled=True)

# graph = graphviz.Source(graph_data) 
# graph

In [None]:
plt.figure(figsize=(15, 10))
tree.plot_tree(randomForestModel.estimators_[0], feature_names=col_List, class_names='target', filled=True)
plt.show()

F) SELECT THE BEST MODEL

a. Print the confusion matrix of all classifiers

b. Print the classification report of all classifiers

c. Calculate Recall Precision and F1 score of all the models

d. Visualize confusion matrix using heatmaps 

e. Select the best model based on the best accuracies

In [None]:
modelMetrics = pd.DataFrame()
modelMetrics['Model Name'] = ['Logictic Regression', 'Decision Tree Classifier', 'Random Forest Classifier']
modelMetrics['Confusion Matrix'] = [c1, c2, c3]
modelMetrics['Precision'] = [precision1, precision2, precision3]
modelMetrics['Recall Score'] = [recall1, recall2, recall3]
modelMetrics['F1 Score'] = [f1Score1, f1Score2, f1Score3]
modelMetrics['Accuracy'] = [a1, a2, a3]



In [None]:
pd.set_option('expand_frame_repr', False)
print(modelMetrics)

In [None]:
f, axes = plt.subplots(1,3, figsize=(17, 4))
plt.subplot(1, 3, 1)
sns.heatmap(c1, annot=True, fmt='G')
plt.title('Logistic Regression')
# plt.show()

plt.subplot(1, 3, 2)
sns.heatmap(c2, annot=True, fmt='G')
plt.title('Decision Tree Classifier')
# plt.show()

plt.subplot(1, 3, 3)
sns.heatmap(c3, annot=True, fmt='G')
plt.title('Random Forest Classifier')

plt.show()


In [None]:
print("Logistic Regression Classification Report\n", report1, 
      "\n\nDecision Tree Classification Report\n", report2, 
      "\n\nRandom Forest Classification Report\n", report3)

**From the Accuracy percentage it is clear that Random Forest Classifier Model has performed the Best among the other classifier models**