# 55-700241 Applicable Artificial Intelligence (No much data transformation and cleaning.
1.	Outline of the Task
The objective of the coursework is to design, implement and evaluate a Neural Network for data classification and write a report on it. The neural network is to be developed using Matlab.

2.	The data
A subset of the Heart Disease (Cleveland) data set is provided via Blackboard (file named cleveland_heart_disease_dataset_labelled.mat). The data has the following properties:
•	This is a cleaned up subset of 14 features from a full set of 75,
•	It contains multiple classes (0: no heart disease, 1: mild heart disease, 2: severe heart disease),
•	The majority of experiments in the literature focus on detecting presence (1 or 2) from absence (0),
•	Current state of the art is around 90% accuracy.

The task is to design a neural network to achieve a cross validated classification rate as close as possible to current state of the art. 


#### Import the necessary libraries

In [1]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

#### Data Loading

In [2]:
#load data
heart_df = pd.read_csv('./heart_dataset.csv')

In [3]:
#view the data
heart_df.head()

Unnamed: 0,Age,Sex,CP,Trestbps,Chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,1
2,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
3,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
4,56,1,2,120,236,0,0,178,0,0.8,1,0,3,0


In [4]:
#view the data types
heart_df.dtypes

Age           int64
Sex           int64
CP            int64
Trestbps      int64
Chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [6]:
cols = heart_df.columns
cols

Index(['Age', 'Sex', 'CP', 'Trestbps', 'Chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

#### Exploratory data Analysis

for label in cols[:-1]:
    plt.hist(heart_df[heart_df['target']==0][label], color='blue', label='No disease', alpha = 0.7, density = True)
    plt.hist(heart_df[heart_df['target']==1][label], color='red', label='Mild Heart Disease', alpha = 0.7, density = True)
    plt.hist(heart_df[heart_df['target']==2][label], color='green', label='Severe Heart Disease', alpha = 0.7, density = True)
    plt.title(label)
    plt.ylabel('Probability')
    plt.xlabel(label)
    plt.legend()
    plt.show()

In [11]:
# check the data statistics
heart_df.describe()

Unnamed: 0,Age,Sex,CP,Trestbps,Chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0
mean,54.542088,0.676768,3.158249,131.693603,247.350168,0.144781,0.996633,149.599327,0.326599,1.055556,1.602694,0.676768,4.73064,0.622896
std,9.049736,0.4685,0.964859,17.762806,51.997583,0.352474,0.994914,22.941562,0.469761,1.166123,0.618187,0.938965,1.938629,0.748341
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0,0.0
50%,56.0,1.0,3.0,130.0,243.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0,0.0
75%,61.0,1.0,4.0,140.0,276.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,1.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,2.0


We can see that we have a lot of large numbers here in some features of the data. Some we need to scale the data down so that it would be more useful for our design.

In [34]:
# Check if we have null classes.
heart_df.isnull()

Unnamed: 0,Age,Sex,CP,Trestbps,Chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,False,False,False,False,False,False,False,False,False,False,False,False,False,False
293,False,False,False,False,False,False,False,False,False,False,False,False,False,False
294,False,False,False,False,False,False,False,False,False,False,False,False,False,False
295,False,False,False,False,False,False,False,False,False,False,False,False,False,False


#### Data Scaling

In [23]:
# define function that scales the data using min-max
def scale_dataset(dataframe, oversample=False):
    x = dataframe[dataframe.columns[:-1]].values
    y = dataframe[dataframe.columns[-1]].values
    
    scaler = StandardScaler()
    x = scaler.fit_transform(x)
    
    #Equalize the normal of the classes in the dataset
    if oversample:
        ros = RandomOverSampler()
        x,y = ros.fit_resample(x,y)
    
    data = np.hstack((x,np.reshape(y,(-1,1))))
                     
    return data, x,y

In [25]:
#set training, validation and test set
train, valid, test = np.split(heart_df.sample(frac=1), [int(0.6*len(heart_df)), int(0.8*len(heart_df))])

In [26]:
# check the length of the data
print('Length of No disease class:',len(train[train['target']==0]))
print('Length of Mild disease class:',len(train[train['target']==1]))
print('Length of severe disease class:',len(train[train['target']==2]))

Length of No disease class: 92
Length of Mild disease class: 52
Length of severe disease class: 34


We can see that the classes are nor equal so we use the random over sampler to add more data to balance the classes.

In [27]:
#Use the above function to fix the mismatch in the data above then split the data
train, x_train, y_train = scale_dataset(train, oversample=True)
valid, x_valid, y_valid = scale_dataset(valid, oversample=False)
test, x_test, y_test = scale_dataset(test, oversample=False)

# Model Creation

#### KNN Model

In [55]:
# import library
from sklearn import neighbors, metrics
from sklearn.metrics import f1_score, classification_report

In [56]:
#create a knn model
clf_knn = neighbors.KNeighborsClassifier(n_neighbors=1)
clf_knn.fit(x_train, y_train)

In [57]:

y_pred = clf_knn.predict(x_test)

In [58]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.69      0.73        35
           1       0.44      0.39      0.41        18
           2       0.31      0.57      0.40         7

    accuracy                           0.58        60
   macro avg       0.51      0.55      0.51        60
weighted avg       0.62      0.58      0.59        60



#### Naive Bayes

In [59]:
#import library
from sklearn.naive_bayes import GaussianNB

In [60]:
clf_nb = GaussianNB()
clf_nb.fit(x_train, y_train)

In [62]:
y_pred = clf_nb.predict(x_test)

In [63]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.83      0.84        35
           1       0.43      0.33      0.38        18
           2       0.42      0.71      0.53         7

    accuracy                           0.67        60
   macro avg       0.57      0.63      0.58        60
weighted avg       0.67      0.67      0.66        60



#### Logistic Regression

In [64]:
#import libraries
from sklearn.linear_model import LogisticRegression

In [65]:
#make classifier
clf_log = LogisticRegression()
clf_log.fit(x_train, y_train)

In [66]:
y_pred = clf_log.predict(x_test)

In [67]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.89      0.90        35
           1       0.50      0.39      0.44        18
           2       0.33      0.57      0.42         7

    accuracy                           0.70        60
   macro avg       0.58      0.62      0.59        60
weighted avg       0.72      0.70      0.70        60



#### Support Vector Machines

In [68]:
#import libraries
from sklearn import svm

In [69]:
clf_svm = svm.SVC()
#fit data
clf_svm.fit(x_train,y_train)

In [70]:
y_pred = clf_svm.predict(x_test)

In [71]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.80      0.82        35
           1       0.43      0.33      0.38        18
           2       0.38      0.71      0.50         7

    accuracy                           0.65        60
   macro avg       0.55      0.62      0.57        60
weighted avg       0.67      0.65      0.65        60



In [72]:
#### Decision tree

In [73]:
from sklearn.tree import DecisionTreeClassifier

In [78]:
clf_dec = DecisionTreeClassifier(random_state=0)
#fit data
clf_dec.fit(x_train, y_train)

In [79]:
#make predictions
y_pred = clf_dec.predict(x_test)

In [80]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.74      0.80        35
           1       0.62      0.72      0.67        18
           2       0.33      0.43      0.38         7

    accuracy                           0.70        60
   macro avg       0.61      0.63      0.61        60
weighted avg       0.73      0.70      0.71        60



#### Neural Net

In [81]:
from sklearn.neural_network import MLPClassifier

In [82]:
clf_nn = MLPClassifier()
clf_nn.fit(x_train, y_train)



In [83]:
y_pred = clf_nn.predict(x_test)

In [84]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.74      0.81        35
           1       0.45      0.50      0.47        18
           2       0.45      0.71      0.56         7

    accuracy                           0.67        60
   macro avg       0.60      0.65      0.61        60
weighted avg       0.71      0.67      0.68        60

