# Data Modeling - Classification - Exercise #1

In this exercise, you will work with the `heart.csv` dataset, and create classifiers for it to predict if a person has a heart disease or not.

<div class="alert alert-info">
<b><h2>Data Dictionary:</b><ul>
<li>age</li>
<li>sex</li>
<li>chest pain type (4 values)</li>
<li>resting blood pressure</li>
<li>serum cholestoral in mg/dl</li>
<li>fasting blood sugar > 120 mg/dl</li>
<li>resting electrocardiographic results (values 0,1,2)</li>
<li>maximum heart rate achieved</li>
<li>exercise induced angina</li>
<li>oldpeak = ST depression induced by exercise relative to rest</li>
<li>the slope of the peak exercise ST segment</li>
<li>number of major vessels (0-3) colored by flourosopy</li>
<li>thal: 0, 1 = normal; 2 = fixed defect; 3 = reversable defect</li>
</ul></div>

## Questions
### 1. Load the `heart.csv` dataset into a pandas dataframe

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [2]:
# your code here
df = pd.read_csv('heart.csv')

### 2. Perform minimal EDA and preprocessing steps that are mandatory in order to proceed to classification (do not go overboard)

In [3]:
# your code here
df.shape

(298, 14)

In [4]:
df.sample(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
191,59,1,0,170,326,0,0,140,1,3.4,0,0,3,0
293,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
240,56,1,0,132,184,0,0,105,1,2.1,1,1,1,0
194,62,1,0,120,267,0,1,99,1,1.8,1,2,3,0
172,60,1,0,117,230,1,1,160,1,1.4,2,2,3,0


In [5]:
df.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [6]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0
mean,54.510067,0.677852,0.959732,131.580537,246.90604,0.147651,0.52349,149.466443,0.328859,1.055369,1.395973,0.674497,2.312081,0.540268
std,9.030526,0.468085,1.033963,17.669293,51.893097,0.35535,0.526521,22.98383,0.470589,1.164162,0.617574,0.938202,0.614024,0.499214
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,241.5,0.0,1.0,152.5,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,165.75,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,3.0,3.0,1.0


In [7]:
assert set(df.sex) == {0,1}

In [8]:
assert set(df.restecg) == {0,1,2}

In [9]:
assert set(df.ca) == {0,1,2,3}

In [10]:
assert set(df.thal) == {0,1,2,3}

In [11]:
assert set(df.cp) == {0, 1, 2, 3}

In [12]:
scaler = StandardScaler()

scaler.fit(df[['trestbps','chol','thalach']])

df[['trestbps','chol','thalach']] = scaler.transform(df[['trestbps','chol','thalach']])



In [13]:
mm_scaler = MinMaxScaler()

mm_scaler.fit(df[['age']])
df[['age']] = mm_scaler.transform(df[['age']])


### 3. Split the data into a train set and a test set, making sure the sets are stratified according to the dependent variable.

In [14]:
# your code here
X = df.drop(columns = ['target'])
y = df.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, stratify = y)

### 4. Train a model based on the Naive Bayes algorithm. Which version of the model did you choose? Why?

#### Note: You may chose whichever NB model seems MOST appropriate.

As we have a few numerical continuous features and we preprocessed them to follow a normal distribution, we will use the Gaussian Naive Bayes classifier.

In [15]:
# your code here
%%time
model = GaussianNB()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)


CPU times: user 8.28 ms, sys: 58 µs, total: 8.34 ms
Wall time: 9.05 ms


### 5. Report the results of the Naive Bayes algorithm on your test set, in terms of Accuracy. Also, show the confusion matrix, and explain what was the number of FP and FN of the model.

In [16]:
# your code here
acc_gauss = accuracy_score(y_test,y_pred)
acc_gauss

0.8

In [17]:
conf_mat = confusion_matrix(y_test,y_pred)
conf_mat

array([[21,  7],
       [ 5, 27]])

We get an accuracy of 0.8.
We get 7FP and 5FN.

### 6. Train a model based on the Logistic Regression algorithm.

In [18]:
# your code here
%%time
model_lr = LogisticRegression()
model_lr.fit(X_train,y_train)
y_pred_lr = model_lr.predict(X_test)



CPU times: user 20.5 ms, sys: 915 µs, total: 21.5 ms
Wall time: 24.7 ms


### 7. Report the results of the Logistic Regression algorithm on your test set, in terms of Accuracy. Also, show the confusion matrix, and explain what was the number of FP and FN of the model.

In [19]:
# your code here
acc_logreg = accuracy_score(y_test,y_pred_lr)
acc_logreg

0.8333333333333334

In [20]:
conf_matr_lr = confusion_matrix(y_test,y_pred_lr)
conf_matr_lr

array([[20,  8],
       [ 2, 30]])

We get an accuracy of 0.83. We get 8FP and 2FN.

### 8. Train a model based on the SVM algorithm.
We did not learn about SVM yet, for now create a `sklearn.svm.SVC` object and call the `fit()` and `predict()` functions as you did for the Naive Bayes and logistic regression models. We will learn how SVM models work in the second classification lecture.

In [21]:
# your code here
%%time
model_svm = SVC()
model_svm.fit(X_train,y_train)
y_pred_svm = model_svm.predict(X_test)


CPU times: user 15.5 ms, sys: 0 ns, total: 15.5 ms
Wall time: 22.6 ms


### 9. Report the results of the SVM algorithm on your test set, in terms of Accuracy. Also, show the confusion matrix, and explain what was the number of FP and FN of the model.

In [22]:
# your code here
acc_score_svm = accuracy_score(y_test,y_pred_svm)
acc_score_svm

0.8333333333333334

In [23]:
conf_matr_svm = confusion_matrix(y_test,y_pred_svm)
conf_matr_svm

array([[20,  8],
       [ 2, 30]])

We get an accuracy of 0.83. We get 8FP and 2FN.

### 10. Measure the time it took each one of the models to train.  What is the difference between the different models results, and the time it took to train? Would you prefer one over the other? Why do you think you got the results that you did? Please elaborate.

Naive Bayes : 8.34ms

Logistic Regression : 21.5ms

SVM : 15.5ms

In terms of result, the Logistic Regression model and the SVM model are the most efficient with an accuracy of 0.83, 8FP and 2FN.

The Naive Bayes model is the less efficient with an accuracy of 0.8, 7FP and 5FN.


I would prefer the SVM as it is the most accurate with the less false results, and takes less time than the Logistic Regression.

The Gaussian Naive Bayes gives the baddest result as it assumes that features follow a gaussian distribution, which is not the case for most of the features. This strong assumption gives an import bias to the model.

The Logistic Regression model does not make assumptions about the distribution of features. Moreover, it can take both numerical and categorical features, whereas the gaussian naive bayes model is designed for continuous features.