# Replication: A machine learning approach for the classification of cardiac arrhythmia

### Commentery

**Classifying research based on**:<br /> 
    - purpose: **applied research**, focus of the research is to classify different ecg signals so its analysis would be easier<br /> 
    - depth: **correlation**, we are looking at how ecg signals correlate to different diagnosis<br /> 
    - type of used data: **quantitative**, ecg signals are recorded using a machine <br /> 
    - degree of data manipulation: **observational**, we want the signals to be recorded without any of our own infulence<br /> 
    - type of conclusion: **inductive**, knowledge is generated by observation in order to achieve generalization<br /> 
    - implementation time: **cross-sectional study**, the data was collected at one point without having to observe it for a longer period of time <br/>  
    - source of data: **primary**, research is using ecg singals that were recorded from first hand <br/>
    - way of collecting data: unknown, there is nothing written about this in paper or in the data source<br />
    
**Methods of how the data for the article were collected**: <br />
It was a documentary research since the research used already collected and available published data.

**Whether the data were collected and published according to the rules of research ethics**: <br />
Weren't able to find any data regarding this question.

In [120]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## Dataset

Importing data from https://archive.ics.uci.edu/ml/datasets/Arrhythmia <br/>
For simpler data handling, data downloaded from https://datahub.io/machine-learning/arrhythmia#data

In [121]:
data = pd.read_csv("arrhythmia_csv.csv")
data

Unnamed: 0,age,sex,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
0,75,0,190,80,91,193,371,174,121,-16,...,0.0,9.0,-0.9,0.0,0.0,0.9,2.9,23.3,49.4,8
1,56,1,165,64,81,174,401,149,39,25,...,0.0,8.5,0.0,0.0,0.0,0.2,2.1,20.4,38.8,6
2,54,0,172,95,138,163,386,185,102,96,...,0.0,9.5,-2.4,0.0,0.0,0.3,3.4,12.3,49.0,10
3,55,0,175,94,100,202,380,179,143,28,...,0.0,12.2,-2.2,0.0,0.0,0.4,2.6,34.6,61.6,1
4,75,0,190,80,88,181,360,177,103,-16,...,0.0,13.1,-3.6,0.0,0.0,-0.1,3.9,25.4,62.8,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
447,53,1,160,70,80,199,382,154,117,-37,...,0.0,4.3,-5.0,0.0,0.0,0.7,0.6,-4.4,-0.5,1
448,37,0,190,85,100,137,361,201,73,86,...,0.0,15.6,-1.6,0.0,0.0,0.4,2.4,38.0,62.4,10
449,36,0,166,68,108,176,365,194,116,-85,...,0.0,16.3,-28.6,0.0,0.0,1.5,1.0,-44.2,-33.2,2
450,32,1,155,55,93,106,386,218,63,54,...,-0.4,12.0,-0.7,0.0,0.0,0.5,2.4,25.0,46.6,1


In [122]:
data.shape

(452, 280)

No. **attributes** = 279 <br/> No. **patients** = 452

In the article it says, "<em> The dataset was obtained from the UCI - Machine Learning Repository [1] which contains the patient ECG data for 472
patients. Each record contains 279 attributes.</em>", but looking at imported data we have only 452 patients.

In [123]:
data.describe()

Unnamed: 0,age,sex,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
count,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,...,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0
mean,46.471239,0.550885,166.188053,68.170354,88.920354,155.152655,367.207965,169.949115,90.004425,33.676991,...,-0.278982,9.048009,-1.457301,0.003982,0.0,0.514823,1.222345,19.326106,29.47323,3.880531
std,16.466631,0.497955,37.17034,16.590803,15.364394,44.842283,33.385421,35.633072,25.826643,45.431434,...,0.548876,3.472862,2.00243,0.050118,0.0,0.347531,1.426052,13.503922,18.493927,4.407097
min,0.0,0.0,105.0,6.0,55.0,0.0,232.0,108.0,0.0,-172.0,...,-4.1,0.0,-28.6,0.0,0.0,-0.8,-6.0,-44.2,-38.6,1.0
25%,36.0,0.0,160.0,59.0,80.0,142.0,350.0,148.0,79.0,3.75,...,-0.425,6.6,-2.1,0.0,0.0,0.4,0.5,11.45,17.55,1.0
50%,47.0,1.0,164.0,68.0,86.0,157.0,367.0,162.0,91.0,40.0,...,0.0,8.8,-1.1,0.0,0.0,0.5,1.35,18.1,27.9,1.0
75%,58.0,1.0,170.0,79.0,94.0,175.0,384.0,179.0,102.0,66.0,...,0.0,11.2,0.0,0.0,0.0,0.7,2.1,25.825,41.125,6.0
max,83.0,1.0,780.0,176.0,188.0,524.0,509.0,381.0,205.0,169.0,...,0.0,23.6,0.0,0.8,0.0,2.4,6.0,88.8,115.9,16.0


### Makeup

In [124]:
data[["J"]].isna().sum()
data_tmp = data.copy()
data_tmp.drop(['J'], axis=1, inplace=True)

In [125]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Columns: 280 entries, age to class
dtypes: float64(125), int64(155)
memory usage: 988.9 KB


In [126]:
np.where(np.isnan(data_tmp))

(array([  4,  54,  59,  66,  91, 106, 108, 116, 133, 174, 177, 193, 200,
        204, 212, 217, 219, 238, 241, 243, 253, 279, 284, 298, 300, 308,
        310, 350, 360, 372, 412, 420], dtype=int64),
 array([13, 11, 11, 10, 10, 11, 11, 11, 11, 11, 11, 11, 10, 11, 10, 11, 11,
        10, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 11],
       dtype=int64))

In [127]:
data_tmp2 = pd.DataFrame(data_tmp).fillna(data.mean())
np.where(np.isnan(data_tmp2))

(array([], dtype=int64), array([], dtype=int64))

In [128]:
import sklearn.metrics as metrics

def spec_sens(y, X):
    confusion_matrix = metrics.confusion_matrix(y, X)
    FP = confusion_matrix[0][1]
    FN = confusion_matrix[1][0]
    TP = confusion_matrix[1][1]
    TN = confusion_matrix[0][0]

    sensitivity = TP/(TP+FN)
    specifity = TN/(TN+FP)

    print("sensitivity: ", sensitivity)
    print("specifity: ", specifity)

### X and y

In [129]:
target = data_tmp2.loc[:,"class"]

values = data_tmp2.loc[:,data_tmp.columns != "class"]

#### Scaling data

In [130]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit on training set only.
scaler.fit(values)
# Apply transform to both the training set and the test set.
values = scaler.transform(values)

## Feature Selection

The dataset will be narrowed down using the **Principle Component Analysis, PCA**.

In [131]:
from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA().fit(values)

In [132]:
import plotly.express as px
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)
px.area(
    x=range(1, exp_var_cumul.shape[0] + 1),
    y=exp_var_cumul,
    labels={"x": "n.o. Components", "y": "Explained Variance"}
)

The graph shows us that around **99% of variance can be explained with 150 features**.

In [133]:
pca = PCA(0.988972).fit(values)
values = pca.transform(values)


In [134]:
print(values.shape)

(452, 150)


Using pythons library for PCA we reduced **number of features to 150**.

### Splitting data 

In [135]:
from sklearn.model_selection import train_test_split
train_values, test_values, train_target, test_target = train_test_split( values, target, test_size=0.3, random_state=0)
print("Number of train samples: "+ str(train_values.shape[0]))
print("Number of test samples: "+ str(test_values.shape[0]))

Number of train samples: 316
Number of test samples: 136


## Logistic Regression

In [136]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold

cv = KFold(n_splits=5, shuffle=True)
logmodel = LogisticRegression(multi_class='multinomial')
log_parameters = {'solver':('newton-cg', 'lbfgs', 'sag'),'penalty':('none', 'l2') ,'C':[1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]}
log_clf = GridSearchCV(logmodel, log_parameters, cv=cv)

# execute search
log_result = log_clf.fit(train_values, train_target)
# summarize result
print('Best Score: %s' % log_result.best_score_)
print('Best Hyperparameters: %s' % log_result.best_params_)



Best Score: 0.6833829365079365
Best Hyperparameters: {'C': 0.01, 'penalty': 'l2', 'solver': 'sag'}


In [138]:
log_model = LogisticRegression(C = 1, penalty='l2',solver = 'sag').fit(train_values, train_target)
print("accuracy: " + str(log_model.score(test_values, test_target)))
spec_sens(test_target, log_model.predict(test_values))
print("n.o. correctly classified samples: "+ 
      str(accuracy_score(log_model.predict(test_values), test_target, normalize=False)))

accuracy: 0.6323529411764706
sensitivity:  0.5
specifity:  0.9365079365079365
n.o. correctly classified samples: 86


Logisitic Regression is able to fit with **62.35% accuracy** or **86/136 samples**.

## SVM

In [139]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

cv = KFold(n_splits=5, shuffle=True)
svmmodel = SVC()
svm_parameters = {'gamma': [0.001, 0.01, 1], 'C':[1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]}
svm_clf = GridSearchCV(svmmodel, svm_parameters, cv=cv)

# execute search
svm_result = svm_clf.fit(train_values, train_target)
# summarize result
print('Best Score: %s' % svm_result.best_score_)
print('Best Hyperparameters: %s' % svm_result.best_params_)


Best Score: 0.6961805555555556
Best Hyperparameters: {'C': 10, 'gamma': 0.001}


In [140]:
svm_model = SVC(C=100, gamma=0.001).fit(train_values, train_target)
print("accuracy: " + str(svm_model.score(test_values, test_target)))
spec_sens(test_target, svm_model.predict(test_values))
print("n.o. correctly classified samples: "+ 
      str(accuracy_score(svm_model.predict(test_values), test_target, normalize=False)))

accuracy: 0.6617647058823529
sensitivity:  0.4
specifity:  0.9393939393939394
n.o. correctly classified samples: 90


SVM is able to fit with **66.18% accuracy** or **90/136 samples**.

## K-Nearest Neighbors (KNN) Algorithm

In [141]:
from sklearn.neighbors import KNeighborsClassifier

cv = KFold(n_splits=5, shuffle=True)
knnmodel = KNeighborsClassifier()
knn_parameters = {'n_neighbors':[3,5,11,19], 'weights':['uniform', 'distance'], 'metric':['euclidean', 'manhattan']}
knn_clf = GridSearchCV(knnmodel, knn_parameters, cv=cv)

# execute search
knn_result = knn_clf.fit(train_values, train_target)
# summarize result
print('Best Score: %s' % knn_result.best_score_)
print('Best Hyperparameters: %s' % knn_result.best_params_)


Best Score: 0.632986111111111
Best Hyperparameters: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'distance'}


In [142]:
knn_model = KNeighborsClassifier(metric= 'euclidean', n_neighbors= 5, weights= 'distance').fit(train_values, train_target)
print("accuracy: " + str(knn_model.score(test_values, test_target)))
spec_sens(test_target, knn_model.predict(test_values))
print("n.o. correctly classified samples: "+ 
      str(accuracy_score(knn_model.predict(test_values), test_target, normalize=False)))

accuracy: 0.5735294117647058
sensitivity:  0.08333333333333333
specifity:  0.971830985915493
n.o. correctly classified samples: 78


KNN is able to fit with **57.35% accuracy** or **78/136 samples**.

## Random Forest Classifier

In [143]:
from sklearn.ensemble import RandomForestClassifier

cv = KFold(n_splits=5, shuffle=True)
rfcmodel = RandomForestClassifier()
rfc_parameters = {'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]}
rfc_clf = GridSearchCV(rfcmodel, rfc_parameters, cv=cv)

# execute search
rfc_result = rfc_clf.fit(train_values, train_target)
# summarize result
print('Best Score: %s' % rfc_result.best_score_)
print('Best Hyperparameters: %s' % rfc_result.best_params_)

Best Score: 0.550843253968254
Best Hyperparameters: {'bootstrap': True, 'max_depth': 80, 'max_features': 3, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 100}


In [145]:
rfc_model = RandomForestClassifier(bootstrap= True, max_depth= 80, max_features= 3, min_samples_leaf= 3, min_samples_split= 10, n_estimators= 100).fit(train_values, train_target)
rfc_model.score(test_values, test_target)
print("accuracy: " + str(rfc_model.score(test_values, test_target)))
spec_sens(test_target, rfc_model.predict(test_values))
print("n.o. correctly classified samples: "+ 
      str(accuracy_score(rfc_model.predict(test_values), test_target, normalize=False)))

accuracy: 0.5294117647058824
sensitivity:  0.0
specifity:  1.0
n.o. correctly classified samples: 72


Random Forest Classifier is able to fit with **52.94% accuracy** or **72/136 samples**.

## Conclusion 

When the dataset was cross-validated and testes, the maximum accuracy was found to be obtained by SVM with 66.18% accuracy. The fact that the SVM is giving the best results is the same as in the paper however the accuracy of all the models is much lower.