Skip to content

gbganalyst/Heart-disease-paper

Repository files navigation

Performance analysis of supervised classification models on heart disease prediction

Abstract

This paper presents a predictive analysis of data on heart disease patients to determine the possible risk factors associated with their heart disease status. Two independent (but similar) published heart disease datasets, the Cleveland data (used to build classification models) and the Statlog data (used for results’ validation), were considered for analysis. A detailed exploratory analysis using the Chi-square test of independence was performed on the Cleveland data after which ten standard classification models were trained for class prediction. The classification models were built by partitioning the Cleveland data randomly into $208$ (70%) training samples and $89$ (30%) test samples over $200$ replications. Preliminary results showed that some of the bio-clinical categorical variables are strongly associated with the heart disease conditions of the patients (p < 0.001). The classification results from the test samples indicated that the support vector machine yielded the best predictive performances with $85$% accuracy, $82$% sensitivity, $88$% specificity, $87$% precision, $91$% area under the ROC curve, and $38$% log loss value. These results were validated on the Statlog data in tenfold cross-validation which were all consistent with those obtained from the Cleveland dataset.