In [1]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
dataset <- read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")

Print column names for future reference

In [3]:
colnames(dataset)

Check datatype of entires. This is to make sure we dont have any strings or categorical data in words. 

In [7]:
library(dplyr)
dataset[1,] %>%
   select(where(~ all(varhandle::check.numeric(.)))) 

Checking covariance to see if some points are useless or not. Doesn't look like it. 

In [8]:
cov(dataset)

D in svd is also a good metric to eliminate any bad columns. It looks like everything is pretty strongly correlated. 

In [9]:
svd_dataset<-svd(dataset)
svd_dataset$d

Data is split into test and train. 

In [11]:
library(caTools)
set.seed(123)
split = sample.split(dataset$DEATH_EVENT, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Scaling ensures there is no blowup or disappearance in weights.

In [12]:
training_set[,-c(13)] = scale(training_set[,-c(13)])
test_set[,-c(13)] = scale(test_set[,-c(13)])

Verifying training set is set. 

In [14]:
training_set[1:5,]

Train ANN model for detection. 

In [42]:
library(h2o)
h2o.init(nthreads = -1)
model = h2o.deeplearning(y = 'DEATH_EVENT',
                         training_frame = as.h2o(training_set),
                         activation = 'Rectifier',
                         hidden = c(5,5),
                         epochs = 100,
                         train_samples_per_iteration = -2)


Prediction time. 

In [29]:
y_pred = h2o.predict(model, newdata = as.h2o(test_set[-c(13)]))
y_pred = (y_pred > 0.5)
y_pred = as.vector(y_pred)

Verify all outputs are in same shape. 

In [30]:
print(length(y_pred))
print(length(t(test_set[,13])))

Print confusion matrix. 
Not too good. 

In [31]:
cm = table(t(test_set[, 13]), y_pred)
print(cm)

To get additonal data we use caret. 
76% accuracy in prediction is not very good. 

In [34]:
library(caret)
confusionMatrix(as.factor(t(test_set[, 13])), as.factor(y_pred))

Lets look at training set results. Looks like a healthy 86%. Overfitting was not done. 

In [43]:
y_pred = h2o.predict(model, newdata = as.h2o(training_set[-c(13)]))
y_pred = (y_pred > 0.5)
y_pred = as.vector(y_pred)
confusionMatrix(as.factor(t(training_set[, 13])), as.factor(y_pred))

For completeness please avoid any overflows or unnecessary process usage by shutdown. 

In [32]:
h2o.shutdown()

Let's use SVM

In [36]:

#install.packages('e1071')
library(e1071)
 
model = svm(formula = DEATH_EVENT ~ .,
                 data = training_set,
                 type = 'C-classification',
                 kernel = 'linear')
summary(model)

Make prediction on model

In [39]:
svm_pred <- predict(model, newdata = test_set[,-c(13)])

Print confusion matrix of prediction. Looks like it performed better than ANN

In [40]:
confusionMatrix(as.factor(t(test_set[, 13])), as.factor(svm_pred))

Lets look at training set performance. Looks like healthy 85%. Overfitting doesnt look likely. 

In [41]:
svm_train_pred <- predict(model, newdata = training_set[,-c(13)])
confusionMatrix(as.factor(t(training_set[, 13])), as.factor(svm_train_pred))