# INTRODUCTION

**__ !! If you liked my work, please upvote this kernel. Thank you! __**

In this kernel we'll create four classification models to identify HeartDisease output class:

* Logistic Regression
* K-Nearest Neighbors
* Deicision Tree
* Random Forest


I'll be using tidymodels package with default model parametrs.

# DATA LOADING AND PREPROCESSING

In this step, we'll
1. Load the required libraries
2. Load the data
3. Display first few rows of the data
4. Check for data types
5. Check for missing values
6. Descriptive statistics for numeric values
7. Count unique values for all categorical variables

In [None]:
# Load required libraries
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
library(scales)
library(tidyverse)
library(gridExtra)
library(tidymodels)
library(rsample)
library(ggcorrplot)
library(recipes)
library(parsnip)
library(tune)
library(yardstick)
library(psych)
library(caret)

In [None]:
# Load the dataset
heart_data <- read_csv("/kaggle/input/heart-failure-prediction/heart.csv")
head(heart_data)

In [None]:
#Check for data types
heart_data %>% glimpse()

ERROR: Error in heart_data %>% glimpse(): could not find function "%>%"


In [None]:
#Check for missing values
heart_data %>%
  summarise(across(everything(), ~ sum(is.na(.x))))

In [None]:
#Descriptive statistics for numberic values
heart_data %>%
  select(where(is.numeric)) %>%
  describe()

In [None]:
# Count unique values for categorical variables
heart_data %>%
  summarise(across(where(is.character), n_distinct))

In the first step of the analysis, we loaded the necessary R libraries and the dataset. We discovered that the dataset comprises 918 rows and 12 variables, including 5 categorical and the remainder numeric. There were no missing values, which simplified our initial data preparation. By examining the first few rows and calculating descriptive statistics for numeric values, we gained a foundational understanding of the data's distribution. We also counted the unique entries for categorical variables, providing insights into the diversity of the dataset.

# EXPLORATORY DATA ANALYSIS

In this step, we'll do:
1. Distribution of numeric values
2. Counts of categorical values
3. Check for outliers (with box plots)
4. Correlation plot
5. Distribution of the target varibale

In [None]:
# Create distribution plots for numeric variables
options(repr.plot.width=12, repr.plot.height=7)

heart_data %>%
  select(where(is.numeric)) %>%
  select(-HeartDisease) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, fill = variable)) +
  facet_wrap(~ variable, scales = "free") +
  geom_histogram(bins = 30, alpha = 0.7) +
  labs(title = "Distribution Plots for Numeric Variables") +
  scale_fill_viridis_d() +
  theme_minimal() +
  guides(fill = "none")

In [None]:
# Create plots for categorical variables
heart_data %>%
  select(where(is.character)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, fill = variable)) +
  facet_wrap(~ variable, scales = "free") +
  geom_bar() +
  labs(title = "Distribution Plots for Categorical Variables") +
  scale_fill_viridis_d() +
  theme_minimal() +
  guides(fill = "none")

In [None]:
# Create box plots for numeric variables by the 'HeartDisease' category
heart_data %>%
  mutate(HeartDisease = factor(HeartDisease)) %>%
  select(HeartDisease, where(is.numeric)) %>%
  pivot_longer(cols = where(is.numeric), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = HeartDisease, y = value, fill = HeartDisease)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free") +
  labs(x = "Heart Disease", y = "Value", title = "Box Plots for Numeric Variables by Heart Disease") +
  scale_fill_viridis_d() +
  theme_minimal()

In [None]:
# Find correlated variables
heart_data %>%
  select_if(is.numeric) %>%
  cor()

# Correlation plot
heart_data %>%
  select_if(is.numeric) %>%
  ggcorr(label = TRUE, label_size = 3)

In [None]:
# Create distribution of HeartDisease
options(repr.plot.width=7, repr.plot.height=7)

ggplot(heart_data, aes(x = factor(HeartDisease), fill = factor(HeartDisease))) +
  geom_bar() +
  labs(x = "HeartDisease", y = "Count", fill = "HeartDisease", title = "Distribution of HeartDisease") +
  scale_fill_viridis_d() +
  theme_minimal()

**Distribution of Numeric Values:** indicate that 'Age' and 'MaxHR' variables have a fairly normal distribution. The other variable shows a skewed distribution.

**Counts of Categorical Values:** The bar charts for categorical data reveal that certain categories dominate, like the 'ASY' (asymptomatic) type of chest pain and a higher prevalence of male patients. These imbalances may influence the model's performance and should be considered in analysis and model training.

**Outlier Detection:** The box plots show that most of the numerical variables have outliers. These outliers may represent atypical cases or errors in data entry. However, they could also indicate true extremes that are clinically significant.

**Correlation Plot:** The correlation heatmap suggests some variables have a moderate positive or negative correlation with the occurrence of heart disease. For instance, 'Oldpeak' shows a positive correlation with the presence of heart disease, whereas 'MaxHR' appears to have a negative correlation. There are no highly correlated variables to be removed on feature ingineering step.

**Distribution of the Target Variable ('HeartDisease'):** The final bar chart highlights a slight imbalance in the distribution of the target variable, with a larger number of patients not having heart disease compared to those who do. This suggests the need for careful consideration of class distribution when training predictive models to ensure they are not biased towards the majority class.

By interpreting these insights, we can make informed decisions on data preprocessing, such as outlier treatment and feature selection, which will ultimately contribute to building a more accurate heart failure prediction model.

# DATA PREPARATION

In this step we'll:
1. Convert the outcome variable to a factor type
2. Split the data to training and test sets
3. Normalize the data
4. Create dummy variables

In [None]:
# Convert outcome variable to factor and check the order
heart_data$HeartDisease <- as.factor(heart_data$HeartDisease)
levels(heart_data[['HeartDisease']])

In [None]:
# Split data to training and test sets
heart_split <- initial_split(heart_data,
                            prop = 0.7,
                            strata = HeartDisease)

heart_training <- heart_split %>% training()
heart_test <- heart_split %>% testing()

nrow(heart_training)
nrow(heart_test)

In [None]:
# Normalize and remove corelated numeric variables and create dummy for nominal variables
heart_recipe <- recipe(HeartDisease~.,
                      data = heart_training) %>%
    step_corr(all_numeric(), threshold = 0.8) %>%
    step_normalize(all_numeric()) %>%
    step_dummy(all_nominal(),-all_outcomes())

heart_recipe

In [None]:
# Apply transformation recipe to training and test sets
heart_recipe_prep <- heart_recipe %>%
    prep(training = heart_training)

heart_training_prep <- heart_recipe_prep %>%
    bake(new_data = NULL)

heart_test_prep <- heart_recipe_prep %>%
    bake(new_data = heart_test)

head(heart_training_prep)

Now the data is ready for modeling.

# MODELS TRAINING AND RESULTS

Now, we'll create 4 predictive models with default parameters.

**Logistic Regression Model**

In [None]:
# Logistic regression model setup
glm_model <- logistic_reg() %>%
    set_engine('glm') %>%
    set_mode('classification')

glm_fit <- glm_model %>%
    fit(HeartDisease ~.,
       data = heart_training_prep)

glm_fit

In [None]:
# Logistic regression model predictions on test data
glm_class_pred <- predict(glm_fit,
                         new_data = heart_test_prep,
                         type = 'class')

glm_prob_pred <- predict(glm_fit,
                         new_data = heart_test_prep,
                         type = 'prob')

glm_results <- heart_test %>%
    select(HeartDisease) %>%
    bind_cols(glm_class_pred, glm_prob_pred)

# Performance metrics for logistic regression model
conf_mat(glm_results,
        truth = HeartDisease,
        estimate = .pred_class)

glm_metric_spec <-
    metric_set(accuracy, sens, spec, roc_auc)

glm_metrics <- glm_metric_spec(glm_results,
                              truth = HeartDisease,
                              estimate = .pred_class, .pred_0)
glm_metrics

# Logistic regression model ROC Curve
glm_results %>%
    roc_curve(truth = HeartDisease, .pred_0) %>%
    autoplot()

The Logistic Regression Model showed:
* accuracy of 0.8478261
* sensitivity of 0.8536585
* specificity of 0.8431373
* roc_auc of 0.9207716

**K-Nearest Neighbors**

In [None]:
# K-nearest neighbors model
knn_model <- nearest_neighbor() %>%
    set_engine('kknn') %>%
    set_mode('classification')

knn_fit <- knn_model %>%
    fit(HeartDisease~.,
       data = heart_training_prep)

knn_fit

In [None]:
# K-nearest neighbors model predictions on test data
knn_class_pred <- predict(knn_fit,
                         new_data = heart_test_prep,
                         type = 'class')

knn_prob_pred <- predict(knn_fit,
                         new_data = heart_test_prep,
                         type = 'prob')

knn_results <- heart_test %>%
    select(HeartDisease) %>%
    bind_cols(knn_class_pred, knn_prob_pred)

# Performance metrics for K-nearest neighbors model
conf_mat(knn_results,
        truth = HeartDisease,
        estimate = .pred_class)

knn_metric_spec <-
    metric_set(accuracy, sens, spec, roc_auc)

knn_metrics <- knn_metric_spec(knn_results,
                              truth = HeartDisease,
                              estimate = .pred_class, .pred_0)
knn_metrics

# Logistic regression model ROC Curve
knn_results %>%
    roc_curve(truth = HeartDisease, .pred_0) %>%
    autoplot()

The K-Nearest Neighbors Model showed:
* accuracy of 0.8550725
* sensitivity of 0.8536585
* specificity of 0.8562092
* roc_auc of 0.9038737

**Decision Trees Model**

In [None]:
# Decision tree model
dt_model <- decision_tree() %>%
    set_engine('rpart') %>%
    set_mode('classification')

dt_fit <- dt_model %>%
    fit(HeartDisease~.,
       data = heart_training_prep)

dt_fit

In [None]:
# Decision tree model predictions on test data
dt_class_pred <- predict(dt_fit,
                         new_data = heart_test_prep,
                         type = 'class')

dt_prob_pred <- predict(dt_fit,
                         new_data = heart_test_prep,
                         type = 'prob')

dt_results <- heart_test %>%
    select(HeartDisease) %>%
    bind_cols(dt_class_pred, dt_prob_pred)


# Performance metrics for decision tree model
conf_mat(dt_results,
        truth = HeartDisease,
        estimate = .pred_class)

dt_metric_spec <-
    metric_set(accuracy, sens, spec, roc_auc)

dt_metrics <- dt_metric_spec(dt_results,
                              truth = HeartDisease,
                              estimate = .pred_class, .pred_0)
dt_metrics

# Decision tree model ROC Curve
dt_results %>%
    roc_curve(truth = HeartDisease, .pred_0) %>%
    autoplot()

The Decision Tree Model showed:

* accuracy of 0.8514493
* sensitivity of 0.8211382
* specificity of 0.8758170
* roc_auc of 0.8815824

Random Forest Models

In [None]:
# Random forest model
rf_model <- rand_forest() %>%
    set_engine('ranger') %>%
    set_mode('classification')

rf_fit <- rf_model %>%
    fit(HeartDisease~.,
       data = heart_training_prep)

rf_fit

In [None]:
# Random forest model predictions on test data
rf_class_pred <- predict(rf_fit,
                         new_data = heart_test_prep,
                         type = 'class')

rf_prob_pred <- predict(rf_fit,
                         new_data = heart_test_prep,
                         type = 'prob')

rf_results <- heart_test %>%
    select(HeartDisease) %>%
    bind_cols(rf_class_pred, rf_prob_pred)

# Performance metrics for random forest model
conf_mat(rf_results,
        truth = HeartDisease,
        estimate = .pred_class)

rf_metric_spec <-
    metric_set(accuracy, sens, spec, roc_auc)

rf_metrics <- rf_metric_spec(rf_results,
                              truth = HeartDisease,
                              estimate = .pred_class, .pred_0)
rf_metrics

# Decision tree model ROC Curve
rf_results %>%
    roc_curve(truth = HeartDisease, .pred_0) %>%
    autoplot()

The Random Forest Model showed:

* accuracy of 0.8659420
* sensitivity of 0.8373984
* specificity of 0.8888889
* roc_auc of 0.9267761

# MODELS PERFORMANCE COMPARISON

In [None]:
# Model results comparison
combined_metrics <- bind_cols(
    select(glm_metrics, .metric),
    select(glm_metrics, .estimate) %>% rename(glm_results = .estimate),
    rename(select(knn_metrics, .estimate), knn_results = .estimate),
    rename(select(dt_metrics, .estimate), dt_results = .estimate),
    rename(select(rf_metrics, .estimate), rf_results = .estimate)
)

combined_metrics

In [None]:
# Reshape data from wide to long format
combined_metrics_long <- pivot_longer(combined_metrics,
                                      cols = c(glm_results,knn_results, dt_results, rf_results),
                                      names_to = "model", values_to = "value")

options(repr.plot.width=12, repr.plot.height=10)

# Create faceted bar plots
ggplot(combined_metrics_long, aes(x = model, y = value, fill = model)) +
    geom_bar(stat = "identity", position = "dodge") +
    facet_wrap(~.metric, scales = "free_y") +
    labs(x = "Model", y = "Value", fill = "Model") +
    scale_fill_viridis_d() +
    theme_minimal() +
    guides(fill = "none")

# CONCLUSION

The goal of this kernel was to determine what models are the best performing on 'Heart Failure' dataset.

Overall, the Random Forest model showed the best results in terms of accuracy, specificity and AUROC.

The Logistic Regression and KNN Models showed the highest sensitivity numbers.
