# **STROKE PREDICTION MODEL**

### **Introduction:**

According to the World Health Organization (WHO), Stroke is ranked as the 2nd leading cause of death worldwide, and causes approximately 11% of total deaths. Therefore, it is really important to know what are the possible risk factors for strokes that would help us in predicting the stroke. This dataset is used to predict a classification question of whether a patient is likely to get a stroke based on the input parameters like age, hypertension, heart disease, and average glucose level. In this dataset "healthcare-dataset-stroke-data.csv", each row in the data provides relevant information and an observation about the patient. Moreover, each column of the table respresents a particular variable, and each row corresponds to the record of a patient in this dataset. 

### **Literature Review**

1. Hypertension and stroke

Hypertension is the single most important risk factor for all types of stroke: ischemic stroke, intracerebral hemorrhage, and aneurysmal subarachnoid hemorrhage. Epidemiologic studies over the past 30 years have demonstrated a dramatic reduction in the incidence and mortality of all stroke types with good control of hypertension, and it appears that all effective antihypertensive agents have similar efficacy in their ability to reduce stroke risk. 

2. Age and stroke

Patients with SR were biologically older than those without SR. B-Age was independently associated with high risk of developing SR. (sr: stroke recurrence, B-age: biological age)

3. Reasons for choosing the predictors(based on graphs we made)

Age: According to the graph, we can clearly see that age plays a major role when deciding if a person is likely to have a stroke or not. Usually older people above the age of 60 are much more likely to get a heart stroke compared to people below the age of 60 who are much less likely to get the stroke
Hypertension: From the histogram, we can see that the proportion of people who have stroke is much higher in people who already have hypertension. 
Heart Disease: Looking at the histogram, it is concluded that people who already have a heart disease will much more likely get a stroke compared to individuals who do not have a heart disease
 
 
We did not choose, marriage status, work type, residence type as we believe that these variables have no connection with having a stroke. 


### **Preliminary exploratory data analysis:**

In [6]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)
source("tests.R")
source("cleanup.R")

“cannot open file 'tests.R': No such file or directory”


ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [7]:
set.seed(1234)

url <- "https://raw.githubusercontent.com/abhinavkansal08/DSCI_100_Project/main/healthcare-dataset-stroke-data.csv"

stroke_data <- read_csv(url)

head(stroke_data)

Parsed with column specification:
cols(
  id = [32mcol_double()[39m,
  gender = [31mcol_character()[39m,
  age = [32mcol_double()[39m,
  hypertension = [32mcol_double()[39m,
  heart_disease = [32mcol_double()[39m,
  ever_married = [31mcol_character()[39m,
  work_type = [31mcol_character()[39m,
  Residence_type = [31mcol_character()[39m,
  avg_glucose_level = [32mcol_double()[39m,
  bmi = [31mcol_character()[39m,
  smoking_status = [31mcol_character()[39m,
  stroke = [32mcol_double()[39m
)

“1 parsing failure.
 row col   expected    actual                                                                                                         file
1904  -- 12 columns 8 columns 'https://raw.githubusercontent.com/abhinavkansal08/DSCI_100_Project/main/healthcare-dataset-stroke-data.csv'
”


id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>
9046,Male,67,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
51676,Female,61,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
31112,Male,80,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
60182,Female,49,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
1665,Female,79,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
56669,Male,81,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


As we can see from above, the data is already in a tidy format so we can use the dataset as it is.

In [8]:
stroke_data <- stroke_data %>% 
    mutate(stroke = as_factor(stroke))

In [9]:
stroke_data <- stroke_data %>%
  mutate(stroke = as_factor(stroke))
glimpse(stroke_data)

Rows: 1,904
Columns: 12
$ id                [3m[90m<dbl>[39m[23m 9046, 51676, 31112, 60182, 1665, 56669, 53882, 1043…
$ gender            [3m[90m<chr>[39m[23m "Male", "Female", "Male", "Female", "Female", "Male…
$ age               [3m[90m<dbl>[39m[23m 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54,…
$ hypertension      [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
$ heart_disease     [3m[90m<dbl>[39m[23m 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, …
$ ever_married      [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "N…
$ work_type         [3m[90m<chr>[39m[23m "Private", "Self-employed", "Private", "Private", "…
$ Residence_type    [3m[90m<chr>[39m[23m "Urban", "Rural", "Rural", "Urban", "Rural", "Urban…
$ avg_glucose_level [3m[90m<dbl>[39m[23m 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.…
$ bmi               [3m[90m<chr>[39m[23m "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4"

In [10]:
stroke_data %>%
  pull(stroke) %>%
  levels()

In [11]:
stroke_data <- stroke_data[!is.na(stroke_data$stroke),]

In [12]:
print("Number of observations and percentage in stroke")
num_obs <- nrow(stroke_data)
stroke_data %>%
  group_by(stroke) %>%
  summarize(
    count = n(),
    percentage = n() / num_obs * 100)

print("Means of the predictor variables used in our analysis")
stroke_data %>%
  summarize(across(age:heart_disease, mean, na.rm = TRUE))

stroke_data %>%
  summarize(across(avg_glucose_level: bmi, mean, na.rm = TRUE))

print("Missing data")
stroke_data %>% 
    summarise_all(~ sum(is.na(.)))

[1] "Number of observations and percentage in stroke"


`summarise()` ungrouping output (override with `.groups` argument)



stroke,count,percentage
<fct>,<int>,<dbl>
0,1654,86.9154
1,249,13.0846


[1] "Means of the predictor variables used in our analysis"


age,hypertension,heart_disease
<dbl>,<dbl>,<dbl>
45.62642,0.113505,0.06673673


“argument is not numeric or logical: returning NA”


avg_glucose_level,bmi
<dbl>,<dbl>
109.3888,


[1] "Missing data"


id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,0,0,0,0,0,0


In [13]:
stroke_split <- initial_split(stroke_data, prop = 0.75, strata = stroke)
stroke_training <- training(stroke_split)
stroke_testing <- testing(stroke_split)

In [14]:
glimpse(stroke_training)

Rows: 1,428
Columns: 12
$ id                [3m[90m<dbl>[39m[23m 31112, 60182, 1665, 56669, 53882, 10434, 27419, 604…
$ gender            [3m[90m<chr>[39m[23m "Male", "Female", "Female", "Male", "Male", "Female…
$ age               [3m[90m<dbl>[39m[23m 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 78, 79, 50,…
$ hypertension      [3m[90m<dbl>[39m[23m 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, …
$ heart_disease     [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, …
$ ever_married      [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Ye…
$ work_type         [3m[90m<chr>[39m[23m "Private", "Private", "Self-employed", "Private", "…
$ Residence_type    [3m[90m<chr>[39m[23m "Rural", "Urban", "Rural", "Urban", "Rural", "Urban…
$ avg_glucose_level [3m[90m<dbl>[39m[23m 105.92, 171.23, 174.12, 186.21, 70.09, 94.39, 76.15…
$ bmi               [3m[90m<chr>[39m[23m "32.5", "34.4", "24", "29", "27.4", "22.8", "N/A"

In [15]:
glimpse(stroke_testing)

Rows: 475
Columns: 12
$ id                [3m[90m<dbl>[39m[23m 9046, 51676, 12175, 34120, 25226, 68794, 4219, 5482…
$ gender            [3m[90m<chr>[39m[23m "Male", "Female", "Female", "Male", "Male", "Female…
$ age               [3m[90m<dbl>[39m[23m 67, 61, 54, 75, 57, 79, 71, 69, 60, 81, 76, 77, 78,…
$ hypertension      [3m[90m<dbl>[39m[23m 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, …
$ heart_disease     [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ever_married      [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Ye…
$ work_type         [3m[90m<chr>[39m[23m "Private", "Self-employed", "Private", "Private", "…
$ Residence_type    [3m[90m<chr>[39m[23m "Urban", "Rural", "Urban", "Urban", "Urban", "Urban…
$ avg_glucose_level [3m[90m<dbl>[39m[23m 228.69, 202.21, 104.51, 221.29, 217.08, 228.70, 102…
$ bmi               [3m[90m<chr>[39m[23m "36.6", "N/A", "27.3", "25.8", "N/A", "26.6", "27.2

In [16]:
stroke_proportions <- stroke_training %>%
                      group_by(stroke) %>%
                      summarize(n = n()) %>%
                      mutate(percent = 100*n/nrow(stroke_training))

stroke_proportions

`summarise()` ungrouping output (override with `.groups` argument)



stroke,n,percent
<fct>,<int>,<dbl>
0,1241,86.90476
1,187,13.09524


In [17]:
stroke_recipe <- recipe(stroke ~ age + hypertension + heart_disease + avg_glucose_level, 
                        data = stroke_training) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

In [18]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")

In [None]:
stroke_vfold <- vfold_cv(stroke_training, v = 10, strata = stroke)
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

knn_results <- workflow() %>%
  add_recipe(stroke_recipe) %>%
  add_model(knn_spec) %>%
  tune_grid(resamples = stroke_vfold, grid = k_vals) %>%
  collect_metrics() 

accuracies <- knn_results %>%
  filter(.metric == "accuracy") %>%
  filter(mean == max(mean)) %>%
  pull(neighbors)

accuracies

In [None]:
accurate <- knn_results %>%
  filter(.metric == "accuracy")

accuracy_vs_k <- ggplot(accurate, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") + 
  theme(text = element_text(size = 12))

accuracy_vs_k

In [None]:
stroke_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 51) %>%
                            set_engine("kknn") %>%
                            set_mode("classification")

stroke_best_fit <- workflow() %>%
                        add_recipe(stroke_recipe) %>%
                        add_model(stroke_best_spec) %>%
                        fit(data = stroke_training)

stroke_summary <- stroke_best_fit %>% 
                       predict(stroke_testing) %>%
                       bind_cols(stroke_testing) %>%
                       metrics(truth = stroke, estimate = .pred_class)

stroke_summary

In [None]:
print("Visualizations used to compare the distributions of each of the predictor variables used in our analysis.")
stroke_plot_1 <- stroke_data %>%
  ggplot(aes(x = bmi, y = avg_glucose_level, color = stroke)) +
  ggtitle("Scatter plot of bmi versus average glucose level colored by stroke") +
  geom_point(alpha = 0.6) +
  labs(x = "bmi (standardized)", 
       y = "avg glucose level (standardized)",
       color = "stroke") +
  scale_color_manual(labels = c("no stroke", "stroke"), 
                     values = c("red", "blue")) +
  theme(text = element_text(size = 12))

stroke_plot_2 <- ggplot(stroke_data, aes(x = age, fill = stroke)) +
  ggtitle("Histogram of age of stroke data filled by stroke") +
  geom_histogram(alpha = 0.5, position = "identity")  

stroke_plot_3 <- ggplot(stroke_data, aes(x = hypertension, fill = stroke)) +
  ggtitle("Histogram of hypertension of stroke data filled by stroke") +
  geom_histogram(alpha = 0.5, position = "identity")

stroke_plot_4 <- ggplot(stroke_data, aes(x = heart_disease, fill = stroke)) +
  ggtitle("Histogram of heart disease of stroke data filled by stroke") +
  geom_histogram(alpha = 0.5, position = "identity") 
                                                   
stroke_plot_1
stroke_plot_2                                        
stroke_plot_3                                      
stroke_plot_4                                                   

### **Methods:**

**Explain how you will conduct either your data analysis and which variables/columns you will use**

We'll conduct our data anaylsis using Classification, i.e., using one or more variables to predict the value of a categorical variable of interest (i.e. stroke in our case). Moreover, we'll answer our predictive question using four variables/columns present in our dataset ,i.e., based on four predictor variables namely age, hypertension, heart disease, BMI, and average glucose level in our dataset. We chose these four predictor variables to predict the values of stroke because these are the four major reasons of a stroke occurence and thus would be most helpful in predicting the values of stroke.

**Describe at least one way that you will visualize the results**

We'll visualize the results using scatterplots and histograms. We'll use scatterplot for BMI and average glucose level colored by stroke, and we'll use histograms for age, hypertension, heart disease filled by stroke.

### **Expected outcomes and significance:**

**What do you expect to find?**

We expect to find the percentage of stroke using the given dataset to predict the possibility of a patient getting a stroke based on parameters like age, hypertension, heart diseage, BMI, and average glucose level present in our dataset. Moreover, we expect find a relationship among the predictor variables and the variable of interest.

**What impact could such findings have?**

Such findings would have a major impact on the expected outcomes and could also help in identifying which parameters are more strongly related to having a stroke, thus reminding people under high risk to be careful with what they do and what precationary methods they should take to avoid the possibility of a stroke.

**What future questions could this lead to?**

This could lead to a plethora of future questions such as identifying which would be the most important variable to determine the possibility of a stroke, comparisions between different parameters, and spot the most important risk factor for a stroke.

## **References:**

Dubow, J., Fink, M.E. Impact of Hypertension on Stroke. Curr Atheroscler Rep 13, 298–305 (2011). https://doi.org/10.1007/s11883-011-0187-y

Soriano-Tárraga, C., Lazcano, U., Jiménez-Conde, J. et al. Biological age is a novel biomarker to predict stroke 
recurrence. J Neurol 268, 285–292 (2021). https://doi.org/10.1007/s00415-020-10148-3