## Heart Attack Predictive Analysis: Group 44

### Introduction

Heart disease is a prevalent and life-threatening condition that affects millions of people worldwide. Early diagnosis and risk assessment are crucial in providing timely medical intervention and reducing the morbidity and mortality associated with this condition. In our data science project, we aim to leverage a [dataset](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) containing hospital data from Cleveland, Hungary, Switzerland, and VA Long Beach. However, we will specifically choose to work with the Cleveland data as it contains the most data from heart disease patients. The dataset contains 14 health-related variables, including the presence or absence of heart disease as the categorical variable. Using classification modelling, the data offers us a valuable opportunity to explore its relationship with heart disease. Hence, we propose to answer the question: Can we use medical laboratory test data available to us to predict whether a patient has heart disease?



### Preliminary exploratory data analysis
- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle data into a tidy format
- Using only training data, summarize the data in at least one table (this is exploratory data analysis). 
    - An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.
        - (9 chest pain "cp", 10 resting blood pressure "bp", 12 serum cholesterol "chol", 16 fasting blood sugar "fbs", 19 resting ecg "restecg")
        - no missing values in the above columns
        - (58 diagnosis of heart disease "num")
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). 
    - An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.


In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
library(RColorBrewer)

In [None]:
# creating column names for our data as it does not contain
our_col_names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

# reading in our data and storing as an object
heart_data <- read_csv("data/processed.cleveland.data", col_names = our_col_names) |>
    select(cp, trestbps, chol, fbs, restecg, num)
heart_data

In [None]:
# prepare data for splitting
split_obj <- initial_split(heart_data, prop = 0.8, strata = num)

# split and extract data
train_data <- training(split_obj)
test_data <- testing(split_obj)


# calculcate summary statistics
summary_diagnosis <- train_data |>
    group_by(num) |>
    summarize(count = n()) |>
    pivot_wider(names_from = num,
                values_from = count)

# summary_diagnosis

summary_means <- train_data |>
    select(trestbps, chol) |>
    summarize_all(mean, na.rm = TRUE)

# summary_means

summary_table <- cbind(summary_means, summary_diagnosis) |>
    rename("resting blood pressure (mean)" = trestbps,
          "serum cholesterol (mean)" = chol,
          "no disease" = `0`,
          "level 1 disease" = `1`,
          "level 2 disease" = `2`,
          "level 3 disease" = `3`,
          "level 4 disease" = `4`)
           
summary_table

In [None]:
visulization1 <- train_data |>
    ggplot(aes(x=trestbps,  y=chol, colour= as_factor(num))) +
    geom_point() +
    scale_color_discrete(labels=c("No Disease", "Level 1 disease", "Level 2 disease", "Level 3 disease", "Level 4 disease")) + 
    labs(x="Resting Blood Pressure (mm Hg)", y="Serum Cholesterol (mg/dl)", colour = "Heart Disease Diagnosis")

visulization1

# (9 chest pain "cp", 10 resting blood pressure "bp", 12 serum cholesterol "chol", 16 fasting blood sugar "fbs", 19 resting ecg "restecg")
# cp, fbs, restecg

cat_cp <- train_data |>
    ggplot(aes(x = as_factor(cp), fill = as_factor(num))) +
    geom_bar(stat = "count") +
    scale_fill_discrete(labels=c("No Disease", "Level 1 disease", "Level 2 disease", "Level 3 disease", "Level 4 disease")) + 
    labs(x = "Chest Pain", y = "Count", fill = "Heart Disease Diagnosis")

cat_cp

cat_fbs <- train_data |>
    ggplot(aes(x = as_factor(fbs), fill = as_factor(num))) +
    geom_bar(stat = "count") +
    scale_fill_discrete(labels=c("No Disease", "Level 1 disease", "Level 2 disease", "Level 3 disease", "Level 4 disease")) + 
    labs(x = "Fasting Blood Sugar > 120mg/dl", y = "Count", fill = "Heart Disease Diagnosis")

cat_fbs

cat_restecg <- train_data |>
    ggplot(aes(x = as_factor(restecg), fill = as_factor(num))) +
    geom_bar(stat = "count") +
    scale_fill_discrete(labels=c("No Disease", "Level 1 disease", "Level 2 disease", "Level 3 disease", "Level 4 disease")) + 
    labs(x = "Resting Ecocardiogram Results", y = "Count", fill = "Heart Disease Diagnosis")

cat_restecg

### Methods

To conduct our data analysis, we will be considering the following predictor variables:  cp (chest pain type: typical angina, atypical angina, non-anginal pain, asymptomatic), trestbp (resting blood pressure), restecg (resting ECG), fbs (fasting blood sugar), and chol (serum cholesterol). Aside from chest pain type, each one of these predictor variables are results of medical laboratory tests that are commonly requested by doctors, while the chest pain type accounts for the patient's experience in consideration of the diagnosis. The response variable will be num (diagnosis of heart disease). Using these variables, we will first explore the relationships between the standardized predictor variables, with respect to the diagnosis of heart disease, by visualizing them on three scatter plots, as shown above. Once we get a sense of the relationships, we will use the K-nearest neighbors algorithm to predict the categorical class, which in this case is the diagnosis of heart disease. Before using the algorithm, we will first tune the classifier as well as consider how well the classifier performs in terms of accuracy, precision, and recall.

### Expected outcomes and significance

We expect to find that the predictor variables we have chosen will successfully and accurately predict the diagnosis of heart disease for a patient. As mentioned earlier, these predictor variables are results from medical laboratory tests, which are objective and conducted systematically. Chest pain type, on the other hand, accounts for the subjectivity of heart disease patients. Both considerations of subjectivity and objectivity ultimately ensures a well-rounded diagnosis. These findings could potentially have significant implications for early risk assessment and timely medical intervention. By improving the accuracy of heart disease classification, our findings can help guide healthcare professionals in making informed decisions and improve patient care outcomes. Finally, our findings may inspire further extrapolation of other predictive models for heart disease diagnosis, potentially leading to more accurate and personalized risk assessments that take into account a broader range of health and lifestyle factors.   

## TODO: Export as .html and .ipynb when submitting