# Can Data Diagnose Heart Disease Better Than Doctors?

By: Edward Zou, Hui Lin Shan, Reimi Shishido, and Emma Lo (Group 45)

## Introduction:

*Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal*

- Heart disease refers to several cardiovascular conditions that, due to a narrowing of major arteries causing restricted blood flow, can fatally affect the structure and function of the heart, and is the second leading cause of death in Canada. Diagnosing heart disease is a rigorous and time consuming process, and misdiagnoses is still a relevant concern with about 16%-68% of cases being reported as misdiagnosed, depending on the setting. Therefore, finding an easier and more reliable way of diagnosis will be very beneficial to the health of patients as well as doctors attempting to diagnose heart disease. Our group’s chosen data set tries to tackle the problem of identifying an individual who potentially has heart disease by comparing their attributes to patients who have been diagnosed with heart disease, and we want to do so with more accuracy than previous cases (ie. less than  16% error, or accurately predict more than 84% of the time). 


*Clearly state the question you will try to answer with your project*

- Can we use a heart disease dataset to create a model that can reliably and correctly predict whether a new patient has heart disease more than 84% of the time? 


*Identify and describe the dataset that will be used to answer the question*

- We will be using the heart disease dataset provided on Canvas (processed Cleveland version) of the UCI Machine Learning Respository. This specific dataset consists of 14 attributes (taken from a larger collection of 76 attributes) that aid in predicting whether or not an individual has heart disease. There were four datasets from four different institutions (Cleveland, Hungary, Switzerland, and VA Long Beach) provided, but we chose to use the Cleveland data because the website has stated that the Cleveland data set was the only data set that has been used by ML researchers, and we believe that this means it can potentially be the best data set with the most helpful observations and so will help us ensure our model is more effetcive. We chose to use the processed dataset because after comparing the columns that were removed from 


## Preliminary exploratory data analysis:
*Demonstrate that the dataset can be read from the web into R*

- The current dataset is in a .data file. We can either use read.table to read the file or convert the dataset to .csv file and use read_csv

*Clean and wrangle your data into a tidy format*

- Break it down into different sexes

*Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data*

*Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis*


In [1]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘testthat’


The following object is masked from ‘package:dplyr’:

    matches


The following object is masked from ‘package:purrr’:

    is_null


The following objects are masked from ‘package:readr’:

    edition_get, local_edition


The following object is masked from ‘package:tidyr’:

    matches


── [1mAttaching packages[22m

In [2]:
cleveland <- read_csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"), col_names = FALSE) %>%
    rename(age = X1, sex = X2, chest_pain = X3, resting_blood_pressure = X4,
           cholesterol = X5, fast_blood_sugar = X6, resting_electrocardiographic_results = X7,
           maximum_heart_rate_achieved = X8, exercise_induced_angina = X9,
           ST_depression_induced_by_exercise_relative_to_rest = X10, 
           the_slope_of_the_peak_exercise_ST_segment = X11, 
           number_of_major_vessels_colored_by_flourosopy = X12, thalassemia = X13,
           diagnosis_of_heart_disease = X14) %>%
    select(sex, cholesterol, resting_blood_pressure, diagnosis_of_heart_disease) %>%
    mutate(diagnosis_of_heart_disease = as_factor(diagnosis_of_heart_disease),
                diagnosis_of_heart_disease = recode(diagnosis_of_heart_disease, 
                                                    "0" = "No", 
                                                    "1" = "Yes",
                                                    "2" = "Yes",
                                                    "3" = "Yes",
                                                    "4" = "Yes"))
cleveland

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


sex,cholesterol,resting_blood_pressure,diagnosis_of_heart_disease
<dbl>,<dbl>,<dbl>,<fct>
1,233,145,No
1,286,160,Yes
1,229,120,Yes
⋮,⋮,⋮,⋮
1,131,130,Yes
0,236,130,Yes
1,175,138,No


In [3]:
male_data <- cleveland %>%
        filter(sex == 1)

female_data <- cleveland %>%
        filter(sex == 0)

male_count <- male_data %>%
        summarize(n = n())

female_count <- female_data %>%
        summarize(n = n())

cholesterol_average_male <- male_data %>%
        summarize(cholesterol = mean(cholesterol))

cholesterol_average_female <- female_data %>%
        summarize(cholesterol = mean(cholesterol))


resting_blood_pressure_average_male <- male_data %>%
        summarize(resting_blood_pressure = mean(resting_blood_pressure))

resting_blood_pressure_average_female <- female_data %>%
        summarize(resting_blood_pressure = mean(resting_blood_pressure))

average_table <- data.frame(X1 = cholesterol_average_male,
                            X2 = cholesterol_average_female,
                            X3 = resting_blood_pressure_average_male,
                            X4 = resting_blood_pressure_average_female) %>%
        rename(male_cholesterol_average = cholesterol, female_cholesterol_average = cholesterol.1,
               male_resting_blood_pressure_average = resting_blood_pressure,
               female_resting_blood_pressure_average = resting_blood_pressure.1,)

#male_data
#female_data
#male_count
#female_count
#cholesterol_average_male
#cholesterol_average_female
#resting_blood_pressure_average_male
#resting_blood_pressure_average_female
#average_table

Write here about what the chrolesterol average and resting blood pressure average mean.

In [4]:
male_split <- initial_split(male_data, prop = 0.70, strata = diagnosis_of_heart_disease)
male_train <- training(male_split)   
male_test <- testing(male_split)

female_split <- initial_split(female_data, prop = 0.70, strata = diagnosis_of_heart_disease)
female_train <- training(female_split)   
female_test <- testing(female_split)

options(repr.plot.width = 20, repr.plot.height = 15)

male_train_plot <- ggplot(data = male_train, aes(x = cholesterol, y = resting_blood_pressure)) +
    geom_point(aes(colour = diagnosis_of_heart_disease)) +
    labs(colour = "Does this patient have heart disease?") +
    ggtitle("Male cholesterol level vs resting blood pressure") +
    xlab("Cholesterol level (mg/dl)") + 
    ylab("Resting blood pressure (mm Hg)") +
    theme(text = element_text(size = 20))

female_train_plot <- ggplot(data = female_train, aes(x = cholesterol, y = resting_blood_pressure)) +
    geom_point(aes(colour = diagnosis_of_heart_disease)) +
    labs(colour = "Does this patient have heart disease?") +
    ggtitle("Female cholesterol level vs resting blood pressure") +
    xlab("Cholesterol level (mg/dl)") + 
    ylab("Resting blood pressure (mm Hg)") +
    theme(text = element_text(size = 20))

#male_train_plot
#female_train_plot

## Methods:
*Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?*

- High cholesterol and blood pressure are the main contributing factors of heart disease. For this reason we will be using these factors as our main variables for prediction. The reason we chose to use only two factors is because we wanted to be able to make a good visualization

*Describe at least one way that you will visualize the results*

- We will use the argument ggplot to plot out the various different data points and color them to a legend that represents each data point.
- We can use a histogram to represent a heart disease patient’s data for each column to see if your data can match.


## Expected outcomes and significance:

*What do you expect to find?*
- Whether our new patient has heart disease
- The tendency of patients with heart disease to possess which characteristics

*What impact could such findings have?*
- We can use this to help those who may or may not suspect they have heart disease to determine the likelihood of them getting heart disease. 
- We can also determine a “safe zone” for various categories and see if a patient has any specific categories that is endangering them (i.e. having a cholesterol content close to a heart disease patient’s cholesterol level)

*What future questions could this lead to?*
- What other illnesses can we predict with just statistics?
- Could the need for diagnosis by a doctor eventually become redundant, as computers and statistics can do a better and more accurate job?


## Rerences:

https://www.sciencedirect.com/science/article/pii/S1071916421002049#:~:text=Misdiagnosis%20of%20heart%20failure%20ranges,68%25%20depending%20on%20the%20setting.&text=Patients%20with%20ischemic%20heart%20disease,at%20risk%20of%20HF%20misdiagnosis.&text=Patients%20with%20lung%20disease%2C%20stroke,from%20screening%20for%20heart%20failure.

https://www.cdc.gov/chronicdisease/resources/publications/factsheets/heart-disease-stroke.htm