# Data analysis on heart disease directory

In [28]:
_Group 10: Samantha Lam, Sohayl Allibhai, Amanda Yang, Bolun Xie_

ERROR: Error in parse(text = x, srcfile = src): <text>:1:1: unexpected input
1: _
    ^


**Introduction**

Though treatment has advanced tremendously, the prevalence of heart disease has continued to rise in lower-income communities, contributing to the pervasiveness of cardiovascular disease as a leading cause of premature death (Bowry et al. 1151). Similarly, this trend is reflected in Canada, with the 11.3% increase in adults with cardiovascular disease occurring mostly in lower income areas despite the overall decline of heart disease in the general population in 2016 (Dai et al. 2). 
    
Therefore, there exists a push for more efficient  methods of diagnosing individuals at high risk of heart disease. Consequently, we aim to ascertain the severity of heart disease for unique individuals by analyzing a list of attributes taken from a dataset from Cleveland Clinic Foundation. (Detrano et al. 305) The raw dataset was refined by David W. Aha to create the processed dataset that we will utilize. (UCI Machine Learning Repository: Heart Disease Data Set) 
    

This dataset contains 14 attributes of which we isolated six for the purposes of our analysis namely, 
* age in years
* sex
* resting blood pressure (trestbps) 
* chest pain type (cp)
* number of major blood vessels coloured by fluoroscopy (ca) 
* the diagnosis of heart disease (num). 


In [None]:
#Please run this cell to load the libraries needed for data analysis 
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
#Please run this cell to download the dataset and read it into a dataframe.

my_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
download.file(my_url, "cleveland.csv")
cleveland_sel <- read_csv("cleveland.csv", col_names = FALSE)
cleveland_sel <- rename(cleveland_sel, "age" = X1,
                     "sex" = X2,
                     "cp" = X3,
                     "trestbps" = X4,
                     "chol" = X5,
                     "fbs" = X6,
                     "restecg" = X7,
                     "thalach" = X8,
                     "exnag" = X9,
                     "oldpeak" = X10,
                     "slope" = X11,
                     "ca" = X12,
                     "thal" = X13,
                     "num" = X14)

In [None]:
#Here we select for our chosen attributes

cleveland_df_sel<-select(cleveland_sel,age, sex, cp, trestbps, ca, num)
cleveland_df_sel

In [None]:
#Please run this cell to filter out rows with missing data 

cleveland_df_sel_f <- filter(cleveland_df_sel, ca != "?") 
cleveland_df_sel_f


In [29]:
patients_split<-initial_split(cleveland_df_sel_f, prop = 0.75, strata = cp)  
patients_train<- training(patients_split)
patients_test<- testing(patients_split)

nrow(cleveland_df_sel_f)
nrow(patients_train)
nrow(patients_test)


patients_test
patients_train


age,sex,cp,trestbps,ca,num
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
63,1,4,130,1.0,2
57,1,4,140,0.0,0
56,1,3,130,1.0,2
⋮,⋮,⋮,⋮,⋮,⋮
41,1,2,120,0.0,0
59,1,4,164,2.0,3
68,1,4,144,2.0,2


age,sex,cp,trestbps,ca,num
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
63,1,1,145,0.0,0
67,1,4,160,3.0,2
67,1,4,120,2.0,1
⋮,⋮,⋮,⋮,⋮,⋮
45,1,1,110,0.0,1
57,1,4,130,1.0,3
57,0,2,130,1.0,1


In [None]:
#Run this cell to obtain the total number of patients 

num_Patients <- cleveland_df_sel_f %>% summarize(n_patients = n())
num_Patients

In [None]:
#Run this cell to obtain the number of male and female patients respectively. 

n_each_gender <- cleveland_df_sel_f %>% group_by(sex) %>% summarize(n= n())

n_each_gender <- pivot_wider(n_each_gender, names_from = sex, values_from = n)

n_each_gender <- 
        rename(n_each_gender, 
              "female" = "0",
              "male" = "1") %>% mutate("female" = as.character(female), "male" = as.character(male))

n_each_gender <- pivot_longer(n_each_gender, cols = female:male, names_to = "Sex", values_to = "n") %>% mutate(n = as.numeric(n))

n_each_gender 


In [None]:
#Run this cell to obtain a visualisation of the proportion of male and female patients

options(repr.plot.width = 5, repr.plot.height = 7)

sex_bar<- ggplot(n_each_gender, aes(x = Sex, y = n)) + 
    geom_bar(stat = "identity") +
    xlab("Sex") +
    ylab("Number of Patients")+
    theme(text = element_text(size = 20))+ 
    ggtitle("Sex versus Number of Patients")

sex_bar

In [None]:
#Run this cell to find the average age of both female and male patients respectively

mean_age <- cleveland_df_sel_f %>% group_by(sex) %>% summarize(avg = mean(age))

mean_age <- pivot_wider(mean_age, names_from = sex, values_from = avg)

mean_age <- 
        rename(mean_age, 
              "female" = "0",
              "male" = "1") %>% mutate("female" = as.character(female), "male" = as.character(male))

mean_age <- pivot_longer(mean_age, cols = female:male, names_to = "sex", values_to = "avg_age") %>% mutate(avg_age = as.numeric(avg_age))

mean_age


In [None]:
#Run this cell to find the exact distribution of patients with each type of chest pain 

n_chest_pain <- cleveland_df_sel_f %>% group_by(cp) %>% summarize(n = n())

n_chest_pain <- pivot_wider(n_chest_pain, names_from = cp, values_from = n)

n_chest_pain <- 
        rename(n_chest_pain, 
              "typical_angina" = "1",
              "atypical_angina" = "2", "Non_anginal_pain" = "3", "Asymptomatic" = "4") #%>% mutate("female" = as.character(female), "male" = as.character(male))

n_chest_pain <- pivot_longer(n_chest_pain, cols = typical_angina:Asymptomatic, names_to = "chest_pain", values_to = "n") %>% mutate(n = as.numeric(n))


n_chest_pain


In [None]:
#Run this cell to find the minimum and maximum rest blood pressure in each age group. 
#Young adult (18-35 years old)
#Middle adult(36-55 years old)
#Old adult (>55 years old)
#Classifications obtained from (Marateb and Goudarzi, 216) 


young_adult_bps <- cleveland_df_sel_f %>%
select(age,trestbps) %>%
filter(between(age,18, 35))

middle_adult_bps <- cleveland_df_sel_f %>%
select(age,trestbps) %>%
filter(between(age,36, 55))

old_adult_bps <- cleveland_df_sel_f %>%
select(age, trestbps) %>%
filter(between(age,55,77))

max_young_trestbps <- young_adult_bps %>% 
    arrange(desc(trestbps))%>% 
    head(n=1)

min_young_trestbps <- young_adult_bps %>% 
    arrange(desc(trestbps)) %>%
    tail(n=1)

max_middle_trestbps <- middle_adult_bps %>% 
    arrange(desc(trestbps)) %>%
    head(n=1)

min_middle_trestbps <- middle_adult_bps %>% 
    arrange(desc(trestbps)) %>%
    tail(n=1)

max_old_trestbps <- old_adult_bps %>% 
    arrange(desc(trestbps)) %>%
    head(n=1)

min_old_trestbps <- old_adult_bps %>% 
    arrange(desc(trestbps)) %>%
    tail(n=1)


max_young_trestbps
min_young_trestbps

max_middle_trestbps
min_middle_trestbps

max_old_trestbps
min_old_trestbps


In [None]:
#Run this cell to obtain mean resting blood pressure in relation to ca

mean_rbp <- cleveland_df_sel_f %>% group_by(ca) %>% summarize(avg_rbp = mean(trestbps))
mean_rbp

In [None]:
#Run this cell to obtain visualisation of Age versus Resting blood pressure

options(repr.plot.width = 12, repr.plot.height = 7)

age_vs_rbps_plot <- ggplot(cleveland_df_sel_f, aes(x = age, y = trestbps, color = as_factor(num))) +
                    geom_point(alpha = 0.5) +
                    labs(x = "Age (years)", y = "Resting blood pressure (mmHg)", color = "Diagnosis of Heart Disease") +
                    theme(text = element_text(size = 20)) + 
                    ggtitle("Age versus Resting blood pressure") +
                    scale_color_manual(labels = c("0", "1","2","3","4"), values = c("red","blue","yellow","green","black"))
age_vs_rbps_plot

## Methods:
We aim to find the relationship between the chosen five variables and the severity of heart disease to discover which variable has the most influence on heart disease. 
* We picked five variables in total, **age, sex, resting blood pressure *(trestbps)*, chest pain type *(cp)*, and the number of major blood vessels colored by fluoroscopy *(ca)***. We selected **diagnosis of heart disease and severity *(nu)*** as our target variable, which we will compare to each of our five variables.

* We plan to utilize a bar plot to show the distribution of sex, as this is a categorical variable that we want to use to make comparisons. Scatter plots will be used to display the relationships between the remainder of our variables due to their quantitative nature.


## Expected outcomes and significance
Increasing age and the risk of contracting a heart disease has been consistently shown to be positively correlated (Roth et al. 2985) and the average lifestyle men lead typically incorporates more high risk factors for heart disease compared to women. (Dai et al. 6) As such, older men are more likely to contract cardiovascular diseases, and we would expect to see this reflected in our results. 

Through this project, we hope to create a classification model that can contribute to the ongoing effort to help healthcare workers efficiently target individuals at risk of developing heart disease or quickly identify the severity of the disease for patients.  

It is noteworthy that the dataset that we used to train our model was formed from data collected over a time period of May 1981 to 1984 (Detrano et al. 305), and the implications of the age of the data has not been explored. As such, it would be interesting to repeat the project with more recent data to observe for any differences in observations. Furthermore, the financial status of the patients was not noted and consequently the proportion of low income individuals in the test group is unknown. Considering the goal of our project, it would be worth exploring if focusing more explicitly on low income individuals using the same model would produce different results. 


## Bibliography 

Bowry, Ashna D. K., et al. ‘The Burden of Cardiovascular Disease in Low- and Middle-Income Countries: Epidemiology and Management’. Canadian Journal of Cardiology, vol. 31, no. 9, Sept. 2015, pp. 1151–59. DOI.org (Crossref), https://doi.org/10.1016/j.cjca.2015.06.028.

Dai, Haijiang, et al. ‘Regional and Socioeconomic Disparities in Cardiovascular Disease in Canada during 2005–2016: Evidence from Repeated Nationwide Cross-Sectional Surveys’. BMJ Global Health, vol. 6, no. 11, Nov. 2021, p. e006809. DOI.org (Crossref), https://doi.org/10.1136/bmjgh-2021-006809.

Detrano, Robert, et al. ‘International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease’. The American Journal of Cardiology, vol. 64, no. 5, Aug. 1989, pp. 304–10. ScienceDirect, https://doi.org/10.1016/0002-9149(89)90524-9.

UCI Machine Learning Repository: Heart Disease Data Set. https://archive.ics.uci.edu/ml/datasets/Heart+Disease. Accessed 3 Mar. 2022.

Roth, Gregory A., et al. ‘Global Burden of Cardiovascular Diseases and Risk Factors, 1990–2019’. Journal of the American College of Cardiology, vol. 76, no. 25, Dec. 2020, pp. 2982–3021. DOI.org (Crossref), https://doi.org/10.1016/j.jacc.2020.11.010.

Marateb, Hamid Reza, and Sobhan Goudarzi. ‘A Noninvasive Method for Coronary Artery Diseases Diagnosis Using a Clinically-Interpretable Fuzzy Rule-Based System’. Journal of Research in Medical Sciences : The Official Journal of Isfahan University of Medical Sciences, vol. 20, no. 3, Mar. 2015, pp. 214–23.

