**Project title**: What your heart rate is telling you?

**Name:** Amy Yang

**E-mail:** yangy.ustc@gmail.com

**GitHub username**: amysheep

**Link to prior writing**: https://goo.gl/FqLnYf

**Project description**: Heart disease is the main reason for death in the world over the last decade. Almost one person dies of Heart disease about every minute in the United States alone. Researchers have been using several data
mining techniques to help health care professionals in the diagnosis of heart disease. In this project, we will exam the relationship between the maximum heart rate one can achieve during erexcise and the likelihood of developing heart disease using multiple logistic regression to account for potential confounding effects from age and gender. <img src="datadict.png" height="400" width="400">

**Dataset(s) used**: The existing datasets of heart disease patients from Cleveland database of UCI repository is used, which is available at http://archive.ics.uci.edu/ml/datasets/Heart+Disease. The dataset has 13 attributes and 303 records. The data dictionary is included here.

**Assumed student knowledge**: tidyverse, logistic regression, basic probability and statistic 101 knowledge

# Maximum heart rate during exercise and heart disease

Millions of people are getting some sort of heart disease every year and heart disease is the biggest killer of both
men and women in the United States and around the world. Statistical analysis has identified many risk factors associated with heart disease such as age, blood pressure, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, lack of physical exercise etc. In this notebook, we're going to run statistical testings and regression models using the Cleveland heart disease dataset to assess one perticular factor -- maximum heart rate one can achieve during exercise and how it is associated with higher likelihood of getting heart disease. <img src="run31.png" height="300" width="300">

## 1. Heart disease and potential risk factors

Let's start by loading the data 'Cleveland_hd.csv' into our Notebook. Let's also load tidyverse library for data cleaning.

In [3]:
# Load in the tidyverse package
library(tidyverse)

# Read datasets Cleveland_hd.csv into hd_data
hd_data <- read.csv("Cleveland_hd.csv")

# take a look at the first 5 rows of hd_data
head(hd_data,5)


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


## 2. Converting diagnosis class into outcome variable

We noticed that the outcome variable 'class' has more than two levels. According to the codebook, any none zero values can be coded as an 'event'. Let's create a new variable called 'hd' to represent a binary 1/0 outcome.

There are a few other categorical/discrete variables in the dataset. Let's also convert sex, cp and fbs into 'factor' type for next step analysis, otherwise R will treat them as continuous by default.

In [8]:
# Use the 'mutate' function from dplyr to recode our data

hd_data%>%mutate(hd=ifelse(class>0,1,0))->hd_data

# check the newly created variable by looking at the crosstab with 'class'

table(hd_data$class,hd_data$hd)

# recode sex, cp and fbs using mutate function and save as hd_data_cleaned

hd_data%>%mutate(sex=factor(sex),cp=factor(cp),fbs=factor(fbs))->hd_data_cleaned


   
      0   1
  0 164   0
  1   0  55
  2   0  36
  3   0  35
  4   0  13

## 3. Which clinical variables are associated with heart disease?

Now, let's use statistical tests to see which ones are related to heart disease. We can explore the associations for each variable in the dataset. Depending on the type of the data (i.e. continuous or categorical), we use t-test or chi-squred test to calculate the p-values.

In [12]:
# Does sex have an effect? Sex is a binary variable, so the appropriate test is Chi-squared test
chisq.test(hd_data_cleaned$sex, hd_data_cleaned$hd)

# Does age have an effect? Age is continuous, so we use t-test here
t.test(hd_data_cleaned$age ~ hd_data_cleaned$hd)

# What about thalach: maximum heart rate one can achieve during exercise?
t.test(hd_data_cleaned$thalach ~hd_data_cleaned$hd)


	Pearson's Chi-squared test with Yates' continuity correction

data:  hd_data_cleaned$sex and hd_data_cleaned$hd
X-squared = 22.043, df = 1, p-value = 2.667e-06



	Welch Two Sample t-test

data:  hd_data_cleaned$age by hd_data_cleaned$hd
t = -4.0303, df = 300.93, p-value = 7.061e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.013385 -2.067682
sample estimates:
mean in group 0 mean in group 1 
       52.58537        56.62590 



	Welch Two Sample t-test

data:  hd_data_cleaned$thalach by hd_data_cleaned$hd
t = 7.8579, df = 272.27, p-value = 9.106e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 14.32900 23.90912
sample estimates:
mean in group 0 mean in group 1 
        158.378         139.259 


*To be continued...*

---