# Heart Disease Research Part I 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import binom_test
from scipy.stats import ttest_1samp

In this project, you’ll investigate some data from a sample patients who were evaluated for heart disease at the Cleveland Clinic Foundation. The data was downloaded from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Heart+Disease) and then cleaned for analysis.

The principal investigators responsible for data collection were:

1.    Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2.    University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3.    University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4.    V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.


In [2]:
# load data
heart = pd.read_csv('heart_disease.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']
heart.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence


## Cholesterol Analysis
1. The full dataset has been loaded for you as `heart`, then split into two subsets:

    -    `yes_hd`, which contains data for patients with heart disease
    -    `no_hd`, which contains data for patients without heart disease

   For this project, we’ll investigate the following variables:

    -    `chol`: serum cholestorol in mg/dl
    -    `fbs`: An indicator for whether fasting blood sugar is greater than 120 mg/dl (`1` = true; `0` = false)

   To start, we’ll investigate cholesterol levels for patients with heart disease. Use the dataset `yes_hd` to save cholesterol levels for patients with heart disease as a variable named `chol_hd`.

In [3]:
yes_hd.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
6,62.0,female,140.0,268.0,asymptomatic,0.0,0.0,160.0,presence
8,63.0,male,130.0,254.0,asymptomatic,0.0,0.0,147.0,presence
9,53.0,male,140.0,203.0,asymptomatic,1.0,1.0,155.0,presence


In [4]:
no_hd.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence
5,56.0,male,120.0,236.0,atypical angina,0.0,0.0,178.0,absence
7,57.0,female,120.0,354.0,asymptomatic,1.0,0.0,163.0,absence


In [5]:
chol_hd = yes_hd.chol

In [6]:
chol_hd.head()

1    286.0
2    229.0
6    268.0
8    254.0
9    203.0
Name: chol, dtype: float64

2. In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). Calculate the mean cholesterol level for patients who were diagnosed with heart disease and print it out. Is it higher than 240 mg/dl?

In [8]:
chol_mean = np.mean(chol_hd)

if chol_mean > 240:
    print('Mean cholestoral for patients with heart disease is higher than 240 mg/dl at ' + str(round(chol_mean, 2)) + ' mg/dl' + '\n')
else:
    print('Mean cholestoral for patients with heart disease is lower than 240 mg/dl at ' + str(round(chol_mean, 2)) + ' mg/dl')

Mean cholestoral for patients with heart disease is higher than 240 mg/dl at 251.47 mg/dl



3. Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:

    -    Null: People with heart disease have an average cholesterol level equal to 240 mg/dl
    -    Alternative: People with heart disease have an average cholesterol level that is **greater** than 240 mg/dl

   Note: Unfortunately, the `scipy.stats` function we’ve been using does not (at the time of writing) have an alternative parameter to change the alternative hypothesis for this test. Therefore, you’ll have to run a two-sided test. However, since you calculated earlier that the average cholesterol level for heart disease patients is greater than 240 mg/dl, you can calculate the p-value for the one-sided test indicated above simply by dividing the two-sided p-value in half.

    Imported `ttest_1samp` at beginning.

4. Run the hypothesis test indicated in task 3 and print out the p-value. Can you conclude that heart disease patients have an average cholesterol level significantly greater than 240 mg/dl? Use a significance threshold of 0.05.

   `ttest_1samp` has two inputs: 
    - the sample of values (in this case, the cholesterol levels for patients with heart disease) 
    - the null value (in this case, 240). 

   It has two outputs, the t-statstic and a p-value.

   When you divide the p-value by two (in order to run the one-sided test), you should get a p-value of 0.0035. This is less than 0.05, suggesting that heart disease patients have an average cholesterol level significantly higher than 240 mg/dl.

In [9]:
tstat, pval = ttest_1samp(chol_hd, 240)

if pval/2 > 0.05:
    print('The p-value of ' + str(round(pval/2,4)) + ' is not significant.')
else:
    print('The p-value of ' + str(round(pval/2,4)) + ' is significant.')

The p-value of 0.0035 is significant.


5. Repeat steps 1-4 in order to run the same hypothesis test, but for patients in the sample who were not diagnosed with heart disease. Do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?


In [11]:
chol_hd_n = no_hd.chol

In [12]:
chol_mean_n = np.mean(chol_hd_n)

if chol_mean_n > 240:
    print('Mean cholestoral for patients without heart disease is higher than 240 mg/dl at ' + str(round(chol_mean_n,2)) + ' mg/dl' + '\n')
else:
    print('Mean cholestoral for patients without heart disease is lower than 240 mg/dl at ' + str(round(chol_mean_n,2)) + ' mg/dl')

Mean cholestoral for patients without heart disease is higher than 240 mg/dl at 242.64 mg/dl



In [14]:
tstat_n, pval_n = ttest_1samp(chol_hd_n, 240)

if pval_n/2 > 0.05:
    print('The p-value of ' + str(round(pval_n/2,4)) + ' is not significant.')
else:
    print('The p-value of ' + str(round(pval_n/2,4)) + ' is significant.')

The p-value of 0.264 is not significant.


## Fasting Blood Sugar Analysis
6. Let’s now return to the full dataset (saved as `heart`). How many patients are there in this dataset? Save the number of patients as `num_patients` and print it out.

In [15]:
num_patients = len(heart)
print('Number of patients: ' + str(num_patients))

Number of patients: 303


7. Remember that the `fbs` column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (`1` means that their fasting blood sugar was greater than 120 mg/dl; 0 means it was less than or equal to 120 mg/dl).

   Calculate the number of patients with fasting blood sugar greater than 120. Save this number as `num_highfbs_patients` and print it out.

In [16]:
num_highfbs_patients = np.sum(heart.fbs)
print('The number of patients with high blood sugar is ' + str(num_highfbs_patients) + '\n')

The number of patients with high blood sugar is 45.0



8. Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.

   By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes? Calculate and print out this number.

   Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

In [19]:
percent_diab = round(((num_highfbs_patients/num_patients)*100), 1)
print('Estimated percent of patients with high fasting blood sugar: ' + str(percent_diab) + '%' + '\n')
# or
pop_percent = round((0.08*num_patients), 1)
print('We should have ' + str(pop_percent) + '% patients with high fasting blood sugar.' + '\n')

Estimated percent of patients with high fasting blood sugar: 14.9%

We should have 24.2% patients with high fasting blood sugar.



9. Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:

    -    Null: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
    -    Alternative: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

   Imported `binom_test` at beginning.

10. Run the hypothesis test indicated in task 9 and print out the p-value. Using a significance threshold of 0.05, can you conclude that this sample was drawn from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%?

In [20]:
#  the number of observed successes, the number of total trials, and an expected probability of success.
p_value_1sided = binom_test(45, n=303, p=0.08, alternative = 'greater')

if p_value_1sided > 0.05:
    print(str(p_value_1sided) + ' is greater than 0.05. Our patients are not from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%. \n')
else:
    print(str(p_value_1sided) + ' is less than 0.05. Our patients are from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%. \n')

4.689471951448875e-05 is less than 0.05. Our patients are from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%. 

