In [5]:
import pandas as pd
import numpy as np
import seaborn as sns

heart = pd.read_csv('heart.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']
print(yes_hd.head())

    age     sex  trestbps   chol            cp  exang  fbs  thalach  \
1  67.0    male     160.0  286.0  asymptomatic    1.0  0.0    108.0   
2  67.0    male     120.0  229.0  asymptomatic    1.0  0.0    129.0   
6  62.0  female     140.0  268.0  asymptomatic    0.0  0.0    160.0   
8  63.0    male     130.0  254.0  asymptomatic    0.0  0.0    147.0   
9  53.0    male     140.0  203.0  asymptomatic    1.0  1.0    155.0   

  heart_disease  
1      presence  
2      presence  
6      presence  
8      presence  
9      presence  


The full dataset has been loaded for you as heart, then split into two subsets:

yes_hd, which contains data for patients with heart disease
no_hd, which contains data for patients without heart disease
For this project, we’ll investigate the following variables:

chol: serum cholestorol in mg/dl
fbs: An indicator for whether fasting blood sugar is greater than 120 mg/dl (1 = true; 0 = false)
To start, we’ll investigate cholesterol levels for patients with heart disease. Use the dataset yes_hd to save cholesterol levels for patients with heart disease as a variable named chol_hd

In [8]:
chol_hd = yes_hd.chol
mean_chol_hd = np.mean(chol_hd)
print(mean_chol_hd)

251.4748201438849


Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? Import the function from scipy.stats that you can use to test the following null and alternative hypotheses:

Null: People with heart disease have an average cholesterol level equal to 240 mg/dl
Alternative: People with heart disease have an average cholesterol level that is greater than 240 mg/dl
Note: Unfortunately, the scipy.stats function we’ve been using does not (at the time of writing) have an alternative parameter to change the alternative hypothesis for this test. Therefore, you’ll have to run a two-sided test. However, since you calculated earlier that the average cholesterol level for heart disease patients is greater than 240 mg/dl, you can calculate the p-value for the one-sided test indicated above simply by dividing the two-sided p-value in half.

In [9]:
from scipy.stats import ttest_1samp

Run the hypothesis test indicated in task 3 and print out the p-value. Can you conclude that heart disease patients have an average cholesterol level significantly greater than 240 mg/dl? Use a significance threshold of 0.05.

In [13]:
# Calculating the p value for two-sided test and dividing by two
tstat,pval = ttest_1samp(chol_hd, 240, alternative = 'greater')
print(pval)

0.0035411033905155707


Repeat steps 1-4 in order to run the same hypothesis test, but for patients in the sample who were not diagnosed with heart disease. Do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?

In [14]:
chol_no = no_hd.chol
mean_chol_no = np.mean(chol_no)
print(mean_chol_no)

tstat,pval = ttest_1samp(chol_no, 240, alternative = 'greater')
print(pval)

242.640243902439
0.26397120232220506


Let’s now return to the full dataset (saved as heart). How many patients are there in this dataset? Save the number of patients as num_patients and print it out.

In [15]:
num_patients = len(heart)
print(num_patients)

303


Remember that the fbs column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (1 means that their fasting blood sugar was greater than 120 mg/dl; 0 means it was less than or equal to 120 mg/dl).

Calculate the number of patients with fasting blood sugar greater than 120. Save this number as num_highfbs_patients and print it out.

In [17]:
num_highfbs_patients = np.sum(heart.fbs == 1)
print(num_highfbs_patients)

45


Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes? Calculate and print out this number.

Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

In [18]:
print('The number of people in this dataset that we estimate have diabetes is ' + str(num_patients*0.08))

The number of people in this dataset that we estimate have diabetes is 24.240000000000002


Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Import the function from scipy.stats that you can use to test the following null and alternative hypotheses:

Null: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
Alternative: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

In [19]:
from scipy.stats import binom_test

In [21]:
pval = binom_test(num_highfbs_patients, num_patients, p = 0.08, alternative = 'greater')
print(pval)
if pval >= 0.05:
    print("The null hypothesis cannot be rejected.")
else:
    print("The alternative hypothesis cannot be rejected.")

4.689471951448875e-05
The alternative hypothesis cannot be rejected.
