In [1]:
# import libraries
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp, binomtest

# load data
heart = pd.read_csv('heart_data.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

In [2]:
yes_hd.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
6,62.0,female,140.0,268.0,asymptomatic,0.0,0.0,160.0,presence
8,63.0,male,130.0,254.0,asymptomatic,0.0,0.0,147.0,presence
9,53.0,male,140.0,203.0,asymptomatic,1.0,1.0,155.0,presence


In [3]:
chol_hd = yes_hd.chol
np.mean(chol_hd)

251.4748201438849

## Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average?
Unfortunately, the scipy.stats function we’ve been using does not (at the time of writing) have an alternative parameter to change the alternative hypothesis for this test. Therefore, you’ll have to run a two-sided test. However, since you calculated earlier that the average cholesterol level for heart disease patients is greater than 240 mg/dl, you can calculate the p-value for the one-sided test indicated above simply by dividing the two-sided p-value in half.

In [4]:
tstat, pval = ttest_1samp(chol_hd, 240)
pval / 2


0.0035411033905155703

The p-value is less than 0.05, so we reject the null hypothesis.
## Repeat steps 1-4 in order to run the same hypothesis test, but for patients in the sample who were not diagnosed with heart disease

In [5]:
chol_hd_no = no_hd.chol
np.mean(chol_hd_no)

242.640243902439

In [6]:
tstat, pval = ttest_1samp(chol_hd_no, 240)
pval / 2


0.26397120232220506

# Fasting Blood Sugar Analysis

In [7]:
num_patients = len(heart)
num_patients

303

In [8]:
heart.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence


In [9]:
num_high_fbs = len(heart[heart.fbs == 1.0])
num_high_fbs

45

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes? Calculate and print out this number.

Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

In [10]:
num_patients * 0.08

24.240000000000002

Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Import the function from scipy.stats that you can use to test the following null and alternative hypotheses:

Null: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
Alternative: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

In [11]:
pval = binomtest(num_high_fbs, num_patients, 0.08, alternative = 'greater')
pval

BinomTestResult(k=45, n=303, alternative='greater', statistic=0.1485148514851485, pvalue=4.6894719514488777e-05)