In [None]:
# Initialize OK
from client.api.notebook import Notebook
ok = Notebook('lab07.ok')

# Lab 7: The Framingham Heart Study

As usual, run the cell below to prepare the lab and the automatic tests.

In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from client.api.notebook import Notebook
ok = Notebook('lab07.ok')
_ = ok.submit()

The [Framingham Heart Study](https://en.wikipedia.org/wiki/Framingham_Heart_Study) is an observational study of cardiovascular health. Cardiovascular disease is one of the leading causes of death in the United States, as you will see in tomorrow’s homework. The initial study followed over 5,000 volunteers for several decades, and followup studies even looked at their descendants. In this lab, we'll investigate some of its key findings about diet, cholesterol, and heart disease.

Run the cell below to examine data for almost 4,000 subjects from the first wave of the study, collected in 1956.

In [25]:
framingham = Table.read_table('framingham.csv')
framingham

Each row contains data from one subject. The first seven columns describe the subject at the time of their initial medical exam at the start of the study. The last column, `ANYCHD`, tells us whether the subject developed some form of heart disease at any point after the start of the study.

You may have noticed that the table contains fewer rows than subjects in the original study: this is because we are excluding subjects who already had heart disease as well as subjects with missing data.

### Section 1: Diabetes and the population

Before we begin our investigation into cholesterol, we'll first look at some limitations of this dataset. In particular, we will investigate ways in which this is or isn't a representative sample of the population by examining the number of subjects with diabetes.

[According to the CDC](https://www.cdc.gov/diabetes/statistics/slides/long_term_trends.pdf), the prevalence of diagnosed diabetes (i.e., the proportion of the population who have it) in the U.S. around this time was 0.0093. We are going to conduct a hypothesis test with the following null and alternative hypotheses:

**Null Hypothesis**: The probability that a participant within the Framingham Study has diabetes is equivalent to the prevalence of diagnosed diabetes within the population. (i.e., any difference between the study prevalence and population prevalence is due to random chance in the selection of the participants).

**Alternative Hypothesis**: The probability that a participant within the Framingham Study has diabetes is different than the prevalence of diagnosed diabetes within the population.

We are going to use the absolute distance between the observed prevalence and the true population prevalence as our test statistic. The column `DIABETES` in the `framingham` table contains a 1 for subjects with diabetes and a `0` for those without.

**Question 1**: What is the observed value of the statistic in the data from the Framingham Study?

<!--
BEGIN QUESTION
name: q2_1_1
manual: false
-->

In [26]:
observed_diabetes_distance = ...
observed_diabetes_distance

In [None]:
ok.grade("q2_1_1");

**Question 2**: The array `diabetes_proportions` contains the proportions of the population without and with diabetes. Define the function `diabetes_statistic`, which should return exactly one simulated statistic under the null hypothesis. Remember that our statistic is the absolute distance between the observed prevalence and the true population prevalence.

<!--
BEGIN QUESTION
name: q2_1_2
manual: false
-->

In [29]:
diabetes_proportions = make_array(.9907, .0093)

def diabetes_statistic():
    sample_prevalence = ...
    simulated_stat = ...
    ...

**Question 3**: Complete the following code to simulate 5000 values of the statistic under the null hypothesis.

<!--
BEGIN QUESTION
name: q2_1_3
manual: false
-->

In [30]:
diabetes_simulated_stats = make_array()

for i in np.arange(5000): 
    simulated_stat = ...
    diabetes_simulated_stats = ...
    
diabetes_simulated_stats

In [None]:
ok.grade("q2_1_3");

**Question 4**: Run the following cell to generate a histogram of the simulated values of your statistic, along with the observed value.

*Check the result of this cell with a staff member and explain what this histogram displays.*

<!--
BEGIN QUESTION
name: q2_1_4
manual: true
-->
<!-- EXPORT TO PDF -->

In [34]:
Table().with_column('Simulated distance to true incidence', diabetes_simulated_stats).hist()
plots.scatter(observed_diabetes_distance, 0, color='red', s=30);

**Question 5**: Based on the results of the test and the empirical distribution of the test statistic under the null, should you reject the null hypothesis?

<!--
BEGIN QUESTION
name: q2_1_5
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

**Question 6**: Why might there be a difference between the population and the sample from the Framingham Study? Assign the name `framingham_diabetes_explanations` to an array of the following explanations that **are possible and consistent** with the observed data and hypothesis test results.

1. Diabetes was under-diagnosed in the population (i.e., there were a lot of people in the population who had diabetes but weren't diagnosed). By contrast, the Framingham participants were less likely to go undiagnosed because they had regular medical examinations as part of the study.
2. The relatively wealthy population in Framingham ate a luxurious diet high in sugar (high-sugar diets are a known cause of diabetes).
3. The Framingham Study subjects were older on average than the general population, and therefore more likely to have diabetes.

<!--
BEGIN QUESTION
name: q2_1_6
manual: false
-->

In [35]:
framingham_diabetes_possibilities = ...
framingham_diabetes_possibilities

In [None]:
ok.grade("q2_1_6");

In real-world studies, getting a truly representative random sample of the population is often incredibly difficult. Even just to accurately represent all Americans, a truly random sample would need to examine people across geographical, socioeconomic, community, and class lines (just to name a few). For a study like this, scientists would also need to make sure the medical exams were standardized and consistent across the different people being examined. In other words, there's a tradeoff between taking a more representative random sample and the cost of collecting all the data from the sample.

The Framingham study collected high-quality medical data from its subjects, even if the subjects may not be a perfect representation of the population of all Americans. This is a common issue that data scientists face: while the available data aren't perfect, they're the best we have. The Framingham study is generally considered the best in its class, so we'll continue working with it while keeping its limitations in mind.

(For more on representation in medical study samples, you can read these recent articles from [NPR](https://www.npr.org/sections/health-shots/2015/12/16/459666750/clinical-trials-still-dont-reflect-the-diversity-of-america) and [Scientific American](https://www.scientificamerican.com/article/clinical-trials-have-far-too-little-racial-and-ethnic-diversity/)).

### Section 2: Cholesterol and Heart Disease

In the remainder of this question, we are going to examine one of the main findings of the Framingham study: an association between serum cholesterol (i.e., how much cholesterol is in someone's blood) and whether or not that person develops heart disease. [These sections](https://www.inferentialthinking.com/chapters/12/Comparing_Two_Samples.html) of the textbook should be helpful for the rest of the lab if you run into difficulties.

We'll use the following null and alternative hypotheses:

**Null Hypothesis:** In the population, the cholesterol levels of people who get heart disease come from the same distribution as the cholesterol levels of people who do not.

**Alternative Hypothesis:** The cholesterol levels of people in the population who get
heart disease are higher, on average, than the cholesterol level of people who do not.



**Question 1:** From the provided Null and Alternative Hypotheses, does it seem reasonable to use A/B Testing to determine which model is more consistent? Assign the variable `ab_reasonable` to `True` if it seems reasonable and `False` otherwise.

<!--
BEGIN QUESTION
name: q2_2_1
manual: false
-->

In [38]:
ab_reasonable = ...
ab_reasonable

In [None]:
ok.grade("q2_2_1");

To test our hypothesis, we’ll generally want to use a test statistic where larger values of the test statistic will point towards the alternative. In this case, since the alternative states that the cholesterol levels of people who get heart disease are **higher** on average than those who do not, we should use a statistic that maintains the directionality implied in the alternative. Let’s use the **difference between the average cholesterol levels of those who get heart disease and those who don’t.**

**Question 2**: Write a function that returns the test statistic described above. It should take a table with two columns, `TOTCHOL` and `ANYCHD`. 

*Hint: You will need to use the `.group` function to calculate your test statistic.*

<!--
BEGIN QUESTION
name: q2_2_2
manual: false
-->

In [41]:
def compute_framingham_test_statistic(tbl):
    ...

In [None]:
ok.grade("q2_2_2");

**Question 3**: Use the function you defined above to compute the observed test statistic, and assign it to the name `framingham_observed_statistic`.

<!--
BEGIN QUESTION
name: q2_2_3
manual: false
-->

In [44]:
framingham_observed_statistic = ...
framingham_observed_statistic

In [None]:
ok.grade("q2_2_3");

Now that we have defined hypotheses and a test statistic, we are ready to conduct a hypothesis test. We'll start by defining a function to simulate the test statistic under the null hypothesis, and then use that function 1000 times to understand the distribution under the null hypothesis.

**Question 4**: Write a function to simulate the test statistic under the null hypothesis. 

The `simulate_framingham_null` function should simulate the null hypothesis once (not 1000 times) and return the value of the test statistic for that simulated sample.

<!--
BEGIN QUESTION
name: q2_2_4
manual: false
-->

In [50]:
def simulate_framingham_null():
    shuffled_column = ...
    original_with_shuffled_col = ...
    simulated_test_stat = ...
    return simulated_test_stat 

In [None]:
ok.grade("q2_2_4");

In [54]:
# Run your function once to make sure that it works.
simulate_framingham_null()

**Question 5**: Fill in the blanks below to complete the simulation for the hypothesis test. Your simulation should compute 1000 values of the test statistic under the null hypothesis and store the result in the array framingham_simulated_stats.

*Hint*: You should use the function you wrote above in Question 4.

*Note*: Warning: running your code might take a few minutes!  We encourage you to check your `simulate_framingham_null()` code to make sure it works correctly before running this cell. 

<!--
BEGIN QUESTION
name: q2_2_5
manual: false
-->

In [55]:
framingham_simulated_stats = make_array()

for i in np.arange(1000):
    ...

**Question 6:** The following line will plot the histogram of the simulated test statistics, as well as a point for the observed test statistic. As before, check the result of this cell with a staff member and explain what this histogram displays. 

<!--
BEGIN QUESTION
name: q2_2_6
-->

In [56]:
Table().with_column('Simulated statistics', framingham_simulated_stats).hist()
plots.scatter(framingham_observed_statistic, 0, color='red', s=30);

**Question 7**: Compute the p-value for this hypothesis test, and assign it to the name `framingham_p_value`.

*Hint*: One of the key findings of the Framingham study was a strong association between cholesterol levels and heart disease. If your p-value doesn't match up with this finding, you may want to take another look at your test statistic and/or your simulation.

<!--
BEGIN QUESTION
name: q2_2_7
manual: false
-->

In [57]:
framingham_p_value = ...
framingham_p_value

In [None]:
ok.grade("q2_2_7");

**Question 8**: Despite the Framingham Heart Study's well-deserved reputation as a well-conducted and rigorous study, it has some major limitations. Discuss with a neighbor and give one specific reason why it can't be used to say that high cholesterol *causes* heart disease.

<!--
BEGIN QUESTION
name: q2_2_8
-->

*Write your answer here, replacing this text.*

Similar studies from the 1950s found positive associations between diets high in saturated fat, high cholesterol, and incidence of heart disease. In 1962, the U.S. Surgeon General said:

*"Although there is evidence that diet and dietary habits may be implicated in the development of coronary heart disease and may be significant in its prevention or control, at present our only research evidence is associative and not conclusive."*

Congratulations, you're done with lab 7! Be sure to

- **run all the tests** (the next cell has a shortcut for that),
- **Save and Checkpoint** from the File menu,
- **run the last cell to submit your work**,
- and ask one of the staff members to check you off.

In [59]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [60]:
_ = ok.submit()

### Further reading

If you're interested in learning more, you can check out this article:

* [Origin story of the Framingham Heart Study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1449227/)