<h1> Lecture 38 

Data Science 8, Spring 2021 </h1>

<h3>
<b>
<ul>
<li>Case Study: Minnesota Coronary Experiment (1968-1973)</li><br>
    <li>Ivan Frantz</li><br>
</ul>
</b>
</h3>

In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

#The following allows porting images into a Markdown window
#Syntax: ![title](image_name.png)
from IPython.display import Image

## Broste Thesis Data ##

In [None]:
summary = Table(['Age', 'Condition', 'Total', 'Deaths', 'CHD Deaths']).with_rows([
    ['0-34',  'Diet',    1367, 3, 0],
    ['35-44', 'Diet',    728, 3, 0],
    ['45-54', 'Diet',    767, 14, 4],
    ['55-64', 'Diet',    870, 35, 7],
    ['65+',   'Diet',    953, 190, 42],
    ['0-34',  'Control', 1337, 7, 1],
    ['35-44', 'Control', 731, 4, 1],
    ['45-54', 'Control', 816, 16, 4],
    ['55-64', 'Control', 896, 33, 12],
    ['65+',   'Control', 958, 162, 34],   
])
summary.drop('CHD Deaths')

<h4>Reconstruct Ivan Frantz's Full Table</h4>

<h5>Example:<br> 
    <ul>
        <li>Create 1,367 rows</li><br>
        <li>Assign age 0-34</li><br>
        <li>Tag 3 as Died.</li>
    </ul>

In [None]:
subjects = Table(['Age', 'Condition', 'Participated', 'Died'])
for row in summary.rows:
    i = np.arange(0, row.item('Total'))
    t = Table().with_columns('Died', i < row.item('Deaths'))
    t.append_column('Age', row.item('Age'))
    t.append_column('Condition', row.item('Condition'))
    t.append_column('Participated', True)
    subjects.append(t)
subjects

<h4>Rearrange the Table to Show by Age Group and Condition (Control/Treatment)</h4>

In [None]:
subjects.group(['Age', 'Condition'], sum)

<h4>Pay attention to the 65+ age group in the table above.</h4>

<h4>Could this have been by random chance? Or was it due to diet?</h4>

<h3>Hypothesis Test</h3>
<h4>
    <ul>Null Hypothesis $H_0$: There's no significant difference in mortality between the two diets&mdash;that is, the observed differences are due to random chance.</li><br><br>
    <li>Alternative Hypothesis $H_1$: Diet matters. There's a statistically significant difference in mortality between the two diets.</li>
    </ul>
    
To test the difference between two proportions, we use A/B Testing.<br>

<br>The alternative hypothesis hypothesis determines whether we use a <u>simple difference</u> in mortality rates or an <u>absolute difference</u>.  Which one do we use here?
</h4>

<h4>How the table looks if we group by Condition only</h4>

In [None]:
subjects.group('Condition', sum)

<h4>Diet group seems to have about a 10% larger mortality rate. But is this significant, or is it due to randomness?</h4>

<h4>Drop the 'Age sum' column</h4>

In [None]:
subjects.drop('Age').group('Condition', sum)

In [None]:
# hazard_rate gives the proportion of people who died
def hazard_rate(counts): #counts is a row of the table
    return counts.item('Died sum') / counts.item('Participated sum')

def rate_difference(t):  #t is the table of ALL individuals
    #the following line creates the aggregate table like the one above
    counts = t.drop('Age').group('Condition', sum) 
    return abs(hazard_rate(counts.row(1)) - hazard_rate(counts.row(0)))

Aggregate Rate Difference Across All Age Groups

In [None]:
rate_difference(subjects)

Rate Difference for Ages 0-34

In [None]:
rate_difference(subjects.where('Age', '0-34'))

Rate Difference for Ages 35-44

In [None]:
rate_difference(subjects.where('Age', '35-44'))

Rate Difference for Ages 45-54

In [None]:
rate_difference(subjects.where('Age', '45-54'))

Rate Difference for Ages 55-64

In [None]:
rate_difference(subjects.where('Age', '55-64'))

Rate Difference for Ages 65+

In [None]:
rate_difference(subjects.where('Age', '65+'))

<h4>To perform our A/B test, we must shuffle either the Died or Condition column.</h4>

In [None]:
def test(t):
    #The line below calculates the rate difference across ALL age groups
    observed = rate_difference(t) #this is the observed test statistic
    #this many different shufflings
    repetitions = 200  

    stats = make_array()
    for i in np.arange(repetitions):
        simulated_results = t.select('Died').sample(with_replacement=False).column('Died')
        simulated_outcomes = t.with_column('Died', simulated_results)
        simulated_stat = rate_difference(simulated_outcomes)
        stats = np.append(stats, simulated_stat)

    # Find the empirical P-value:
    #Computes the proportion of simulations that resulted in mortality
    #at least as extreme as our observation.
    p = np.count_nonzero(stats >= observed) / repetitions
    print('Observed absolute difference in hazard rates:', np.round(observed,6))
    print('P-value:', p)

In [None]:
test(subjects)

<h4>Question: To reject the Null (and therefore claim that diet mattered in a statistically significant way), do we expect a small $p$-value (e.g., $p\leq 0.05$) or something larger?</h4>

SLIDE: Conventions About Inconsistency

In [None]:
subjects.group('Age').column('Age')

<h4>$p$-values by age group</h4>

In [None]:
for age in subjects.group('Age').column('Age'):
    print('Ages', age)
    test(subjects.where('Age', age))