In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Review: Comparing Two Samples

Here are two functions we wrote in Lecture 19 for A/B testing. Let's review what we did.

We wanted to simulate many values of our test statistic under the assumptions of the null hypothesis: there is no association between the numerical variable (baby birth weight) and the grouping variable (maternal smoking). This gives us a context for understanding the **observed** difference of means in the actual data.

Step 1: Write a function to calculate one value of the test statistic, `mean(smokers) - mean(non-smokers)`

In [None]:
def difference_of_means(table, numeric_label, group_label):
    """
    Takes: name of table, column label of numerical variable,
    column label of group-label variable (must be just two groups)
    
    Returns: Difference of means of the two groups
    """
    
    #table with the two relevant columns
    reduced = table.select(numeric_label, group_label)  
    
    # table containing group means for the numerical variable
    means_table = reduced.group(group_label, np.average)
    
    # array of group means
    means = means_table.column(1)
    
    return means.item(1) - means.item(0)

Step 2: Write a function which creates the shuffled data table and returns one simulated value of the test statistic. This lets us carry out the "permutation test" that underlies A/B testing.

In [None]:
def one_simulated_difference(table, numeric_label, group_label):
    """
    Takes: name of table, column label of numerical variable,
    column label of group-label variable (with 2 levels)
    
    Returns: Difference of means of the two groups after shuffling labels
    """
    
    # array of shuffled labels
    shuffled_labels = table.sample(
        with_replacement = False).column(group_label)
    
    # table of numerical variable and shuffled labels
    shuffled_table = table.select(numeric_label).with_column(
        'Shuffled Label', shuffled_labels)
    
    return difference_of_means(
        shuffled_table, numeric_label, 'Shuffled Label')   

In [None]:
births = Table.read_table('baby.csv')

In [None]:
births.group('Maternal Smoker', np.average).select('Maternal Smoker', 'Birth Weight average')

Even though our result (`P-value is: 0.0`) showed extremely convincing evidence to support the alternative hypothesis (Group B -- smokers -- tend to have lower birthweight babies), we could NOT conclude from these data that smoking **causes** lower birthweight. The reason is, these data are observational; there is no "controlled experiment" being conducted here, which would randomly assign some women to smoke during pregnancy and others to not smoke. Such an experiment would be problematic and highly unethhical.

**Back to slides...**

# Randomized Controlled Experiment

Today's example relates to botox treatments and their use to reduce pain in various parts of the body.

In [None]:
# We have a single grouping column and a single numerical variable
# Group indicates "control" versus "treatment"
# Result is 0 (did not reduce pain) or 1 (did reduce pain) for each subject in the study
botox = Table.read_table('bta.csv')
print("We have", botox.num_rows, "subjects in this study")
botox.show()

What table method should we use to summarize the experiment results in a 2 x 2 grid, showing (for each group) how many subjects had a certain value for the numerical variable?

In [None]:
# using group
botox.group('Result')

In [None]:
# using pivot
botox.pivot('Result', 'Group')

Does it seem like the botox treatment is generally more effective than the placebo which was given to the control group? How can we tell if these data are statistically significant?

As a first step toward carrying out a hypothesis test with these data, let's show the `Result average` for each group in a 2-row table:

In [None]:
botox.group('Group', np.average)

**Back to slides...**

# Testing the Hypotheses

  - Null: Treatment has no effect
  - Alternative: In the population, more of the potential treatment scores are 1 (pain improves) than the potential control scores. In short, the treatment is generally more effective than a placebo.
  - Test statistic: mean(treament_group) - mean(control_group)
  - What values of the test statistic will support the alternative hypothesis?

In [None]:
# Use our handy difference_of_means function from Lecture 19 in the current context

# Record, by defining a function, the process for computing the difference of means
# ONLY USE THIS ON A TWO-COLUMN TABLE
def difference_of_means(table, numer_label, group_label):
    """
    Parameters: name of table, column label of numerical variable,
                column label of group-label variable
    Returns:    Difference of means for the two groups 
                (2nd group mean minus 1st group mean)
    """
    #table with just the two relevant columns
    reduced = table.select(group_label, numer_label)
    
    # table containing the two group means 
    means_table = reduced.group(group_label, np.mean)
    
    # array holding the two means
    means = means_table.column(1)
    
    # return the computed test statistic, 2nd mean minus 1st mean
    return means.item(1) - means.item(0)

observed_difference = difference_of_means(botox, 'Result', 'Group')
observed_difference

In [None]:
# And our handy one_simulated_difference function
# As usual, we begin by defining a function to simulate one value of the test statistic
# WE WILL USE A TWO-COLUMN TABLE
def one_simulated_difference(table, numer_label, group_label):
    """Takes: name of table, column label of numerical variable, column label of group-label variable
    Returns: Difference of means of the two groups after shuffling labels"""

    # array of shuffled labels
    shuffled_labels = ...
    
    # table of numerical variable (second column) and shuffled labels (first column)
    shuffled_table = (Table().with_columns(
        group_label, shuffled_labels,             # labels are shuffled
        numer_label, table.column(numer_label)))  # birth weights are unshuffled
    
    stat = ...
    return stat
    

In [None]:
# Notice how it jumps around when we run it repeatedly
one_simulated_difference(botox, 'Result', 'Group')

In [None]:
repetitions = 10000
simulated_diffs = make_array()  # initialize with an empty array

# Make an array of 10000 simulated values of our test statistic
for i in range(repetitions):
    ...
    ...

print(len(simulated_diffs))
simulated_diffs

In [None]:
# Make a visualization
# Histogram shows the expected values of the test statistic if the null hypothesis is true
# Red dot shows the observed value of the test statistic from actual data (controlled exper.)
col_name = 'Difference of Means'
Table().with_column(col_name, simulated_diffs).hist(col_name)
plots.scatter(observed_difference, -.01, color='red', s=40);

What is your instinct? Is this observed value for the statistic significant? Or could it be explained away as "random variation"?

The p-value is a popular way to quantify the level of significance. Remember, SMALLER p-values are (LESS? MORE?) significant in terms of supporting the alternative hypothesis.

Also, we mentioned on Friday that the "context" of the data should determine an appropriate p-value cutoff BEFORE we actually calculate the p-value. Here we have a medical context, and that often leads to more stringent standards. 

Let's decide we will only favor the alternative hypothesis if the p-value is HIGHLY significant (less than 1%).

In [None]:
# p-value calculation
p_value = ...
p_value

Conclusion: We have highly significant evidence (p ~ .0009) that this botox treatment is generally associated with decreased pain for people in the population from which the subjects in this study were drawn. 

In addition, because the data come from a randomized controlled experiment, we conclude that botox treatments actually CAUSE lower pain levels. 