<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 23: A/B Testing

Associated Textbook Sections: [12.0, 12.1](https://inferentialthinking.com/chapters/12/Comparing_Two_Samples.html)

<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

## Outline

* [A/B Testing](#A/B-Testing)
* [Digital Experiments](#Digital-Experiments)
* [Hypothesis Testing Review](#Hypothesis-Testing-Review)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## A/B Testing

### Comparing Two Samples

* Compare values of sampled individuals in Group A with values of sampled individuals in Group B.
* Question: Do the two sets of values come from the same underlying distribution?
* Answering this question by performing a statistical test is called A/B testing.

### Demo: Smoking and Birth Weight

Load the data in `baby.csv`. This data contains information on 1174 pregnancies and was part of a larger study of all the births form 1960 to 1967 among women in the Kaiser Foundation Health Plan in Oakland, California.

In [None]:
births = Table().read_table("./data/baby.csv")
births

Explore the relationship between the values of `'Maternal Smoker'` and `'Birth Weight'`.

In [None]:
smoking_and_birthweight = births.select('Maternal Smoker', 'Birth Weight')
smoking_and_birthweight.show(2)

In [None]:
smoking_and_birthweight.group('Maternal Smoker')

In [None]:
smoking_and_birthweight.group('Maternal Smoker', np.mean)

In [None]:
smoking_and_birthweight.group('Maternal Smoker', np.median)

In [None]:
smoking_and_birthweight.hist('Birth Weight', group='Maternal Smoker')

### The Groups and the Question

* Sample of mothers of newborns. Compare:
    * (A) Birth weights of babies of mothers who smoked during pregnancy 
    * (B) Birth weights of babies of mothers who didn't smoke
* Question: Could the differences we are observing be due to chance alone?


### Hypotheses

* Null Hypothesis: In the population, the distributions of the birth weights of the babies in the two groups are the same. (They are different in the sample just due to chance.)
* Alternative Hypothesis: In the population, the babies of the mothers who smoked during pregnancy weigh less, on average, than the babies of the non-smokers.


### Test Statistic

* Group A: non-smokers
* Group B: smokers
* Statistic: Difference between average weights: `group_B_mean - group_A_mean`  
* Negative values of this statistic favor the alternative


Determine the average difference in the birth weights associated with mothers that did and didn't smoke during pregnancy.

In [None]:
means_table = smoking_and_birthweight.group('Maternal Smoker', np.average)
means_table

In [None]:
means = means_table.column(1)
observed_difference = ...
observed_difference

Create a function that take name of table, column label of numerical variable, column label of group-label variable and returns the difference of means of the two groups.

In [None]:
def difference_of_means(table, value_label, group_label):
    """Takes: name of table, column value_label of numerical variable,
    column group_label of group-label variable
    Returns: Difference of means of the two group-labels"""
    
    #table with the two relevant columns
    reduced = ...  
    
    # table containing group means
    means_table = ..
    
    # array of group means
    means = ...
    
    return ...

In [None]:
difference_of_means(births, ..., ...)

### The Data

<img src="img/lec19_the_data.png" width=80%>

### Shuffling Labels Under the Null

<img src="img/lec19_shuffling_labels.png" width=80%>

### Shuffling Rows

#### The `sample` table method.

* `tbl.sample(n)`: Table of `n` rows picked randomly with replacement
* `tbl.sample()`: Table with same number of rows as original `tbl`, picked randomly with replacement
* `tbl.sample(n, with_replacement = False)`: Table of `n` rows picked randomly without replacement
* `tbl.sample(with_replacement = False)`: All rows of `tbl`, in random order

#### Random Permutation (Shuffling)

Demonstrate how to perform a random permutation using the `sample` method.

In [None]:
letters = Table().with_column('Letter', make_array('a', 'b', 'c', 'd', 'e'))
letters

In [None]:
# most likely not a permutation
...

In [None]:
# a random permutation
...

In [None]:
shuffled_letters = ...
...

#### Simulating Under the Null

* If the null is true, all rearrangements of labels are equally likely
* Plan:
    1. Shuffle all group labels
    1. Assign each shuffled label to a birth weight
    1. Find the difference between the averages of the two shuffled groups
    1. Repeat
* This process is generally called a permutation test.


### Simulation Under Null Hypothesis

Perform a random permutation on the table containing a Boolean-valued column representing whether or not the mother smoked during pregnancy and a integer-valued column of their baby's birth weight.

In [None]:
smoking_and_birthweight.show(3)

In [None]:
shuffled_labels = ...

In [None]:
original_and_shuffled = smoking_and_birthweight.with_column(
    'Shuffled Label', shuffled_labels
)

original_and_shuffled

Calculate the difference of the birth weight means for the two smoking groups based on the shuffled data and the original data.

In [None]:
difference_of_means(original_and_shuffled, ..., ...)

In [None]:
difference_of_means(original_and_shuffled, ..., ...)

### Permutation Test

Perform a permutation test using 2500 simulations to determine how likely it is to see the observed birth weights if it is assumed that there was no impact on birth weight due to the mother smoking or not during pregnancy. *This might take a few minutes to run.*

In [None]:
def one_simulated_difference(table, value_label, group_label):
    """Takes: name of table, column value_label of numerical variable,
    column group_label of group-label variable
    Returns: Difference of means of the two groups after shuffling group_labels"""
    
    # array of shuffled labels
    shuffled_labels = ...
    
    # table of numerical variable and shuffled labels
    shuffled_table = ...
    
    return ...

In [None]:
one_simulated_difference(births, ..., ...)

In [None]:
differences = ...

for ...
    new_difference = ...
    differences = ...

In [None]:
Table().with_column('Difference Between Group Means', differences).hist()
print('Observed Difference:', observed_difference)
plt.scatter(observed_difference, 0, color='red', s=60, zorder=3)
plt.title('Prediction Under the Null Hypothesis');

Calculate the p-value.

In [None]:
p_value = ...
p_value

### Conclusion

* With a p-value approximately 0%, we reject the null hypothesis and accept the alternative. 
* That is, in the population, the babies of the mothers who smoked during pregnancy weigh less, on average, than the babies of the non-smokers.

---

## Digital Experiments

* A/B tests are used in digital experiments. 
* Since they are typically easy to implement, it is common to find that multiple tests are actually run over a period of time.
* They can provide a methodical way to measure whether some new feature is having a statistically significant impact.
    * Advertising revenue
    * Click rate
    * etc.


<center>
    <img src="./img/A-B_testing_example.png" alt="A comparision of two versions of a website design showing how the design might impact the click rate." width=60%>
</center>

_Image Source: [Wikipedia - A/B Testing](https://en.wikipedia.org/wiki/A/B_testing)_

### ASOS.com

[ASOS](https://www.asos.com/us/) is a fashion brand and they publicly shared [datasets from their digital experiments](https://github.com/liuchbryan/oce-dataset). For each of the experiments, you can see the change in measurements between the control and treatment groups over time.

*  _This example provides an optional look at real experimental data from a corporation's test results._
* _The company has anonymized the experiments and the measurements._
* _Understanding the context of a data source is important for interpreting the results of a statistical test._

In [None]:
asos = Table.read_table('./data/asos_digital_experiments_dataset.csv')
asos

Here is an explanation of what the labels represent in this dataset.

| Field Name | Description | Data Type | Format/ Example | Null value allowed? |
| --- | --- | --- | --- | --- |
| `experiment_id` | Anonymised ID for the A/B test | string | “036afc” | No |
| `variant_id` | The ID of the treatment group. The summary statistics for the corresponding control group is included in Columns 5-7, and hence there are no dedicated rows for the control groups in this dataset. <br /><br /> Note: The variants are not necessarily numbered consecutively. | integer | 2 | No |
| `metric_id` | The ID of the organisational metric. (See notes below on list of metric) | integer | (1 \| 2 \| 3 \| 4) | No |
| `time_since_start` | 	Number of days since the start of the experiment. | double | 12.5 | No |
| `count_c` | Number of samples in the control group. (The number is stored as a double, but they are clearly integers) | double | 123456.0 | No |
| `mean_c` | The sample mean of responses across the control group. | double | 4.361 | Yes |
| `variance_c` | The sample variance of responses across the control group. | double | 72.354 | Yes |
| `count_t` | Number of samples in the treatment group. | double | 123572.0 | Yes |
| `mean_t` | The sample mean of responses across the treatment group. | double | 4.345 | Yes |
| `variance_t` | The sample variance of responses across the treatment group. | double | 73.591 | Yes |

#### List of metrics

The dataset features four metrics, numbered 1, 2, 3, and 4. Metric 1 accepts binary responses, metrics 2 and 3 accept count-based responses, and metric 4 accepts non-negative real number responses. The responses for metrics 2, 3, and 4 demonstrate various degrees of right skewness.

Visualize the trend of the observed difference in means for the experiment labeled `036afc` for metric 4. Notice the gap that is likely to reflect a gap in the time period in which the experiments where run before pausing to reflect and make modifications.

In [None]:
exp_id = '036afc'
metric_id = 4
reduced = asos.where('experiment_id', exp_id).where('metric_id', metric_id)
times = reduced.column('time_since_start')
diffs = reduced.column('mean_c') - reduced.column('mean_t')
means_table = Table().with_columns(
    'Time Since Start', times,
    'Diff in Means', diffs
)
means_table.plot('Time Since Start')

In [None]:
means_table.hist('Diff in Means')

In [None]:
means_table_pre_40 = means_table.where('Time Since Start', are.below(40))
means_table_post_40 = means_table.where('Time Since Start', are.above_or_equal_to(40))

In [None]:
means_table_pre_40.hist('Diff in Means')
plt.hist(means_table_post_40.column('Diff in Means'), alpha=0.5)
plt.show()

---

## Hypothesis Testing Review

### Some Hypothesis Testing Situations

* 1 Sample: One Category (e.g. percent of flowers that are purple)
    * Test Statistic: `empirical_percent`, `abs(empirical_percent - null_percent)`
    * How to Simulate: `sample_proportions(n, null_dist)`
* 1 Sample: Multiple Categories (e.g. ethnicity distribution of jury panel)
    * Test Statistic: `tvd(empirical_dist, null_dist)`
    * How to Simulate: `sample_proportions(n, null_dist)`
* 1 Sample: Numerical Data (e.g. scores in a lab section)
    * Test Statistic: `empirical_mean`, `abs(empirical_mean - null_mean)`
    * How to Simulate: `population_data.sample(n, with_replacement=False)`
* 2 Samples: Numerical Data (e.g. birth weights of smokers vs. non-smokers)
    * Test Statistic: `group_a_mean - group_b_mean`, `group_b_mean - group_a_mean`, `abs(group_a_mean - group_b_mean)`
    * How to Simulate: `empirical_data.sample(with_replacement=False)`


---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>