<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 20: Causality

Associated Textbook Sections: [12.2](https://inferentialthinking.com/chapters/12/2/Causality.html)

<h2>Set Up the Notebook<h2>

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Causality

### Randomized Controlled Experiment

* Sample A: control group
* Sample B: treatment group
* If the treatment and control groups are selected at random, then you can make causal conclusions.
* Any difference in outcomes between the two groups could be due to chance or the treatment.


### Demo: Randomized Control Experiment

Load the data in `bta.csv` that is associated with [an experiment to assess whether or not Botox and be used to reduce pain](https://journals.lww.com/clinicalpain/Abstract/2006/01000/The_Effect_of_Small_Doses_of_Botulinum_Toxin_A_on.11.aspx) (Neck-Shoulder Myofascial Pain Syndrome). The control group received a saline solution, while the treatment group received botulinum toxin type A. A 1 result means that there was a reduction in pain, and a 0 result means that there wasn't a reduction in pain.

In [None]:
botox = Table.read_table('data/bta.csv')
botox.show()

Compare the results for the two groups.

In [None]:
...

In [None]:
...

### Before the Randomization

* In the population there is one imaginary ticket for each of the 31 participants in the experiment.
* Each participant’s ticket looks like this:

<img src="img/lec20_before_randomization.png" width=70%>

### The Data

16 randomly picked tickets show:

<img src="img/lec20_16_tickets.png" width=50%>

The remaining 15 tickets show:

<img src="img/lec20_15_remaining.png" width=50%>

### The Hypotheses

* Null: In the population, the distribution of all potential control scores is the same as the distribution of all potential treatment scores. (The treatment has no effect.)
* Alternative: In the population, more of the potential treatment scores are 1 (pain improves) than the potential control scores.

### Demo: Testing the Hypothesis

Compute the observed difference in results for the treatment and control groups using the `difference_of_means` function.

In [None]:
def difference_of_means(table, numeric_label, group_label):
    """
    Takes: name of table, column label of numerical variable,
    column label of group-label variable
    
    Returns: Difference of means of the two groups
    """
    
    #table with the two relevant columns
    reduced = table.select(numeric_label, group_label)  
    
    # table containing group means
    means_table = reduced.group(group_label, np.average)
    
    # array of group means
    means = means_table.column(1)
    
    return means.item(1) - means.item(0)

In [None]:
observed_diff = ...
observed_diff

Use `one_simulated_difference` to determine the chance of observing the above sample data if it were true that Botox had no impact on pain reduction.

In [None]:
def one_simulated_difference(table, numeric_label, group_label):
    """
    Takes: name of table, column label of numerical variable,
    column label of group-label variable
    
    Returns: Difference of means of the two groups after shuffling labels
    """
    
    # array of shuffled labels
    shuffled_labels = table.sample(
        with_replacement = False).column(group_label)
    
    # table of numerical variable and shuffled labels
    shuffled_table = table.select(numeric_label).with_column(
        'Shuffled Label', shuffled_labels)
    
    return difference_of_means(
        shuffled_table, numeric_label, 'Shuffled Label')   

In [None]:
...

In [None]:
simulated_diffs = make_array()

for i in np.arange(10000):
    sim_diff = ...
    simulated_diffs = np.append(simulated_diffs, sim_diff)

In [None]:
col_name = 'Distances between groups'
Table().with_column(col_name, simulated_diffs).hist(col_name)
plots.scatter(observed_diff, 0, color="red", s=60, zorder=3);

In [None]:
p_value = ...

### Random Assignment & Shuffling

<img src="img/lec20_random_assignment_and_shuffling.png" width=80%>

## An Error Probability

### Can the Conclusion be Wrong?

Yes.

| | Null is True | Alternative is True | 
| --- | ----------- | ---|
| Test favors the Alternative | ❌ (Type I Error)| ✅ |
| Test favors the Null | ✅ | ❌ (Type II Error) |


* Type I Error - False Positive
* Type II Error - False Negative

### An Error Probability

* The cutoff for the P-value is the Type I error probability.
* If your cutoff is 5% and the null hypothesis happens to be true, then there is about a 5% chance that your test will reject the null hypothesis. (Type I Error: 5%)
* The chance of making a Type II error is the complement value to the statistical power of the test. You are not expected to know about the power of a test or what the chance of making a Type II error is in this class.

### P-value cutoff vs P-value

* P-value cutoff
    * Does not depend on observed data or simulation
    * Decide on it before seeing the results
    * Conventional values at 5% and 1%
    * Probability of hypothesis testing making an error
* P-value
    * Depends on the observed data and simulation
    * Probability under the null hypothesis that the test statistic is the observed value or further towards the alternative


<footer>
    <hr>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>