<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 22: Decisions and Uncertainty

Associated Textbook Sections: [11.3, 11.4](https://inferentialthinking.com/chapters/11/3/Decisions_and_Uncertainty.html)

<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

## Outline

* [Decisions and Uncertainty](#Decisions-and-Uncertainty)
* [Review: Terminology](#Review:-Terminology)
* [A Low Midterm Average](#A-Low-Midterm-Average)
* [Statistical Significance](#Statistical-Significance)
* [How We’ve Tested Thus Far](#How-We’ve-Tested-Thus-Far)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Decisions and Uncertainty

### Incomplete Information

* We are trying to choose between two views of the world, based on data in a sample.
* It is not always clear whether the data are consistent with one view or the other.
* Random samples can turn out quite extreme. It is unlikely, but possible.

---

## Review: Terminology

### Testing Hypotheses

* A test chooses between two views of how data were generated
* The views are called hypotheses
* The test picks the hypothesis that is better supported by the observed data


### Null and Alternative

The method only works if we can simulate data under one of the hypotheses.
* Null hypothesis:
    * A well defined chance model about how the data were generated
    * We can simulate data under the assumptions of this model – “under the null hypothesis”
* Alternative hypothesis: A different view about the origin of the data


### Test Statistic

* The statistic that we choose to simulate, to decide between the two hypotheses
* Questions before choosing the statistic:
    * What values of the statistic will make us lean towards the null hypothesis?
    * What values will make us lean towards the alternative? Preferably, the answer should be just "high". Try to avoid "both high and low".


### Prediction Under the Null Hypothesis

* Simulate the test statistic under the null hypothesis; draw the histogram of the simulated values
* This displays the **empirical distribution of the statistic under the null hypothesis**
* It is a prediction about the statistic, made by the null hypothesis 
    * It shows all the likely values of the statistic
    * Also how likely they are (**if the null hypothesis is true**)
* The probabilities are approximate, because we can't generate all the possible random samples


### Conclusion of the Test

* Resolve choice between null and alternative hypotheses
    * Compare the **observed test statistic** and its empirical distribution under the null hypothesis
    * If the observed value is **not consistent** with the distribution, then the test favors the alternative ("data is more consistent with the alternative")
* Whether a value is consistent with a distribution:
    * A visualization may be sufficient
    * If not, there are conventions about "consistency"

---

## A Low Midterm Average

### The Set Up

* Large(-ish) Data Science class divided into 12 discussion sections
* After the midterm, students in Section 3 notice that the average score in their section is lower than in others

### The Instructor's Defense

* Section 3 Instructor's position (Null Hypothesis): If we had picked my section at random from the whole class, we could have got an average like this one.
* Alternative Hypothesis: No, the average score is too low. Randomness is not the only reason for the low scores.


### Demo: A Low Midterm Average

Load the `scores_by_section.csv` data, identify the average Midterm score for each section, make sure to store the value of the Section 3 average Midterm score.

In [None]:
scores = Table.read_table('data/scores_by_section.csv')
scores

In [None]:
scores.group('Section')

In [None]:
scores.group('Section', np.average).show()

In [None]:
observed_average = ...
observed_average

Randomly sample 27 students (same as section 3) from the population (without replacement) and compute the sample average Midterm score.

In [None]:
random_sample = ...
random_sample

In [None]:
...

Simulate one value of the test statistic under the hypothesis that the section is like a random sample from the class.

In [None]:
def random_sample_midterm_avg():
    random_sample = ...
    return ...

In [None]:
random_sample_midterm_avg()

Simulate 50,000 copies of the test statistic and compare the simulated distribution of the statistic and the actual observed statistic.

In [None]:
sample_averages = ...

for ...
    random_sample_average = ...
    sample_averages = ...

In [None]:
averages_tbl = Table().with_column('Random Sample Average', sample_averages)
averages_tbl.hist(bins = 20)
plt.scatter(observed_average, 0, color = 'red', s=60, zorder=3);

In [None]:
...

---

## Statistical Significance

### Tail Areas

<img src="img/tail_areas.png" alt="comparison of distributions" width=80%>

### Conventions About Inconsistency

* "Inconsistent with the null": The test statistic is in the tail of the empirical distribution under the null hypothesis
* "In the tail," first convention:
    * The area in the tail is less than 5%
    * The result is "statistically significant"
* "In the tail," second convention:
    * The area in the tail is less than 1%
    * The result is "highly statistically significant"


### Demo: Conventions About Inconsistency

In [None]:
0.05 * 50_000

Use the fact that 5% of 50,000 is 2500, identify the 2500th data value in the sorted (ascending) table of averages.

In [None]:
five_percent_point = averages_tbl.sort(0).column(0).item(2500)
five_percent_point

Visualize the distribution of sample averages along with a vertical line marking the 2500th data value.

In [None]:
averages_tbl.hist(bins = 20)
plt.plot([five_percent_point, five_percent_point], [0, 0.35], color='gold', lw=2)
plt.title('Area to the left of the gold line: 5%');

### The P-Value as an Area

In [None]:
averages_tbl.hist(bins = np.arange(10, 21, 0.5), right_end=five_percent_point)
plt.plot([five_percent_point, five_percent_point], [0, 0.35], color='gold', lw=2)
plt.scatter(observed_average, 0, color = 'red', s=60, zorder=3);

* Empirical distribution of the test statistic under the null hypothesis
* The red dot is the observed statistic.
* The P-value is represented by the shaded region of the histogram determined by the definition of the alternative hypothesis.

### Definition of the P-value

* The P-value is the chance, 
    * under the null hypothesis, 
    * that the test statistic 
    * is equal to the value that was observed in the data
    * or is even further in the direction of the alternative.
* Also known as the observed significance level


Calculate the P-value using the distribution of simulated statistics.

In [None]:
averages_tbl

In [None]:
observed_average

In [None]:
p_value = ...
p_value

---

## How We’ve Tested Thus Far

### Hypothesis Testing Review

* One Category (*ex: percent of flowers that are purple*)
    * Test Statistic (1): `empirical_percentage`
    * Test Statistic (2): `abs(empirical_percentage - null_percentage)`
    * How to Simulate: sample_proportions(n, null_dist)
* Multiple Categories (*ex: ethnicity distribution of jury panel*)
    * Test Statistic: `tvd(empirical_dist, null_dist)`
    * How to Simulate: `sample_proportions(n, null_dist)`
* Numerical Data (*ex: scores in a lab section*)
    * Test Statistic: `empirical_mean`
    * How to Simulate: `population_data.sample(n, with_replacement=False)`

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>