<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 22: Hypothesis Testing Examples

Associated Textbook Sections: [12.3](https://inferentialthinking.com/chapters/12/3/Deflategate.html)

<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

---

## Outline

* [Benford's Law](#Benford's-Law)
* [Reaction Time](#Reaction-Time)
* [Zodiac Signs](#Zodiac-Signs)

---

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Benford's Law

> [Benford's law](https://en.wikipedia.org/wiki/Benford%27s_law), also known as the Newcomb-Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. Benford's law tends to apply most accurately to data that span several orders of magnitude. As a rule of thumb, the more orders of magnitude that the data evenly covers, the more accurately Benford's law applies.

---

Observe the distribution of the first digits of numbers according to Benford's model.

In [None]:
digits = np.arange(1, 10)
benford_model = np.log10(1 + 1/digits)
benford = Table().with_columns(
    'First digit', digits,
    'Benford model prob', benford_model)
benford

In [None]:
benford.barh('First digit')

---

### Get First Digit

Use bracket notation to get get the first character of a string. 

_The use of bracket notation here is just for an example. Bracket notation is a common way in Python to obtain information within a collection. This is similar to how we use `.item(0)` with arrays._

In [None]:
a_string = 'data science'
a_string[0]

Try the same thing with an integer.

In [None]:
# Uncomment this to see a TypeError
an_integer = 1234
#an_integer[0]

Explore the `first_digit` function.

In [None]:
def first_digit(num):
    """Returns the first digit of the interger num."""
    return int(str(num)[0])

In [None]:
first_digit(32)

---

### County Data

Load the `counties.csv` data. This data contains county population sizes from the 2010 Census.

In [None]:
counties = Table.read_table('counties.csv')
counties = counties.where('SUMLEV', 50).select(5,6,9)\
                                       .relabeled(0,'State')\
                                       .relabeled(1,'County')\
                                       .relabeled(2,'Population')

counties

In [None]:
counties.where('County', 'San Francisco County')

---

### Demo: Benford's Law

* Apply `first_digit` to add a column to the `counties` table that shows the first digit of the population sizes.
* Visually compare the distribution of first digits from the `counties` data and Benford's proportions.

In [None]:
first_digits = ...
counties = counties.with_column('First digit', first_digits)
counties.show(3)

In [None]:
num_counties = ...
by_digit = ...
proportions = ...
by_digit = by_digit.with_columns(
    'Proportion', proportions,
    'Benford proportion', benford_model
)
...

---

Test whether or not the distribution of proportions in `counties` is consistent with Benford's model.

Null hypothesis: ...

Alternative hypothesis: ...

Test statistic: ...

Fill in the ... with "Bigger" or "Smaller":

... values of the test statistic favor the alternative

---

Calculate the observed TVD and create a distribution of simulated TVDs under the null hypothesis.

In [None]:
def tvd(arr1, arr2):
    ...

In [None]:
observed_tvd = ...
observed_tvd

In [None]:
...

In [None]:
simulated_frequencies = sample_proportions(num_counties, benford_model)
...

In [None]:
def simulate_county_first_digits():
    simulated_frequencies = ...
    ...

In [None]:
simulated_tvds = make_array()
reps = 10_000

for __ in np.arange(reps):
    simulated_tvds = ...

In [None]:
Table().with_column('Simulated TVD', simulated_tvds).hist(0)
plt.scatter(observed_tvd, 0, color='red', s=60, zorder=3)
plt.show()

In [None]:
benfords_p_value = ...
benfords_p_value

---

## Survey Observations

* _The following examples use a real data set (Class Survey) along with the hypothesis testing procedure to come to conclusions._
* _Keep in mind that data is not from a random sample._
* _The conclusions we reach should not be taken too seriously._

---

### Demo: Reaction Time

* Load our class survey data from `survey.csv`.
* Explore the relationship between a person's reaction time and whether or not someone has at least a Bachelor's degree.
* Remove the rows associated with missing or extreme data values for the relevant columns.

In [None]:
survey = Table.read_table('survey.csv')
survey.show(3)

In [None]:
def is_at_least_bachelors(degree):
    ...

ed_reaction = (survey.select('reaction_time_ms', 'ed_level')
                     .where('reaction_time_ms', are.above_or_equal_to(0))
                     .where('ed_level', are.not_equal_to('nan')))
                           

at_leaset_bachelors = ...
ed_reaction = ed_reaction.with_column('at_least_BS', at_leaset_bachelors)

ed_reaction = ed_reaction.drop('ed_level')
ed_reaction

In [None]:
ed_reaction.group('at_least_BS')

In [None]:
ed_reaction.hist('reaction_time_ms', group='at_least_BS')

In [None]:
ed_reaction.sort('reaction_time_ms', True)

In [None]:
ed_reaction.sort('reaction_time_ms')

In [None]:
ed_reaction = ed_reaction.where('reaction_time_ms', are.between(100, 600))
ed_reaction.sort('reaction_time_ms', True)

In [None]:
ed_reaction.hist('reaction_time_ms', group='at_least_BS')

In [None]:
ed_reaction.group('at_least_BS', np.average)

---

Test whether or not there is a significant difference between the average reaction time for those who play sports or do not.

In [None]:
def compute_test_statistic(tbl):
    grouped = ...
    avgs = ...
    ...

In [None]:
obs_test_stat = ...
obs_test_stat

In [None]:
random_labels = ed_reaction.sample(with_replacement=False).column('at_least_BS')
random_labels

In [None]:
def simulate_under_null():
    random_labels = ...
    relabeled_tbl = ...
    ...


In [None]:
simulated_diffs = make_array()

for __ in np.arange(1000):
    null_stat = ...
    simulated_diffs = np.append(simulated_diffs, null_stat)

In [None]:
Table().with_column('Simulated difference', simulated_diffs).hist(0)
plt.scatter(obs_test_stat, 0, color='red', s=60, zorder=3)
plt.show()

In [None]:
ed_react_p_value = np.mean(simulated_diffs >= obs_test_stat)
ed_react_p_value

---

### Demo: Zodiac Signs

---

Load the distribution of Zodiac signs in the United States (as found in 2018 from [statisticbrain.com](https://www.statisticbrain.com/zodiac-sign-statistic)).

In [None]:
zodiac_distribution = Table.read_table('zodiac_distribution.csv')
zodiac_distribution.show()

In [None]:
zodiac_distribution.barh('Zodiac Sign')

---

* Get the birthdays from the survey data.
* Remove rows with missing birth dates.
* Convert the birthdays to Zodiac signs.
* Compare the distribution of Zodiac signs in MATH 108 with the country's distribution.

In [None]:
birthdays = survey.select('bday').where('bday', are.not_equal_to('nan'))

In [None]:
sample_size = birthdays.num_rows
sample_size

In [None]:
type(birthdays.column(0).item(0))

In [None]:
def get_zodiac_sign(birthday):
    """
    Given a birthday in the format "Month/Day", 
    returns a string representing the corresponding zodiac sign.
    """

    month, day = birthday.split('/')
    month = int(month)
    day = int(day)
    
    if month == 12:
        # a more compact way of writing an if statement
        astro_sign = 'Sagittarius' if (day < 22) else 'Capricorn' 
    elif month == 1:
        astro_sign = 'Capricorn' if (day < 20) else 'Aquarius'
    elif month == 2:
        astro_sign = 'Aquarius' if (day < 19) else 'Pisces'
    elif month == 3:
        astro_sign = 'Pisces' if (day < 21) else 'Aries'
    elif month == 4:
        astro_sign = 'Aries' if (day < 20) else 'Taurus'
    elif month == 5:
        astro_sign = 'Taurus' if (day < 21) else 'Gemini'
    elif month == 6:
        astro_sign = 'Gemini' if (day < 21) else 'Cancer'
    elif month == 7:
        astro_sign = 'Cancer' if (day < 23) else 'Leo'
    elif month == 8:
        astro_sign = 'Leo' if (day < 23) else 'Virgo'
    elif month == 9:
        astro_sign = 'Virgo' if (day < 23) else 'Libra'
    elif month == 10:
        astro_sign = 'Libra' if (day < 23) else 'Scorpio'
    elif month == 11:
        astro_sign = 'Scorpio' if (day < 22) else 'Sagittarius'
    else:
        astro_sign = 'Invalid Date'
    return astro_sign

In [None]:
def prop_function(count):
    """Return the count provided as a proportion of the sample size."""
    return count / sample_size

birthdays = birthdays.with_column(
    'Zodiac Sign',
    birthdays.apply(get_zodiac_sign, 'bday')
)

birthdays_by_sign = birthdays.group('Zodiac Sign')
birthdays_by_sign = birthdays_by_sign.with_column(
    'Proportion in MATH 108',
    birthdays_by_sign.apply(prop_function, 'count')
).drop('count').sort('Zodiac Sign')

zodiac = birthdays_by_sign.join('Zodiac Sign', zodiac_distribution)
zodiac.show()

In [None]:
zodiac.barh('Zodiac Sign')

---

Run a test to see if there is a significant difference between the distribution of Zodiac signs in the class and the nation.

In [None]:
prop_in_108 = zodiac.column('Proportion in MATH 108')
prop_in_US = zodiac.column('Proportion of US Population')
observed_tvd_zodiac = ...
observed_tvd_zodiac

In [None]:
def simulate_zodiac_proportions():
    simulated_proportions = ...
    ...

simulated_tvds_zodiac = make_array()
reps = 10_000

for __ in np.arange(reps):
    simulated_tvds_zodiac = np.append(simulated_tvds_zodiac, simulate_zodiac_proportions())
    
Table().with_column('Simulated TVD', simulated_tvds_zodiac).hist(0)
plt.scatter(observed_tvd_zodiac, 0, color='red', s=60, zorder=3)
plt.show()

zodiac_p_value = ...
zodiac_p_value

---

## Test Reflection

* The p-value for the last test was approximately 68%. 😱
* This means that our test results are NOT statistically significant. 🫣
* The course Zodiac distribution looks wildly different from the population's distribution.
* How can this be? 
    * Sample size has a pretty big impact. 
    * Try changing the sample size in `sample_proportions` to 450, instead of the actual sample size of 45.

---

### Hypothesis Test Concerns

The outcome of a hypothesis test can be affected by:
* The hypotheses you investigate: 
    * How do you define your null distribution?
* The test statistic you choose: 
    * How do you measure a difference between samples?
* The empirical distribution of the statistic under the null:
    * How many times do you simulate under the null distribution?
* The data you collected:
    * Did you happen to collect a sample that is similar to the population?
* The truth:
    * If the alternative hypothesis is true, how extreme is the difference?

---

### Hypothesis Test Effects

* Number of simulations: 
    * large as possible: empirical distribution → true distribution
    * No new data needs to be collected (yay!)
* Number of observations: 
    * A larger sample will lead you to reject the null more reliably if the alternative is in fact true.
* Difference from the null: 
    * If truth is similar to the null hypothesis, then even a large sample may not provide enough evidence to reject the null.


---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>