In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
%matplotlib inline

# Module 3.2 Part 2: Comparing Distributions

In this module, we'll extend the hypothesis testing procedure seen in the previous lecture guide to compare distributions
with more than two categories.

4 videos make up this notebook, for a total run time of 37:15.

1. [Introduction to Comparing Distributions](#section1) *1 video, total runtime 6:03*
2. [Assessing Models](#section2) *2 videos, total runtime 28:25*
3. [Summary](#section3) *1 video, total runtime 2:47*
4. [Check for Understanding](#section4)

Textbook readings: [Chapter 11.2: Multiple Categories](https://www.inferentialthinking.com/chapters/11/2/Multiple_Categories.html)

<a id='section1'></a>
## 1. Introduction to Comparing Distributions

In this lecture video, you'll learn how to compare distributions involving more than two categories. An example on
racial and ethnic disparities in Alameda County jury pools is introduced. We'll return to this example in subsequent
videos.

In [None]:
YouTubeVideo('-FppaIdE0sY')

<a id='section2'></a>
## 2. Model Assessment

A new statistic that measures the difference between two categorical variable distributions,
the total variation distance, is introduced in lecture 17.2. It is then used to assess whether there is
evidence of racial discrimination in Alameda County's jury pools.

In [None]:
YouTubeVideo('4fBN_xShzes')

In [None]:
YouTubeVideo('LVpwt9Bi3c8')

<a id='section3'></a>
## 3. Summary

The next video recaps what you've learned in this lecture guide.

In [None]:
YouTubeVideo('g7dyCeCZY4o')

<a id='section4'></a>
## 4. Check for Understanding

The COVID-19 pandemic in the United States is disproportionately affecting racial and ethnic minorities,
especially African-Americans and Latinos ([source](https://www.npr.org/sections/health-shots/2020/05/30/865413079/what-do-coronavirus-racial-disparities-look-like-state-by-state)).

A table indicating the proportion of cases by race and ethnicity in California ([as of June 29th](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/Race-Ethnicity.aspx))
is provided in the cell below. In the next few questions, we'll determine whether there is evidence supporting the claim that
COVID-19 is hitting California's racial and ethnic minorities the hardest.

In [None]:
ca_covid_cases = Table().with_columns(
  'Race/Ethnicity', make_array('Latino', 'White', 'Asian', 'African American',
                               'Multi-Race', 'American Indian or Alaskan Native',
                               'Native Hawaiian or Pacific Islander', 'Other'),
  'Proportion Cases', make_array(0.557, 0.168, 0.066, 0.044, 0.007, 0.002, 0.006, 0.150),
  'Proportion Population', make_array(0.389, 0.366, 0.154, 0.060, 0.022, 0.005, 0.003, 0.000) 
)

ca_covid_cases

**A. Complete the** `total_variation_distance` **function in the cell below.**

In [None]:
def total_variation_distance(dist_1, dist_2):
    ...

<details>
    <summary>Solution</summary>
    
    def total_variation_distance(dist_1, dist_2):
        tvd = sum(np.abs(dist_1 - dist_2)) / 2
        return tvd
</details>
<br>

**B. If we assumed that COVID-19 did not disproportionately affect racial and ethnic minorities, what values would
we expect in the "Percent Cases" column of the** `ca_covid_cases` **table?**

<details>
    <summary>Solution</summary>
    If COVID-19 does not disproportionately affect racial and ethnic minorities, then we would expect COVID-19 to
    be spread uniformly across the population of California. We would therefore expect the "Percent Cases" column
    to be almost identical to the "Percent Population" column in the <code>cali_covid_cases</code>.
</details>
<br>

**C. Compute the total variation distance between the proportion of COVID-19 cases by race and ethnicity and each race
and ethnicity's share of the Californian population. Do you these distributions look similar to you?**

In [None]:
...

<details>
    <summary>Solution</summary>
    <b>Code</b>: 
    
    total_variation_distance(
        ca_covid_cases.column('Proportion Cases'),
        ca_covid_cases.column('Proportion Population')
    )

<b>Interpretation</b>: <br>
     Although interpreting the total variation distance is difficult, COVID-19 clearly seems
    to be disproportionately affecting Latinos.
 </details>


**D. Could this difference between distributions, as measured by the total variation distance, be due to chance alone?
Use the procedure introduced in lecture video 17.3 to answer this question. Assume that the proportion of COVID-19 cases
were estimated from a sample 10,000 individuals.** 

In [None]:
# create array to store statistics
tvds = ...

# generate 10,000 statistics
repetitions = 10000
for i in np.arange(repetitions):
    sample_distribution = ...
    tvd = ...
    tvds = ...
    
# view the distribution of the distances    
Table().with_column('Total Variation Distance', tvds).hist()

<details>
    <summary>Solution</summary>
    <b>Code</b>: 
    
    # create array to store statistics
    tvds = make_array()

    # generate 10,000 statistics
    repetitions = 10000
    for i in np.arange(repetitions):
        sample_distribution = sample_proportions(10000, ca_covid_cases.column('Proportion Population'))
        tvd = total_variation_distance(sample_distribution, ca_covid_cases.column('Proportion Population'))
        tvds = np.append(tvds, tvd)

    # view the distribution of the distances
    Table().with_column('Total Variation Distance', tvds).hist()
    
<b>Interpretation</b>: <br>
    Based on the distribution of total variance distances generated under the assumption that the proportion of
    COVID-19 cases by race and ethnicity is similar to that of the population breakdown in California, the total
    variation distance of 0.3205 computed in C seems unlikely to be a result of chance alone. There is evidence to
    suggest that the certain races and ethnicities are disproportionately affected by COVID-19.
</details>
<br>