In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Lecture 17 ##

## Alameda County Jury Panels ##

In [None]:
# These data are from 2010, Alameda County, California
# The 'Panels' values are based on the actual jury panel of 1423 eligible jurors
jury = Table().with_columns(
    'Ethnicity', make_array('Asian', 'Black', 'Latino', 'White', 'Other'),
    'Eligible', make_array(0.15, 0.18, 0.12, 0.54, 0.01),
    'Panels', make_array(0.26, 0.08, 0.08, 0.54, 0.04)
)

jury

We see that the actual jury panel proportions are different from the proportions in the eligible population. Our job is to assess whether this difference could be due to just random variation, or is it extreme enough to show that the panel selection process is biased in some way?

Here's a visualization of the actual panel proportions along with the eligible population proportions:

In [None]:
jury.barh('Ethnicity')

In [None]:
# Under the model (null hypothesis states that jury panel selection is not biased), this is 
# the true distribution from which jury panels are formed, supposedly at random
model = make_array(0.15, 0.18, 0.12, 0.54, 0.01)

### Simulate data based on the model

We use `sample_proportions` to simulate the results of random sampling for a categorical variable such as 'Ethnicity':

In [None]:
# Let's simulate a random draw of 1423 jurors from this distribution
def simulated(): 
    ''' returns the proportions from a simulated sample of 1423 jurors
        using the model distribution
    '''
    sample_size = 1423  # to match the real-world jury panel data
    ...

simulated()

In [None]:
# The actual observed distribution (Panels) looks quite different
# from the simulation -- try running this several times to confirm!
jury_with_simulated = jury.with_column('Simulated', simulated())
jury_with_simulated

In [None]:
jury_with_simulated.barh('Ethnicity')

**Question**: When we create simulated proportions many times, based on the model, do we find that the real-world sample is consistent with the model? What statistic will we use to make this assessment?

A statistic is a single number. If we want to describe how far the 'Panel' proportions are from the 'Eligible', for example, what arithmetic will we use?

  - Plan A: To compare a column of values `(a, b, c, d, e)` with another column of values `(a2, b2, c2, d2, e2)`, we could sum up the pairwise differences: `(a-a2) + (b-b2) + (c-c2) + (d-d2) + (e-e2)`. Why is this NOT a good way to make a statistic for our specific context (Alameda jury panel selection)?

  - Plan B: Total Variation Distance (TVD). Apply absolute values to each difference before adding, so that negative and positive differences don't cancel each other out.

In [None]:
# example showing Plan A is not feasible:
vals1 = make_array(.3, .4, .1, .1, .1)  # sum is 1
vals2 = make_array(.2, .3, 0, .1, .4)   # sum is 1
total_deviation = np.sum(vals1 - vals2)
total_deviation

**Back to Slides...**

## Total Variation Distance

In [None]:
def tvd(dist1, dist2):
    '''
    Takes the proportions `dist1` and `dist2` for two categorical distributions 
        of the same length
    Returns the Total Variation Distance between the distributions
    '''
    ...
    return np.sum(np.abs(dist1 - dist2))

In [None]:
# The TVD of our observed data (Panels column) from their expected values,
# assuming the model is true
obsvd_tvd = ...
obsvd_tvd

Is the observed TVD "large"? 

Does it indicate a biased jury panel selection process, or could it plausibly be the result of random variation?

In [None]:
# Here's the TVD for one simulated jury pool of size 1423
tvd(simulated(), model)

In [None]:
# To truly understand the observed tvd in the context of the model, we need to
# make many simulated distributions and collect up the resulting TVD values

# One simulated test statistic:
def simulated_tvd():
    '''
    Returns the TVD for one simulated jury pool of size 1423
    '''
    ...
    return tvd(simulated(), model)

# An empty array to collect up many simulated values:
tvds = ...

# A for loop to run the simulation 10000 times:
num_simulations = ...
num_simulations = 10000
for ...
   ...

print(len(tvds))
tvds

In [None]:
# Create a histogram for the simulated TVD values
# We see that the TVD numbers tend to be between 0.00 and 0.10
axis_label = 'Simulated TVDs (if model is true)'
mybins = np.arange(0.0, 0.15, .005)

tbl = Table().with_column(axis_label, tvds)
tbl.hist(bins = mybins)
print('Observed TVD: ' + str(obsvd_tvd))

We see that the observed TVD of 0.14 is WAY outside the realm of reasonability if the model is correct. We conclude that the jury pool formation process is not like random sampling and does not tend to result in a representative jury pool.

**Back to Slides...**