# Lab 4: Testing hypotheses

Welcome to lab 4!  This lab covers the use of computer simulation, in combination with data, to answer questions about the world.  You'll answer two questions:
1. Was a panel of jurors selected at random from the population of eligible jurors?
2. Were murder rates in US states from 1960-2003 equally likely to go up or down each year?

We'll also learn some basic ways to work with data in tables.

# 0. Preliminaries
First, run the cell below to prepare the lab and the automatic tests.

In [1]:
# Run this cell, but please don't change it.

# These lines import the NumPy and datascience modules.
import numpy as np
# This way of importing the datascience module lets you write "Table" instead
# of "datascience.Table".  The "*" means "import everything in the module."
from datascience import *

# These lines set up visualizations.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# This line loads some special functions for this lab.
from lab_utils import *

# These lines load the tests.
from client.api.assignment import load_assignment 
lab04 = load_assignment('longlab04.ok')

# 1. Jury selection: Swain v. Alabama
In lecture, we tested the hypothesis that a jury panel in Alameda County was a random sample from the population of Alameda.  First we'll go through a similar example.

##### Background
Swain v. Alabama was a US Supreme Court case decided in 1965.  A black man, Swain, was accused of raping a white woman, and had been convicted by an all-white jury.  The jury was selected from a panel that contained only 8 black people and 92 whites.  The panel was supposed to be a random sample from the eligible population of Talladega County, which was 26% black.  Swain's lawyers argued that this reflected a bias in the jury panel selection process.  A five-justice majority of the Supreme Court disagreed, concluding that,

> "The overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of Negros."

Is 8 is enough smaller than 26 to indicate bias?  This may seem like a judgement call.  Apparently five Supreme Court justices thought so (interpreting their motives charitably)!

But in fact we can contradict the assertion in the Supreme Court opinion with basic computer simulation.  (The scarcity of computers at the time was no excuse.  Any statistician could have done this calculation by hand in 1965.)

Here are the data:

In [22]:
jury = Table.read_table("jury.csv")
jury.barh('Race', [1, 3])
jury

Imagine drawing a random sample from the population of Talladega.  Each person sampled has a 26% chance of being black.  We can simulate drawing 1 person in this way by calling this function:

In [23]:
one_random_sample = random_sample_counts(jury, "Proportion eligible", 1)
one_random_sample

This table is the same as `jury`, but the last column in this table contains the numbers of Black and Other people in a random sample of 1 person.  The race of the person was decided randomly: they had a 26% chance of being black.

If we repeated this many times, about 26% of the time we'd get a black person.  That's the law of averages.  Let's try that out.

In [30]:
# Simulates many samples of sample_size people from the eligible
# population.  Prints out some information about the samples we
# simulated.
def simulate_samples(sample_size):
    # We can define a function just for use inside another function!
    # This function simulates a sample of people from the eligible
    # population and returns the number of black people in that
    # sample.
    def simulated_num_black():
        return random_sample_counts(jury, "Proportion eligible", sample_size).column("Count in Random Sample").item(0)

    num_simulations = 10000
    many_simulations = repeat(simulated_num_black, num_simulations)
    simulations_table = Table().with_column("number of black people in a sample of " + str(sample_size), many_simulations)
    if sample_size <= 5:
        bins = np.arange(-.5, sample_size + .5, .01)
    else:
        bins = np.linspace(0, sample_size+1, 100)
    simulations_table.hist(bins=bins)

simulate_samples(1)

In each sample, there was only 1 person, so there could be only 0 or 1 black people.  The histogram shows that around 26% of the time there was 1 black person and 74% of the time there were no black people.

Now let's try this with a sample of 100 people.  This is how the jury panel in Swain's case was *supposed* to be selected.

In [31]:
panel_size = 100
simulate_samples(panel_size)

This histogram shows the number of black people we expect to see in a sample of 100 from the eligible population.  Sometimes it's as low as 10 or as high as 45.  It's almost never as low as 8.  It might help to draw 8 on the histogram:

In [33]:
# Plot a vertical red line at spot.  You don't need to
# read this function unless you're interested.
def draw_red_line(spot, height, description):
    plt.plot([spot, spot], [0, height], color="red", label=description)
    # Draw a legend off to the right to explain the red line.
    plt.legend(loc="center left", bbox_to_anchor=[.8, .5])

simulate_samples(panel_size)
draw_red_line(8, .1, "number of black people in\nSwain's jury panel")

We've demonstrated by simulation that it's extremely unlikely to see a jury panel with only 8 black people, if the panel is really randomly selected from an eligible population with 26% black people.

The Supreme Court opinion was factually incorrect; there is strong evidence of some kind of bias.

# 2. Understanding violent crime
Does instituting a death penalty for murder have a deterrent effect on murder?  Social scientists have been bringing data to bear on this question for at least 40 years, but the answer is still controversial.  This is in part because the evidence is surprisingly complex.  The final project in the Spring 2016 course had students investigate the evidence.

We'll look at a small part of that project, using simulation to answer a preliminary question about the overall trend in murder rates in the US.

Our data source for this section comes from a [paper](http://cjlf.org/deathpenalty/DezRubShepDeterFinal.pdf) by three researchers, Dezhbakhsh, Rubin, and Shepherd.  The dataset contains per-capita rates of various violent crimes for every year 1960-2003 (44 years) in every US state.  (Actually, the rates are per 100,000 people.)  The researchers compiled their data from the FBI's Uniform Crime Reports.

The dataset is in a file in your account called `crime_rates.csv`.  Run the next cell to load it.

In [13]:
murder_rates = Table.read_table('crime_rates.csv').select(['State', 'Year', 'Population', 'Murder Rate'])
murder_rates.set_format(2, NumberFormatter)

# 2.1. A basic question
Here is one basic question about this dataset:

> Across each 2-year period in each state, did murder rates increase more often than they decreased?

Let's define the *net number of increases* as the number of times the murder rate increased year-over-year in a state, minus the number of times it decreased.  So we can rephrase our question as: Is the net number of increases positive?

This is a question about the data at hand, involving no randomness or unknowns.  We can just compute the answer with Python.

First, we define a function that takes an array of murder rates for a single state, in order by year, and produces the number of net increases for that state.

In [15]:
# rates should be an array of murder rates for a single state,
# in order by year.  This function computes the number of
# increases over each 2-year period minus the number of
# decreases over each 2-year period.
def two_year_increases(rates):
    # Don't worry about how this is computed; it uses some
    # programming ideas we haven't covered.
    two_year_diffs = rates[1:] - rates[:-1]
    return np.count_nonzero(two_year_diffs > 0) - np.count_nonzero(two_year_diffs < 0)

Now we want to do this for each state.  More precisely, here is a recipe for computing the total number of net increases in US states over this time period:
1. take the murder rates for each state,
2. put them in an array (50 arrays total),
3. call `two_year_increases` on each array (50 calls to `two_year_increases`), and
4. sum up all those to get the total net increases in the US over the period.

We could do this manually, using a method called `where` that you'll learn about soon:

In [None]:
alabama_increases = two_year_increases(murder_rates.where("State", are.equal_to("Alabama")).column("Murder Rate"))
alaska_increases = two_year_increases(murder_rates.where("State", are.equal_to("Alaska")).column("Murder Rate"))
...
manual_total_net_increases = alabama_increases + alaska_increases + ...

But that would involve writing 50 lines of nearly-identical code!  In computer programming, it's rarely correct to do that kind of repetitive work.

<img src="https://qph.is.quoracdn.net/main-qimg-3289068633a342369f8e9319609f5762?convert_to_webp=true"/>

Instead, we can have Python do the work for us, using a method called `group`.  Here's how:

In [40]:
# First, make a table with just the State and Murder Rate columns,
# for simplicity.
state_and_murder_rate = murder_rates.select(["State", "Murder Rate"])

# Group together the 44 years for each state, then compute the
# number of net increases for each state.  You'll learn how
# to use the group method later.
net_increase_count_by_state = state_and_murder_rate.group("State", two_year_increases)

### 2.1.1. Interlude: `group`
`group` categorizes rows in a table according to one column (like the State), and then does something with the data in each category.  Let's see some simpler examples.

One thing `group` can do with each category is count the number of things in the category.  So we could use `group` to see how many years are covered in our `murder_rates` table for each state:

In [42]:
murder_rates.group("State")

For each row in the table you see, the `count` column is the number of rows in `murder_rates` with that `State`.  You can see that we have 44 rows for each state.  That's one for each year (1960 to 2003.)

**Question 2.1.1.1.** Suppose we want to know how many states are in our dataset for each year.  Make an appropriate table using `group`.

In [None]:
num_states_by_year = ...
num_states_by_year

In [None]:
_ = lab04.grade("q2111")

For our hypothesis test, we want to compute the net number of increases in murder rate for each state instead of its count.  It's also possible to do that kind of thing with `group`.  For example, we can compute the highest one-year murder rate in each state over the period:

In [None]:
# Pare down our table to just the State and Murder Rate columns.
state_and_murder_rate = murder_rates.select(["State", "Murder Rate"])

highest_murder_rate = state_and_murder_rate.group("State", max)
highest_murder_rate

`group` puts each state's murder rates in its own array, and then calls `max` on each one to produce the maximum murder rate for that state.  Pictorially, if the original table looked like this:

|State|Murder Rate|
|-|-|
|Alabama|8.1|
|Alabama|9.5|
|Alaska|7.5|
|Alaska|8.6|
|Alaska|8.2||

Then the grouped table is computed like this:

|State|Murder Rate max|
|-|-|
|Alabama|max([8.1, 9.5])|
|Alaska|max([7.5, 8.6, 8.2])||

...which would result in `state_and_murder_rate.group("State", max)` looking like this:

|State|Murder Rate max|
|-|-|
|Alabama|9.5|
|Alaska|8.6||

**Question 2.1.1.2.** Use the `murder_rates` table to compute the total US population in each year.  The total US population in a year is just the sum of the populations of the 50 states in that year.  Run the tests for a hint.

In [44]:
# For your convenience, here's a table with only the Year and Population columns:
year_and_population = murder_rates.select(["Year", "Population"])

# Fill in this line.  Use something like:
#   year_and_population.group(...).
total_pop_by_year = ...

# We can use your table to make a plot of US population over time.
# You don't need to edit this part.
total_pop_by_year.plot(0, 1)
total_pop_by_year

In [None]:
_ = lab04.test("q2112")

So this is what we did to compute the net increases by state:

    net_increase_count_by_state = state_and_murder_rate.group("State", two_year_increases)

### 2.1.2. Back to murder rates
Now we have the net increases for each state, we just add them up.  The function `np.sum` adds up the elements in an array.

In [45]:
# Add together the net increases for all the states to get
# a single number.
total_net_increases = np.sum(net_increase_count_by_state.column(1))

print('Total increases minus total decreases, across all states and years:', total_net_increases)

Okay, so the murder rate increased more often than it decreased.  But does this really say anything interesting about the underlying process according to which murder rates change?

Maybe this result is actually compatible with a simple story:

> Murder rates *randomly* go up or down each year in each state, like the flip of a coin.

Intuitively, we're unlikely to see a lot more increases than decreases if this story is true.  But it's hard to know whether 36 qualifies as "a lot."  We'll use simulation to find out.

## 2.2. Simulation
Translating into technical language, the simple story is our *null hypothesis*.  We need to simulate what would happen if that hypothesis were true, and see if we'd be surprised to see 36 net increases.  If our data would be surprising under the null hypothesis, that would be evidence against the null hypothesis.

The number of net increases is called our *test statistic*.  It's a simple one-number summary of our data that we'll use to check whether our data are likely under the null hypothesis.

The null hypothesis says that changes are generated randomly.  So in that story, in each state, each year the murder rate goes up with chance 1/2 and down with chance 1/2.  There are 50 states and 43 2-year periods, so there are $50 \times 43$ ($2150$) separate chances for a change.

Let's simulate them and count the number of net increases!

In [47]:
def simulate_net_increases():
    # First we make a table that gives the chances of each outcome:
    # 1/2 for an increase, and 1/2 for a decrease.
    changes_distribution = Table().with_column("change",      ["increase", "decrease"])\
                                  .with_column("probability", [1/2,        1/2])
    
    # Now we sample 2150 times from that distribution, using the
    # built-in method sample_from_distribution.  This results in
    # a table that tells us how many times (out of the 2150 simulated
    # periods) we saw an increase, and how many times we saw a
    # decrease.
    num_periods = 50*43
    simulated_changes = changes_distribution.sample_from_distribution("probability", num_periods)
    
    # You can uncomment the next line (remove the # sign) and run
    # this cell to see what the simulated_changes table looks like.
    #simulated_changes.show()
    
    # We extract the number of increases and the number of decreases
    # and subtract one from the other.
    num_increases_in_simulation = simulated_changes.column("probability sample").item(0)
    num_decreases_in_simulation = simulated_changes.column("probability sample").item(1)
    return num_increases_in_simulation - num_decreases_in_simulation

# This is a little magic to make sure that you see the same results
# we did, just for pedagogical purposes.
np.random.seed(1234567)

# Simulate once:
one_simulation_result = simulate_net_increases()

print("In one simulation, there were", one_simulation_result, "net increases (versus", total_net_increases, "in the real data).")

In the simulation, we happened to see about as many net increases (30) as we did in the real data (36).  This implies the real data are consistent with the simple story, where change was random.

But maybe we just got lucky.  Instead of simulating once, we should simulate many times, and see how the simulations *typically* come out.  Then we can reject our null hypothesis if the number of net increases in the real murder rate data looks really unusual.

In [19]:
num_simulations = 5000

# An array of 5000 net increases, each from a separate simulation
# where the changes in murder rates where random.
simulated_net_increases = repeat(simulate_net_increases, num_simulations)

# Here we've written a function to make a histogram of the
# simulated net increases.
increases_bins=np.arange(-200, 200, 10)
def display_test(null_statistics, actual_statistic):
    # Generate a histogram of the simulated net increases.
    Table().with_column("null test statistics", null_statistics).hist(bins=increases_bins)
    # This function was defined earlier in the lab.  It draws
    # a vertical red line at a spot.
    draw_red_line(actual_statistic, .01, "what we actually observed")

# Call the function we defined to actually draw the plot.
display_test(simulated_net_increases, total_net_increases)

The plot shows that it wasn't a fluke: 36 net increases in murder rates over 2150 periods is quite consistent with the simple random story.  We shouldn't reject our null hypothesis.

If you prefer to be more quantitative, we can compute a P-value.  That's the proportion of simulated results that are at least as "extreme" as 36.  Graphically, it's the proportion of stuff in the magenta regions:

In [21]:
# Overlays a histogram for the values in null_statistics
# that are larger in magnitude than actual_statistic.
# You don't need to read this unless you want to; it's
# a little complicated.
def display_extreme_region(null_statistics, actual_statistic):
    # Identify the simulated net increases that are "extreme".
    extreme_null_statistics = Table().with_column("null stats", null_statistics).where(0, are.not_between_or_equal_to(-abs(actual_statistic), abs(actual_statistic))).column(0)
    # Make a magenta-colored histogram of just those extreme values.
    # The tricky bit here is rescaling the height of the histogram
    # to match up with the height of our blue histogram.
    bin_width = increases_bins.item(1) - increases_bins.item(0)
    plt.hist(extreme_null_statistics, bins=increases_bins, weights=[1/(len(null_statistics)*bin_width)]*len(extreme_null_statistics), color="magenta", label="random results more extreme\nthan what we observed")
    plt.legend(loc="center left", bbox_to_anchor=[.8, .5])

# We plot the actual histogram again...
display_test(simulated_net_increases, total_net_increases)
# ...and then this call overlays the extreme region on it.
display_extreme_region(simulated_net_increases, total_net_increases)

To actually compute this proportion, we can put our simulated net increases in a table and use a method called `where` to find how many are bigger than the observed net increases.

In [81]:
# First we take our array of simulated net increases and make them
# a column of a table with only one column.  That way we can use
# the useful table functions to filter them.
simulation_table = Table().with_column("simulated net increases", simulated_net_increases)

# Now we find the rows where the simulated net increases weren't
# between -36 and 36.
more_extreme = simulation_table.where("simulated net increases", are.not_between_or_equal_to(-abs(total_net_increases), abs(total_net_increases)))

# The P-value is the number of times the simulated increases
# weren't between -36 and 36, divided by the total number of
# simulations we ran.
p_value = more_extreme.num_rows / simulation_table.num_rows
p_value

Once you get the hang of this, you won't need to visualize the sampling distribution and observed test statistic as we did.  Then you'd only need to write the code in this last cell to compute a P-value.

### 2.2.1. Interlude: `where`
To find the simulations where the net increases were extreme, we used `where`.  This kind of operation is also called "filtering."  Like `group`, it's an important part of the data analysis toolbox.  Let's spend some time to understand it.

##### Filtering by state
Suppose we want to see only the murder rate data from California.  We can use `where` to do that.  Here's how:

In [9]:
murder_rates.where("State", are.equal_to("California"))

`where` takes 2 arguments:
1. The name of the column that contains the data you want to filter by.  In this case, we want the rows where the state has a certain name, so we use the name of the State column.
2. A "predicate" that tells it whether to accept or reject values in that column.  Predicates are a bit magical, but they're simple to use if you don't think about how they work.  They're all created by calling functions in a module called `are`.  In this case, we want rows where the State column's value is `"California"`.

It returns a table containing only the subset of the rows in the original table where the predicate matches the value in the named column.

It's helpful to translate code like this into English in your head.  In this case, think:

> "the rows in murder rates where the State is California"

**Question 2.2.1.1.** Make a table of the data from the year 1960.

In [None]:
murder_rates_1960 = ...
murder_rates_1960

In [None]:
_ = lab04.grade("q2211")

##### Filtering multiple years with `where`
Suppose we want only the data from the years 1971 to 1973.  We can use `where` again, with a different predicate:

In [None]:
murder_rates.where("Year", are.between_or_equal_to(1971, 1973))

If you type

    are.

in a code cell and hit Tab, you'll see a full list of the available predicates.  They have pretty straightforward names.

**Question 2.2.1.2.** Make a table like `murder_rates`, but containing only the rows for state-years when murder rates were above 16.5 per 100,000 per year.

In [None]:
high_murder_rates = ...
high_murder_rates

In [None]:
_ = lab04.grade("q2212")

So to compute the P-value in our test, we used `where` to select the simulation results that were bigger in magnitude than 36, the actual number of net increases:

    simulation_table.where(
        "simulated net increases",
        are.not_between_or_equal_to(-abs(total_net_increases), abs(total_net_increases)))

## 2.3. Interpretation
There is a long tradition of declaring that we can reject a null hypothesis with "statistical significance" if our P-value is less than 0.05, and with "high statistical significance" if it is less than 0.01.  These thresholds are historical accidents.

It's better to think of the P-value as measuring the strength of the evidence in favor of the null hypothesis.  Low values indicate that the evidence goes against the null hypothesis and in favor of some alternative model of the world.

If you must use a threshold and come to a yes-or-no conclusion, come up with your own threshold by considering how much evidence you'd need to reject the null hypothesis.

In this case, the null hypothesis seems somewhat implausible to begin with.  Why would changes in murder rates behave in such a simple way?  So we might be comfortable "rejecting" it if we found a P-value of, say, 0.02.  But, in fact, the data are quite consistent with the null hypothesis.

**Question 2.3.1.** Here are some potential conclusions we could draw from this mathematical exercise.  Which ones are valid?  Discuss with a neighbor.

1. The fact that there were 36 more increases than decreases in murder rates is not strong evidence against the hypothesis that murder rates were equally likely to go up or down.
2. The changes in murder rates over this period were probably random.
3. There is no discernable pattern in the changes in murder rates over this period.
4. There were as many increases as decreases in the murder rate over this period.
5. There were more increases than decreases in the murder rate over this period.

*Write your answer here, replacing this text.*