<div class="alert alert-block alert-danger">

# Chi Square Test for Independence (COMPLETE)
    
</div>

Made in collaboration with [Skew the Script](https://skewthescript.org/) and [CourseKata](https://coursekata.org/).

<img src="https://i.postimg.cc/ty1GkxB8/Skew-the-Script-Logo.png" title="Skew the script logo" width=200 align = left>

<img src="https://i.postimg.cc/tXcF0nzD/Course-Kata-logo.png" title="CourseKata logo" width=200 align = right>

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

# This will get the data from a csv file and save it into a data frame called `StopFrisk_full`
# There were a lot of stop and frisks in 2011 so this may take a minute.
StopFrisk_full <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vS1OjQ3WIWin5SUI4l_1nM4fagFKcY2rdCnqgT6hhuXAeOcIhk4YFLmEOaB1VGhk13fpx4b2u5CBUKw/pub?output=csv", header = TRUE)
StopFrisk_full$Force3_unordered <- recode(StopFrisk_full$Highest_Force, "none" = "No Force", "baton" = "Higher Force", 
"drawweap" = "Higher Force", "ground" = "Higher Force", "hands" = "Hands", 
"hcuff" = "Higher Force", "peppersp" = "Higher Force", "pointweap" = "Higher Force",
"wall" = "Higher Force")
StopFrisk_full$Force3 <- factor(StopFrisk_full$Force3_unordered, ordered = TRUE, levels = c("No Force", "Hands", "Higher Force"))

# This will take a random sample of 5000 from the data (to help speed up code processing during analyses)
set.seed(1)
StopFrisk <- sample(StopFrisk_full, 5000)

<div class="alert alert-block alert-danger">

**NOTE**: 

The effects discussed here are not seen with smaller sample sizes (e.g, n=1000, n=2000, or n=3000). I think it may be because sometimes the expected value for the White/Higher Force group is less than 5 (violating an assumption of the chi-sq test? Not really sure...). Should we make a note about this for the teachers or students?

The Skew the Script lesson uses n= 3164, but we do not get similar patterns until about n=5000 here.

</div>

### 1.0 - Introduction

Today we will consider some data from NYPD's Stop and Frisk program. This was a controversial program in New York City that allowed police to stop people on the street and search them for weapons or contraband. We can download detailed data on the program from the [NYPD website](https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page). Over 685,724 people were stopped at the height of the program in 2011 so we will consider data from that year.

<img src="https://i.postimg.cc/NQTXq6Zb/Chi-Sq-Indep-stop-frisk.png" title="police brutality protest" width=500>

#### Today we will consider the question: 

Was there racial discrimination during stop and frisk?

These are some of the variables you will find in the `StopFrisk` data frame (a random sample of n = 5000 from the full data frame):

- `Precinct`: precinct of stop (from 1 to 123). Use this map linked above to locate precincts. (Use this [precinct map](https://www1.nyc.gov/site/nypd/bureaus/patrol/find-your-precinct.page) to see stop locations) 
- `Gender`: F = Female, M = Male, Z = Unknown (NYPD data did not include non-binary identities)
- `Age`: Age in years
- `Race`: B = Black, H = Hispanic, W = White (data was filtered to these three groups, as they were the most commonly stopped groups)
- `Crime_Suspected`: Crime suspected when stopped (NYPD data did not provide key to these values)
- `Arrest_Made`: Was the stopped person actually arrested? Y = Yes, N = No
- `Highest_Force`: (highest force level used on suspect by police officer): none = no force used, hands = hands used, wall = suspect against wall, hcuff = handcuffs, draw_weap = officer weapon drawn, ground = suspect on ground, point_weap = weapon pointed at suspect, baton = baton, peppersp = pepper spray
- `Force3`: a simplification of `Highest_Force` by separating force levels into 3 groups: no force, hands used, and higher force


**Data Cleaning Notes:**

The data were filtered to only include the 3 most common racial groups stopped: Black, Hispanic, and White. Race values of "P" and "Q" in original data were Black-Hispanic and White-Hispanic respectively; both were changed to "H" in this dataset). All cases with "other" force levels were dropped from the dataset.

**1.1:** Take a look at the data frame. What are some specific questions we can answer with this data frame?

In [None]:
head(StopFrisk)

<div class="alert alert-block alert-warning">

**Sample Responses**

There are quite a few questions we could answer, but these are a few examples:

- Which precincts had the most stops?
- Are there more men than women being stopped?
- What is the average age of the people being stopped?
- Which race(s) gets stopped the most?
- Was there usually an arrest made?

</div>

**1.2:** If another person was stopped and frisked, how would we add that information into the data frame? (That is, would it be another row? Another column? Into one of the variables? If so, which variable?) 

<div class="alert alert-block alert-warning">

**Sample Response**

The cases in this data frame are stops, so it would be added as a new row.
    
</div>

**1.3:** Let's take a look at the distribution of `Race` and `Force3` in a two-way table. What do you notice?

In [None]:
tally(Force3 ~ Race, data = StopFrisk, margins = TRUE)

<div class="alert alert-block alert-warning">

**Sample Responses**

- Just looking at the totals, we can see more interactions (regardless of Force3) among Black and Hispanic individuals.
- Black and Hispanic individuals have more No Force interactions than White individuals.
- Black and Hispanic individuals have the most frequent Hands and Higher Force interactions compared to White individuals.
    
</div>

### 2.0 - Putting numbers in context

**Imagine a television commentator says:**

“Police had ‘no force’ interactions with only about 400 white suspects. Meanwhile, a much higher number of black suspects—about 2,100—didn’t experience force. Clearly, black people experience ‘force-free’ interactions with police more frequently than white people.” (*note: your counts may be slightly different based on random sampling variation each time you run this notebook*)

**2.1:** Is this a good take? 

<div class="alert alert-block alert-warning">

**Sample Response**

Well, maybe in terms of *raw counts* this is true, but it ignores the rates of force used *proportionally*.
    
</div>

#### Conditional Distributions

What we are really interested in are the conditional distributions. For example:

- What percent of people received force, given they’re black?
- What percent of people received force, given they’re white?

**2.1:** So we are going to find the proportional distribution among *each* racial group. 

In [None]:
# Find the proportion of Force3 by Race

# Complete Version
tally(Force3 ~ Race, data = StopFrisk, format = "proportion")

**2.2:** Use the information in our conditional distribution table to reevaluate the claim made by the TV commentator.

<div class="alert alert-block alert-warning">

**Sample Response**

To return to the question of whether the commentator had a good take on the data, while the numbers are technically correct, this is not a good analysis of this dataset. The proportion of white people who received no force was actually higher than it was among black people. In other words, white people had no-force interactions at higher rates than black people. It’s just that fewer white people were stopped and searched in the first place. Generally, focusing on raw numbers rather than rates/proportions in a two-way table can lead to misleading conclusions.
  
</div>

### 3.0 - $\chi^2$: Comparing what we expect to what we observe

Since we are investigating if there is convincing evidence of an association between race of suspect and level of force used, we will want to know if the differences between our observed and expected counts of force by race are significantly different. Rather than doing a bunch of comparisons (and increasing our chance of Type I Error), we can conduct a single **chi-square ($\chi^2$) test for independence**.

**3.1:** What would the empty model assume about the association between the race of the person and the level of force used? What would the empty model predict?

<div class="alert alert-block alert-warning">

**Sample Response**

The empty model would assume there is no association between race of people and level of force used; force is used at the same rates between all races.
  
</div>

**3.2:** What would the model that includes Race suggest?

<div class="alert alert-block alert-warning">

**Sample Response**

The alternative model suggests there *is* an association between the race of the person and level of force used; force is used at different rates between racial groups.
  
</div>

To calculate our $\chi^2$ statistic, we start by assuming the empty model is true and the DGP comes from a world where force is used equally across racial groups. We can estimate the *expected* counts of suspects receiving the various levels of force under this assumption using the following approach:

<img src="https://i.postimg.cc/W21gfTDj/Chi-Sq-Indep-Expected.png" title="Expected Count" width=500>

Imagine the following example of a random sample of n = 5,000:

<img src="https://i.postimg.cc/JRRB35VD/Chi-Sq-Indep-Sample-Obs-Data.png" title="Sample Observed Data" width=800>

Below are the *expected* counts for each group:

<img src="https://i.postimg.cc/09J2w6rM/Chi-Sq-Indep-Sample-Exp-Counts.png" title="Expected Counts of Sample Observed Data" width=800> 

**3.3:** In your own words, explain how the expected counts were calculated and why they represent a world under the null hypothesis assumption.

<div class="alert alert-block alert-warning">

**Sample Response**

*First, we use the marginal distribution for force to get the overall force rates used across all the racial groups (e.g. 2484/3164 = 78.5% overall rate of “no force”). Then, we apply those force rates to the
sample size from each racial group get the expected number who’d receive that force level (e.g. we’d
expected 0.785 × 1743 ≈ 1368 Black suspects to receive “no force”).*
  
</div>

Once we have the expected counts, we can use the ***same chi-square ($\chi^2$) equation*** we used for our Goodness of Fit test!

<img src="https://i.postimg.cc/9VPsCQmm/Chi-Sq-Fit-Chi-Sq-Formula.png" title="Chi Square Equation" width = 500/>

<img src="https://i.postimg.cc/BsHbtV6r/Chi-Sq-Indep-Chi-Sq-Table-and-Calc.png" title="Chi Square Calculation Table" width = 800/>


In [None]:
# For Reference (for teachers)
# Full R calculation
((2129 - 2137.3)^2 / 2137.3) + ((1368 - 1392.6)^2 / 1392.6) + ((406 - 373.1)^2 / 373.1) + ((428 - 413.9)^2 / 413.9) + ((282 - 269.7)^2 / 269.7)+ ((46 - 72.3)^2 / 72.3)+ ((181- 186.7)^2 / 186.7)+ ((134- 121.7)^2 / 121.7)+ ((26 - 32.6)^2 / 32.6)


# Simplified calculation
0.03 + 0.43 + 2.90 + 0.48 + 0.56 + 9.56 + 0.17 + 1.24 + 1.33


**3.4:** Take a shortcut and use the `chisq.test` function below to estimate the Chi-Square value for your sample. It will take our table of observed counts and estimate the expected counts for us! Then it automatically calculates the $\chi^2$ value, as well the $df$ and the $p-value$.

In [None]:
# Save our tally() table into an object
table1 <- tally(Force3 ~ Race, data = StopFrisk, margins = TRUE)

# Put the object as the argument for `chisq.test()`
chisq.test(table1)

### 4.0 - Discussion and Conclusions

**4.1:** What do these results suggest? Does there appear to be a random association between race and use of force? Explain your reasoning.

<div class="alert alert-block alert-warning">

**Sample Response**

These results suggest that there is a low probability (p < .001) that our sample came from the null DGP. There does not appear to be a random association between race and use of force. This is because the expected rates for stop and frisk (under the random/empty model) are very different from the actual observed rates for each group. Thus, the analysis has resulted in a high chi-squared value (24.98) and a low p-value.
  
</div>

**4.2:** Are the differences in force rates across racial groups *practically* important?

To answer this question, let’s hone in on one comparison: the percent of Hispanic people who receive hands-level force vs. the percent of White people who receive hands-level force (using the example sample data from section 3):

<img src="https://i.postimg.cc/81wBwZBZ/Chi-Sq-Indep-Hvs-W-Hands.png" title="Chi Square H vs W Hands Table" width = 800/>


At first glance: It’s only a 6 percentage point difference – that doesn’t seem like a lot! 

But: Using hands-level force against White people was somewhat unlikely (about 10% or less). If we use their rate as a baseline, we can ask: what’s the *relative* likelihood of experiencing hands-level force for Hispanic people?

Use R to answer this question below. Interpret the result.

In [None]:
# Likelihood of Hispanic people experiencing hands-level force *relative* to White people
(16 - 10) / 10

<div class="alert alert-block alert-warning">

**Sample Response**

Hispanic suspects were 60% more likely to have hands-level force used on them.

*Students will have differing viewpoints on whether the percentage point difference (6%) or percent
difference (60%) is more relevant here. It’s important for students to see both calculations and
contextualize them.*
  
</div>

**4.3:** Use R to calculate this likelihood using your sample. Is it a similar likelihood for your sample?

In [None]:
# Likelihood of Hispanic people experiencing hands-level force *relative* to White people
( - ) /

# Complete Version: students should fill in the values based on the sample data they have 
# There will be some slight sampling variation each time the data is loaded
# but they should get similar proportions

**4.4:** Does our data provide enough evidence to prove that New York police are racially biased (in terms of
use of force)? Why or why not?

<div class="alert alert-block alert-warning">

***Discussion notes:***

Another way to phrase this question: Were these differences in rates of force between racial groups
caused by police bias? Some students may argue that these differences can only be explained by police
racial bias. Other students may argue that there are separate underlying variables, such as the
individuals’ cooperation, why they were stopped, etc., that could explain the differences.

Although racial bias may be the most likely explanation, this dataset is consistent with multiple possible
explanations that involve the underlying variables mentioned above (or others). So, we do not have
enough information to draw causal conclusions about police officer bias.

To further the discussion, you can share with students this prominent [study from a Harvard economist](https://scholar.harvard.edu/fryer/publications/empirical-analysis-racial-differences-police-use-force),
who analyzed the same data. It found that, even when controlling for the potential underlying variables
mentioned above, racial differences in force levels remained. This result suggests police bias, but still
does not quite prove it (even though the controls seem pretty comprehensive, there may still be other
“uncontrolled” variables).
One further note:

Our dataset shows that persons of color are stopped at disproportionally high rates. If stops were done
truly randomly, many more white people would be stopped. This doesn’t necessarily pertain to use of
force, but it’s an important point to stamp when talking about the sample sizes.
  
</div>