# Logistic regression of jury rejections

In this notebook we'll be looking for evidence of racial bias in the jury selection process. We'll be reproducing [How did Curtis Flowers end up with a nearly all-white jury?
](https://features.apmreports.org/in-the-dark/curtis-flowers-trial-six-jury-selection/).

You can find the documentation for the columns [right here](https://github.com/APM-Reports/jury-data).

## Research

Read the [documentation of what each column means](https://github.com/APM-Reports/jury-data). Make notes about what columns you're interested in and which ones you'll want to control for. You _could_ add them all, but that's just... so much! But maybe you should anyway? Maybe some are more important than others?

Will you know which ones are more important than others before you do the regression? **In the BuzzFeed story, we realized that you could just use steering to predict whether a plane is a surveillance plane. Can we do something like that here?**

> The column we're going to predict is `struck_by_state`. It's whether the juror was struck (removed from the jury pool) by the prosecuting lawyer (the government).

You can also read [the methodology paper](https://features.apmreports.org/files/peremptory_strike_methodology.pdf) that Will Craft put together. Do you want to reproduce exactly what he did?

## Imports and setup

No need to copy paste, it's all set up for us already!

In [None]:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)

## Read in the data

We'll start by reading in a **pre-cleaned version of the dataset**. We've already joined the potential jurors, the trial information, and the judge information. We've also added the `struck_by_state` column for those removed from the jury pool by the state, and converted true and false values into ones and zeroes.

In [None]:
df = pd.read_csv("jury-cleaned.csv")
df.head(2)

### Perform some regressions.

Run a few of them. Which columns do you include? Which do you not? Think about things like

* The p-value of the overall regression
* The p-value of each specific column
* Whether people might be interested in you saying "controlling for ...." for these variables
* What Will Craft used

> **Data cleaning note:** Some values are True/False and others are 1/0. If you have a reason to standardize this, you might use something like `df = df.replace({ True: 1, False: 0 })` or `df = df.replace({ 1: True, 0: False })`. Note that you might not need to, it depends on what you're looking at!

In [None]:
model = smf.logit("""
    struck_by_state ~
        C(race, Treatment('White'))
        + fam_law_enforcement
        + medical
        + caretaker
        + know_def
""", data=df)

results = model.fit()
results.summary()

In [None]:
coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs
