# Logistic regression of jury rejections

In this notebook we'll be looking for evidence of racial bias in the jury selection process. We'll be reproducing [How did Curtis Flowers end up with a nearly all-white jury?
](https://features.apmreports.org/in-the-dark/curtis-flowers-trial-six-jury-selection/).

You can find the documentation for the columns [right here](https://github.com/APM-Reports/jury-data).

## Research

Read the [documentation of what each column means](https://github.com/APM-Reports/jury-data). Make notes about what columns you're interested in and which ones you'll want to control for. You _could_ add them all, but that's just... so much! But maybe you should anyway? Maybe some are more important than others?

Will you know which ones are more important than others before you do the regression? **In the BuzzFeed story, we realized that you could just use steering to predict whether a plane is a surveillance plane. Can we do something like that here?**

> The column we're going to predict is `struck_by_state`. It's whether the juror was struck (removed from the jury pool) by the prosecuting lawyer (the government).

You can also read [the methodology paper](https://features.apmreports.org/files/peremptory_strike_methodology.pdf) that Will Craft put together. Do you want to reproduce exactly what he did?

## Imports and setup

No need to copy paste, it's all set up for us already!

In [4]:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)

## Read in the data

We'll start by reading in a **pre-cleaned version of the dataset**. We've already joined the potential jurors, the trial information, and the judge information. We've also added the `struck_by_state` column for those removed from the jury pool by the state, and converted true and false values into ones and zeroes.

In [6]:
df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)

Unnamed: 0,id_x,juror_id,juror_id__trial__id,no_responses,married,children,religious,education,leans_state,leans_defense,leans_ambi,moral_hardship,job_hardship,caretaker,communication,medical,employed,social,prior_jury,crime_victim,fam_crime_victim,accused,fam_accused,eyewitness,fam_eyewitness,military,law_enforcement,fam_law_enforcement,premature_verdict,premature_guilt,premature_innocence,def_race,vic_race,def_gender,vic_gender,def_social,vic_social,def_age,vic_age,def_sexpref,vic_sexpref,def_incarcerated,vic_incarcerated,beliefs,other_biases,innocence,take_stand,arrest_is_guilt,cant_decide,cant_affirm,cant_decide_evidence,cant_follow,know_def,know_vic,know_wit,know_attny,civil_plantiff,civil_def,civil_witness,witness_defense,witness_state,prior_info,death_hesitation,no_death,no_life,no_cops,yes_cops,legally_disqualified,witness_ambi,notes,id_y,trial,trial__id,race,gender,race_source,gender_source,struck_by,strike_eligibility,id,defendant_name,cause_number,state_strikes,defense_strikes,county,defendant_race,second_defendant_race,third_defendant_race,fourth_defendant_race,more_than_four_defendants,judge,prosecutor_1,prosecutor_2,prosecutor_3,prosecutors_more_than_three,def_attny_1,def_attny_2,def_attny_3,def_attnys_more_than_three,offense_code_1,offense_title_1,offense_code_2,offense_title_2,offense_code_3,offense_title_3,offense_code_4,offense_title_4,offense_code_5,offense_title_5,offense_code_6,offense_title_6,more_than_six,verdict,case_appealed,batson_claim_by_defense,batson_claim_by_state,voir_dire_present,struck_by_state
0,1521,107.0,3.0,0,unknown,unknown,unknown,unknown,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,,107,2004-0257--Sparky Watson,3,White,Male,Jury strike sheet,Jury strike sheet,Struck by the defense,Both State and Defense,3,Sparky Watson,2004-0257,1,1,Grenada,Black,,,,0,"C. Morgan, III",Susan Denley,Ryan Berry,,0,M. Kevin Horan,Elizabeth Davis,,0,41-29-139(a)(1)(b)(3),sale of marihuana (less than 30 grams),41-29-139(a)(1)(b)(1),sale of cocaine,,,,,,,,,0,Guilty on at least one offense,1,0,0,1,0
1,1524,108.0,3.0,0,unknown,unknown,unknown,unknown,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,,108,2004-0257--Sparky Watson,3,Black,Female,Jury strike sheet,Jury strike sheet,Struck by the state,State,3,Sparky Watson,2004-0257,1,1,Grenada,Black,,,,0,"C. Morgan, III",Susan Denley,Ryan Berry,,0,M. Kevin Horan,Elizabeth Davis,,0,41-29-139(a)(1)(b)(3),sale of marihuana (less than 30 grams),41-29-139(a)(1)(b)(1),sale of cocaine,,,,,,,,,0,Guilty on at least one offense,1,0,0,1,1


In [19]:
# df['second_defendant_race'].value_counts()
df['race']

0         White
1         Black
2         Black
3         Black
4         White
         ...   
2290      White
2291      White
2292      Black
2293      White
2294    Unknown
Name: race, Length: 2295, dtype: object

## Feature engineering

We spent a lot of time feature engineering in the past - **creating out new features based on our dataset.** The best thing about statsmodels is **you don't have to build new columns for this**, you can just write it when you're describing the model.

For example, if you wanted to say "let's talk about the education level of the juror" you can do...

```python
+ C(education)
```

...but if you wanted to specifically say, *did the juror attend college?*, then you can do...

```python
+ (df.education == 'attended college')
```

...and you can even compare columns, like: *in this case, is the gender of the victim the same as the defendant?*

```python
+ (df.def_gender == df.vic_gender)
```

So it's up to you whether you want to create new 0/1 columns for something like "did the juror attend college" and "is the gender of the victim the same as the defendant" or whether you'd like to just use statsmodels formulas.

### Perform some regressions.

Run a few of them. Which columns do you include? Which do you not? Think about things like

* The p-value of the overall regression
* The p-value of each specific column
* Whether people might be interested in you saying "controlling for ...." for these variables
* What Will Craft used

> **Data cleaning note:** Some values are True/False and others are 1/0. If you have a reason to standardize this, you might use something like `df = df.replace({ True: 1, False: 0 })` or `df = df.replace({ 1: True, 0: False })`. Note that you might not need to, it depends on what you're looking at!

## Explaining your regression

Using normal human being words, explain your regression.