# Logistic regression of jury rejections

In this notebook we'll be looking for evidence of racial bias in the jury selection process. We'll be reproducing [How did Curtis Flowers end up with a nearly all-white jury?
](https://features.apmreports.org/in-the-dark/curtis-flowers-trial-six-jury-selection/).

You can find the documentation for the columns [right here](https://github.com/APM-Reports/jury-data).

## Research

Read the [documentation of what each column means](https://github.com/APM-Reports/jury-data). Make notes about what columns you're interested in and which ones you'll want to control for. You _could_ add them all, but that's just... so much! But maybe you should anyway? Maybe some are more important than others?

Will you know which ones are more important than others before you do the regression? **In the BuzzFeed story, we realized that you could just use steering to predict whether a plane is a surveillance plane. Can we do something like that here?**

> The column we're going to predict is `struck_by_state`. It's whether the juror was struck (removed from the jury pool) by the prosecuting lawyer (the government).

You can also read [the methodology paper](https://features.apmreports.org/files/peremptory_strike_methodology.pdf) that Will Craft put together. Do you want to reproduce exactly what he did?

## Imports and setup

No need to copy paste, it's all set up for us already!

In [1]:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)

## Read in the data

We'll start by reading in a **pre-cleaned version of the dataset**. We've already joined the potential jurors, the trial information, and the judge information. We've also added the `struck_by_state` column for those removed from the jury pool by the state, and converted true and false values into ones and zeroes.

In [2]:
df = pd.read_csv("jury-cleaned.csv")
df.head(2)

Unnamed: 0,id_x,juror_id,juror_id__trial__id,no_responses,married,children,religious,education,leans_state,leans_defense,leans_ambi,moral_hardship,job_hardship,caretaker,communication,medical,employed,social,prior_jury,crime_victim,fam_crime_victim,accused,fam_accused,eyewitness,fam_eyewitness,military,law_enforcement,fam_law_enforcement,premature_verdict,premature_guilt,premature_innocence,def_race,vic_race,def_gender,vic_gender,def_social,vic_social,def_age,vic_age,def_sexpref,vic_sexpref,def_incarcerated,vic_incarcerated,beliefs,other_biases,innocence,take_stand,arrest_is_guilt,cant_decide,cant_affirm,cant_decide_evidence,cant_follow,know_def,know_vic,know_wit,know_attny,civil_plantiff,civil_def,civil_witness,witness_defense,witness_state,prior_info,death_hesitation,no_death,no_life,no_cops,yes_cops,legally_disqualified,witness_ambi,notes,id_y,trial,trial__id,race,gender,race_source,gender_source,struck_by,strike_eligibility,id,defendant_name,cause_number,state_strikes,defense_strikes,county,defendant_race,second_defendant_race,third_defendant_race,fourth_defendant_race,more_than_four_defendants,judge,prosecutor_1,prosecutor_2,prosecutor_3,prosecutors_more_than_three,def_attny_1,def_attny_2,def_attny_3,def_attnys_more_than_three,offense_code_1,offense_title_1,offense_code_2,offense_title_2,offense_code_3,offense_title_3,offense_code_4,offense_title_4,offense_code_5,offense_title_5,offense_code_6,offense_title_6,more_than_six,verdict,case_appealed,batson_claim_by_defense,batson_claim_by_state,voir_dire_present,struck_by_state
0,1521,107.0,3.0,0,unknown,unknown,unknown,unknown,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,,107,2004-0257--Sparky Watson,3,White,Male,Jury strike sheet,Jury strike sheet,Struck by the defense,Both State and Defense,3,Sparky Watson,2004-0257,1,1,Grenada,Black,,,,0,"C. Morgan, III",Susan Denley,Ryan Berry,,0,M. Kevin Horan,Elizabeth Davis,,0,41-29-139(a)(1)(b)(3),sale of marihuana (less than 30 grams),41-29-139(a)(1)(b)(1),sale of cocaine,,,,,,,,,0,Guilty on at least one offense,1,0,0,1,0
1,1524,108.0,3.0,0,unknown,unknown,unknown,unknown,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,,108,2004-0257--Sparky Watson,3,Black,Female,Jury strike sheet,Jury strike sheet,Struck by the state,State,3,Sparky Watson,2004-0257,1,1,Grenada,Black,,,,0,"C. Morgan, III",Susan Denley,Ryan Berry,,0,M. Kevin Horan,Elizabeth Davis,,0,41-29-139(a)(1)(b)(3),sale of marihuana (less than 30 grams),41-29-139(a)(1)(b)(1),sale of cocaine,,,,,,,,,0,Guilty on at least one offense,1,0,0,1,1


### Perform some regressions.

Run a few of them. Which columns do you include? Which do you not? Think about things like

* The p-value of the overall regression
* The p-value of each specific column
* Whether people might be interested in you saying "controlling for ...." for these variables
* What Will Craft used

> **Data cleaning note:** Some values are True/False and others are 1/0. If you have a reason to standardize this, you might use something like `df = df.replace({ True: 1, False: 0 })` or `df = df.replace({ 1: True, 0: False })`. Note that you might not need to, it depends on what you're looking at!

In [3]:
model = smf.logit("""
    struck_by_state ~
        C(race, Treatment('White'))
        + fam_law_enforcement
        + medical
        + caretaker
        + know_def
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.448728
         Iterations 7


0,1,2,3
Dep. Variable:,struck_by_state,No. Observations:,2295.0
Model:,Logit,Df Residuals:,2288.0
Method:,MLE,Df Model:,6.0
Date:,"Mon, 09 Aug 2021",Pseudo R-squ.:,0.2015
Time:,10:50:16,Log-Likelihood:,-1029.8
converged:,True,LL-Null:,-1289.7
Covariance Type:,nonrobust,LLR p-value:,4.546000000000001e-109

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.0644,0.087,-23.734,0.000,-2.235,-1.894
"C(race, Treatment('White'))[T.Black]",2.0965,0.111,18.858,0.000,1.879,2.314
"C(race, Treatment('White'))[T.Unknown]",0.3597,0.774,0.465,0.642,-1.157,1.876
fam_law_enforcement,-0.3881,0.153,-2.534,0.011,-0.688,-0.088
medical,3.2940,1.096,3.006,0.003,1.146,5.442
caretaker,0.2659,1.282,0.207,0.836,-2.247,2.778
know_def,1.3765,0.210,6.560,0.000,0.965,1.788


In [4]:
coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs


Unnamed: 0,coef,odds ratio,name
0,-2.06445,0.12689,Intercept
1,2.09649,8.13753,"C(race, Treatment('White'))[T.Black]"
2,0.3597,1.4329,"C(race, Treatment('White'))[T.Unknown]"
3,-0.38811,0.67833,fam_law_enforcement
4,3.29396,26.94936,medical
5,0.26586,1.30455,caretaker
6,1.37648,3.96094,know_def


In [None]:
# binary logistic regression
# 1/0, true/false, spyplane/not spyplane
# sentiment analysis - 0-1 neg-pos, 0.4??? 0.7???

# SOMETIMES!!!! we want to predict numbers
# you can't do that with logistic regression
# you're going to use LINEAR REGRESSION!!!