# Logistic Regression


Adapted from Jonathan Stray's risk-ratios repo
https://github.com/jstray/risk-ratios

In [13]:
import pandas as pd
import numpy as np
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)

Consider this recent [Miami Herald story](https://www.miamiherald.com/sports/nfl/article258302943.html) about the hiring of Black head coaches in the NFL:

> The chances of landing an NFL head coaching position were three times better for white candidates compared to their non-white counterparts — even after including the most recent hires and controlling for age, number of opportunities, previous coaching position and years of experience in the league.

How do we get to this conclusion? Let's look at the data, which can be downloaded from the Herald [here](https://docs.google.com/spreadsheets/d/1lVPgIu7OKg40trVMnVlzg5EvnAByxXJCWQqGL1qDBis/edit#gid=0). For this exercise we'll use a slightly reformmated version of the data, with one row per candidate per year (candidates often apply for multiple jobs in the same year).

In [2]:
url = "https://raw.githubusercontent.com/jstray/risk-ratios/main/FINAL_coaches_by_year.csv"
df = pd.read_csv(url)

df

Unnamed: 0,Unit_of_Analysis,Coach_ID,Name,Age,Hired,Year,Number_of_Interviews_That_Year,Previous_Job,Previous_Job_Coded,NFL_Playing_Experience,NFL_Coaching_Experience,Total_NFL_Experience,Black,White,Minority,OC,DC,HC
0,Aaron Glenn 2021,1,Aaron Glenn,48,0,2021,1,Other NFL Job,5,15,8,23,1,0,1,0,0,0
1,Aaron Glenn 2022,1,Aaron Glenn,49,0,2022,2,Defensive Coordinator,4,15,9,24,1,0,1,0,1,0
2,Adam Gase 2015,2,Adam Gase,36,0,2015,5,Offensive Coordinator,3,0,12,12,0,1,0,1,0,0
3,Adam Gase 2016,2,Adam Gase,37,1,2016,4,Offensive Coordinator,3,0,13,13,0,1,0,1,0,0
4,Adam Gase 2019,2,Adam Gase,40,1,2019,2,Head Coach Previous Season,1,0,16,16,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,Vic Fangio 2018,119,Vic Fangio,59,0,2018,1,Defensive Coordinator,4,0,31,31,0,1,0,0,1,0
204,Vic Fangio 2019,119,Vic Fangio,60,1,2019,1,Defensive Coordinator,4,0,32,32,0,1,0,0,1,0
205,Vic Fangio 2022,119,Vic Fangio,63,0,2022,1,Head Coach Previous Season,1,0,35,35,0,1,0,0,0,1
206,Winston Moss 2018,120,Winston Moss,52,0,2018,1,Other NFL Job,5,11,12,23,1,0,1,0,0,0


# Exercise 1:  Exploratory pivot tables

In [3]:
piv = pd.crosstab(df.Hired, df.Black, margins=True)
display(piv)

pct_hired = pd.crosstab(df.Hired, df.Black, normalize='index')
display(pct_hired.round(2))

Black,0,1,All
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,93,59,152
1,48,8,56
All,141,67,208


Black,0,1
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.61,0.39
1,0.86,0.14


In [None]:
piv = pd.crosstab(df.Hired, df.Black, margins=True)
display(piv)

pct_hired = pd.crosstab(df.Hired, df.Black, normalize='index')
display(pct_hired.round(2))

*  What does the pivot table above tell us?

👉 _Black candidates are far less likely to be hired than non-Black candidates. 14 percent of Black candidates that were interviewed were hired, compared to 86 percent of non-Black candidates_



### Now you try!
Create a few more pivot tables to help explain the probability of getting hired, for each pivot table you make...write a sentence explaining your takeaway. You can copy and paste the code from above and switch around the "index" and "columns" fields as needed.

_note: Pivot tables help you quickly see the relationships between categorical variables. If you decide to investigate continuous variables, a histogram faceted by "Hired" may be more helpful than a pivot table._

In [14]:
piv = pd.crosstab(df.Hired, df.NFL_Coaching_Experience, margins=True)
display(piv)

pct_hired = pd.crosstab(df.Hired, df.NFL_Coaching_Experience, normalize='index')
display(pct_hired.round(2))

NFL_Coaching_Experience,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,31,32,35,All
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1
0,5,2,3,4,2,2,4,4,5,10,6,6,12,7,11,5,9,7,8,7,6,3,5,2,4,1,2,4,2,2,1,0,1,152
1,2,1,0,1,2,0,1,2,2,2,2,2,7,6,2,3,2,2,4,1,2,4,0,1,0,1,2,1,0,0,0,1,0,56
All,7,3,3,5,4,2,5,6,7,12,8,8,19,13,13,8,11,9,12,8,8,7,5,3,4,2,4,5,2,2,1,1,1,208


NFL_Coaching_Experience,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,31,32,35
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
0,0.03,0.01,0.02,0.03,0.01,0.01,0.03,0.03,0.03,0.07,0.04,0.04,0.08,0.05,0.07,0.03,0.06,0.05,0.05,0.05,0.04,0.02,0.03,0.01,0.03,0.01,0.01,0.03,0.01,0.01,0.01,0.0,0.01
1,0.04,0.02,0.0,0.02,0.04,0.0,0.02,0.04,0.04,0.04,0.04,0.04,0.12,0.11,0.04,0.05,0.04,0.04,0.07,0.02,0.04,0.07,0.0,0.02,0.0,0.02,0.04,0.02,0.0,0.0,0.0,0.02,0.0


* What does the pivot table above tell us?
_Coaches with twelve to thirteen years of NFL coaching experience made up the greatest share of those hired_



In [8]:
piv = pd.crosstab(df.Hired, df.Number_of_Interviews_That_Year, margins=True)
display(piv)

pct_hired = pd.crosstab(df.Hired, df.Number_of_Interviews_That_Year, normalize='index')
display(pct_hired.round(2))

Number_of_Interviews_That_Year,1,2,3,4,5,6,All
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,101,28,12,7,3,1,152
1,27,9,13,4,1,2,56
All,128,37,25,11,4,3,208


Number_of_Interviews_That_Year,1,2,3,4,5,6
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.66,0.18,0.08,0.05,0.02,0.01
1,0.48,0.16,0.23,0.07,0.02,0.04


* What does the pivot table above tell us?
_Most of the coaches that were hired were given the job after their first interview._



In [9]:
piv = pd.crosstab(df.Hired, df.HC, margins=True)
display(piv)

pct_hired = pd.crosstab(df.Hired, df.HC, normalize='index')
display(pct_hired.round(2))

HC,0,1,All
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,137,15,152
1,50,6,56
All,187,21,208


HC,0,1
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.9,0.1
1,0.89,0.11


In [10]:
piv = pd.crosstab(df.Hired, df.DC, margins=True)
display(piv)

pct_hired = pd.crosstab(df.Hired, df.DC, normalize='index')
display(pct_hired.round(2))

DC,0,1,All
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,112,40,152
1,42,14,56
All,154,54,208


DC,0,1
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.74,0.26
1,0.75,0.25


In [11]:
piv = pd.crosstab(df.Hired, df.OC, margins=True)
display(piv)

pct_hired = pd.crosstab(df.Hired, df.OC, normalize='index')
display(pct_hired.round(2))

OC,0,1,All
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,113,39,152
1,35,21,56
All,148,60,208


OC,0,1
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.74,0.26
1,0.62,0.38


* What do the pivot tables above tell us?
_Prior experience as a head coach did not seem to play a big role in whether a candidate was hired. Having coaching experience in general (offense or defense) mattered more._



# Exercise 2: Odds Ratio

Calculate the odds ratio of the chances of landing an NFL head coaching position white candidates compared to their non-white counterparts


In [23]:
# hint: if you're not sure, split this up into two parts, numerator and denominator
#       google the definition of odds ratio to remind yourself how it works if you have forgotten

odds_ratio_of_coaching_job_if_white = (.8/.6)/(.2/.4)
odds_ratio_of_coaching_job_if_white


2.666666666666667

In [24]:
(45/91)/(11/61)

2.7422577422577423



**Bonus** (optional) 🤖

Calculate the risk ratio. How did the two compare?


In [None]:
# risk_ratio_of_coaching_job_if_white =  

# Exercise 3: Logistic Regression

Try to calculate an odds ratio of landing an NFL head coaching job, but controlling for various other factors using a logistic regression. You can reference this guide at investigate.ai for python syntax https://investigate.ai/regression/logistic-regression-quickstart/

In [None]:
import statsmodels.formula.api as smf

# Here is a basic logistic regression to get you started. 
# You'll need to replace Y And X. Remember, you can add more variables like this
# Y ~ X1 + X2 + X3

model = smf.logit("Y ~ X", data=df)
results = model.fit()
results.summary()


Hmmm...that coefficient is a logit. How do we make sense of that? 

You can exponentiate in order to convert the coefficient into an odds ratio
https://investigate.ai/regression/logistic-regression-quickstart/#Converting-coefficient-to-odds-ratio


In [None]:
coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
})

coefs

In [None]:
df['predicted_proba'] = results.predict()
# note that logistic regression gives you predicted probabilitites
# but there is no such thing as a residual here as the outcome is binary...
df.sort_values(by='predicted_proba', ascending=False)

# Discussion

What can we conclude based on our analysis so far?


What can't we conclude based on our analysis so far?


What questions does this analysis leave us with?
- reporting questions
- quant questions

# Evaluating the claim(s)

What do you think of this claim?

> The chances of landing an NFL head coaching position were three times better for white candidates compared to their non-white counterparts — even after including the most recent hires and controlling for age, number of opportunities, previous coaching position and years of experience in the league.


In [None]:
👉

Were you able to reproduce the analysis in this chart? What do you think of the claims the Miami Herald is making here?

![](miami-herald-chart.png)

In [None]:
👉