# Logistic Regression


Adapted from Jonathan Stray's risk-ratios repo
https://github.com/jstray/risk-ratios

In [1]:
import pandas as pd
import numpy as np

Consider this recent [Miami Herald story](https://www.miamiherald.com/sports/nfl/article258302943.html) about the hiring of Black head coaches in the NFL:

> The chances of landing an NFL head coaching position were three times better for white candidates compared to their non-white counterparts — even after including the most recent hires and controlling for age, number of opportunities, previous coaching position and years of experience in the league.

How do we get to this conclusion? Let's look at the data, which can be downloaded from the Herald [here](https://docs.google.com/spreadsheets/d/1lVPgIu7OKg40trVMnVlzg5EvnAByxXJCWQqGL1qDBis/edit#gid=0). For this exercise we'll use a slightly reformmated version of the data, with one row per candidate per year (candidates often apply for multiple jobs in the same year).

In [2]:
url = "https://raw.githubusercontent.com/jstray/risk-ratios/main/FINAL_coaches_by_year.csv"
df = pd.read_csv(url)

df

Unnamed: 0,Unit_of_Analysis,Coach_ID,Name,Age,Hired,Year,Number_of_Interviews_That_Year,Previous_Job,Previous_Job_Coded,NFL_Playing_Experience,NFL_Coaching_Experience,Total_NFL_Experience,Black,White,Minority,OC,DC,HC
0,Aaron Glenn 2021,1,Aaron Glenn,48,0,2021,1,Other NFL Job,5,15,8,23,1,0,1,0,0,0
1,Aaron Glenn 2022,1,Aaron Glenn,49,0,2022,2,Defensive Coordinator,4,15,9,24,1,0,1,0,1,0
2,Adam Gase 2015,2,Adam Gase,36,0,2015,5,Offensive Coordinator,3,0,12,12,0,1,0,1,0,0
3,Adam Gase 2016,2,Adam Gase,37,1,2016,4,Offensive Coordinator,3,0,13,13,0,1,0,1,0,0
4,Adam Gase 2019,2,Adam Gase,40,1,2019,2,Head Coach Previous Season,1,0,16,16,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,Vic Fangio 2018,119,Vic Fangio,59,0,2018,1,Defensive Coordinator,4,0,31,31,0,1,0,0,1,0
204,Vic Fangio 2019,119,Vic Fangio,60,1,2019,1,Defensive Coordinator,4,0,32,32,0,1,0,0,1,0
205,Vic Fangio 2022,119,Vic Fangio,63,0,2022,1,Head Coach Previous Season,1,0,35,35,0,1,0,0,0,1
206,Winston Moss 2018,120,Winston Moss,52,0,2018,1,Other NFL Job,5,11,12,23,1,0,1,0,0,0


# Exercise 1:  Exploratory pivot tables

In [3]:
piv = df.pivot_table(index='Hired', columns='White', aggfunc='count', values='Coach_ID')
display(piv)

# pct_hired = piv.apply(lambda x: x/sum(x)).round(2)
# display(pct_hired)

White,0,1
Hired,Unnamed: 1_level_1,Unnamed: 2_level_1
0,61,91
1,11,45


*  What does the pivot table above tell us?

👉 _(your answer here)_



### Now you try!
Create a few more pivot tables to help explain the probability of getting hired, for each pivot table you make...write a sentence explaining your takeaway. You can copy and paste the code from above and switch around the "index" and "columns" fields as needed.

_note: Pivot tables help you quickly see the relationships between categorical variables. If you decide to investigate continuous variables, a histogram faceted by "Hired" may be more helpful than a pivot table._

# Exercise 2: Odds Ratio

Calculate the odds ratio of the chances of landing an NFL head coaching position white candidates compared to their non-white counterparts


In [None]:
odds_ratio_of_coaching_job_if_white = 



**Bonus** (optional) 🤖

Calculate the risk ratio. How did the two compare?


In [None]:
# risk_ratio_of_coaching_job_if_white = 

# Exercise 3: Logistic Regression

Try to calculate an odds ratio of landing an NFL head coaching job, but controlling for various other factors using a logistic regression. You can reference this guide at investigate.ai for python syntax https://investigate.ai/regression/logistic-regression-quickstart/

In [None]:
import statsmodels.formula.api as smf

# Here is a basic logistic regression to get you started. 
# You'll need to replace Y And X. Remember, you can add more variables like this
# Y ~ X1 + X2 + X3

model = smf.logit("Y ~ X", data=df)
results = model.fit()
results.summary()


Hmmm...that coefficient is a logit. How do we make sense of that? 

You can exponentiate in order to convert the coefficient into an odds ratio
https://investigate.ai/regression/logistic-regression-quickstart/#Converting-coefficient-to-odds-ratio


In [None]:
coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
})

coefs

In [None]:
df['predicted_proba'] = results.predict()
# note that logistic regression gives you predicted probabilitites
# but there is no such thing as a residual here as the outcome is binary...
df.sort_values(by='predicted_proba', ascending=False)

# Discussion

What can we conclude based on our analysis so far?


What can't we conclude based on our analysis so far?


What questions does this analysis leave us with?
- reporting questions
- quant questions