# Introduction
In this notebook, we are going to compare effect estimates using regression with and without conditioning on / controlling for confounders.

Let's load in the dataset we created in the last notebook.

In [1]:
import pandas as pd
import statsmodels.formula.api as smf
url = '../data/sample_MGMT.csv'
sample = pd.read_csv(url)
sample.head()

Unnamed: 0,management,foundfam_owned,country,industry,comp_strength
0,3.0,0,United States,measurement,low
1,4.444445,0,United States,chemicals,high
2,2.666667,0,United States,fab_metal,high
3,4.388889,1,United States,electronic,high
4,4.833333,0,United States,machinery,high


# Regression without confounders
To perform regression without confounders, we simple fit a regression model which models the expected value of our outcome variable (`management`) as a linear function of our causal variable (`foundfam_owned`).

We can do that easily using Ordinary Least Squares (OLS) regression.

In [2]:
formula1 = "management ~ foundfam_owned"
ols1 = smf.ols(formula=formula1, data=sample).fit()
ols1.summary()

0,1,2,3
Dep. Variable:,management,R-squared:,0.081
Model:,OLS,Adj. R-squared:,0.081
Method:,Least Squares,F-statistic:,874.7
Date:,"Thu, 01 Dec 2022",Prob (F-statistic):,2.45e-184
Time:,17:10:25,Log-Likelihood:,-9404.4
No. Observations:,9953,AIC:,18810.0
Df Residuals:,9951,BIC:,18830.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.0500,0.008,371.995,0.000,3.034,3.066
foundfam_owned,-0.3738,0.013,-29.576,0.000,-0.399,-0.349

0,1,2,3
Omnibus:,7.97,Durbin-Watson:,1.788
Prob(Omnibus):,0.019,Jarque-Bera (JB):,7.152
Skew:,0.011,Prob(JB):,0.028
Kurtosis:,2.87,Cond. No.,2.47


**Question**: In the summary printed out, we are looking for the coefficent of `foundfam_owned`. Can you see it? What value is it and what does it mean?

# Regression with confounders
To perform regression with confounders, we fit a regression model which models the expected value of our outcome variable (`management`) as a linear function of our causal variable (`foundfam_owned`) and our confounders `country`,`industry`, and `comp_strength`.

In [3]:
formula2 = "management ~ foundfam_owned + country + industry + comp_strength"
ols2 = smf.ols(formula=formula2, data=sample).fit()
ols2.summary()

0,1,2,3
Dep. Variable:,management,R-squared:,0.185
Model:,OLS,Adj. R-squared:,0.181
Method:,Least Squares,F-statistic:,49.95
Date:,"Thu, 01 Dec 2022",Prob (F-statistic):,0.0
Time:,17:10:25,Log-Likelihood:,-8805.4
No. Observations:,9952,AIC:,17700.0
Df Residuals:,9906,BIC:,18030.0
Df Model:,45,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.6362,0.044,60.052,0.000,2.550,2.722
country[T.Australia],0.2665,0.040,6.612,0.000,0.188,0.346
country[T.Brazil],0.0280,0.036,0.786,0.432,-0.042,0.098
country[T.Canada],0.4400,0.041,10.690,0.000,0.359,0.521
country[T.Chile],0.0530,0.041,1.290,0.197,-0.028,0.134
country[T.China],-0.0218,0.039,-0.562,0.574,-0.098,0.054
country[T.France],0.2454,0.040,6.197,0.000,0.168,0.323
country[T.Germany],0.4662,0.041,11.390,0.000,0.386,0.546
country[T.Great Britain],0.2526,0.036,7.115,0.000,0.183,0.322

0,1,2,3
Omnibus:,0.952,Durbin-Watson:,1.951
Prob(Omnibus):,0.621,Jarque-Bera (JB):,0.984
Skew:,-0.017,Prob(JB):,0.612
Kurtosis:,2.965,Cond. No.,32.0


**Question**: What is the coefficent of `foundfam_owned` when we include confounders? Is it more or less than when we performed regression without including confounders?

**Question**: Is our estimate of the effect getting smaller or larger as we include confounders? What does this tell us?

# Propensity Scoring
Given the change in our estimate of the effect of `foundfam_owned` on `management` when performing regression with and without confounders, we may conclude that our causal variable is, indeed, dependent on our confounders `country`,`industry`, and `comp_strength`.

We could model this relationship using logistic regression and then use the model for [propensity scoring](https://en.wikipedia.org/wiki/Propensity_score_matching).

To do this, run the code below.

In [4]:
sample_x = sample[['foundfam_owned','country','industry','comp_strength']]
formula_pscore1 = 'foundfam_owned ~ country + industry + comp_strength'
log_reg_model = smf.logit(formula=formula_pscore1, data=sample_x)
log_reg = log_reg_model.fit()
log_reg.summary()

Optimization terminated successfully.
         Current function value: 0.598148
         Iterations 6


0,1,2,3
Dep. Variable:,foundfam_owned,No. Observations:,9952.0
Model:,Logit,Df Residuals:,9907.0
Method:,MLE,Df Model:,44.0
Date:,"Thu, 01 Dec 2022",Pseudo R-squ.:,0.1211
Time:,17:10:25,Log-Likelihood:,-5952.8
converged:,True,LL-Null:,-6772.9
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.2180,0.160,7.595,0.000,0.904,1.532
country[T.Australia],-1.3658,0.150,-9.123,0.000,-1.659,-1.072
country[T.Brazil],0.3006,0.126,2.393,0.017,0.054,0.547
country[T.Canada],-1.0008,0.147,-6.821,0.000,-1.288,-0.713
country[T.Chile],-0.4186,0.143,-2.932,0.003,-0.698,-0.139
country[T.China],-1.6126,0.147,-11.001,0.000,-1.900,-1.325
country[T.France],-1.0052,0.142,-7.078,0.000,-1.284,-0.727
country[T.Germany],-0.6961,0.144,-4.828,0.000,-0.979,-0.414
country[T.Great Britain],-1.2035,0.127,-9.450,0.000,-1.453,-0.954


**Question**: In which industry are founder/family owned firms least likely?

**Question**: In which country are founder/family owned firms most likely?

# Exercise
Return to the original dataset and explore which other variables you could include in a regression model estimating the effect of `foundfam_owned` on `management`. Explore which variables reduce/increase the estimate of the effect and discuss whether you believe they are common causes of `foundfam_owned` and `management` or not.

**Note**: a full walk through of an analysis of the dataset used in these notebooks can be found in the [case study](https://gabors-data-analysis.com/casestudies/#ch21a-founderfamily-ownership-and-quality-of-management) in Chpater 21 of [Data analysis for business, economics, and policy](https://bris.on.worldcat.org/oclc/1250272914).

In [5]:
# (SOLUTION)
