# Introduction
In this notebook, we are going to compare effect estimates using regression with and without conditioning on / controlling for confounders.

Let's load in the dataset we created in the last notebook.

In [None]:
import pandas as pd
import statsmodels.formula.api as smf
url = 'https://github.com/data-analytics-in-business/gabor-management-case-study/raw/main/data/sample_MGMT.csv'
sample = pd.read_csv(url)
sample.head()

# Regression without confounders
To perform regression without confounders, we simple fit a regression model which models the expected value of our outcome variable (`management`) as a linear function of our causal variable (`foundfam_owned`).

We can do that easily using Ordinary Least Squares (OLS) regression.

In [None]:
formula1 = "management ~ foundfam_owned"
ols1 = smf.ols(formula=formula1, data=sample).fit()
ols1.summary()

**Question**: In the summary printed out, we are looking for the coefficent of `foundfam_owned`. Can you see it? What value is it and what does it mean?

# Regression with confounders
To perform regression with confounders, we fit a regression model which models the expected value of our outcome variable (`management`) as a linear function of our causal variable (`foundfam_owned`) and our confounders `country`,`industry`, and `comp_strength`.

In [None]:
formula2 = "management ~ foundfam_owned + country + industry + comp_strength"
ols2 = smf.ols(formula=formula2, data=sample).fit()
ols2.summary()

**Question**: What is the coefficent of `foundfam_owned` when we include confounders? Is it more or less than when we performed regression without including confounders?

**Question**: Is our estimate of the effect getting smaller or larger as we include confounders? What does this tell us?

# Exercise
Experiment with different regression models to identify which of the confounders `country`,`industry`, or `comp_strength` leads to the biggest change in our estimate of the effect of `foundfam_owned` on `management`.

In [None]:
# (SOLUTION)
formula2 = "management ~ foundfam_owned + country"
#formula2 = "management ~ foundfam_owned + industry"
#formula2 = "management ~ foundfam_owned + comp_strength"
ols2 = smf.ols(formula=formula2, data=sample).fit()
ols2.summary()

# Propensity Scoring
Given the change in our estimate of the effect of `foundfam_owned` on `management` when performing regression with and without confounders, we may conclude that our causal variable is, indeed, dependent on our confounders `country`,`industry`, and `comp_strength`.

We could model this relationship using logistic regression and then use the model for [propensity scoring](https://en.wikipedia.org/wiki/Propensity_score_matching).

To do this, run the code below.

In [None]:
sample_x = sample[['foundfam_owned','country','industry','comp_strength']]
formula_pscore1 = 'foundfam_owned ~ country + industry + comp_strength'
log_reg_model = smf.logit(formula=formula_pscore1, data=sample_x)
log_reg = log_reg_model.fit()
log_reg.summary()

**Question**: In which industry are founder/family owned firms least likely?

**Question**: In which country are founder/family owned firms most likely?

# Exercise
Return to the original dataset and explore which other variables you could include in a regression model estimating the effect of `foundfam_owned` on `management`. Explore which variables reduce/increase the estimate of the effect and discuss whether you believe they are common causes of `foundfam_owned` and `management` or not.

**Note**: a full walk through of an analysis of the dataset used in these notebooks can be found in the [case study](https://gabors-data-analysis.com/casestudies/#ch21a-founderfamily-ownership-and-quality-of-management) in Chpater 21 of [Data analysis for business, economics, and policy](https://bris.on.worldcat.org/oclc/1250272914).

In [None]:
# (SOLUTION)
