<center> <h1>Workshop: IV</h1> </center> 
<center> <h2>Application: IV estimation of demand function</h2> </center> 

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
from statsmodels.api import add_constant
#import iv2sls from linear models
from linearmodels.iv import IV2SLS


### 0. Setup
This is an example of demand estimation from Stock and Watson’s textbook. We want to estimate the demand for cigarettes using a dataset containing sales, prices and a number of other variables measured in different U.S. states. 
You can download the dataset from Canvas.
If you are using R, the dataset is contained in the AER package so you don’t need to download it separately.


**Description** A data frame containing 48 observations on 7 variables for 2 periods.

**state:**
Factor indicating state.

**year:**
Factor indicating year.

**cpi:**
Consumer price index.

**population:**
State population.

**packs:**
Number of packs per capita.

**income:**
State personal income (total, nominal).

**tax:**
Average state, federal and average local excise taxes for fiscal year.

**price:**
Average price during fiscal year, including sales tax.

**taxs:**
Average excise taxes for fiscal year, including sales tax.




In [2]:
cigs_all = pd.read_csv('cigarette.csv')
cigs_all.head()

Unnamed: 0,state,year,cpi,population,packs,income,tax,price,taxs
0,AL,1985,1.076,3973000.0,116.486282,46014968,32.500004,102.181671,33.348335
1,AR,1985,1.076,2327000.0,128.534592,26210736,37.0,101.474998,37.0
2,AZ,1985,1.076,3184000.0,104.522614,43956936,31.0,108.578751,36.170418
3,CA,1985,1.076,26444000.0,100.363037,447102816,26.0,107.837341,32.104
4,CO,1985,1.076,3209000.0,112.963539,49466672,31.0,94.266663,31.0


In [3]:
# Following S&W, we use just 1995 data 
cigs = cigs_all[cigs_all.year==1995].copy()

### Create new variables

# deflate prices and income by CPI to get real values

cigs["rprice"] = cigs.price / cigs.cpi
cigs["rincome"] = (cigs.income/cigs.population)/cigs.cpi 

# log values are used in the regressions (note that you could create these on the fly within the regression commands)

cigs["lprice"]= np.log(cigs.rprice)
cigs["lquant"] = np.log(cigs.packs)
cigs["lincome"] = np.log(cigs.rincome)

# tdiff = the real tax on cigarettes arising just from general sales tax, which we will use as an instrument 
cigs["tdiff"]= (cigs.taxs - cigs.tax)/cigs.cpi


In [4]:
cigs.head()

Unnamed: 0,state,year,cpi,population,packs,income,tax,price,taxs,rprice,rincome,lprice,lquant,lincome,tdiff
48,AL,1995,1.524,4262731.0,101.085434,83903280,40.500004,158.371338,41.904671,103.918206,12.915347,4.643604,4.615966,2.558416,0.921697
49,AR,1995,1.524,2480121.0,111.042969,45995496,55.5,175.542511,63.859169,115.18538,12.169073,4.746543,4.709917,2.498898,5.485019
50,AZ,1995,1.524,4306908.0,71.95417,88870496,65.333328,198.607498,74.790825,130.319887,13.539638,4.869992,4.276029,2.605622,6.205707
51,CA,1995,1.524,31493524.0,56.859306,771470144,61.0,210.504669,74.771332,138.12643,16.073591,4.928169,4.04058,2.777178,9.036307
52,CO,1995,1.524,3738061.0,82.582924,92946544,44.0,167.350006,44.0,109.80972,16.315557,4.698749,4.413803,2.792119,0.0


### 1. Naive Regression

**Q:** First run a regression of (log) packs of cigarettes on (log) price and comment on the estimate of the price elasticity you obtained.

In [9]:
# OLS regression
model = smf.ols('lquant ~ lprice', data=cigs).fit()
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                 lquant   R-squared:                       0.406
Model:                            OLS   Adj. R-squared:                  0.393
Method:                 Least Squares   F-statistic:                     31.41
Date:                Fri, 23 Feb 2024   Prob (F-statistic):           1.13e-06
Time:                        11:16:01   Log-Likelihood:                 12.724
No. Observations:                  48   AIC:                            -21.45
Df Residuals:                      46   BIC:                            -17.71
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.3389      1.035      9.986      0.0

**A:** I obtained a price elasticity of -1.2 for cigarettes.

### 2. First- and second-stage regressions for IV estimation

**Q:** Explain why sales tax (tdiff) could be a valid instrument.

**A:**

**Q:** Using the sales tax (tdiff) as an instrument, start by running the first-stage regression for IV estimation. Comment on the results. Also generate the predicted values that you need for the second stage regression.

In [10]:
# Using the sales tax (tdiff) as an instrument, start by running the first-stage regression for IV estimation. Comment on the results. Also generate the predicted values that you need for the second stage regression.


# IV estimation
iv = smf.ols('lprice ~ tdiff', data=cigs).fit()
print(iv.summary())

                            OLS Regression Results                            
Dep. Variable:                 lprice   R-squared:                       0.471
Model:                            OLS   Adj. R-squared:                  0.459
Method:                 Least Squares   F-statistic:                     40.96
Date:                Fri, 23 Feb 2024   Prob (F-statistic):           7.27e-08
Time:                        11:16:23   Log-Likelihood:                 46.435
No. Observations:                  48   AIC:                            -88.87
Df Residuals:                      46   BIC:                            -85.13
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.6165      0.029    158.601      0.0

**A:** 

**Q:** Now run the second-stage regression for IV estimation. Compare your estimate of the demand elasticity to the one you obtained from the naive regression in part 1.

In [15]:
cigs = add_constant(cigs, has_constant="add")

#iv = IV2SLS(dependent=cigs.lquant, con ,exog=cigs.lprice, endog=cigs.tdiff, instruments=cigs.tdiff).fit(cov_type='unadjusted')
iv = IV2SLS(cigs.lquant, cigs[['const', 'lincome']], cigs.lprice, cigs.tdiff).fit()
print(iv)

ValueError: DataFrame contains duplicate column names. All column names must be distinct

**A:** 

### 3. TSLS estimation

#### 3.1 Simple TSLS
**Q:** Now run the TSLS estimation using the IV2SLS command from "linearmodels" and compare with the results in part 2.

**A:** 

#### 3.2  More Sources of Bias

**Q:**  Let's revisit the exogeneity assumption. Do you think income ("lincome") is an omitted variable that could be affecting the validity of our instrument? Why? 

**A:**

**Q:** Now include log income (the variable lincome you generated in the beginning) in the regression as a control variable. Comment on the results.

**A:** 

#### 3.3 More Instruments

**Q:** Re-run the model in 3.2, also considering another instrument in addition to tdiff, obtained as the real tax on cigarette: tax/cpi. Comment on the estimates of both the price elasticity and the income elasticity of cigarette demand.

**A:** 

#### 3.4 Instrument Validity
**Q:** Assess the validity of the instruments. Using:

- iv3.first_stage
- iv3.sargan


(In R adding the diagnostics=TRUE to the summary command in the regression in part 3.3.)

**A:**

**Bonus Question:** You are thinking about using the "distance of the state from cigarrete manufacturing plants" as an instrument. Do you think this would be a weak or strong instrument? Why?

**A:**