<center> <h1>Workshop: Panel Data</h1> </center> 
<center> <h2>Application: Lobbying Revenue</h2> </center> 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
from linearmodels import PanelOLS

### 0. Setting
The questions below are based on the academic paper “Revolving Door Lobbyists” (American Economic Review, 2012), which uses a panel data approach to study how lobbyists’ revenue is affected by the number of past connections they have to senators. One important piece of information is that lobbyists in this dataset can lose connections over time but do not gain new ones.


**Q:** What is the possible endogeneity problem here, and how can panel data help overcome it?

**A:** Endogeneity in explaining the effect on revenue of past connections is due to the fact that lobbyists with connections might be different from lobbyists without connections, in ways that also directly affect their revenue. Lobbyists’ quality is for example an omitted variable that would cause bias here. Panel data helps control for lobbyists’ quality because it is a fixed effect (it does not vary over time).

### 1. Look at the data

**Q:** Add code to the chunk below that loads the dataset lobby_data.csv and summarizes the variables. Make sure you understand what each variable refers to. Note that the panel is unbalanced (this should be clear when eye-balling the data-frame), but we assume that the data are missing at random.

In [2]:
lobby_data = pd.read_csv('lobby_data.csv')
lobby_data.head()

Unnamed: 0,lobbyist_id,semester_id,log_revenue,democrat_dummy,senate_connection_num
0,1,13,12.367341,0,0
1,1,14,11.0021,0,0
2,1,15,14.311274,0,0
3,1,16,14.136506,0,0
4,1,17,14.188266,0,0


In [3]:
lobby_data.describe()

Unnamed: 0,lobbyist_id,semester_id,log_revenue,democrat_dummy,senate_connection_num
count,10418.0,10418.0,10418.0,10418.0,10418.0
mean,879.716836,14.041659,12.899352,0.451718,0.458053
std,546.371045,5.681964,1.356998,0.497687,0.553222
min,1.0,1.0,4.094345,0.0,0.0
25%,398.0,10.0,11.982929,0.0,0.0
50%,862.0,15.0,12.94801,0.0,0.0
75%,1349.0,19.0,13.864301,1.0,1.0
max,1828.0,22.0,16.439455,1.0,2.0


### 2. Regressions


#### 2.1 Standard regression
**Q:** Add code that runs a standard regression of the (log) revenue on the number of senator connections. Interpret the slope coefficient (Hint: what is the interpretation when the Y variable is a log transformation?)

In [4]:
reg_no_controls  = smf.ols(formula = 'log_revenue ~ senate_connection_num', data = lobby_data)
result = reg_no_controls.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:            log_revenue   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                   0.02072
Date:                Sun, 04 Feb 2024   Prob (F-statistic):              0.886
Time:                        12:45:33   Log-Likelihood:                -17962.
No. Observations:               10418   AIC:                         3.593e+04
Df Residuals:                   10416   BIC:                         3.594e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                12.89

**A:** An additional senator connection increases revenue by 0.3% (when Y is a log transformation, the slope coefficient is interpreted as a percentage change in Y for a unit change in X).

#### 2.2 Panel regressions

Run a panel regression with lobbyist fixed effects only and a two-way fixed effect regression with also time fixed effects (semesters in this data). 

Also do the following:

- For these two regressions, implement the "entity-demeaned" OLS algorithm. Do you obtain the same coefficient?
- From "linearmodels" import "PanelOLS", and implement both regressions using PanelOLS. You may want to check [this documentation](https://bashtage.github.io/linearmodels/panel/panel/linearmodels.panel.model.PanelOLS.html#linearmodels.panel.model.PanelOLS)

In [5]:
lobby_data = lobby_data.set_index(["lobbyist_id", "semester_id"])

In [13]:
exog_vars = ["senate_connection_num"]
exog = sm.add_constant(lobby_data[exog_vars])
mod = PanelOLS(lobby_data.log_revenue, exog, entity_effects=True, time_effects=False)
reg_lobbyist_FEs = mod.fit(cov_type='clustered', cluster_entity=True, cluster_time=False)
#reg_lobbyist_FEs = mod.fit() #you can also try this to check the difference in stde's. You should get an error that is similar to the one you obtain using dummy variables
print(reg_lobbyist_FEs)

                          PanelOLS Estimation Summary                           
Dep. Variable:            log_revenue   R-squared:                     1.015e-05
Estimator:                   PanelOLS   R-squared (Between):             -0.0165
No. Observations:               10418   R-squared (Within):            1.015e-05
Date:                Fri, Feb 17 2023   R-squared (Overall):          -4.747e-05
Time:                        17:59:50   Log-likelihood                -1.154e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      0.0944
Entities:                        1113   P-value                           0.7586
Avg Obs:                       9.3603   Distribution:                  F(1,9304)
Min Obs:                       3.0000                                           
Max Obs:                       22.000   F-statistic (robust):             0.0227
                            

In [7]:
exog_vars = ["senate_connection_num"]
exog = sm.add_constant(lobby_data[exog_vars])
mod = PanelOLS(lobby_data.log_revenue, exog, entity_effects=True, time_effects=True)
reg_twoway = mod.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)
#reg_twoway = mod.fit() #you can also try this to check the difference in stde's. You should get an error that is similar to the one you obtain using dummy variables
print(reg_twoway)

                          PanelOLS Estimation Summary                           
Dep. Variable:            log_revenue   R-squared:                        0.0023
Estimator:                   PanelOLS   R-squared (Between):             -0.0205
No. Observations:               10418   R-squared (Within):              -0.0025
Date:                Sun, Feb 04 2024   R-squared (Overall):             -0.0068
Time:                        12:46:27   Log-likelihood                -1.103e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      21.712
Entities:                        1113   P-value                           0.0000
Avg Obs:                       9.3603   Distribution:                  F(1,9283)
Min Obs:                       3.0000                                           
Max Obs:                       22.000   F-statistic (robust):             4.4868
                            

**Q:** Compare the senate connection coefficient across the three regression models you have estimated (standard regression, one-way fixed effects and two-way fixed effects) and explain the direction of its change.

**A:** The coefficient decreases (becomes negative) when including lobbyist fixed effects. This could be because lobbyists with connections are generally better lobbyists (independent of their connections) and hence a regression without lobbyist fixed effects will overstate the effect of connections.

When we also include time fixed effects we obtain a larger estimate (becomes positive). This could be due to the fact that time is an omitted variable: since revenue increases over time (lobbyists gain experience) and in this data connections decrease over time (lobbyists lose connections over time but do not gain additional ones), omitting time fixed effects leads to an understatement of the effect.


**Only for R:**
The chunk below uses the plm command to run a panel regression with lobbyist fixed effects only and a two-way fixed effect regression with also time fixed effects (semesters in this data). Take note of the syntax that you will need to use later: for the first regression, we use plm(y ~ x, data=frame_name, index=c("FE_var_name"), model="within") where FE_var_name denotes the variable that identifies a particular lobbyist in the data. 


In [None]:
#-------------This code is only for R----------------
library(plm)
# lobbyist FEs
reg_lobbyist_FEs <- plm(log_revenue ~ senate_connection_num, data=lobby_data, index=c("lobbyist_id"), model="within")
summary(reg_lobbyist_FEs)

# lobbyist and time FEs
reg_twoway <- plm(log_revenue ~ senate_connection_num, data=lobby_data, index=c("lobbyist_id","semester_id"), model="within", effect="twoways")
summary(reg_twoway)



#### 2.3 Residual variation in connections variable

**Q:** Explain why it is possible to estimate the senate connection coefficient even when controlling for lobbyist- and time-fixed effects. What is the source of variation in the senate connection variable that remains after controlling for the fixed effects?

**A:** Senate connections vary at both lobbyist and semester-level and therefore are not fully explained by either lobbyist or semester dummies. In particular, it must be that some lobbyists lose connections during the time period we analyze, otherwise the variable senate_connection_num would be constant over time and thus perfectly explained by the unit fixed effects. The fact that the senate connection variable varies both over time and across units means that its coefficient can be estimated. Intuitively, this coefficient captures what happens to revenue when a lobbyist loses a connection, while controlling for a common time trend in revenue across lobbyists.

#### 2.4 Political affiliation of the lobbyist

**Q:** Add code that runs a two-way fixed effects regression that also controls for whether the lobbyist is a democrat or not. Note that the democrat_dummy variable denotes the political leaning of the lobbyist regardless of whether they are currently connected to a senator. Compare the results to the previous two-ways fixed effects regression that didn’t include this control variable.

In [64]:
reg_twoway_democrat  = smf.ols(formula = 'log_revenue ~ senate_connection_num + democrat_dummy + C(lobbyist_id) + C(semester_id)', data = lobby_data).fit()

print("senate_connection_num:", reg_twoway_democrat.params["senate_connection_num"])
print("std err: ", reg_twoway_democrat.bse["senate_connection_num"])
print("p-value: ",reg_twoway_democrat.pvalues["senate_connection_num"])

senate_connection_num: 0.2051249210964543
std err:  0.04402173137296956
p-value:  3.211900765484212e-06


In [55]:
print(reg_twoway_democrat.summary())

                            OLS Regression Results                            
Dep. Variable:            log_revenue   R-squared:                       0.736
Model:                            OLS   Adj. R-squared:                  0.704
Method:                 Least Squares   F-statistic:                     22.81
Date:                Sun, 30 Jan 2022   Prob (F-statistic):               0.00
Time:                        11:30:06   Log-Likelihood:                -11027.
No. Observations:               10418   AIC:                         2.432e+04
Df Residuals:                    9283   BIC:                         3.255e+04
Df Model:                        1134                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 13

**A:** The democrat dummy does not vary over time, so it is fully explained by the lobbyist fixed effects. You obtain the same regression results as before but Python still reports the coefficient estimate for the democrat dummy. R would no report the coefficient estimate for the democrat dummy (Python reports a very small eigenvalue. Note that R would just drop the variable, unfortunately it does not report an error message).

In general, it is important to understand your data so that you know that this problem can arise. Note that if you disregard this problem then you might end up making wrong conclusions about the estimated coefficients (e.g., the democract dummy coefficient has no meaning in the regression above).


### 3. Effect heterogeneity


**Q:** Add code that includes in the two-way fixed effects regression an interaction term, in order to analyze whether a senate connection is more or less valuable depending on whether the lobbyist is a democrat or a republican. Interpret the coefficients.

In [19]:
lobby_data["enate_connection_num:democrat_dummy"] =  lobby_data["senate_connection_num"]*lobby_data["democrat_dummy"]

exog_vars = ["senate_connection_num","enate_connection_num:democrat_dummy"]
exog = sm.add_constant(lobby_data[exog_vars])
mod = PanelOLS(lobby_data.log_revenue, exog, entity_effects=True, time_effects=True)
reg_twoway_democrat_interact = mod.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)
#reg_twoway_democrat_interact = mod.fit() #you can also try this to check the difference in stde's. You should get an error that is similar to the one you obtain using dummy variables
print(reg_twoway_democrat_interact)

                          PanelOLS Estimation Summary                           
Dep. Variable:            log_revenue   R-squared:                        0.0027
Estimator:                   PanelOLS   R-squared (Between):             -0.0210
No. Observations:               10418   R-squared (Within):              -0.0029
Date:                Fri, Feb 17 2023   R-squared (Overall):             -0.0068
Time:                        18:16:25   Log-likelihood                -1.103e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      12.487
Entities:                        1113   P-value                           0.0000
Avg Obs:                       9.3603   Distribution:                  F(2,9282)
Min Obs:                       3.0000                                           
Max Obs:                       22.000   F-statistic (robust):             2.3789
                            

**A:** : The interaction coefficient shows that connections are less valuable for Democrats relative to Republicans. However, the effect is only significant at the 10% level (but not at the 5% level).

### 4. Conclusion


**Q:** What do you conclude from these results about the usefulness of senate connections for lobbyists? What is your best estimate and would you give it a causal interpretation?

**A:** The two-ways fixed effects regression suggests that one more senator connection increases lobbyists’ revenue by 20% (this is also statistically significant at typical levels of confidence). This estimate controls for all omitted variables that vary only over time or across lobbyists, but it does not control for possible omitted variables that affect both number of connections and revenue and that vary both across lobbyists and over time. One example of this is the popularity/influence of the senator (varies over time and across lobbyists and is potentially correlated both with connection and revenue). One should thus always be careful about attributing panel regression estimates a causal interpretation (particularly if you can think of what these additional omitted variables you are not controlling for could be in your specific application).