In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from scipy.stats import norm
import math
import sklearn as sk
from sklearn.linear_model import LogisticRegression as LR
import statsmodels.api as sm

%matplotlib inline

sns.set(style="dark")
plt.style.use("ggplot")

# Research Question 2

What is the causal impact of highway and street spending on the seasonally adjusted real GDP?

## Methods

- **Describe which variables correspond to treatment and outcome.**

    - The treatment variable is "State and Local Government Construction Spending - Highway and Street." This variable represents the total spending on highways and streets in the United States.

    - The outcome variable is "Real Gross Domestic Product - Seasonally Adjusted." This is the real GDP of the United States, adjusted for seasonal/cyclical changes.
    
    - The instrument variable "Highway Fatalities." This represents the total number of people killed in highway-related incidents per month. We argue that this variable has a causal effect on the treatment variable because lawmakers would increase spending on highways if they saw a tangible reason to make roads safer.

- **Describe which variables (if any) are confounders. If the unconfoundedness assumption holds, make a convincing argument for why.**

**Identified confounders:**

     1. Unemployment Rate - Seasonally Adjusted
    
Unemployment Rate is correlated with the outcome variable (Real GDP) since a decrease in the GDP is reflected in a decrease in the rate of employment and vice versa. UR is also correlated with the treatment variable since if the government is spending more on highway and street construction, there are more jobs available and UR is lower. Unemployment rate is not significantly correlated with highway fatalities since people don't become better/worse at driving with unemployment rates.

    2. State and Local Government Construction Spending - Transportation
    
Transportation spending is correlated with the outcome variable (Real GDP) since GDP encompasses all government spending (among other factors). Transportation spending is also correlated with the treatment variable–as highway spending increases, so will transportation spending as transportation includes highway transportation infrastructure. Transportation is not correlated with highway fatalities since it’s too broad - any effects of highway and transportation will be crowded out by the other factors included in transportation.

    3. State and Local Government Construction Spending - Infrastructure
    
Infrastructure spending is correlated with the outcome variable (Real GDP) since GDP encompasses all government spending (among other factors). Infrastructure spending is also correlated with the treatment variable–as highway spending increases, so will infrastructure spending as infrastructure includes highway transportation infrastructure. Infrastructure is not correlated with highway fatalities since it’s too broad - any effects of highway and transportation will be crowded out by the other factors included in infrastructure.

    4. Labor Force Participation Rate - Seasonally Adjusted
    
Labor Force Participation is correlated with the outcome variable (Real GDP) since LFP is related to unemployment, which is correlated with GDP since a decrease in the GDP is reflected in a decrease in the rate of employment and vice versa. LFP is also correlated with the treatment variable since if the government is spending more on highway and street construction, there are more jobs available and LFPR. Labor Force Participation rate is not correlated with highway fatalities since people don't become better/worse at driving with labor force participation rates.

    5. State and Local Government Construction Spending - Bridge
    
Bridge spending is correlated with the outcome variable (Real GDP) since GDP encompasses all government spending (among other factors). Bridge spending is also correlated with the treatment variable – as highway spending increases, so will bridge spending as bridges are a necessity for highway transportation infrastructure. Bridge spending is not correlated with highway fatalities since bridges are not considered highways.

    6. Is Recession
    
Recession periods are correlated with the outcome variable (Real GDP) since recessions usually negatively impact real GDP by reducing the functions of the economy. Recessions are also correlated with the treatment variable because the government may respond by increasing investments into highways and streets to stimulate the economy. Recessions are not correlated with highway fatalities since recessions don't make drivers better/worse at driving.

The unconfoundedness assumption likely does not hold because there are many possible confounders between highway spending and real GDP. For example, technology advancements such as electric vehicles may change the extent to which highway spending is necessary, and advancements in this field would likely have an effect on real GDP.

- **What methods will you use to adjust for confounders?**

We will use two stage least squares regression to adjust for any other confounders. Since we don't have access to randomized data pertaining to the question we want answered, then we need to use instrument variables to estimate the causal relationship. 

- **Are there any colliders in the dataset? If so, what are they?**

There may be some reverse causality between the treatment and outcome variables with the confounding variables. In that case, then each confounding variable can be caused by both the treatment and outcome variables, thereby making the confounders also colliders. However, **we argue that there is no reverse causality and therefore no colliders** because real GDP is just a metric that reflects the state of the economy; the value of the real GDP is caused by many factors but itself does not cause those factors to change.

## Importing and Cleaning Data

In [109]:
df = pd.read_csv('cleaned_transport_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df

Unnamed: 0.1,Unnamed: 0,Date,Unemployment Rate - Seasonally Adjusted,State and Local Government Construction Spending - Transportation,State and Local Government Construction Spending - Infrastructure,Labor Force Participation Rate - Seasonally Adjusted,State and Local Government Construction Spending - Bridge,State and Local Government Construction Spending - Highway and Street,Real Gross Domestic Product - Seasonally Adjusted,Highway Fatalities,Is Recession,Recession Period,President,Political Party
0,0,2005-01-01,0.053,1.125000e+09,85000000.0,0.658,7.260000e+08,2.929000e+09,1.476785e+13,3305.0,not recession,pre-recession,Bush,Republican
1,1,2005-02-01,0.054,1.111000e+09,79000000.0,0.659,8.600000e+08,3.120000e+09,1.476785e+13,3042.0,not recession,pre-recession,Bush,Republican
2,2,2005-03-01,0.052,1.153000e+09,93000000.0,0.659,8.840000e+08,3.583000e+09,1.476785e+13,3334.0,not recession,pre-recession,Bush,Republican
3,3,2005-04-01,0.052,1.262000e+09,75000000.0,0.661,1.106000e+09,4.320000e+09,1.483971e+13,3686.0,not recession,pre-recession,Bush,Republican
4,4,2005-05-01,0.051,1.273000e+09,94000000.0,0.661,1.282000e+09,5.557000e+09,1.483971e+13,3874.0,not recession,pre-recession,Bush,Republican
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,187,2020-08-01,0.084,4.083000e+09,117000000.0,0.617,2.022000e+09,1.106300e+10,1.856077e+13,3802.0,not recession,post recession,Trump,Republican
188,188,2020-09-01,0.079,3.818000e+09,101000000.0,0.614,2.050000e+09,1.061500e+10,1.856077e+13,3724.0,not recession,post recession,Trump,Republican
189,189,2020-10-01,0.069,3.718000e+09,115000000.0,0.616,2.107000e+09,1.037500e+10,1.876778e+13,3793.0,not recession,post recession,Trump,Republican
190,190,2020-11-01,0.067,3.419000e+09,108000000.0,0.615,1.676000e+09,8.164000e+09,1.876778e+13,3445.0,not recession,post recession,Trump,Republican


In [110]:
'''
Converting "Is Recession" into binary numbers
'''

def is_recession_number(text):
    if text == 'recession':
        return 1
    else:
        return 0

df['Is Recession'] = df['Is Recession'].apply(is_recession_number)
df

Unnamed: 0.1,Unnamed: 0,Date,Unemployment Rate - Seasonally Adjusted,State and Local Government Construction Spending - Transportation,State and Local Government Construction Spending - Infrastructure,Labor Force Participation Rate - Seasonally Adjusted,State and Local Government Construction Spending - Bridge,State and Local Government Construction Spending - Highway and Street,Real Gross Domestic Product - Seasonally Adjusted,Highway Fatalities,Is Recession,Recession Period,President,Political Party
0,0,2005-01-01,0.053,1.125000e+09,85000000.0,0.658,7.260000e+08,2.929000e+09,1.476785e+13,3305.0,0,pre-recession,Bush,Republican
1,1,2005-02-01,0.054,1.111000e+09,79000000.0,0.659,8.600000e+08,3.120000e+09,1.476785e+13,3042.0,0,pre-recession,Bush,Republican
2,2,2005-03-01,0.052,1.153000e+09,93000000.0,0.659,8.840000e+08,3.583000e+09,1.476785e+13,3334.0,0,pre-recession,Bush,Republican
3,3,2005-04-01,0.052,1.262000e+09,75000000.0,0.661,1.106000e+09,4.320000e+09,1.483971e+13,3686.0,0,pre-recession,Bush,Republican
4,4,2005-05-01,0.051,1.273000e+09,94000000.0,0.661,1.282000e+09,5.557000e+09,1.483971e+13,3874.0,0,pre-recession,Bush,Republican
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,187,2020-08-01,0.084,4.083000e+09,117000000.0,0.617,2.022000e+09,1.106300e+10,1.856077e+13,3802.0,0,post recession,Trump,Republican
188,188,2020-09-01,0.079,3.818000e+09,101000000.0,0.614,2.050000e+09,1.061500e+10,1.856077e+13,3724.0,0,post recession,Trump,Republican
189,189,2020-10-01,0.069,3.718000e+09,115000000.0,0.616,2.107000e+09,1.037500e+10,1.876778e+13,3793.0,0,post recession,Trump,Republican
190,190,2020-11-01,0.067,3.419000e+09,108000000.0,0.615,1.676000e+09,8.164000e+09,1.876778e+13,3445.0,0,post recession,Trump,Republican


## Performing 2SLS

In [111]:
'''
The first stage of 2SLS. Fitting the treatment variable using linear regression
using the confounding and instrument variables
'''

exog_2sls = sm.add_constant(df[['Unemployment Rate - Seasonally Adjusted', 
                                'State and Local Government Construction Spending - Transportation', 
                                'State and Local Government Construction Spending - Infrastructure',
                                'Labor Force Participation Rate - Seasonally Adjusted',
                                'State and Local Government Construction Spending - Bridge',
                                'Is Recession',
                                'Highway Fatalities']])
# w_2sls = merged_df['expected inflation']
z_2sls = df['State and Local Government Construction Spending - Highway and Street']

model_zhat = sm.OLS(z_2sls, exog_2sls)
results_zhat = model_zhat.fit(cov_type='HC1')
results_zhat.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,State and Local Government Construction Spending - Highway and Street,R-squared:,0.953
Model:,OLS,Adj. R-squared:,0.951
Method:,Least Squares,F-statistic:,133.8
Date:,"Fri, 02 Dec 2022",Prob (F-statistic):,2.42e-53
Time:,18:31:56,Log-Likelihood:,-4113.6
No. Observations:,192,AIC:,8243.0
Df Residuals:,184,BIC:,8269.0
Df Model:,7,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-3.294e+10,3.41e+09,-9.658,0.000,-3.96e+10,-2.63e+10
Unemployment Rate - Seasonally Adjusted,1.551e+10,1.82e+09,8.542,0.000,1.2e+10,1.91e+10
State and Local Government Construction Spending - Transportation,2.1160,0.112,18.858,0.000,1.896,2.336
State and Local Government Construction Spending - Infrastructure,2.7206,1.202,2.263,0.024,0.364,5.077
Labor Force Participation Rate - Seasonally Adjusted,3.851e+10,5.29e+09,7.281,0.000,2.81e+10,4.89e+10
State and Local Government Construction Spending - Bridge,1.9564,0.075,25.967,0.000,1.809,2.104
Is Recession,-2.897e+08,1.44e+08,-2.013,0.044,-5.72e+08,-7.64e+06
Highway Fatalities,1.585e+06,1.12e+05,14.142,0.000,1.37e+06,1.8e+06

0,1,2,3
Omnibus:,0.327,Durbin-Watson:,1.228
Prob(Omnibus):,0.849,Jarque-Bera (JB):,0.305
Skew:,0.095,Prob(JB):,0.859
Kurtosis:,2.96,Cond. No.,521000000000.0


In [112]:
'''
The second stage of 2SLS. Fitting real GDP using the fitted treatment variable
and the confounding variables
'''
df['highway and street hat'] = results_zhat.fittedvalues

y_2sls = df['Real Gross Domestic Product - Seasonally Adjusted']
X_2sls = sm.add_constant(df[['highway and street hat', 
                             'Unemployment Rate - Seasonally Adjusted', 
                             'State and Local Government Construction Spending - Transportation',
                             'State and Local Government Construction Spending - Infrastructure',
                             'Labor Force Participation Rate - Seasonally Adjusted',
                             'State and Local Government Construction Spending - Bridge',
                             'Is Recession']])
model_yhat = sm.OLS(y_2sls, X_2sls)
results_yhat = model_yhat.fit(cov_type='HC1')
results_yhat.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,Real Gross Domestic Product - Seasonally Adjusted,R-squared:,0.963
Model:,OLS,Adj. R-squared:,0.961
Method:,Least Squares,F-statistic:,371.8
Date:,"Fri, 02 Dec 2022",Prob (F-statistic):,7.96e-78
Time:,18:31:56,Log-Likelihood:,-5313.4
No. Observations:,192,AIC:,10640.0
Df Residuals:,184,BIC:,10670.0
Df Model:,7,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.628e+13,2.51e+12,14.429,0.000,3.14e+13,4.12e+13
highway and street hat,-274.1425,43.402,-6.316,0.000,-359.210,-189.075
Unemployment Rate - Seasonally Adjusted,-2.536e+13,8.19e+11,-30.967,0.000,-2.7e+13,-2.38e+13
State and Local Government Construction Spending - Transportation,1662.6528,104.648,15.888,0.000,1457.546,1867.759
State and Local Government Construction Spending - Infrastructure,233.8701,825.533,0.283,0.777,-1384.144,1851.884
Labor Force Participation Rate - Seasonally Adjusted,-3.215e+13,3.68e+12,-8.746,0.000,-3.94e+13,-2.49e+13
State and Local Government Construction Spending - Bridge,348.9821,117.493,2.970,0.003,118.701,579.263
Is Recession,-1.898e+10,5.9e+10,-0.322,0.748,-1.35e+11,9.67e+10

0,1,2,3
Omnibus:,1.791,Durbin-Watson:,0.636
Prob(Omnibus):,0.408,Jarque-Bera (JB):,1.822
Skew:,-0.183,Prob(JB):,0.402
Kurtosis:,2.693,Cond. No.,1890000000000.0


In [113]:
'''
Naive OlS results for comparison with the results from 2SLS
'''
y_naive = df['Real Gross Domestic Product - Seasonally Adjusted']
x_naive = sm.add_constant(df[['State and Local Government Construction Spending - Highway and Street', 
                             'Unemployment Rate - Seasonally Adjusted', 
                             'State and Local Government Construction Spending - Transportation',
                             'State and Local Government Construction Spending - Infrastructure',
                             'Labor Force Participation Rate - Seasonally Adjusted',
                             'State and Local Government Construction Spending - Bridge',
                             'Is Recession']])
model_naive = sm.OLS(y_naive, x_naive)
results_naive = model_naive.fit(cov_type='HC1')
results_naive.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,Real Gross Domestic Product - Seasonally Adjusted,R-squared:,0.962
Model:,OLS,Adj. R-squared:,0.961
Method:,Least Squares,F-statistic:,414.6
Date:,"Fri, 02 Dec 2022",Prob (F-statistic):,1.38e-81
Time:,18:31:57,Log-Likelihood:,-5314.8
No. Observations:,192,AIC:,10650.0
Df Residuals:,184,BIC:,10670.0
Df Model:,7,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.094e+13,1.98e+12,20.648,0.000,3.7e+13,4.48e+13
State and Local Government Construction Spending - Highway and Street,-187.8517,29.701,-6.325,0.000,-246.065,-129.638
Unemployment Rate - Seasonally Adjusted,-2.589e+13,8.42e+11,-30.750,0.000,-2.75e+13,-2.42e+13
State and Local Government Construction Spending - Transportation,1442.2584,81.793,17.633,0.000,1281.947,1602.570
State and Local Government Construction Spending - Infrastructure,-35.3758,779.590,-0.045,0.964,-1563.344,1492.593
Labor Force Participation Rate - Seasonally Adjusted,-3.883e+13,2.91e+12,-13.356,0.000,-4.45e+13,-3.31e+13
State and Local Government Construction Spending - Bridge,146.5097,83.574,1.753,0.080,-17.292,310.311
Is Recession,5.347e+10,5.29e+10,1.011,0.312,-5.01e+10,1.57e+11

0,1,2,3
Omnibus:,0.776,Durbin-Watson:,0.614
Prob(Omnibus):,0.679,Jarque-Bera (JB):,0.89
Skew:,-0.094,Prob(JB):,0.641
Kurtosis:,2.725,Cond. No.,1540000000000.0


## Results

- **Summarize and interpret your results, providing a clear statement about causality (or a lack thereof) including any assumptions necessary.**
    - Our results from 2SLS indicates that **every dollar spent on highways and streets *causes* the real GDP falls by approximately 274 dollars**. This differs from the naive OLS regression, which says that a dollar increase in highways and streets reflects a **188 dollar** decline in real GDP.
    - Assumptions include that inflation during this period is negligible and the usual assumptions surrounding IV regression, which is that the instrument variable is only correlated with the treatment variable, the confounders are only correlated with the treatment and outcome variables, and the relationship in our variables are linear.
    
    
- **Where possible, discuss the uncertainty in your estimate and/or the evidence against the hypotheses you are investigating.**
    - Our initial expectation was that additional investments into highways and streets would increase real GDP since better roads would improve the means by which the economy functions, such as by reducing commute times for workers. However, the results describe the opposite. 
    - We believe our estimate is negative instead of positive because **the effects of investments like highway and street spending are delayed** - the benefits of investments are not immediately realized. Therefore, the government can spend a lot on investments, but it would take a few years before the benefits from those investments are reflected in the real GDP.
    - Of course, there's always the possibility of finding more confounders and better instrument variables to help remove the biases in our coefficient. In this case, our estimate may not be completely unbiased.

## Discussion

- **Elaborate on the limitations of your methods.**

One limitation is that our 2SLS model assumes a **linear relationship** between our variables. It's possible that there are nonlinearities in our data which affect the extent to which our coefficient is a good explanation of the causal effect. Other nonparametric models which do not assume a linear relationship, such as neural networks, may be better specified.

Another limitation is that we need to identify and implement confounding variables and instrument variables, which may be intractable for treatment and outcome variables which can be explained by many different things. Our study on the causal effect of government spending and real GDP is generally a tough topic to study, even for researchers specialized in the field.

- **What additional data would be useful for answering this causal question, and why?**

Each row in our dataframe is a month, but some of our data is only updated quarterly. If we can get more granular data on features such as real GDP and labor force participation rates to be updated monthly rather than quarterly, then we can get better results. 

Additionally, if we can get data that is somehow **adjusted for the lagging effects of investments**, then that would be extremely helpful in determining the true causal effect of our study. We aren't sure how lagged-effects-adjusted data could be created, but that would certainly correct a lot of our uncertainties in our causal estimate.

Lastly, if we can get highway and street data split by the type of project the money was spent on, then that could help separate the projects that contribute little to GDP vs. those that contribute more. For example, highway and street projects that fix relativley miniscule issues, such as repaving roads, may not necessarily improve traffic as much as bigger projects, such as making new highways connecting two points that used to be isolated. We could use that data to see how relatively more productive projects affect real GDP. This would me more interesting and can help motivate new sweeping roads legislation.

- **How confident are you that there’s a causal relationship between your chosen treatment and outcome? Why?**

We're very confident that **there is a causal relationship** between our treatment and outcome variables. Intuitively, investments into facilitating transportation has historically been extremely important to a country's economy. For example, without investments into the modern highway system, then each household may not be able to use their cars as effectively and would instead need to rely on public transportation, thereby reducing the extent to which money can circulate in the economy and lowering real GDP.

In terms of the results from our causal inference analysis, our estimated causal coefficient using 2SLS is even further from 0 than the coefficient suggested by a simple OLS regression. Furthermore, our 95% confidence interval still does not contain 0, meaning that **our result is still statistically significant from 0 despite the fact that our variance has increased** (due to the bias-variance tradeoff; we removed biases in our estimate in exchange for greater standard error). This suggests that there is a causal effect between our treatment and outcome variables.