# Microeconometrics Project by Fabian Balensiefer - 'Are Credit Markets Still Local? Evidence from Bank Branch Closings'

## Quick Reference
Scope of this project is to replicate the study **"Are Credit Markets Still Local? Evidence from Bank Branch Closings."** written by **Hoai-Luu Q. Nguyen**  published in *AMERICAN ECONOMIC JOURNAL: APPLIED ECONOMICS VOL. 11, NO. 1, JANUARY 2019*. <br> <br>

Data and stata-files are provided by the American Economic Association:<br>
<href>https://www.aeaweb.org/articles?id=10.1257/app.20170543</href><br>



**Hyothesis:** Does the distance to bank branches effect credit allocation?<br>

**Identification Issue:** Openings and closings of bank branches are not random assignments<br>

**Idea:** Using the impact of post-merger branch closings to measure the effect on lending <br>
          => *Key assumption:* merger decision is exogenous to local economic conditions (census tract)
          
**Data:**

        * census tract -> macro- and household data on tract level
        * Summary of Deposits -> bank branch data
        * Report of Changes -> merger and branch closing 
        * HMDA and CRA -> lending data
        
**Method:** 

        1. IV – “exposure to post-merger consolidation” as instrument for closings
        2. DD – to compare lending in exposed and control (census) tracts in the same county 

*Why does the author use two methods? - to allow for heterogeneity across tracts within a county (DD)*

## Brief summary of Nguyen (2019)

The paper “Are Credit Markets Still Local? Evidence from Bank Branch Closings” by Nguyen (2019) established a novel approach for estimating the causal impact of bank branch closings on credit supply at the branch level. Motivated by the question whether technological progress have changed the access to credit she is interested in estimating the local average treatment effect of bank branch closings. While other research focuses on aggregate system wide shocks, Nguyen (2019) concentrates on a local approach. Thus, she is able to control for unobserved heterogeneity across different regions. 

Nguyen (2019) combines national data from four different sources. First, macro and household data on tract level is provided by the census bureau. Data on bank branches are published in the Summary of Deposits, while merger and bank branch closings are recorded in the Report of Changes (both by FDIC). Lending data for private lending/ mortgages (HMDA) and commercial lending (CRA) came from the FFIEC. Finally, some additional macroeconomic data is used from the National Establishment Time-Series (NETS) by Walls and Associates. Thus, most of the used data is provided by US official institutions. Data is merged on bank- and tract level by using GIS software to map geographical locations. Tracts are defined by the census bureau as regions containing approximately 4000 inhabitants, while differing in size. The final sample consists of tracts based on exposure to large bank mergers in the period between 1999 and 2012. 

The Hypothesis of the paper, whether distance to bank branches effect credit allocation, is analyzed in a quasi-experimental research design. A difference in differences framework allows to control for time-varying trends across tracts within the same county. Instrumenting bank branch closings with bank mergers addresses the endogeneity issue of bank branch closings. More about the identification will be discussed in the following section. 

Results of Nguyen (2019) support the hypothesis that distance to bank branches still affect access to credit. Findings suggest that “closings lead to a persistent decline in local small business lending” (Nguyen 2019). The effect on private lending seems to be of temporary nature (since this temporary decline is insignificant, we cannot infer any results).  
Nguyen (2019) concludes that “distance matters not only because it improves accessibility, but also because it reduces the costs of transmitting information”.  This is in line with theoretical findings from Akerlof (1970) and Stiglitz and Weiss (1981), which find that credit markets are subject to informational asymmetries. Surprisingly, after major improvements in information technologies in the past decades these findings still hold.

## Causal Graph and Identification Strategy

Causal graphs break down complex relationships into simple, transparent and easy to interpret visualizations. I create a causal graph to emphasize the papers framework. Especially, to illustrate the authors identification strategy and to discuss potential identification issues.

![](graphs/causal_graph.png)

    * D - treatment variable "bank branch closings"
    * Y - dependent variable "lending activity"
    * E - general economic controls (county level)
    * X - local economic controls (tract level)
    * M - instrument "merger activity"
    * U - unobserved drivers of lending activity and bank branch  closings
    
The causal graph above pins down the relationships between banks and lending. In detail, the effect of interest in this paper is between bank branch closings (D) and lending activity (Y). There are multiple backdoor paths, confounding variables and reverse causality issues that need to be solved to show causality.

First consider our variable of interest, lending activity. Since credit is an equilibrium concept, it is difficult to disentangle whether a change in lending activity is driven by a change in credit demand or supply. To solve this issue, we need exogenous variation only effecting the supply side. The author needs to identify a shock, which only effects the banks’ lending supply. Second, our treatment variable, bank branch closings, has an issue of reverse causality with lending activity. One can argue that less demand for credit is affecting banks decision to close a branch in a certain location. Therefore, we are testing whether the closing of a branch in a certain location affects credit availability. This issue can be solved by instrumenting bank branch closing with exposure to merger (M). The author argues that the decision to close a branch after a merger is more driven by merger activity than by local demand of lending. Furthermore, merger induced branch closings can be seen as exogenous shock to local credit supply. Therefore, the framework enables to disentangle between credit demand and supply. By controlling for local (X) and general (E) economic conditions the remaining two backdoor paths are blocked. Finally, there are other unobserved characteristics which could affect both lending activity and bank branch closings at the same time. By controlling for individual fixed effects on the tract level (local) the author addresses this issue. Thus, the exogeneity and relevance conditions should be fulfilled.

A naive approach to measure the effect of bank branch closings on lending activity would be:

$$ y_{it}=\alpha_i + \gamma_t + \lambda X_{it} + \beta_C \text{Close}_{it} + \epsilon_{it} $$

As explained above we have an issue with reverse causality between $y_{it}$ and $ Close_{it}$.  Nguyen (2019) uses exposure to post-merger consolidation as an instrument for bank branch closings. She argues that large institutions (as in the dataset) often have overlapping networks in the same regions and therefore are at greater risk of a post-merger closings. The first stage of the instrumental variable (IV) approach has the following structure:

$$ \text{Close}_{it} = \kappa_{i} + \phi_{t} + \rho X_{it} +\text{Expose}_{it} + \omega_{it}$$

Here the crucial assumption is that the decision to merger is not affected by local economic conditions arises. The exogeneity of our instrument exposure to merger consolidation ensures a random assignment of treatment. To further address the exogeneity concern Nguyen (2019) focuses on mergers between large banks (both buyer and target held at least $10 billion in premerger assets). The idea to focus on large banks is that these institutions merge because of other factors than local economic conditions. 

Since tracts and counties differ in various characteristics, a concern on heterogeneity across tracts arises. The instrumental variable approach captures time-invariant tract and year specific characteristics by including individual and time fixed effects. But these fixed effects are not able to control for unobserved time-varying individual tract characteristics. Therefore, the author expands the analysis by a difference in differences approach (DD). Both panel data methods allow to account for heterogeneity. The idea is to compare treated and non-treated tracts within a county, while controlling for tract and county-by-year fixed-effects. General economic conditions are captured by county-by-year fixed-effects, such that heterogeneity is not affecting the results anymore. According to Wooldridge (2015), to evaluate the local average treatment effect (LATE), the difference in differences framework hinges on the parallel trends’ assumption. In this particular framework exposed and control tracts should evolve the same in absent of a merger (table 3 compares sample groups).

So far we discussed the internal validity of this natural experiment, now focus on the external validity. Nguyen (2019) raises the question “Is the local average treatment effect (LATE) identified from merger-induced closings informative for understanding the impact of branch closings more generally?”. This is rather on the validity of the instrument, than comparison of merger sample tracts and other branched tracts. As Nguyen (2019) points out, the interpretation of the LATE is the effect of treatment on compliers. 

$$ \text{LATE} = E[Y(1)-Y(0) \mid D(1)=1, D(0)=0] $$

Since it is not possible to identify compliers directly, she applies the procedure by Angrist and Pischke (2009). This procedure describes how to construct complier characteristics based on fractions of always- ($\pi_A$) and never-takers ($\pi_N$). 

$$ \pi_C = 1-\pi_A-\pi_N  $$

The results of the representativeness of compliers are presented in table 5.

The difference in differences framework is presented as the following model, with a year-by-year estimation of treatment effect:

$$y_{icmt} = \alpha_i + (\gamma_t \times \sigma_c) + X_i \beta_t + \sum_{\tau} \delta_{\tau} (D_{mt}^{\tau} \times \text{Expose}_{icm}) + \epsilon_{icmt}$$

Where tract $i$ in county $c$ experienced merger $m$ in year $t$. $D_{mt}^{\tau}$ is a dummy variable that equals one in year $t$ and $\tau$ years after merger $m$ is approved. The reduced model is independent of $\tau$, thus:

$$y_{icmt} = \alpha_i + (\gamma_t \times \sigma_c) + X_i \beta_t + \delta_{\text{POST}} (\text{POST}_{mt} \times \text{Closure}_{icm}) + \epsilon_{icmt}$$

This functional form is typically used for a difference in differences approach. It is less flexible but easy to interpret. The Dummy variables $\text{POST}$ and $\text{Closure}$ are interacted, thus, the interacted term is equal to one if a tract experienced closings after a merger. The results are presented in table 6 and 7.

Coming back to the discussion of external validity of the identification framework, Nguyen (2019) compares tracts from the merger sample with the average branched tracts in the US. Since the difference in differences framework identifies only the effect of merger induced closings, it is questionable whether the effect is representative. According to Nguyen (2019), merger sample tracts are in general wealthier and have larger banking markets. Therefore, the LATE is likely underestimated. The simulation study later in this paper is showing why the reduced form is underestimating the true effect. 

In [1]:
%reset -f
%clear
%load_ext autoreload

# preface loading packages required for Python Data Science
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from patsy import dmatrices
import matplotlib.pyplot as plt
from scipy import stats
import warnings
from auxiliary import *
from linearmodels import PanelOLS
from linearmodels.iv import IV2SLS
#from statsmodels.iolib.summary2 import summary_col
warnings.filterwarnings('ignore')

%matplotlib inline




## Replication of Summary Statistics

Table 1 shows the bank mergers Nguyen (2019) uses for her analysis. I am fully able to replicate this table, which consists of 13 bank mergers. As described before these are mergers from large banks to emphasize exogeneity of the instrument.

In [2]:
# Table 1: Merger Sample
print('Table 1: Merger Sample\n')
print(tab1().to_string(index=False))

Table 1: Merger Sample

                                   Buyer                                             Target  Year approved
 Manufacturers and Traders Trust Company                                      Allfirst Bank           2003
   Bank of America, National Association                                Fleet National Bank           2004
                      National City Bank                                 The Provident Bank           2004
                            Regions Bank          Union Planters Bank, National Association           2004
                     JPMorgan Chase Bank                     Bank One, National Association           2004
                         North Fork Bank                                    GreenPoint Bank           2004
                           SunTrust Bank                          National Bank of Commerce           2004
     Wachovia Bank, National Association                                    SouthTrust Bank           2004
             

Table 2 summarizes buyer and target characteristics prior the treatment. Again, the focus is on large banks and I am fully able to replicate table 2. 

In [3]:
# Table 2: Merger Sammary Statistics
print('Table 2: Merger Sammary Statistics\n')
print(tab2().to_string(index=False))

Table 2: Merger Sammary Statistics

               Variable   Median       Min         Max
           Total assets 81954710  25963401  1252402412
               Branches      696       254        5569
    States of operation        8         1          31
 Countries of operation      182        18         692
           Total assets 25955711  10426963   245783000
               Branches      277        28        1482
    States of operation        6         1          13
 Countries of operation       54         7         202


Table 3 compares the sample groups and provides summary statistics. Concentrating on column 3, exposed (treatment) tracts are similar to other tracts in the sample, while column 5 indicates some significant difference between exposed and control tracts. Nguyen (2019) argues that exposed tracts are wealthier and have larger banking markets than the average US tracts. This results in an underestimation of the estimates. First, the stata-file does not include the measures ‘Establishment growth’ and ‘Employment growth’. Due to some differences between the computation in stata and the scipy package, some of the p-values differ. One reason could be that I compute the p-value conditioned on my generated data-frame, while Nguyen (2019) estimates the p-value with a fixed effects regression. Therefore, the qualitative result still holds, there is a difference between exposed and control tracts.

In [4]:
# Table 3: Summary Statistics for Exposed and Control Tracts
df_t=tab3()[0]
std=tab3()[1]
index=tab3()[2]
var=pd.DataFrame(columns=['list'], index=index)
var['list']=['Population density','Population','Median Income','Fraction minority','Fraction college educated','Fraction mortgage','Percent MSA median income','Total branches','Branch growth','SBL originations','Mortgage originations']
print('Table 3: Summary Statistics for Exposed and Control Tracts\n')
print('{:<25s}{:>15s}{:>15s}{:>15s}{:>15s}{:>15s}\n'.format('Variable','Exposed','All other','p-value','Control','p-value'))
for i in index:
    print('{:<25s}{:>15.2f}{:>15.2f}{:>15.3f}{:>15.2f}{:>15.3f}'.format(var.list[i], df_t.Exposed[i], df_t.Allother[i], df_t.pvalue01[i], df_t.Control[i], df_t.pvalue02[i]))
    print('{:<25s}{:>15.2f}{:>15.2f}{:>15s}{:>15.2f}{:>15s}'.format(' ',std.a[i], std.b[i], ' ',std.c[i], ' '))
print('{:<25s}{:>15.0f}{:>15.0f}{:>15.0s}{:>15.0f}{:>15.2s}'.format('Obs',df_t.Exposed['Obs'], df_t.Allother['Obs'], ' ', df_t.Control['Obs'], ' '))

Table 3: Summary Statistics for Exposed and Control Tracts

Variable                         Exposed      All other        p-value        Control        p-value

Population density               2575.41        7206.31          0.000        6105.75          0.000
                                 7925.48       14576.32                      13868.57               
Population                       5761.40        4571.78          0.000        5387.57          0.013
                                 3229.73        2365.53                       2714.44               
Median Income                   44223.77       45451.95          0.304       52171.48          0.000
                                20288.25       23290.21                      24045.71               
Fraction minority                   0.21           0.39          0.000           0.24          0.039
                                    0.23           0.34                          0.24               
Fraction college educated     

Table 4 presents the summary statistics of the merger sample. All variables are fixed to year 2001, prior any mergers. Column 1 contains all tracts with bank branch information, while column 2 contains tracts which experienced closings. Finally, column 3 presents the treatment sample containing tracts which experienced a bank merger. I am able to completely replicate table 4.

In [5]:
# Table 4: Representativeness of the Merger Sample
df_t=tab4()[0]
index=tab4()[1]
std=tab4()[2]
var=pd.DataFrame(columns=['list'], index=index)
var['list']=['Population density','Population','Median Income','Fraction minority','Fraction college educated','Fraction mortgage','Total branches','Branch growth','SBL originations','Mortgage originations','Percent MSA median income']
print('Table 4: Representativeness of the Merger Sample\n')
print('{:<25s}{:>20s}{:>25s}{:>20s}\n'.format('Variable','All branched tracts','Tracts with closings','Merger sample'))
for i in index:
     print('{:<25s}{:>20.2f}{:>25.2f}{:>20.2f}'.format(var.list[i], df_t.All[i], df_t.Closings[i], df_t.Merger[i]))
     print('{:<25s}{:>20.2f}{:>25.2f}{:>20.2f}'.format(' ',std.a[i], std.b[i], std.c[i]))
print('{:<25s}{:>20.0f}{:>25.0f}{:>20.0f}'.format('Obs',df_t.All['Obs'], df_t.Closings['Obs'], df_t.Merger['Obs'])) 

Table 4: Representativeness of the Merger Sample

Variable                  All branched tracts     Tracts with closings       Merger sample

Population density                    4032.40                  3615.26             6166.24
                                     10052.64                  7348.34            14319.17
Population                            4687.90                  4941.80             5401.22
                                      2193.43                  2430.65             2702.60
Median Income                        44829.35                 45248.80            51699.99
                                     20213.36                 20685.61            23886.71
Fraction minority                        0.20                     0.20                0.23
                                         0.24                     0.22                0.24
Fraction college educated                0.25                     0.27                0.34
                                       

Table 5 shows the complier characteristics using the methodology by Angrist and Pischke (2009). Column 1 contains the proportion of compliers above the sample median, while column 2 presents the complier to sample ratios. According to Nguyen (2019), compliers tend to be fairly representative of the median tract sample. Again, table 5 can fully be replicated using the data provided.

In [6]:
# Table 5: Complier Characteristics
pd.options.display.float_format = '{:.3f}'.format
df_t5=tab5()
index=df_t5.index
var=pd.DataFrame(columns=['list'], index=index)
var['list']=['Population density','Population','Median Income','Fraction minority','Fraction college educated','Fraction mortgage','Percent MSA median income','Total branches','Branch growth','SBL originations','Mortgage originations']
print('Table 5: Complier Characteristics\n')
#print(tab5().to_string(index=False))
print('{:<25s}{:>20s}{:>40s}\n'.format('Variable','Proportion of compliers above the sample median (percent)','Ratio: Compliers to sample'))
for i in index:
    print('{:<25s}{:>35.0f}{:>55.2f}'.format(var.list[i], df_t5.ecomp[i], df_t5.ratio[i]))

Table 5: Complier Characteristics

Variable                 Proportion of compliers above the sample median (percent)              Ratio: Compliers to sample

Population density                                        18                                                   0.37
Population                                                58                                                   1.15
Median Income                                             29                                                   0.58
Fraction minority                                         60                                                   1.21
Fraction college educated                                 47                                                   0.94
Fraction mortgage                                         39                                                   0.78
Percent MSA median income                                 41                                                   0.83
Total branches               

Overall, I am able to replicate the summary statistics using the provided data and python. Only the calculation of the p-values in table 3 differs. As mentioned above, this can be caused by a different computation/ implementation. 

## Replication of the Main Results

In [3]:
mean=fig2()[0]
std=fig2()[1]
ind=range(-7,9)
plt.axhline(y=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=-4.0, color='lightgrey', linestyle='-')
plt.axvline(x=6.0, color='lightgrey', linestyle='-')
plt.errorbar(ind, mean, xerr=0.5, yerr=2*std, linestyle='')
plt.title('Number of branch closings')
plt.show()

TypeError: __init__() got an unexpected keyword argument 'drop_absorbed'

In [None]:
# Table 6: First-Stage and Reduced-Form estimates 
# load and prepare data
dftest=tab6()[0]
exog=tab6()[1]
index=tab6()[2]
dftest.to_csv('df_table6.csv')

# estimte column 1 
mod = PanelOLS(dftest.num_closings, dftest[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
reg1=mod.fit(cov_type='clustered', clusters=dftest.clustID)

# estimte column 2 
mod = PanelOLS(dftest.totalbranches, dftest[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
reg2=mod.fit(cov_type='clustered', clusters=dftest.clustID)

# estimte column 3 
mod = PanelOLS(dftest.NumSBL_Rev1, dftest[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
reg3=mod.fit(cov_type='clustered', clusters=dftest.clustID)

# estimate column 4
mod = PanelOLS(dftest.total_origin, dftest[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
reg4=mod.fit(cov_type='clustered', clusters=dftest.clustID)

# compute baseline mean
mean1=np.nanmean(dftest.num_closings)
mean2=np.nanmean(dftest.totalbranches)
mean3=np.nanmean(dftest.NumSBL_Rev1)
mean4=np.nanmean(dftest.total_origin)

regressors=index
delta=pd.DataFrame(columns=['delta'], index=index)
delta['delta']=['<-1','0','1','2','3','4','5','6','>6']
print('Table 6: First-Stage and Reduced-Form Estimates \n')
print('   Number of closings:       Total branches:       SBL orginiations:     Mortgage originations:\n\n')
for i in index:
    #print(delta.loc[i, 'delta'])
    print('{:<5s}{:>15.4f}{:>20.4f}{:>20.4f}{:>20.4f}'.format(delta.loc[i, 'delta'],reg1.params[i], reg2.params[i], reg3.params[i], reg4.params[i]))
    print('{:<5s}{:>15.4f}{:>20.4f}{:>20.4f}{:>20.4f}'.format(' ',reg1.std_errors[i], reg2.std_errors[i], reg3.std_errors[i], reg4.std_errors[i]))
print('{:<5s}{:>15.4f}{:>20.4f}{:>20.4f}{:>20.4f}'.format('Mean',mean1, mean2, mean3, mean4))
print('{:<5s}{:>15.0f}{:>20.0f}{:>20.0f}{:>20.0f}'.format('Obs',reg1.nobs, reg2.nobs, reg3.nobs, reg4.nobs))

### Table 6
The first two columns of table 6 presents the estimates from the first-stage regressions, while the last two columns show the estimates from the reduced form model. For transparency, the event dummy varies over different time windows indicated by the subscript ($\delta_x$). The idea of this presentation is to show a general trend. For example, prior and approximate six years after the true merger event the number of closing is negative. In the year of the merger and few years after the dummy estimates are significantly positive indicating that merger induced consolidations take place. Therefore, one cannot argue that the effect of the event dummy is driven by randomness. 
Due to several implementation issues, the results differ from the results presented by Nguyen (2019). One major issue is that a package for higher dimensional fixed effects regressions is currently not available within the python environment. The corresponding package used in stata is called ‘reghdfe’. The author of the python package ‘econtools’ Daniel M. Sullivan is aware of this problem and plans to implement it in the future (<href>http://www.danielmsullivan.com/pages/tutorial_stata_to_python.html</href>).  I tried to implement fixed effects as a dummy variable regression, which was computational extremely consuming.  As final solution I first reindex the data structure by individual ID and group-year ID. Then as second step I use the ‘linearmodels’ package which allows for two-way fixed effects. 
Especially, the handling of absorbed observations is different between stata and python. I exported the dataset from stata and load it into python to test whether the data differs (I also checked the data by eyeballing). Both datasets create an issue with multicollinearity in python, thus the error ‘dependent variable matrix does not have full column rank’ appears. The ‘reghdfe’ package in stata is handling missing values and omitted variables differently. Especially, most regression packages using the Cholesky factorization to calculate the regression estimates. In contrast ‘reghdfe’ relies on the method of alternating projections (MAP). The general idea is to refer on the separating hyperplane theorem and using MAP to produce two non-empty, convex and disjoint sets. Therefore, the computation does not produce any errors using stata (in detail see <href>http://scorreia.com/research/hdfe.pdf</href>). I solved this problem by filling missing values with zeros to check whether the collinearity issues can be solved. After this step the estimation is now able to compute estimates.
The resulting estimates differ from these presented in Nguyen (2019), but the interpretation and qualitative results still hold. For example, the direction of the point estimates in the first two columns is identically to the authors findings. Indicating that the years after a merger consolidation of branches take place. Due to filling missing values with zeros, the dataset used for estimation slightly differs to the authors one. Therefore, I suspect the estimates of column 3 and 4 differ. To show, that the results published in the paper are the corresponding stata estimates I additionally computed these using the stata ‘reghdfe’ command.

In [None]:
# Figure 3: Exposure to consolidation and local branch levels
mean=fig3()[0]
std=fig3()[1]
ind=range(-7,9)
plt.axhline(y=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=-4.0, color='lightgrey', linestyle='-')
plt.axvline(x=6.0, color='lightgrey', linestyle='-')
plt.errorbar(ind, mean, xerr=0.5, yerr=2*std, linestyle='')
plt.title('Total branches')
plt.show()  

In [5]:
# Figure 4: Exposure to consolidation and the volume of new lending
mean1=fig4()[0]
std1=fig4()[1]
mean2=fig4()[2]
std2=fig4()[3]
ind=range(-7,9)
plt.figure()
plt.subplot(1,2,1)
plt.axhline(y=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=-4.0, color='lightgrey', linestyle='-')
plt.axvline(x=6.0, color='lightgrey', linestyle='-')
plt.errorbar(ind, mean1, xerr=0.5, yerr=2*std1, linestyle='')
plt.title('New Small Business loans')
plt.subplot(1,2,2)
plt.axhline(y=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=-4.0, color='lightgrey', linestyle='-')
plt.axvline(x=6.0, color='lightgrey', linestyle='-')
plt.errorbar(ind, mean2, xerr=0.5, yerr=2*std2, linestyle='')
plt.title('New Mortgages')
plt.show() 

TypeError: __init__() got an unexpected keyword argument 'drop_absorbed'

In [None]:
# Figure 5: The effect of subsequent bank entry on local credit supply
plt.figure()

#mean=fig3()[0]
#mean1=fig4()[0]
#mean2=fig4()[2]
#ind=range(-7,9)

## Small Business Lending
plt.subplot(1,2,1)
plt.axhline(y=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=-4.0, color='lightgrey', linestyle='-')
plt.axvline(x=6.0, color='lightgrey', linestyle='-')
plt.scatter(ind, mean1)
plt.subplot(1,2,1)
plt.scatter(ind, mean)
plt.legend((mean1, mean), ('loans', 'totalbranches'),loc='upper center', bbox_to_anchor=(0.5, -0.10))
plt.title('New Small Business loans')
#plt.show() 

## Mortgages
plt.subplot(1,2,2)
plt.axhline(y=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=0.0, color='lightgrey', linestyle='--')
plt.axvline(x=-4.0, color='lightgrey', linestyle='-')
plt.axvline(x=6.0, color='lightgrey', linestyle='-')
plt.scatter(ind, mean2)
plt.subplot(1,2,2)
plt.scatter(ind, mean)
plt.legend((mean2, mean), ('loans', 'totalbranches'),loc='upper center', bbox_to_anchor=(0.5, -0.10))
plt.title('New Mortgages')
plt.show()

### Figures 2, 3, 4 and 5
As already discussed, the results for the first stage estimates slightly correspond to those published in Nguyen (2019). Therefore, the development of the estimates over the time in figures 2 and 3 can be replicated. Note that the scaling of the estimates is different to the authors, since the ‘eclplot’ command in stata differently computes the estimates. I compute the first stage regression estimates and use two times the corresponding standard error to compute the confidence intervals.
The reduced form results presented in figure 4, on the other hand, differ from those presented in the paper. This is not surprising, since the estimates in column 3 and 4 of table 6 differ already. Figure 5 uses the data from the figures before. Hence, the development of ‘total branches’ is in line with Nguyen (2019). While the development of the reduced form estimates differs again. 
Panel A of figure 5 still provides the same intuition as in the paper discussed. After the merger occurs, the number of bank branches declines as well as the commercial lending. Private lending is only affected for a short period. 
In general, we should pay attention to the results. Since the confidence bands from figures 2, 3 and 4 often cross the zero line, we cannot infer any significant effect from it. 

In [12]:
# Table 7: IV-Estimates of the effect of closings an local credit supply
# load and prepare data
df=tab7()[0]
controllist=tab7()[1]
#df.to_csv('df_table7.csv')
# compute baseline mean
mean1=np.nanmean(df.num_closings)
mean2=np.nanmean(df.totalbranches)
mean3=np.nanmean(df.NumSBL_Rev1)
mean4=np.nanmean(df.total_origin)

In [13]:
## OLS
exog='POST_close'
#exog=controllist
#exog.append('POST_close')
#exog = sm.add_constant(df[exog])
mod = PanelOLS(df.NumSBL_Rev1, df[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
regA1 = mod.fit(cov_type='clustered', clusters=df.clustID)
#print('Panel A. OLS: %2.4f (%2.4f)'  %(regA1.params['POST_close'],regA1.std_errors['POST_close']))
mod = PanelOLS(df.AmtSBL_Rev1, df[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
regA2 = mod.fit(cov_type='clustered', clusters=df.clustID)
#print('Panel A. OLS: %2.4f (%2.4f)'  %(regA2.params['POST_close'],regA2.std_errors['POST_close']))
mod = PanelOLS(df.total_origin, df[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
regA3 = mod.fit(cov_type='clustered', clusters=df.clustID)
#print('Panel A. OLS: %2.4f (%2.4f)'  %(regA3.params['POST_close'],regA3.std_errors['POST_close']))
mod = PanelOLS(df.loan_amount, df[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
regA4 = mod.fit(cov_type='clustered', clusters=df.clustID)
#print('Panel A. OLS: %2.4f (%2.4f)'  %(regA4.params['POST_close'],regA4.std_errors['POST_close']))

In [14]:
## Reduced-form 
exog='POST_expose'
#exog=controllist
#exog.append('POST_expose')
mod = PanelOLS(df.NumSBL_Rev1, df[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
regB1 = mod.fit(cov_type='clustered', clusters=df.clustID)
#print('Panel B. RF: %2.4f (%2.4f)'  %(regB1.params['POST_expose'],regB1.std_errors['POST_expose']))
mod = PanelOLS(df.AmtSBL_Rev1, df[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
regB2 = mod.fit(cov_type='clustered', clusters=df.clustID)
#print('Panel B. RF: %2.4f (%2.4f)'  %(regB2.params['POST_expose'],regB2.std_errors['POST_expose']))
mod = PanelOLS(df.total_origin, df[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
regB3 = mod.fit(cov_type='clustered', clusters=df.clustID)
#print('Panel B. RF: %2.4f (%2.4f)'  %(regB3.params['POST_expose'],regB3.std_errors['POST_expose']))
mod = PanelOLS(df.loan_amount, df[exog], entity_effects=True, time_effects=True, drop_absorbed=True)
regB4 = mod.fit(cov_type='clustered', clusters=df.clustID)
#print('Panel B. RF: %2.4f (%2.4f)'  %(regB4.params['POST_expose'],regB4.std_errors['POST_expose']))

In [15]:
## IV
#exog=controllist
#contrl="+".join(exog)
#mod = IV2SLS.from_formula('NumSBL_Rev1 ~ [POST_close ~ POST_expose]+'+contrl, df)
mod = IV2SLS.from_formula('NumSBL_Rev1 ~ [POST_close ~ POST_expose]', df)
regC1 = mod.fit()
#print('Panel B. RF: %2.4f (%2.4f)'  %(regC1.params['POST_close'],regC1.std_errors['POST_close']))
mod = IV2SLS.from_formula('AmtSBL_Rev1 ~ [POST_close ~ POST_expose]', df)
regC2 = mod.fit()
#print('Panel B. RF: %2.4f (%2.4f)'  %(regC2.params['POST_close'],regC2.std_errors['POST_close']))
mod = IV2SLS.from_formula('total_origin ~ [POST_close ~ POST_expose]', df)
regC3 = mod.fit()
#print('Panel B. RF: %2.4f (%2.4f)'  %(regC3.params['POST_close'],regC3.std_errors['POST_close']))
mod = IV2SLS.from_formula('loan_amount ~ [POST_close ~ POST_expose]', df)
regC4 = mod.fit()
#print('Panel B. RF: %2.4f (%2.4f)'  %(regC4.params['POST_close'],regC4.std_errors['POST_close']))

In [16]:
print('Table7: IV Estimates of the Effect of Closings on Local Credit Supply\n')

print('Panel A. OLS: %2.4f  %2.4f  %2.4f     %2.4f'  %(regA1.params['POST_close'],regA2.params['POST_close'],regA3.params['POST_close'],regA4.params['POST_close']))
print('              (%2.4f)  (%2.4f)  (%2.4f)  (%2.4f)\n'  %(regA1.std_errors['POST_close'],regA2.std_errors['POST_close'],regA3.std_errors['POST_close'],regA4.std_errors['POST_close']))

print('Panel B.  RF: %2.4f  %2.4f  %2.4f     %2.4f'  %(regB1.params['POST_expose'],regB2.params['POST_expose'],regB3.params['POST_expose'],regB4.params['POST_expose']))
print('              (%2.4f)  (%2.4f) (%2.4f)   (%2.4f)\n'  %(regB1.std_errors['POST_expose'],regB2.std_errors['POST_expose'],regB3.std_errors['POST_expose'],regB4.std_errors['POST_expose']))

print('Panel C.  IV: %2.4f  %2.4f   %2.4f   %2.4f'  %(regC1.params['POST_close'],regC2.params['POST_close'],regC3.params['POST_close'],regC4.params['POST_close']))
print('              (%2.4f)  (%2.4f)  (%2.4f)  (%2.4f)\n'  %(regC1.std_errors['POST_close'],regC2.std_errors['POST_close'],regC3.std_errors['POST_close'],regC4.std_errors['POST_close']))

print('Six years cum: %2.4s    %2.4s    %2.4s    %2.4s \n' %('', '', '', ''))
print('Mean:          %2.4f    %2.4f    %2.4f    %2.4f \n' %(mean1, mean2, mean3, mean4))
print('Obs:           %2.0f     %2.0f     %2.0f      %2.0f \n' %(regB1.nobs, regB2.nobs, regB3.nobs, regB4.nobs))

Table7: IV Estimates of the Effect of Closings on Local Credit Supply

Panel A. OLS: -3.4933  -192.1115  -0.4530     95.0195
              (0.8237)  (55.2547)  (3.9744)  (636.9604)

Panel B.  RF: -4.2428  -301.8764  -0.4534     133.9591
              (1.0348)  (85.5112) (3.5049)   (577.4813)

Panel C.  IV: 152.5802  7017.1862   340.2441   57576.8465
              (3.6203)  (185.0283)  (9.1445)  (1673.3647)

Six years cum:                      

Mean:          0.1181    4.0116    70.3317    196.4293 

Obs:           45864     43723     47210      47198 



### Table 7

Reproducing table 7 causes the same issue as the estimation used to compute table 6. To test whether the results hold I drop the control variables $X_{it}$. The resulting model looks as follows:

$$y_{icmt} = \alpha_i + (\gamma_t \times \sigma_c) +\delta_{\text{POST}} (\text{POST}_{mt} \times \text{Closure}_{icm}) + \epsilon_{icmt}$$

Note that this model is different to those the author used. Thus, the resulting estimates again differ. The direction of the estimates in the first two rows are similar to those Nguyen (2019) presents in table 7, except the last column ‘dollar volume of mortgages'. This is not contrary, since the authors and my estimates are insignificant. The third row presents the estimation of the instrumental variable approach. The ‘linearmodels’ package includes 2SLS regressions but without fixed effects. Again, the estimated model differs to Nguyen (2019):

$$ \text{POST}_{ itcm }=\beta_0 + \delta* \text{POST_Closure}_{ itcm }+\epsilon_{ itcm }\\
Y_{itcm}=\alpha_i + (\gamma_t \times \sigma_c) + \beta*\hat{POST}_{ itcm }+\epsilon_{ictm} $$

The estimated effects are significant and positive in contrast to the authors. Since we exclude all control variables and fixed effects the resulting difference is not surprising. Meaning that our estimate may now explain some of the variance our control variables or fixed effects captured before. As in table 6 explained, I additionally computed the results in stata using the ‘reghdfe’ command. 
Again, I am able to replicate the regression results within stata.

### Conclusion – replication of main results

Overall the replication of the results using the python environment is difficult at the current stage of packages available. I am able to run the regressions using the stata commands the author reports. Using python, I often faced issues with missing values, collinearity and implementation of fixed effect models. Surprisingly, the stata ‘redhdfe’ package seems to be quite robust against such errors. Parallel to python I checked the regression models using Julia, R and other stata regression commands. None of these produced the desired regression outputs. To check whether the identification framework correctly identifies the effect of interest, I decided to construct a simulation study in the following section.

## Independent contribution - Simulation Study

The general idea of this section is to test the identification framework with a stylized simulated dataset. 

### Simulated Data

To later test the identification strategy, we first need to replicate the data structure. This is rather to use the assumptions made for the authors identification than to replicate the underlying sample. Therefore, a major requirement is to implement an endogeneity issue for bank branch closings. This is done by using the same standard normal error term for constructing the bank branch closings ($D$) and the credit supply ($Y$). To apply the difference in differences method and allow for fixed effects, we need to construct a panel data structure. First, I create an empty data frame with 400 individuals (tracts) between the period 1999 and 2013. Then, I assign 60 percent of the individuals to the treatment sample indicated by the exposure variable ($Exp$). The exact merger year is randomly assigned by a discrete uniform distributed integer for each individual. The merger dummy variable ($M$) equals one in the years after a merger is undergone. The tract control variable in the sample used for the analysis does not vary much over time. Thus, I had some multicollinearity issues with tract controls together with the tract fixed effects. To avoid such an issue, I create a time varying control variable with a random mean and standard deviation for each individual (tract). Furthermore, I assigned the 400 individuals to four different groups and create random normal distributed group fixed effects. To account for a general time trend, I define increasing time fixed effects (again normal distributed with $mean=\frac{year}{2000}$ and $std=0.3$). These effects enter the data as group-by-year fixed effects. In specific, I multiply the group fixed effects with year fixed effects. Finally, the bank branch closings variable ($D$) and the credit supply ($Y$) are computed as described in the formula below. Important to keep in mind for our simulation study are the true parameters of interest $\beta=0.99$ and $\delta=0.5$. These parameters are used for the data generating process and we later are interested in best estimation of these true parameters.

$$
D_{itcm} = 0.5 * M_{mt} + 0.2*X_{it} +\epsilon_{itcm} \\
Y_{itcm} = \alpha_i + (\gamma_t \times \sigma_c) + 0.99 * D_{it} + 0.3*X_{it} + \epsilon_{itcm}
$$

After reading the generated dataset, I define the difference in differences treatment dummy variable by multiplying the merger dummy ($M$) with the exposure dummy ($Exp$). Next we need to set the index to later apply the fixed effects using the linearmodels package. The data structure is presented in the table below.

In [3]:
np.random.seed(123)
df=panel_sample()
#df.to_csv('df_sample.csv')

df['DD']=df.M*df.Exp
df['indivID']=df['iID'].copy()
df['gtID']=df['group_timeID'].copy()
df.set_index(['indivID', 'group_timeID'], inplace=True)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Y,D,M,Exp,X,t,iID,groupID,DD,gtID
indivID,group_timeID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0.0,0,-1.75614,0.191607,0,1,3.67211,1999.0,0.0,1,0,0
0.0,1,2.30991,2.18673,0,1,8.44031,2000.0,0.0,1,0,1
0.0,2,-2.81534,0.477901,0,1,1.68206,2001.0,0.0,1,0,2
0.0,3,-1.34243,0.913258,1,1,5.83203,2002.0,0.0,1,1,3
0.0,4,0.263266,1.48044,1,1,6.34872,2003.0,0.0,1,1,4


### The naive OLS approach

As already motivated before, we have an endogeneity issue with bank branch closings ($D$) and credit supply ($Y$). Therefore, applying a naïve OLS regression should yield biased results of our estimates.

$$
Y_{itc} = \alpha_i + (\gamma_t \times \sigma_c) + \beta* D_{it} + \epsilon_{itc}
$$

In [4]:
mod = PanelOLS(df.Y, df['D'], entity_effects=True, time_effects=True)
res = mod.fit(cov_type='clustered', clusters=df.groupID)
print('The naive approach, which should be biased \n %2.4f \n (%2.4f)' %(res.params,res.std_errors))

The naive approach, which should be biased 
 2.1313 
 (0.0050)


### The naive IV approach

To solve the endogeneity issue, we apply an instrumental variable approach by instrumenting bank branch closings ($D$) with our exogenous instrument post-merger consolidation ($M$). The predicted values from the first stage regression do not include the biased error term which causes the endogeneity. In the second stage of this analysis we regress credit supply ($Y$) on the predicted values to identify the effect of interest. Since we do not control for time-varying individual tract characteristics, the estimation should be biased. As we can see, the approach correctly identifies the first stage relationship, but is downward biased due to the lack of control for time-varying individual characteristics.

$$
D_{itm}=\alpha_i + \gamma_t + \delta* M_{mt}+\epsilon_{it}\\
Y_{itcm}=\alpha_i + (\gamma_t \times \sigma_c) + \beta*\hat{D}_{itm}+\epsilon_{ictm}
$$

In [5]:
df['indivID']=df['iID'].copy()
df['year']=df['t'].copy()
df.set_index(['indivID', 'year'], inplace=True)
mod1 = PanelOLS(df.D, df.M,entity_effects=True, time_effects=True)
res1 = mod1.fit(cov_type='clustered', clusters=df.groupID)
df['predicted']=res1.predict()
df['indivID']=df['iID'].copy()
df['group_timeID']=df['gtID'].copy()
df.set_index(['indivID', 'group_timeID'], inplace=True)
mod2 = PanelOLS(df.Y, df.predicted, entity_effects=True, time_effects=True)
res2 = mod2.fit(cov_type='clustered', clusters=df.groupID)
print('The naive IV approach without tract controls \n First stage: \n %2.4f \n (%2.4f) \n Second stage: \n %2.4f \n (%2.4f)' %(res1.params['M'],res1.std_errors['M'],res2.params['predicted'],res2.std_errors['predicted']))

The naive IV approach without tract controls 
 First stage: 
 0.4983 
 (0.0244) 
 Second stage: 
 0.8246 
 (0.1433)


### The reduced form difference in differences approach

This approach measures the effect of merger induced bank branch closings. Hence, it is likely to underestimate the true effect of bank branch closings. Again, we are not controlling for time-varying individual tract controls. As the result suggest, there are two drivers of underestimation. First, we condition on merger induced closings which represents only a subset of all closings. Second, we miss to control for time-varying individual characteristics ($X_{it}$). 

$$
Y_{itcm}=\alpha_i + (\gamma_t \times \sigma_c) + \delta*( M_{mt}\times Exp_{icm})+\epsilon_{ictm}
$$

In [6]:
mod = PanelOLS(df.Y, df['DD'], entity_effects=True, time_effects=True)
res = mod.fit(cov_type='clustered', clusters=df.groupID)
print('The reduced form (DD) with Exposure to merger as instrument \n %2.4f \n (%2.4f)' %(res.params,res.std_errors))

The reduced form (DD) with Exposure to merger as instrument 
 0.4109 
 (0.0714)


### The reduced form difference in differences approach with controls
This approach is used by Nguyen (2019) to estimate the effect of merger induced bank branch closings on local credit supply. The author is aware of the issue, that she is likely to underestimate the true effect by conditioning of a subset.  The result approximately correctly identifies the true effect of merger induced closings.


$$
Y_{itcm}=\alpha_i + (\gamma_t \times \sigma_c) + \delta*( M_{mt}\times Exp_{icm})+\phi X_{it}+\epsilon_{ictm}
$$

In [11]:
mod = PanelOLS(df.Y, df[['DD','X']], entity_effects=True, time_effects=True)
res = mod.fit(cov_type='clustered', clusters=df.groupID)
print('The reduced form (DD) with Exposure to merger as instrument and tract controls \n %2.4f \n (%2.4f)\n' %(res.params[0],res.std_errors[0]))
print('Note: The DD framework identifies only merger induced effects of branch closings on credit supply. Thus, the true LATE (0.99)\nis underestimated as in the indentification section discussed.\nThe true effect is 0.5 thus the authors framework should yield reliable results  \n')

The reduced form (DD) with Exposure to merger as instrument and tract controls 
 0.4897 
 (0.0394)

Note: The DD framework identifies only merger induced effects of branch closings on credit supply. Thus, the true LATE (0.99)
is underestimated as in the indentification section discussed.
The true effect is 0.5 thus the authors framework should yield reliable results  



### IV approach with controls

Nguyen (2019) finally estimates the effect of interest using an instrumental variable framework along with tract controls, individual and group-by-year fixed effects. This approach is able to approximately estimate the true effect of bank branch closings ($0.99$).

$$
D_{it}=\alpha_i + \gamma_t + \phi X_{it} + \delta* {DD}_{it}\\
Y_{itcm}=\alpha_i + (\gamma_t \times \sigma_c) + \beta*\hat{D}_{it} + \phi X_{it} +\epsilon_{ictm}
$$

In [8]:
df['indivID']=df['iID'].copy()
df['year']=df['t'].copy()
df.set_index(['indivID', 'year'], inplace=True)
mod1 = PanelOLS(df.D, df[['DD','X']],entity_effects=True, time_effects=True)
res1 = mod1.fit(cov_type='clustered', clusters=df.groupID)
df['predicted']=res1.predict()
df['indivID']=df['iID'].copy()
df['group_timeID']=df['gtID'].copy()
df.set_index(['indivID', 'group_timeID'], inplace=True)
mod2 = PanelOLS(df.Y, df[['predicted','X']], entity_effects=True, time_effects=True)
res2 = mod2.fit(cov_type='clustered', clusters=df.groupID)
print('The authors IV approach, which includes tract controls \n First stage: \n %2.4f \n (%2.4f) \n Second stage: \n %2.4f \n (%2.4f)' %(res1.params['DD'],res1.std_errors['DD'],res2.params['predicted'],res2.std_errors['predicted']))

The authors IV approach, which includes tract controls 
 First stage: 
 0.5066 
 (0.0178) 
 Second stage: 
 0.9667 
 (0.0777)


### Conclusion of the Simulation Study

Simulation studies help us to deepen the understanding of the underlying data structure and assumptions. With the data generating process in mind one is able to test different estimation setups to later identify the best approach. This simulation study shows the strengths and weaknesses of the different approaches applied in the paper by Nguyen (2019). As she discussed, the reduced form systematically underestimates the true effect of interest. Applying an instrumental variable approach yields reliable results. Thus, the identification strategy seems to be appropriate. 


## References

*Akerlof, G. A. (1970). The market for lemons: Quality and the market mechanism. Quarterly. Journal Economics, 84, 488-500.*

*Angrist, J. D., & Pischke, J. S. (2008). Mostly harmless econometrics: An empiricist's companion. Princeton university press.*

*Frölich, M., & Sperlich, S. (2019). Impact evaluation. Cambridge University Press.*

*Nguyen, H. L. Q. (2019). Are credit markets still local? evidence from bank branch closings. American Economic Journal: Applied Economics, 11(1), 1-32.*

*Stiglitz, J. E., & Weiss, A. (1981). Credit rationing in markets with imperfect information. The American economic review, 71(3), 393-410.*

*Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Nelson Education.*