# Sparse network asymptotics for logistic regression under possible misspecification: empirical illustration
_September 2023_    
Bryan S. Graham, UC - Berkeley, bgraham@econ.berkeley.edu

This iPython Jupyter notebook reproduces the short empirical illustration in the revised version of my paper "Sparse network asymptotics for logistic regression under possible misspecification.” This illustration use information on firm-bank relationships from drawn from the Dealscan dataset. The exact extract used in the paper corresponds to syndicated loan deals made in the first six months of 2003 as collected by Jiawei Chen abd Kejun Song (2013) for their paper "Two-sided matching in the loan market" in the _International Journal of Industrial Organization_. I am grateful to Jiawei for kindly sharing his estimation sample with me.

The scripts below were written for Python 3.6. The Anaconda distribution of Python, available at https://www.continuum.io/downloads, comes bundled with all the scientific computing packages used in this notebook (except for the _geopy_ package which users can install with conda). The notebook additionally uses the _ipt_ and _netrics_ modules that I have created. They are available on my GitHub page (https://github.com/bryangraham).

Please feel free to use and modify the material in this notebook for your own research purposes. All I ask is that you cite both the underlying research as well as this codebase (see below for a suggested citation). If you find any errors in what follows I would be happy to hear about them. While I am not able to provide meaningful support for potential users of this code, I am willing to answer questions when/if I have the bandwidth to so.

**References:**

Chen, Jiawei and Kejun Song. (2013). "Two-sided matching in the loan market" _International Journal of Industrial Organization_ 31 (2): 145 - 152.   

Graham, Bryan S. (2020). "Sparse network asymptotics for logistic regression under possible misspecification,” _CEMMAP Working Paper CWP51/20_ (Revised 2024).   

**Suggested code citation:**  

Graham, Bryan S. (2022). "Sparse network asymptotics for logistic regression under possible misspecification: empirical illustration Python Jupyter notebook," (Version 1.0) [Computer program]. Available at http://bryangraham.github.io/econometrics/ (Accessed 20 September 2022).

This first block of code loads the main modules used below. The _geopy_ module is used to compute the distance between firm and bank headquarters from latitude and longitude coordinates.

In [1]:
# Direct Python to plot all figures inline (i.e., not in a separate window)
%matplotlib inline

# Load libraries
import time
import pickle
import numpy as np
import scipy as sp
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import geopy.distance

# Directory where firm balance sheet and supply-chain raw source files are located
data =  '/Users/bgraham/Dropbox/Sites/software/netrics/NotForRepo/matching_dataset/'

Next load the netrics package. The Python 2.7 version of this package is registered on PyPi, with a GitHub repository at https://github.com/bryangraham/netrics_py27. For an informal introduction to the package see this [blog post](https://bryangraham.github.io/econometrics/networks/2016/09/15/netrics-module.html). The blog post also includes links to additional resources. The Python 3.6 version of the package is still under development, but currently available code can be found on GitHub at https://github.com/bryangraham/netrics. Once a basic level of functionality and reliability is in place I will register it on PyPi. The netrics package uses some functionality from the ipt package; so this latter package is loaded as well. You can read more about the ipt package at this [blog post](https://bryangraham.github.io/econometrics/causal/inference/2016/05/15/IPT-module.html).

In [2]:
# Append location of netrics and ipt modules base directory to system path (you can get these off GitHub)
# NOTE: only required if permanent install not made (see comments above)
import sys
sys.path.append('/Users/bgraham/Dropbox/Sites/software/ipt/')
sys.path.append('/Users/bgraham/Dropbox/Sites/software/netrics/')

# Load ipt and netrics modules
import ipt as ipt
import netrics as netrics

I only use the last six months of data in the Chen and Song (2013) dataset. Ths correspond to a sample syndicated loan deals made during the first six months of 2003.

In [3]:
# Read in last period of bank-to-firm loan data (first six months of 2003)
bank_firm = pd.read_excel(data + '200306.xls', sheet_name='0306', header = 0)

Extract index set of firms and banks in the sample. These index sets are used to construct the estimation sample below.

In [6]:
firms = set(bank_firm.firmid.unique())
banks = set(bank_firm.bankid.unique())

The results reported in the paper use firms total assets (in billions of dollars), bank total assets (also in billions of dollars), and firm and bank locations. Using these data I construct a dyadic level dataset with a record for each possible bank-firm pair. Each record includes the log of the bank's total assets, the log of the firm's total assets, the interaction of these two variables, and the long of the distance between the bank and firm head-quarters in thousands of kilometers. This next block of code constructs this dataset.

In [7]:
data = []

# Sets to store indices of firms and banks with missing data
firms_inr = set()
banks_inr = set()

# Construct NxM dyadic dataset
for i in firms:
    for j in banks:
        
        # Indicator for when firm i borrowed from bank j 
        deal_ij = len(bank_firm.loc[(bank_firm['firmid'] == i) & (bank_firm['bankid'] == j)])
        
        # Total firm assets in billions of dollars
        firm_size_i = bank_firm.loc[(bank_firm['firmid'] == i)].iloc[0]['opacity3']/1000
                
        # Firm location
        firm_lat_i = bank_firm.loc[(bank_firm['firmid'] == i)].iloc[0]['latfirm3']
        firm_lon_i = bank_firm.loc[(bank_firm['firmid'] == i)].iloc[0]['longfirm3']
        
        # Total bank assets in billions of dollars
        bank_size_j = bank_firm.loc[(bank_firm['bankid'] == j)].iloc[0]['size']/1000000
                
        # Bank location
        bank_lat_j = bank_firm.loc[(bank_firm['bankid'] == j)].iloc[0]['latbankapproximate']
        bank_lon_j = bank_firm.loc[(bank_firm['bankid'] == j)].iloc[0]['longbank']
        
        # Distance between bank and firm in thousands of kilometers
        try:
            distance_ij = geopy.distance.geodesic((firm_lat_i, firm_lon_i), (bank_lat_j,bank_lon_j)).km/1000    
        except:
            distance_ij = np.nan
                
        data.append(pd.DataFrame({'deal' : deal_ij, \
                                  'firm_size' : np.log(firm_size_i), 'bank_size' : np.log(bank_size_j), \
                                  'bank_size_X_firm_size' : np.log(firm_size_i)*np.log(bank_size_j), \
                                  'distance' : np.log(distance_ij)}, index=[(i, j)]))

        # Check for item non-response
        if (np.isnan(firm_size_i) | np.isnan(firm_lat_i) | np.isnan(firm_lon_i)):
            firms_inr.add(i)
        if (np.isnan(bank_size_j) | np.isnan(bank_lat_j) | np.isnan(bank_lon_j)):
            banks_inr.add(j)    
        
Z = pd.concat(data)        

Next I add a multi-index to the dataframe; this is required by the _netrics_ module's _bilogit_ command. The first few rows of the dataframe are displayed below. The _i_ index is for firms, the _j_ index for banks.

In [9]:
# Set up multi-index with firm (i) as the level zero index and bank (j) as the level one index 
index = pd.MultiIndex.from_tuples(Z.index, names=["i", "j"])
Z.set_index(index, inplace=True)
Z[0:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,deal,firm_size,bank_size,bank_size_X_firm_size,distance
i,j,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,3.426573,1.590567,5.450193,-0.669361
1,2,0,3.426573,6.486161,22.225301,-0.531825
1,3,0,3.426573,4.569543,15.657871,-0.675604
1,4,0,3.426573,3.325036,11.393478,1.292629
1,5,0,3.426573,1.775735,6.084685,-0.670627


Next I drop any firms and banks with missing regressor data.

In [10]:
# Drop firms with missing values for an regressors
for i in firms_inr:
    Z.drop(i, level=0, axis=0, inplace=True)

# Drop banks with missing values for an regressors    
for j in banks_inr:
    Z.drop(j, level=1, axis=0, inplace=True)

print("Estimation sample")    
print("{:>2,.0f}".format(len(firms_inr)) + " out of "+ "{:>2,.0f}".format(len(firms)) + " firms dropped from estimation sample.") 
print("{:>2,.0f}".format(len(banks_inr)) + " out of "+ "{:>2,.0f}".format(len(banks)) + " banks dropped from estimation sample.")    

Estimation sample
37 out of 388 firms dropped from estimation sample.
 0 out of 39 banks dropped from estimation sample.


The final sample of complete cases includes $M=351$ firms and $N=39$ banks. To avoid having firm and bank indices in the dataframe index with no records (which the _bilogit_ command can't handle), I reindex the dataframe.

In [11]:
# Reset the multi-index to account for dropped banks and firms
Z['dyad'] = Z.index.values
index = pd.MultiIndex.from_tuples(Z.index, names=["i", "j"])
Z.set_index(index, inplace=True)
Z.drop('dyad', axis=1, inplace=True)
Z[0:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,deal,firm_size,bank_size,bank_size_X_firm_size,distance
i,j,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,3.426573,1.590567,5.450193,-0.669361
1,2,0,3.426573,6.486161,22.225301,-0.531825
1,3,0,3.426573,4.569543,15.657871,-0.675604
1,4,0,3.426573,3.325036,11.393478,1.292629
1,5,0,3.426573,1.775735,6.084685,-0.670627


The final dataset includes $NM=388 \times 39=13,689$ dyads. Only 351 loans out of the 13,689 possible loans are present in the dataset. In this sample each firm borrows just once, but some banks lend to multiple firms. In Chen and Song's full dataset sometimes firms borrow from more than one bank in a given six month interval, but this occurs only rarely (and not at all in the small extract used here). Some basic summary statistics for my estimation sample are included below.

In [13]:
Z.describe()

Unnamed: 0,deal,firm_size,bank_size,bank_size_X_firm_size,distance
count,13689.0,13689.0,13689.0,13689.0,13689.0
mean,0.025641,-0.034867,2.517779,-0.087787,0.32687
std,0.158068,1.986805,2.695108,7.328194,1.205655
min,0.0,-6.115668,-4.728808,-39.722889,-16.014247
25%,0.0,-1.288772,1.39582,-3.809633,-0.08344
50%,0.0,-0.100405,3.218876,-0.125462,0.514116
75%,0.0,1.285084,4.168214,3.495297,1.136673
max,1.0,4.776229,6.495266,31.022873,2.103295


Finally I fit the logistic regression model using the _bilogit_ command. Basic syntax for the command is given below. This is not commerical-grade software, but it should meet many researchers' needs (perhaps after modification). While I not able to provide direct support for this code I very much welcome comments, suggestions and corrections. Feel free to modify and/or improve the code for your own purposes.

In [14]:
help(netrics.bilogit)

Help on function bilogit in module netrics.bilogit:

bilogit(Y, R, nocons=False, silent=False, cov='DR_bc')
    AUTHOR: Bryan S. Graham, UC - Berkeley, bgraham@econ.berkeley.edu, September 2022
            (revised September 2023)
    PYTHON 3.6
    
    This function computes bipartite logit regression estimates 
        
    N = number of "customers"
    M = number of "products"
    n = NM, dumber of dyads 
    
    
    INPUTS:
    -------
    Y              :  n-length Pandas series with outcome for each
                      dyad as elements  
    R              :  n x K Pandas dataframe / regressor matrix
    nocons         :  if True do NOT append a constant to the regressor
                      matrix (default is to append a constant)
    silent         :  if True suppress optimization and estimation output
    cov            :  covariance matrix estimator
                      'jack-knife', 'dense', 'sparse' are allowable choices (see below)
    
    
    The three variance-c

The logit output shown below appears in **Table 3** of in Section 4 of the paper.

In [15]:
# Get outcome variable and drop from regressor matrix
Y = Z['deal'].copy(deep=True)
Z.drop('deal', axis=1, inplace=True)
Z['constant'] = 1

[theta_BL, vcov_theta_BL] = netrics.bilogit(Y, Z, nocons=True, silent=False, cov='dense')
[theta_BL, vcov_theta_BL] = netrics.bilogit(Y, Z, nocons=True, silent=False, cov='sparse')

Fisher-Scoring Derivative check (2-norm): 0.00086652
Value of -logL = 6041.447147,  2-norm of score = 4989.097462
Value of -logL = 2876.176500,  2-norm of score = 2499.560927
Value of -logL = 1932.568338,  2-norm of score = 921.749430
Value of -logL = 1695.945578,  2-norm of score = 449.079220
Value of -logL = 1493.703937,  2-norm of score = 113.048469
Value of -logL = 1420.946949,  2-norm of score = 142.691991
Value of -logL = 1420.613298,  2-norm of score = 56.857676
Value of -logL = 1409.164145,  2-norm of score = 34.901422
Value of -logL = 1399.768941,  2-norm of score = 27.141524
Value of -logL = 1399.747610,  2-norm of score = 4.797378
Value of -logL = 1399.569893,  2-norm of score = 3.103135
Value of -logL = 1399.435284,  2-norm of score = 0.502672
Value of -logL = 1399.435276,  2-norm of score = 0.068365
Value of -logL = 1399.435261,  2-norm of score = 0.060953
Value of -logL = 1399.435234,  2-norm of score = 0.044987
Value of -logL = 1399.435204,  2-norm of score = 0.007024
Va

The estimation results are consistent with Chen and Song's (2013) findings of positive assortive matching by size among banks and firms and also with the importance of spatial proximity. See their paper for greater discussion about the economic and policy implications of this finding. Note, consistent with the sparse asymptotic approximation developed in the paper, the "dense" standard errors are smaller than the "sparse" ones; particularly for the dyadic regressors (like the log-distance between firm _i_ and bank _j_). This suggests that the extra variance components captured in the sparse approximation are meaningfully large in this particular sample. The estimated standard error for the distance variable is $0.042/0.026=1.61$ longer under the assumption of "sparseness" versus "denseness".

This calculates and prints the degree sequence for banks (number of loans made); as well as the inter-quartile range of the sequence. This is useful for calibrating the Monte Carlo data generating process below. In particular the $\rho$ parameter in the Monte Carlo DGP described in Section 4 of the paper was chosen to match the inter-quartile range of the degree sequence.

In [16]:
bank_degree = Y.groupby(level=1).sum()
print(bank_degree) 
print(bank_degree.quantile(0.75)-bank_degree.quantile(0.25)) 

j
1       1
2     155
3      10
4       2
5       1
6      13
7       1
8       8
9       1
10      6
11      1
12      3
13      3
14      1
15      1
16      3
17      2
18     10
19      1
20     12
21      3
22      4
23      8
24     25
25     11
26      2
27      2
28      5
29      1
30      3
31     14
32      2
33     11
34     15
35      1
36      6
37      1
38      1
39      1
Name: deal, dtype: int64
8.0


This next block of code executes the empirically calibrated Monte Carlo experiment also reported in Section 4 of the paper. In the empirical Monte Carlo the conditional link probability is assumed to take the logit form (i.e., "correct specification" is maintained). The values for the intercept and slope coefficients in this probability are simply those estimated above. The key to calibrating the simulation DGP to the dataset is to construct a latent dyad-level utility shifter that is logistically distributed marginally but that also exhibits the dyadic dependence structure in the paper. I do this by exploiting the reproductive stable property of the gamma distribution. The key cross-dyad dependence parameter, $\rho$ in the paper, is chosen to match the inter-quartile range of the bank degree-sequence.

In [34]:
# Values for alpha_v
designs = [34, 19, 4]
NumDesigns = len(designs)

 # Number of Monte Carlo simulations
S = 5000            

#----------------------------------------------------#
#- CORE FEATURES OF MONTE CARLO DESIGNS             -#
#----------------------------------------------------#
    
n = 390                 # Sample size, same as in empirical illustration
N = 351                 # Number of firms
M = 39                  # Number of banks

# Initialize matrices for storage of Monte Carlo results
theta_hat = np.zeros((S,NumDesigns))
coverage  = np.zeros((S,NumDesigns*2))
se_hat    = np.zeros((S,NumDesigns*2))

#----------------------------------------------------#
#- BEGIN MONTE CARLO SIMULATIONS.                   -#
#----------------------------------------------------#

for b in range(0, NumDesigns):
    
    # Set random seed and start bth Monte Carlo experiments
    # (same seed is used for each design)
    np.random.seed(seed=361)
    
    print("Monte Carlo design " + '%.0f' % (b+1) + " of " + '%.0f' % NumDesigns )

    #----------------------------------------------#
    #- MONTE CARLO SIMULATIONS FOR CURRENT DESIGN -#
    #----------------------------------------------#

    for s in range(0,S):
        start = time.time()
    
        #-------------------------------------#
        #- STEP 1 : SIMULATE BIPARTITE GRAPH -#
        #-------------------------------------#
    
        alpha_a = 1/2
        alpha_b = 1/2
        beta = 1
    
        # Simulate consumer and product heterogeneity/effects
        A = np.random.gamma(alpha_a, 1/beta, N)
        B = np.random.gamma(alpha_b, 1/beta, M)
        
        iota_M   = np.ones((M,))
        i        = np.kron(np.arange(0,N), iota_M)
        A_n      = np.kron(A, iota_M) 
    
        iota_N   = np.ones((N,))
        j        = np.kron(iota_N, np.arange(0,M))
        B_n      = np.kron(iota_N, B) 
    
        
    
        V        = np.random.gamma(designs[b], 1/beta, (N*M,))
        U_star   = A_n + B_n + V
        U_star   = sp.stats.gamma.cdf(U_star, alpha_a + alpha_b + designs[b], 0, 1/beta)  
        U        = np.log(U_star/(1-U_star))                            
    
        Y_s      = 1*(Z @ theta_BL >= U.reshape(-1,1))
    
        #---------------------------------------------#
        #- Uncomment this code and iteratively play  -#
        #- with the DGP to choose the "designs"      -#
        #- parameters to match the degree sequence's -#
        #- inter-quartile range as described in.     -#
        #- Section 4 of the paper.                   -# 
        #---------------------------------------------#
        #bank_degree_s = Y_s.groupby(level=1).sum()
        #print(bank_degree_s)
        #print(bank_degree_s.quantile(0.75)-bank_degree_s.quantile(0.25)) 
        
        #----------------------------------------------------------#
        #- STEP 2 : COMPUTE PSEUDO COMPOSITE LIKELIHOOD ESTIMATES -#
        #----------------------------------------------------------#
    
        #--------------------------------------------------------------------------#
        #- Estimation is with the bilogit() command included in the netrics module-#
        #--------------------------------------------------------------------------#
        
        # (i) Dense network standard errors
        # ---------------------------------
        
        [theta_s, vcov_theta_s]= netrics.bilogit(Y_s, Z, nocons=True, silent=True, cov='dense')
       
        # Save pseudo composite MLE of log-distance coefficient
        theta_hat[s,b] = theta_s[3,0]
    
        # See if true interaction coefficient is inside dense Wald-based confidence interval
        coverage[s,b*2]  = (theta_BL[3,0]<=theta_s[3,0] + 1.96*np.sqrt(vcov_theta_s[3,3]))*\
                           (theta_BL[3,0]>=theta_s[3,0] - 1.96*np.sqrt(vcov_theta_s[3,3]))
           
        # Standard error length
        se_hat[s,b*2]    = np.sqrt(vcov_theta_s[3,3])
    
        # (ii) Bias correction / Sparse network standard errors
        # -----------------------------------------------------
        
        [theta_s, vcov_theta_s]= netrics.bilogit(Y_s, Z, nocons=True, silent=True, cov='sparse')
  
        # See if true interaction coefficient is inside sparse Wald-based confidence interval
        coverage[s,b*2+1]  = (theta_BL[3,0]<=theta_s[3,0] + 1.96*np.sqrt(vcov_theta_s[3,3]))*\
                             (theta_BL[3,0]>=theta_s[3,0] - 1.96*np.sqrt(vcov_theta_s[3,3]))
               
        # Standard error length
        se_hat[s,b*2+1] = np.sqrt(vcov_theta_s[3,3])
    
        end = time.time()
        if (s+1) % 500 == 0:
            print("Time required f/ MC rep  " + str(s+1) + " of " + str(S) + ": " + str(end-start))      


Monte Carlo design 1 of 3
Time required f/ MC rep  500 of 5000: 0.21581125259399414
Time required f/ MC rep  1000 of 5000: 0.10178732872009277
Time required f/ MC rep  1500 of 5000: 0.08482098579406738
Time required f/ MC rep  2000 of 5000: 0.06363415718078613
Time required f/ MC rep  2500 of 5000: 0.06827473640441895
Time required f/ MC rep  3000 of 5000: 0.12692975997924805
Time required f/ MC rep  3500 of 5000: 0.10239005088806152
Time required f/ MC rep  4000 of 5000: 0.07051801681518555
Time required f/ MC rep  4500 of 5000: 0.0822610855102539
Time required f/ MC rep  5000 of 5000: 0.10680794715881348
Monte Carlo design 2 of 3
Time required f/ MC rep  500 of 5000: 0.08307695388793945
Time required f/ MC rep  1000 of 5000: 0.06691408157348633
Time required f/ MC rep  1500 of 5000: 0.06562685966491699
Time required f/ MC rep  2000 of 5000: 0.06409978866577148
Time required f/ MC rep  2500 of 5000: 0.06987190246582031
Time required f/ MC rep  3000 of 5000: 0.06362605094909668
Time re

The next block of code prints out the Monte Carlo summary results reported in **Table 4** of Section 4 of the paper.

In [35]:
from scipy.stats import norm

np.set_printoptions(suppress=True, precision=5)

print("theta_0 : " + '%.5f' % theta_BL[3,0])
print(np.mean(theta_hat-theta_BL[3,0],axis=0))
print(np.median(theta_hat-theta_BL[3,0],axis=0))
print(np.std(theta_hat,axis=0))
print((np.quantile(theta_hat, q=0.95, axis=0)-np.quantile(theta_hat, q=0.05, axis=0))/(norm.ppf(0.95)-norm.ppf(0.05)))  

theta_0 : -0.16629
[0.00217 0.00074 0.00151]
[ 0.0006  -0.00091 -0.00053]
[0.03568 0.03547 0.04003]
[0.0355  0.03479 0.04009]


In [37]:
print(np.mean(coverage,axis=0).reshape((NumDesigns,2)).T)

[[0.5256 0.5266 0.5052]
 [0.928  0.9306 0.9044]]


In [36]:
print(np.mean(se_hat,axis=0).reshape((NumDesigns,2)).T)

[[0.0143  0.01432 0.01482]
 [0.03326 0.03325 0.0345 ]]


In [38]:
# Cross check
U_star.mean()*2
U_star.var()*12

1.0180989336990334