# Notebook 5: Exogenous Network Formation and Tetrad Logit
#### Econometric Methods for Social Spillovers and Networks
#### University of St. Gallen, October 7th to October 11th, 2024
##### _Bryan S. Graham, UC - Berkeley, bgraham@econ.berkeley.edu_
This is the 5th of a series of iPython Jupyter notebooks designed to accompany a series of instructional lectures given at the St. Gallen University from October 7th to October 11th, 2024. The scripts below were written for Python 3.6. The Anaconda distribution of Python, available at https://www.continuum.io/downloads, comes bundled with most the scientific computing packages used in these notebooks.
<br>
<br>
For more information about the course please visit my webpage at http://bryangraham.github.io/econometrics/.
<br>
<br>
In this notebook I illustrate how to fix dyadic network formation models using the two estimation procedureds studied in Graham (2017, _Econometrica_). This paper extends dyadic regression analysis to accommodate unobserved agent-specific effects.
<br>
<br>
#### References    
<br>
Graham, Bryan S. (2017). "An econometric model of network formation with degree heterogeneity,” _Econometrica_ 85 (4): 1033 - 1063.

I begin by importing several key packages. The _numpy_ and _scipy_ libraries include a core set of scientific computing tools. The _pandas_ package is useful for data organization and analysis, while _matplotlib_ is Python's basic plotting add-on. Finally the _networkx_ package includes functionality for the analysis, and also visualizing, network data.

In [3]:
# Direct Python to plot all figures inline (i.e., not in a separate window)
%matplotlib inline

# Main scientific computing modules
import numpy as np
import scipy as sp
import pandas as pd

# Import matplotlib
import matplotlib.pyplot as plt

# networkx module for the analysis of network data
import networkx as nx

Load **netrics** package. The Python 2.7 version of this package is registered on [PyPi](https://pypi.python.org/pypi/netrics/), with a GitHub repository at https://github.com/bryangraham/netrics_py27. For an informal introduction to the package see [this](http://bryangraham.github.io/econometrics/networks/2016/09/15/netrics-module.html) blog post. The blog post also includes links to additional resources. The Python 3.6 version of the package is still under development, but currently available code can be found on GitHub at https://github.com/bryangraham/netrics. Once a basic level of functionality and reliability is in place I will register it on PyPi.

In [5]:
# Append location of ipt module base directory to system path
# NOTE: only required if permanent install not made (see comments above)
import sys
sys.path.append('C:/Users/bgrah/Dropbox/Sites/software/ipt/')
sys.path.append('C:/Users/bgrah/Dropbox/Sites/software/netrics/')

# Load ipt and netrics modules
import ipt as ipt
import netrics as netrics

The following code snippet should be edited to point to wherever you have saved the instructional datasets for the course 

In [8]:
# Directory where datasets of located
data =  'C:/Users/bgrah/Dropbox/Teaching/Short_Courses/St_Gallen/2024/Data/'

A basic dyadic dataset for Nyakatoke was constructed in the Notebook 1. I begin by loading this dataset into a pandas dataframe.

In [12]:
# Read in estimation sample created in the notebook for Lecture 1
es = pd.read_csv(data+"Created/Nyakatoke_Estimation_Sample.csv")
es = es.set_index(['hh1', 'hh2'], drop = False)           # Set dataframe multi-index
del es['Unnamed: 0']                                      # Delete first column which is an unneeded single index

# Print out the first few rows of the dyadic data
es.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,hh1,hh2,links,kinship,distance,clan1,clan2,wealth1,wealth2,religion1,religion2,primary1,primary2,head_age1,head_age2,head_sex1,head_sex2
hh1,hh2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,2,1,2,no link,other blood relation,91.2,21.0,6.0,17.657388,6.639588,Catholic,Catholic,0,1,75.0,30.0,1.0,1.0
1,3,1,3,no link,no blood relation,69.6,21.0,21.0,17.657388,1.538235,Catholic,Catholic,0,1,75.0,48.0,1.0,0.0
1,4,1,4,unilateral link,no blood relation,199.2,21.0,23.0,17.657388,2.743059,Catholic,Catholic,0,1,75.0,80.0,1.0,1.0
1,5,1,5,no link,no blood relation,252.0,21.0,23.0,17.657388,1.976098,Catholic,Catholic,0,1,75.0,35.0,1.0,1.0
1,6,1,6,no link,no blood relation,213.6,21.0,23.0,17.657388,1.394941,Catholic,Catholic,0,1,75.0,28.0,1.0,1.0


Below we fit a dyadic link formation model using the two estimators introduced by Graham (2017, _Econometrica_). The link formation model studied in that paper is


<div> $$\Pr\left(\left.D_{ij}=1\right|\mathbf{X},\mathbf{A}\right) = \frac{\exp\left(\sum_{k=1}^{K}W_{k,ij}\beta_{k}+A_{i}+A_{j}\right)}   {1+\exp\left(\sum_{k=1}^{K}W_{k,ij}\beta_{k}+A_{i}+A_{j}\right)}$$ </div>

Here $\mathbf{X}$ denotes all the household-specific observed covariates in the network, $W_{k,ij} = W_{k,ji}$ are dyad-specific covariates constructed from $\mathbf{X}$, and the $A_i$ for $i=1,...,N$ are _degree heterogeneity_ parameters which are household specific. These are treated as co-called "fixed effects".
<br>
<br>
Graham (2017) introduces a "tetrad logit" procedure which forms a criterion function, based on conditioning arguments, which is invariant to the fixed effects and also a "joint fixed effects" procedure which estimates the individual effects along with the common parameters.
<br>
<br>
In the _netrics_ package these two procedures are operationalized by the *tetrad\_logit()* and *dyad\_jfe\_logit()* functions. These two functions require the user to input data in a very particular way. The outcome is $D$, the $N \times N$ adjacency matrix. The regressors are included in a length $K$ list where each element is an $N \times N $ numpy 2d array $W_{k}$ with $(i,j)$ element equal to the dyadic covariate $W_{k,ij}$. It takes a bit of work to get our Nyakatoke dataset into this form. This is what the next few snippets of code do.

In [15]:
hh_set = set(es['hh1'].unique()) | set(es['hh2'].unique()) # Set of all households
N = len(hh_set)                                            # Number of households
n = N * (N - 1) //2                                        # Number of dyads

In [17]:
# Get multi-indices for lower triangle of N x N matrix
ij_LowTri = np.tril_indices(N, -1)

In [19]:
# Form adjacency matrix as a dense 2d numpy array
D = np.zeros((N,N), dtype = bool)
D[ij_LowTri] = (es['links'] != 'no link')
D = (D + D.T)*1

In [21]:
# Form four categorical bins for wealth (same as used in plot of Network from Lecture 1)
wealth_bkts = [1000000, 6, 3, 1.5, 0]
wealth_bin_labels = ['+600K','(600K,300K]','(300K,150K]','(150K, 0K]']
es['wealth_cat1'] = np.digitize(es.wealth1, wealth_bkts)
es['wealth_cat2'] = np.digitize(es.wealth2, wealth_bkts)

Construct $K$ list of $N \times N$ dyad-specific covariate matrices.

In [24]:
cov_names = ['Distance','Other blood relation', 'Cousin, etc.', 'Sibling, child, etc.', 'Age difference', \
             'Same religion', 'Same clan', 'Same education level', 'Adjacent wealth class', \
             'Non-adjacent wealth class']

# Initialize K-list of dyadic covariate matrices
W = [np.zeros((N,N), dtype = float)]*len(cov_names)

# Distance between households
W[0][ij_LowTri] = es['distance']
W[0] = W[0] + W[0].T

# Whether two households are other blood related
W[1][ij_LowTri] = (es['kinship']=='other blood relation')*1
W[1] = W[1] + W[1].T

# Whether two households have a nephew etc. level blood relation
W[2][ij_LowTri] = (es['kinship']=='Nephew/niece/uncle/aunt,cousin,grandparent/grandchild')*1
W[2] = W[2] + W[2].T    

# Whether two households have a child etc. level blood relation
W[3][ij_LowTri] = (es['kinship']=='sibling, child, parent')*1
W[3] = W[3] + W[3].T  

# Absolute difference in age between two household heads
W[4][ij_LowTri] = np.abs(es['head_age1']-es['head_age2'])
W[4] = W[4] + W[4].T

# Whether two households are of the same religion
W[5][ij_LowTri] = (es['religion1'] == es['religion2'])*1
W[5] = W[5] + W[5].T

# Whether two households belong to the same clan
W[6][ij_LowTri] = (es['clan1'] == es['clan2'])*1
W[6] = W[6] + W[6].T

# Whether two households are of the same education level
W[7][ij_LowTri] = (es['primary1'] == es['primary2'])*1
W[7] = W[7] + W[7].T

# Whether two households are of in adjacent wealth class
W[8][ij_LowTri] = (np.abs(es['wealth_cat1'] - es['wealth_cat2']) == 1)*1
W[8] = W[8] + W[8].T

# Whether two households are of in distance wealth classes
W[9][ij_LowTri] = (np.abs(es['wealth_cat1'] - es['wealth_cat2']) > 1)*1
W[9] = W[9] + W[9].T

Save the adjacency matrix and the covariate matrices to an uncompressed file (this serves as a test dataset for the *tetrad\_logit()* and *dyad\_jfe\_logit()* functions). This file can be download from GitHub at https://github.com/bryangraham/netrics/blob/master/Notebooks/Nyakatoke_Example.npz.

In [27]:
np.savez(data+"Created/Nyakatoke_Example", D = D, Distance = W[0], OtherBlood = W[1], Cousin = W[2], \
                                           Sibling = W[3], AgeDiff = W[4], SameReligion = W[5], \
                                           SameClan = W[6], SameEdLvl = W[7], AdjacentWealthClass = W[8], \
                                           NonAdjacentWealthClass = W[9])

## Tetrad Logit
With $W$ appropriately constructed we can now fit a link formation model using *tetrad\_logit()*. We can get a sense of how the function works by viewing its help header.

In [30]:
help(netrics.tetrad_logit)

Help on function tetrad_logit in module netrics.tetrad_logit:

tetrad_logit(D, W, dtcon=None, silent=False, W_names=None)
    AUTHOR: Bryan S. Graham, bgraham@econ.berkeley.edu, June 2016
            Revised/updated for Python 3.6 October, 2018

    This function computes the Tetrad Logit estimator introduced in Graham (2017, Econometrica)
    -- "An Econometric Model of Link Formation with Degree Heterogeneity". The implementation
    is as described in the paper. Notation attempts to follow that used in the paper.

    INPUTS
    ------
    D                 : N x N undirected adjacency matrix
    W                 : List with elements consisting of N x N 2d numpy arrays of dyad-specific
                        covariates such that W[k][i,j] gives the k-th covariate for dyad ij
    dtcon             : Dyad and tetrad concordance (dtcon) List with elements [tetrad_to_dyads_indices,
                        dyad_to_tetrads_dict]. If dtcon=None, then construct it using generate_tetrad_in

Finally we call *tetrad\_logit()*. Depending on the speed and memory of your computer, this next bit of code may take a few minutes to complete.

In [33]:
[beta_TL, vcov_beta_TL, tetrad_frac_TL, success] = \
    netrics.tetrad_logit(D, W, dtcon=None, silent=False, W_names=cov_names)

Fisher-Scoring Derivative check (2-norm): 83.97825823
Value of -logL = 69158.185519,  2-norm of score = 5050101.000941
Value of -logL = 67593.616893,  2-norm of score = 798126.605197
Value of -logL = 67538.081576,  2-norm of score = 227764.211361
Value of -logL = 67033.641902,  2-norm of score = 180510.666363
Value of -logL = 66303.083394,  2-norm of score = 98018.170491
Value of -logL = 66110.547910,  2-norm of score = 45862.868826
Value of -logL = 59498.546153,  2-norm of score = 383793.425762
Value of -logL = 59486.204464,  2-norm of score = 14890.282068
Value of -logL = 57556.926539,  2-norm of score = 48275.681501
Value of -logL = 57556.713127,  2-norm of score = 11335.705003
Value of -logL = 55482.844360,  2-norm of score = 56718.092363
Value of -logL = 55482.552123,  2-norm of score = 8825.014286
Value of -logL = 52713.320868,  2-norm of score = 88086.568318
Value of -logL = 52712.611482,  2-norm of score = 6841.115100
Value of -logL = 51002.399516,  2-norm of score = 43311.4584

  return terminate(2, msg)



-------------------------------------------------------------------------------------------
- TETRAD LOGIT ESTIMATION RESULTS                                                         -
-------------------------------------------------------------------------------------------

Number of agents,           N :             115
Number of dyads,            n :           6,555
Number of tetrads             :       6,913,340
Number identifying tetrads    :         102,151
Fraction identifying tetrads  :        0.014776

-------------------------------------------------------------------------------------------

Independent variable       Coef.    ( Std. Err.)     (0.95 Confid. Interval )
-------------------------------------------------------------------------------------------
Distance                  -0.002098 (  0.000321)     ( -0.002728 , -0.001468)
Other blood relation       1.263222 (  0.295226)     (  0.684589 ,  1.841855)
Cousin, etc.               2.045045 (  0.373156)     (  1.3136

The program spits out some useful information. In a network with $N = 115$ agents there are a total of $6,913,340$ different tetrads! Of these only $102,151$, or about 1.5 percent, actually contribute to the *tetrad\_logit()* criterion function. The effectiveness of the procedure in the context of sparse networks is one of its theoretical and practical attractions.
<br>
<br>
The results suggest that ties are more frequent between households which are physically close, related by blood and where the household heads are close in age. There is little evidence for sorting by religion, clan or education. There is some evidence for heterophily in terms of wealth (perhaps consistent with some models of "mutual support"), but it is rather weak.
<br>
<br>
The *tetrad\_logit()* estimation procedure is computationally intense. On a modern desktop machine it would be difficult to use the procedure on a network larger than a few hundred agents (fortunately there are many possible ways to apply the method at scale via various approximations and parallelizations; but these extensions are not yet part of the _netrics_ package). 
<br>
<br>
Part of the computational intensity is up front. Specifically, the function computes a detailed dictionary which maps dyads-to-tetrads and vice-versa. This bookkeeping overhead makes later calculations much quicker. A user who anticipates fitting several different models to the same adjacency matrix can save considerable time by computing this concordance first (once and for all) and then passing it to *tetrad\_logit()* via the _dtcon_ parameter. This can be done by calling the *generate\_tetrad\_indices()* function.
<br>
<br>
To illustrate I create the concordance and then fit a model with only the distance and kinship variables included.

In [35]:
N = np.shape(D)[0]
concordance = netrics.generate_tetrad_indices(N, full_set=True)
[beta_TL, vcov_beta_TL, tetrad_frac_TL, success] = \
    netrics.tetrad_logit(D, W[0:4], dtcon=concordance, silent=False, W_names=cov_names[0:4])

Fisher-Scoring Derivative check (2-norm): 83.97815266
Value of -logL = 69163.219760,  2-norm of score = 5045814.882188
Value of -logL = 67604.839042,  2-norm of score = 772959.875424
Value of -logL = 67557.021217,  2-norm of score = 30472.353371
Value of -logL = 60570.400356,  2-norm of score = 312602.526238
Value of -logL = 60562.289810,  2-norm of score = 9904.763502
Value of -logL = 55914.238422,  2-norm of score = 204713.530247
Value of -logL = 55910.626413,  2-norm of score = 6252.561686
Value of -logL = 53745.636791,  2-norm of score = 62366.323280
Value of -logL = 53745.286498,  2-norm of score = 4450.442313
Value of -logL = 52853.898960,  2-norm of score = 13360.325111
Value of -logL = 52853.881208,  2-norm of score = 3931.549249
Value of -logL = 52132.404516,  2-norm of score = 9742.309192
Value of -logL = 52132.394609,  2-norm of score = 3371.576694
Value of -logL = 51720.031399,  2-norm of score = 4052.700791
Value of -logL = 50991.842897,  2-norm of score = 17599.721405
Val

  return terminate(2, msg)



-------------------------------------------------------------------------------------------
- TETRAD LOGIT ESTIMATION RESULTS                                                         -
-------------------------------------------------------------------------------------------

Number of agents,           N :             115
Number of dyads,            n :           6,555
Number of tetrads             :       6,913,340
Number identifying tetrads    :         102,151
Fraction identifying tetrads  :        0.014776

-------------------------------------------------------------------------------------------

Independent variable       Coef.    ( Std. Err.)     (0.95 Confid. Interval )
-------------------------------------------------------------------------------------------
Distance                  -0.002067 (  0.000328)     ( -0.002710 , -0.001424)
Other blood relation       1.307596 (  0.293372)     (  0.732597 ,  1.882595)
Cousin, etc.               2.209613 (  0.393581)     (  1.4382

## Joint Fixed Effects Logit
Graham (2017) also introduces a joint fixed effects estimator for link formation. This procedure estimates the coefficients on $W_{k,ij}$ for $k=1,...,K$ as well as individual specific degree heterogeneity parameters $A_{i}$ for $i=1,...,N$. The *dyad\_jfe\_logit()* function implements this estimator. By default it reports the iterated bias-corrected estimates described in the paper (but the uncorrected estimates are also returned by the function).

In [39]:
[beta_JFE, beta_JFE_BC, vcov_beta_JFE, A_JFE, success] = \
    netrics.dyad_jfe_logit(D, W, T=None, silent=False, W_names=cov_names, beta_sv=None)

-------------------------------------------------------------------------------------------
- COMPUTE JOINT FIXED EFFECT MLEs                                                         -
-------------------------------------------------------------------------------------------
Value of c_logl = 1573.908738,  2-norm of c_score = 62611.044224
Value of c_logl = 1547.376177,  2-norm of c_score = 48351.975346
Value of c_logl = 1529.569133,  2-norm of c_score = 37055.111351
Value of c_logl = 1517.970298,  2-norm of c_score = 28245.611857
Value of c_logl = 1510.616124,  2-norm of c_score = 21451.154557
Value of c_logl = 1506.059542,  2-norm of c_score = 16250.785299
Value of c_logl = 1503.289015,  2-norm of c_score = 12291.724887
Value of c_logl = 1501.628916,  2-norm of c_score = 9289.087258
Value of c_logl = 1500.644547,  2-norm of c_score = 7018.123943
Value of c_logl = 1500.064460,  2-norm of c_score = 5304.215796
Value of c_logl = 1499.723062,  2-norm of c_score = 4013.098329
Value of c_lo

  return terminate(2, msg)



-------------------------------------------------------------------------------------------
- BIAS CORRECTED JOINT FIXED EFFECTS LOGIT ESTIMATION RESULTS                             -
-------------------------------------------------------------------------------------------

Number of agents,           N :             115
Number of dyads,            n :           6,555

-------------------------------------------------------------------------------------------

Independent variable       Coef.    ( Std. Err.)     (0.95 Confid. Interval )
-------------------------------------------------------------------------------------------
Distance                  -0.002001 (  0.000184)     ( -0.002361 , -0.001640)
Other blood relation       1.362710 (  0.189174)     (  0.991936 ,  1.733485)
Cousin, etc.               1.856894 (  0.244763)     (  1.377167 ,  2.336620)
Sibling, child, etc.       2.875526 (  0.273332)     (  2.339806 ,  3.411246)
Age difference            -0.004631 (  0.003642)  

Value of c_logl = 1363.742108,  2-norm of c_score = 0.085660
Value of c_logl = 1363.742101,  2-norm of c_score = 0.397025
Value of c_logl = 1363.742101,  2-norm of c_score = 0.295160
Value of c_logl = 1363.742101,  2-norm of c_score = 0.219695
Value of c_logl = 1363.742101,  2-norm of c_score = 0.163828
Value of c_logl = 1363.742101,  2-norm of c_score = 0.122529
Value of c_logl = 1363.742101,  2-norm of c_score = 0.092091
Value of c_logl = 1363.742101,  2-norm of c_score = 0.196307
Value of c_logl = 1363.742101,  2-norm of c_score = 0.146177
Value of c_logl = 1363.742101,  2-norm of c_score = 0.109087
Value of c_logl = 1363.742101,  2-norm of c_score = 0.081699
Value of c_logl = 1363.742094,  2-norm of c_score = 0.180137
Value of c_logl = 1363.742094,  2-norm of c_score = 0.134141
Value of c_logl = 1363.742094,  2-norm of c_score = 0.100105
Value of c_logl = 1363.742094,  2-norm of c_score = 0.074967
Value of c_logl = 1363.742094,  2-norm of c_score = 0.168251
Value of c_logl = 1363.7

Value of c_logl = 1363.742084,  2-norm of c_score = 0.013948
Value of c_logl = 1363.742084,  2-norm of c_score = 0.015695
Value of c_logl = 1363.742084,  2-norm of c_score = 0.018902
Value of c_logl = 1363.742084,  2-norm of c_score = 0.014042
Value of c_logl = 1363.742084,  2-norm of c_score = 0.046866
Value of c_logl = 1363.742084,  2-norm of c_score = 0.034800
Value of c_logl = 1363.742084,  2-norm of c_score = 0.025855
Value of c_logl = 1363.742084,  2-norm of c_score = 0.019227
Value of c_logl = 1363.742084,  2-norm of c_score = 0.022026
Value of c_logl = 1363.742084,  2-norm of c_score = 0.016378
Value of c_logl = 1363.742084,  2-norm of c_score = 0.018776
Value of c_logl = 1363.742084,  2-norm of c_score = 0.013961
Value of c_logl = 1363.742084,  2-norm of c_score = 0.015933
Value of c_logl = 1363.742084,  2-norm of c_score = 0.011847
Value of c_logl = 1363.742084,  2-norm of c_score = 0.013415
Value of c_logl = 1363.742084,  2-norm of c_score = 0.016128
Value of c_logl = 1363.7

It takes some time for *dyad\_jfe\_logit()* to converge. This reflects the sparsity of the Nyakatoke network and the consequent difficulty of estimating the household degree heterogeneity effects.
<br>
<br>
Relative to the *tetrad\_logit()* results, the joint estimator suggests sorting by religion and clan. In Monte Carlo experiments I have found that the joint estimator -- particularly the bias correction step -- can work very poorly in networks like the Nyakatoke one (i.e., in networks that are sparse such that many agents have only a few links). In such settings the $A_{i}$ for $i=1,...,N$ may be _very_ imprecisely estimated and this can effect the quality of the common parameter estimates as well.
<br>
<br>
For reference we can look at the joint coefficient estimates prior to bias correction using the *print\_coef()* utility included in the **netrics** package. These point estimates are closer to the *tetrad\_logit()* ones than their bias-corrected counterparts. This suggests, that in this particular example, bias-correction may be doing more harm that good.
<br>
<br>
Monte Carlo evidence suggests that in denser networks, the size distortion causes by bias in the limit distribution of the joint fixed effects estimator is very real and, furthermore, that bias correction can be effective in such contexts. So not too much should be made of this example.

In [42]:
netrics.print_coef(beta_JFE, vcov_beta_JFE, var_names=cov_names)


Independent variable       Coef.    ( Std. Err.)     (0.95 Confid. Interval )
-------------------------------------------------------------------------------------------
Distance                  -0.002531 (  0.000184)     ( -0.002891 , -0.002170)
Other blood relation       1.305248 (  0.189174)     (  0.934474 ,  1.676022)
Cousin, etc.               1.924260 (  0.244763)     (  1.444534 ,  2.403987)
Sibling, child, etc.       2.880070 (  0.273332)     (  2.344350 ,  3.415790)
Age difference            -0.013243 (  0.003642)     ( -0.020382 , -0.006103)
Same religion              0.375952 (  0.083760)     (  0.211786 ,  0.540119)
Same clan                  0.259152 (  0.138685)     ( -0.012665 ,  0.530969)
Same education level       0.073594 (  0.117280)     ( -0.156270 ,  0.303459)
Adjacent wealth class     -0.175217 (  0.100513)     ( -0.372218 ,  0.021785)
Non-adjacent wealth class -0.282326 (  0.113675)     ( -0.505124 , -0.059528)

------------------------------------------------

In [44]:
# This imports an attractive notebook style from Github
from IPython.display import HTML
from urllib.request import urlopen
html = urlopen('http://bit.ly/1Bf5Hft')
HTML(html.read().decode('utf-8'))