# Notebook 5: Estimating models of exogenous network formation
#### Econometric Methods for Networks
#### _Singapore Management University (SMU) May 29th to June 1st, 2017_
##### _Bryan S. Graham, UC - Berkeley, bgraham@econ.berkeley.edu_
<br>
<br>
This is the fifth in a series of iPython Jupyter notebooks designed to accompany a series of instructional lectures given at the Singapore Management University (SMU) from May 29th to June 1st, 2017. The scripts below were written for Python 2.7.12. The Anaconda distribution of Python, available at https://www.continuum.io/downloads, comes bundled with most the scientific computing packages used in these notebooks.
<br>
<br>
For more information about the course please visit my webpage at http://bryangraham.github.io/econometrics/. This notebook also provides a basic illustration of the recently introduced _netrics_ Python package for the econometric analysis of network data. 
<br>
<br>
I begin by importing several key packages. The _numpy_ and _scipy_ libraries include a core set of scientific computing tools. The _pandas_ package is useful for data organization and analysis, while _matplotlib_ is Python's basic plotting add-on. Finally the _networkx_ package includes functionality for the analysis, and also visualizing, network data.

In [1]:
# Direct Python to plot all figures inline (i.e., not in a separate window)
%matplotlib inline

# Main scientific computing modules
import numpy as np
import scipy as sp
import pandas as pd

# Import matplotlib
import matplotlib.pyplot as plt

# networkx module for the analysis of network data
import networkx as nx

# Division of two integers in Python 2.7 does not return a floating point result. The default is to round down 
# to the nearest integer. The following piece of code changes the default.
from __future__ import division

Load **netrics** package. This package is registered on [PyPi](https://pypi.python.org/pypi/netrics/), with a GitHub repository at https://github.com/bryangraham/netrics. For an informal introduction to the package see [this](http://bryangraham.github.io/econometrics/networks/2016/09/15/netrics-module.html) blog post. The blog post also includes links to additional resources.

In [3]:
# Append location of netrics module base directory to system path
# NOTE: only required if permanent install not made (see discussion in blog post referenced above)
import sys
sys.path.append('/Users/bgraham/Dropbox/Sites/software/netrics/')

# Load netrics module
import netrics as netrics

ModuleNotFoundError: No module named 'helpers'

The following code snippet should be edited to point to wherever you have saved the instructional datasets for the course 

In [3]:
# Directory where datasets of located
data =  '/Users/bgraham/Dropbox/Teaching/Short_Courses/St_Gallen/Data/'

A basic dyadic dataset for Nyakatoke was constructed in the Lecture 1 notebook. I begin by loading this dataset into a pandas dataframe.

In [4]:
# Read in estimation sample created in the notebook for Lecture 1
es = pd.read_csv(data+"Created/Nyakatoke_Estimation_Sample.csv")
es = es.set_index(['hh1', 'hh2'], drop = False)           # Set dataframe multi-index
del es['Unnamed: 0']                                      # Delete first column which is an unneeded single index

# Print out the first few rows of the dyadic data
es.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,hh1,hh2,links,kinship,distance,clan1,clan2,wealth1,wealth2,religion1,religion2,education1,education2,head_age1,head_age2,head_sex1,head_sex2
hh1,hh2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,2,1,2,no link,other blood relation,91.199997,21.0,6.0,17.657388,6.639588,Catholic,Catholic,started primary,finished primary,75.0,30.0,male,male
1,3,1,3,no link,no blood relation,69.599998,21.0,21.0,17.657388,1.538235,Catholic,Catholic,started primary,secondary,75.0,48.0,male,female
1,4,1,4,unilateral link,no blood relation,199.199997,21.0,23.0,17.657388,2.743059,Catholic,Catholic,started primary,finished primary,75.0,80.0,male,male
1,5,1,5,no link,no blood relation,252.0,21.0,23.0,17.657388,1.976098,Catholic,Catholic,started primary,finished primary,75.0,35.0,male,male
1,6,1,6,no link,no blood relation,213.600006,21.0,23.0,17.657388,1.394941,Catholic,Catholic,started primary,finished primary,75.0,28.0,male,male


Below we fit a dyadic link formation model using the two estimators introduced by Graham (2014, _NBER_). The link formation model studied in that paper is


<div> $$\Pr\left(\left.D_{ij}=1\right|\mathbf{X},\mathbf{A}\right) = \frac{\exp\left(\sum_{k=1}^{K}W_{k,ij}\beta_{k}+A_{i}+A_{j}\right)}   {1+\exp\left(\sum_{k=1}^{K}W_{k,ij}\beta_{k}+A_{i}+A_{j}\right)}$$ </div>

Here $\mathbf{X}$ denotes all the household-specific observed covariates in the network, $W_{k,ij} = W_{k,ji}$ are dyad-specific covariates constructed from $\mathbf{X}$, and the $A_i$ for $i=1,...,N$ are _degree heterogeneity_ parameters which are household specific. These are treated as co-called "fixed effects".
<br>
<br>
Graham (2014) introduces a "tetrad logit" procedure which forms a criterion function, based on conditioning arguments, which is invariant to the fixed effects and also a "joint fixed effects" procedure which estimates the individual effects along with the common parameters.
<br>
<br>
In the _netrics_ package these two procedures are operationalized by the *tetrad\_logit()* and *dyad\_jfe\_logit()* functions. These two functions require the user to input data in a very particular way. The outcome is $D$, the $N \times N$ adjacency matrix. The regressors are included in a length $K$ list where each element is an $N \times N $ numpy 2d array $W_{k}$ with $(i,j)$ element equal to the dyadic covariate $W_{k,ij}$. It takes a bit of work to get our Nyakatoke dataset into this form. This is what the next few snippets of code do.

In [5]:
hh_set = set(es['hh1'].unique()) | set(es['hh2'].unique()) # Set of all households
N = len(hh_set)                                            # Number of households
n = N * (N - 1) //2                                        # Number of dyads

In [6]:
# Get multi-indices for lower triangle of N x N matrix
ij_LowTri = np.tril_indices(N, -1)

In [7]:
# Form adjacency matrix as a dense 2d numpy array
D = np.zeros((N,N), dtype = np.bool)
D[ij_LowTri] = (es['links'] != 'no link')
D = (D + D.T)*1

In [8]:
# Form four categorical bins for wealth (same as used in plot of Network from Lecture 1)
wealth_bkts = [1000000, 6, 3, 1.5, 0]
wealth_bin_labels = ['+600K','(600K,300K]','(300K,150K]','(150K, 0K]']
es['wealth_cat1'] = np.digitize(es.wealth1, wealth_bkts)
es['wealth_cat2'] = np.digitize(es.wealth2, wealth_bkts)

Construct $K$ list of $N \times N$ dyad-specific covariate matrices.

In [9]:
cov_names = ['Distance','Other blood relation', 'Cousin, etc.', 'Sibling, child, etc.', 'Age difference', \
             'Same religion', 'Same clan', 'Same education level', 'Adjacent wealth class', \
             'Non-adjacent wealth class']

# Initialize K-list of dyadic covariate matrices
W = [np.zeros((N,N), dtype = np.float)]*len(cov_names)

# Distance between households
W[0][ij_LowTri] = es['distance']
W[0] = W[0] + W[0].T

# Whether two households are other blood related
W[1][ij_LowTri] = (es['kinship']=='other blood relation')*1
W[1] = W[1] + W[1].T

# Whether two households have a nephew etc. level blood relation
W[2][ij_LowTri] = (es['kinship']=='Nephew/niece/uncle/aunt,cousin,grandparent/grandchild')*1
W[2] = W[2] + W[2].T    

# Whether two households have a child etc. level blood relation
W[3][ij_LowTri] = (es['kinship']=='sibling, child, parent')*1
W[3] = W[3] + W[3].T  

# Absolute difference in age between two household heads
W[4][ij_LowTri] = np.abs(es['head_age1']-es['head_age2'])
W[4] = W[4] + W[4].T

# Whether two households are of the same religion
W[5][ij_LowTri] = (es['religion1'] == es['religion2'])*1
W[5] = W[5] + W[5].T

# Whether two households belong to the same clan
W[6][ij_LowTri] = (es['clan1'] == es['clan2'])*1
W[6] = W[6] + W[6].T

# Whether two households are of the same education level
W[7][ij_LowTri] = (es['education1'] == es['education2'])*1
W[7] = W[7] + W[7].T

# Whether two households are of in adjacent wealth class
W[8][ij_LowTri] = (np.abs(es['wealth_cat1'] - es['wealth_cat2']) == 1)*1
W[8] = W[8] + W[8].T

# Whether two households are of in distance wealth classes
W[9][ij_LowTri] = (np.abs(es['wealth_cat1'] - es['wealth_cat2']) > 1)*1
W[9] = W[9] + W[9].T

Save the adjacency matrix and the covariate matrices to an uncompressed file (this serves as a test dataset for the *tetrad\_logit()* and *dyad\_jfe\_logit()* functions). This file can be download from GitHub at https://github.com/bryangraham/netrics/blob/master/Notebooks/Nyakatoke_Example.npz.

In [10]:
np.savez(data+"Created/Nyakatoke_Example", D = D, Distance = W[0], OtherBlood = W[1], Cousin = W[2], \
                                           Sibling = W[3], AgeDiff = W[4], SameReligion = W[5], \
                                           SameClan = W[6], SameEdLvl = W[7], AdjacentWealthClass = W[8], \
                                           NonAdjacentWealthClass = W[9])

## Tetrad Logit
With $W$ appropriately constructed we can now fit a link formation model using *tetrad\_logit()*. We can get a sense of how the function works by viewing its help header.

In [11]:
help(netrics.tetrad_logit)

Help on function tetrad_logit in module netrics.tetrad_logit:

tetrad_logit(D, W, dtcon=None, silent=False, W_names=None)
    AUTHOR: Bryan S. Graham, bgraham@econ.berkeley.edu, June 2016
    
    This function computes the Tetrad Logit estimator introduced in Graham (2014, NBER No. 20341) -- "An Econometric
    Model of Link Formation with Degree Heterogeneity". The implementation is as described in the paper. Notation
    attempts to follow that used in the paper.
    
    INPUTS
    ------
    D                 : N x N undirected adjacency matrix
    W                 : List with elements consisting of N x N 2d numpy arrays of dyad-specific 
                        covariates such that W[k][i,j] gives the k-th covariate for dyad ij
    dtcon             : Dyad and tetrad concordance (dtcon) List with elements [tetrad_to_dyads_indices, 
                        dyad_to_tetrads_dict]. If dtcon=None, then construct it using generate_tetrad_indices() 
                        function. Se

Finally we call *tetrad\_logit()*. Depending on the speed and memory of your computer, this next bit of code may take a few minutes to complete.

In [12]:
[beta_TL, vcov_beta_TL, tetrad_frac_TL, success] = \
    netrics.tetrad_logit(D, W, dtcon=None, silent=False, W_names=cov_names)

Fisher-Scoring Derivative check (2-norm): 1.05880494
Value of -logL = 69158.188469,  2-norm of score = 5050098.146665
Value of -logL = 67593.624111,  2-norm of score = 798109.125689
Value of -logL = 67538.094518,  2-norm of score = 227678.840354
Value of -logL = 67033.773228,  2-norm of score = 180386.454584
Value of -logL = 66303.957241,  2-norm of score = 97861.543728
Value of -logL = 66113.734802,  2-norm of score = 45086.126082
Value of -logL = 59607.729590,  2-norm of score = 387085.401838
Value of -logL = 59595.163177,  2-norm of score = 14576.033822
Value of -logL = 57635.344150,  2-norm of score = 50080.210276
Value of -logL = 57635.115803,  2-norm of score = 10900.994775
Value of -logL = 55435.049721,  2-norm of score = 64548.389938
Value of -logL = 55434.672762,  2-norm of score = 8406.106366
Value of -logL = 54462.006475,  2-norm of score = 14238.250795
Value of -logL = 52751.424449,  2-norm of score = 55380.340383
Value of -logL = 52751.136088,  2-norm of score = 5782.19725

The program spits out some useful information. In a network with $N = 115$ agents there are a total of $6,913,340$ different tetrads! Of these only $102,151$, or about 1.5 percent, actually contribute to the *tetrad\_logit()* criterion function. The effectiveness of the procedure in the context of sparse networks is one of its theoretical and practical attractions.
<br>
<br>
The results suggest that ties are more frequent between households which are physically close, related by blood and where the household heads are close in age. There is little evidence for sorting by religion, clan or education. There is some evidence for heterophily in terms of wealth (perhaps consistent with some models of "mutual support"), but it is rather weak.
<br>
<br>
The *tetrad\_logit()* estimation procedure is computationally intense. On a modern desktop machine it would be difficult to use the procedure on a network larger than a few hundred agents (fortunately there are many possible ways to apply the method at scale via various approximations and parallelizations; but these extensions are not yet part of the _netrics_ package). 
<br>
<br>
Part of the computational intensity is up front. Specifically, the function computes a detailed dictionary which maps dyads-to-tetrads and vice-versa. This bookkeeping overhead makes later calculations much quicker. A user who anticipates fitting several different models to the same adjacency matrix can save considerable time by computing this concordance first (once and for all) and then passing it to *tetrad\_logit()* via the _dtcon_ parameter. This can be done by calling the *generate\_tetrad\_indices()* function.
<br>
<br>
To illustrate I create the concordance and then fit a model with only the distance and kinship variables included.

In [13]:
N = np.shape(D)[0]
concordance = netrics.generate_tetrad_indices(N, full_set=True)
[beta_TL, vcov_beta_TL, tetrad_frac_TL, success] = \
    netrics.tetrad_logit(D, W[0:4], dtcon=concordance, silent=False, W_names=cov_names[0:4])

Fisher-Scoring Derivative check (2-norm): 1.02942103
Value of -logL = 69163.219831,  2-norm of score = 5045814.866051
Value of -logL = 67604.839126,  2-norm of score = 772959.869305
Value of -logL = 67557.021302,  2-norm of score = 30472.352890
Value of -logL = 60570.400551,  2-norm of score = 312602.514426
Value of -logL = 60562.290006,  2-norm of score = 9904.763394
Value of -logL = 55914.238906,  2-norm of score = 204713.498450
Value of -logL = 55910.626898,  2-norm of score = 6252.561744
Value of -logL = 53745.637844,  2-norm of score = 62366.284082
Value of -logL = 53745.287551,  2-norm of score = 4450.443068
Value of -logL = 52853.901035,  2-norm of score = 13360.292597
Value of -logL = 52853.883284,  2-norm of score = 3931.551890
Value of -logL = 52132.409478,  2-norm of score = 9742.220437
Value of -logL = 52132.399571,  2-norm of score = 3371.585318
Value of -logL = 51720.042469,  2-norm of score = 4052.606975
Value of -logL = 50991.862745,  2-norm of score = 17599.149968
Valu

## Joint Fixed Effects Logit
Graham (2014) also introduces a joint fixed effects estimator for link formation. This procedure estimates the coefficients on $W_{k,ij}$ for $k=1,...,K$ as well as individual specific degree heterogeneity parameters $A_{i}$ for $i=1,...,N$. The *dyad\_jfe\_logit()* function implements this estimator. By default it reports the iterated bias-corrected estimates described in the paper (but the uncorrected estimates are also returned by the function).

In [14]:
[beta_JFE, beta_JFE_BC, vcov_beta_JFE, A_JFE, success] = \
    netrics.dyad_jfe_logit(D, W, T=None, silent=False, W_names=cov_names, beta_sv=None)

-------------------------------------------------------------------------------------------
- COMPUTE JOINT FIXED EFFECT MLEs                                                         -
-------------------------------------------------------------------------------------------
Value of c_logl = 1573.908746,  2-norm of c_score = 62611.046459
Value of c_logl = 1547.376191,  2-norm of c_score = 48351.980722
Value of c_logl = 1529.569153,  2-norm of c_score = 37055.119814
Value of c_logl = 1517.970321,  2-norm of c_score = 28245.623208
Value of c_logl = 1510.616150,  2-norm of c_score = 21451.168506
Value of c_logl = 1506.059569,  2-norm of c_score = 16250.801500
Value of c_logl = 1503.289044,  2-norm of c_score = 12291.742962
Value of c_logl = 1501.628945,  2-norm of c_score = 9289.106809
Value of c_logl = 1500.644577,  2-norm of c_score = 7018.144552
Value of c_logl = 1500.064492,  2-norm of c_score = 5304.237016
Value of c_logl = 1499.723095,  2-norm of c_score = 4013.119659
Value of c_lo

Value of c_logl = 1403.364898,  2-norm of c_score = 701.799618
Value of c_logl = 1403.358569,  2-norm of c_score = 525.226523
Value of c_logl = 1403.354941,  2-norm of c_score = 394.351074
Value of c_logl = 1403.352799,  2-norm of c_score = 297.647854
Value of c_logl = 1403.332252,  2-norm of c_score = 381.059587
Value of c_logl = 1403.330270,  2-norm of c_score = 287.329013
Value of c_logl = 1403.311340,  2-norm of c_score = 370.918453
Value of c_logl = 1403.309483,  2-norm of c_score = 279.351951
Value of c_logl = 1403.308351,  2-norm of c_score = 211.958731
Value of c_logl = 1403.267490,  2-norm of c_score = 433.557738
Value of c_logl = 1403.265070,  2-norm of c_score = 324.677151
Value of c_logl = 1403.263673,  2-norm of c_score = 244.012063
Value of c_logl = 1403.262839,  2-norm of c_score = 184.480102
Value of c_logl = 1403.233461,  2-norm of c_score = 390.826856
Value of c_logl = 1403.231516,  2-norm of c_score = 292.192413
Value of c_logl = 1403.230408,  2-norm of c_score = 219

Value of c_logl = 1364.910830,  2-norm of c_score = 17.844346
Value of c_logl = 1364.910538,  2-norm of c_score = 34.808475
Value of c_logl = 1364.910522,  2-norm of c_score = 25.989251
Value of c_logl = 1364.910513,  2-norm of c_score = 19.470268
Value of c_logl = 1364.910189,  2-norm of c_score = 39.891750
Value of c_logl = 1364.910169,  2-norm of c_score = 29.694750
Value of c_logl = 1364.910158,  2-norm of c_score = 22.141108
Value of c_logl = 1364.910151,  2-norm of c_score = 16.555197
Value of c_logl = 1364.910148,  2-norm of c_score = 12.439619
Value of c_logl = 1364.338427,  2-norm of c_score = 121.214121
Value of c_logl = 1364.338245,  2-norm of c_score = 90.087699
Value of c_logl = 1364.338145,  2-norm of c_score = 66.999484
Value of c_logl = 1364.338089,  2-norm of c_score = 49.878438
Value of c_logl = 1364.338058,  2-norm of c_score = 37.189077
Value of c_logl = 1364.338040,  2-norm of c_score = 27.793946
Value of c_logl = 1364.338030,  2-norm of c_score = 20.852291
Value o

Value of c_logl = 1363.776352,  2-norm of c_score = 1.939775
Value of c_logl = 1363.776352,  2-norm of c_score = 1.454599
Value of c_logl = 1363.776352,  2-norm of c_score = 1.097949
Value of c_logl = 1363.776351,  2-norm of c_score = 2.143239
Value of c_logl = 1363.776350,  2-norm of c_score = 1.600539
Value of c_logl = 1363.776350,  2-norm of c_score = 1.199766
Value of c_logl = 1363.776349,  2-norm of c_score = 2.516224
Value of c_logl = 1363.776349,  2-norm of c_score = 1.872583
Value of c_logl = 1363.776349,  2-norm of c_score = 1.396016
Value of c_logl = 1363.776349,  2-norm of c_score = 1.043754
Value of c_logl = 1363.776349,  2-norm of c_score = 0.784316
Value of c_logl = 1363.774135,  2-norm of c_score = 5.188363
Value of c_logl = 1363.774135,  2-norm of c_score = 3.855379
Value of c_logl = 1363.774135,  2-norm of c_score = 2.867042
Value of c_logl = 1363.774135,  2-norm of c_score = 2.134509
Value of c_logl = 1363.774135,  2-norm of c_score = 1.591945
Value of c_logl = 1363.7

Value of c_logl = 1363.772435,  2-norm of c_score = 0.112648
Value of c_logl = 1363.772435,  2-norm of c_score = 0.083926
Value of c_logl = 1363.772428,  2-norm of c_score = 0.299561
Value of c_logl = 1363.772428,  2-norm of c_score = 0.222581
Value of c_logl = 1363.772428,  2-norm of c_score = 0.165507
Value of c_logl = 1363.772428,  2-norm of c_score = 0.123210
Value of c_logl = 1363.772428,  2-norm of c_score = 0.091887
Value of c_logl = 1363.772428,  2-norm of c_score = 0.068726
Value of c_logl = 1363.772424,  2-norm of c_score = 0.176212
Value of c_logl = 1363.772424,  2-norm of c_score = 0.131116
Value of c_logl = 1363.772424,  2-norm of c_score = 0.097704
Value of c_logl = 1363.772424,  2-norm of c_score = 0.072975
Value of c_logl = 1363.772424,  2-norm of c_score = 0.100159
Value of c_logl = 1363.772424,  2-norm of c_score = 0.074692
Value of c_logl = 1363.772424,  2-norm of c_score = 0.104334
Value of c_logl = 1363.772424,  2-norm of c_score = 0.077687
Value of c_logl = 1363.7

It takes some time for *dyad\_jfe\_logit()* to converge. This reflects the sparsity of the Nyakatoke network and the consequent difficulty of estimating the household degree heterogeneity effects.
<br>
<br>
Relative to the *tetrad\_logit()* results, the joint estimator suggests sorting by religion and clan. In Monte Carlo experiments I have found that the joint estimator -- particularly the bias correction step -- can work very poorly in networks like the Nyakatoke one (i.e., in networks that are sparse such that many agents have only a few links). In such settings the $A_{i}$ for $i=1,...,N$ may be _very_ imprecisely estimated and this can effect the quality of the common parameter estimates as well.
<br>
<br>
For reference we can look at the joint coefficient estimates prior to bias correction using the *print\_coef()* utility included in the **netrics** package. These point estimates are closer to the *tetrad\_logit()* ones than their bias-corrected counterparts. This suggests, that in this particular example, bias-correction may be doing more harm that good.
<br>
<br>
Monte Carlo evidence suggests that in denser networks, the size distortion causes by bias in the limit distribution of the joint fixed effects estimator is very real and, furthermore, that bias correction can be effective in such contexts. So not too much should be made of this example.

In [15]:
netrics.print_coef(beta_JFE, vcov_beta_JFE, var_names=cov_names)


Independent variable       Coef.    ( Std. Err.) 
-------------------------------------------------------------------------------------------
Distance                  -0.002531 (  0.000190)
Other blood relation       1.308501 (  0.190788)
Cousin, etc.               1.925634 (  0.245518)
Sibling, child, etc.       2.893077 (  0.270260)
Age difference            -0.013740 (  0.003730)
Same religion              0.375629 (  0.086042)
Same clan                  0.264691 (  0.141581)
Same education level      -0.062930 (  0.110033)
Adjacent wealth class     -0.176241 (  0.103700)
Non-adjacent wealth class -0.298087 (  0.116544)

-------------------------------------------------------------------------------------------


In [16]:
# This imports an attractive notebook style from Github
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('http://bit.ly/1Bf5Hft').read())