In [1]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# The simultaneous benefit and drawback of networks

A basic regression framework makes the assumption that each data point is independent of the others - there is an error term but it is necessary for it to be randomly distributed for the modeled relationship to be valid.

Network analysis provides an analytical framework to explicitly model the relationships between data points in an attempt to understand the impact of these differentially distributed connections. 

However, this emphasis makes traditional statistical tests largely inapplicable since independence assumptions do not hold. 

# Null models

The way that we tackle this difficulty is not to discard statistical testing, but to instead use null models to assess significance of individual or system-level attributes. 

Null models require that we are able to fix some attributes of the system and then randomize the rest. Through bootstrapping hundreds to thousands of synthetic datasets, we can then observe if our observed quantity significantly differs from the distribution of synthetic values. 

# Classical example

A classical example of using a bootstrap is to estimate the value of pi. The definition of pi is 

$\pi =\frac{C}{d}$

where $C$ is the circumference of a circle and $d$ is its diameter. With this relationship, we can estimate the value of pi by randomly drawing points and seeing how many fall within a circle and how many fall outside.

In [2]:
import math
import numpy as np

def estimate_pi(n_attempts = 10000):
    """ Estimate pi from area of quarter circle """ 
    count_successes = 0
    for i in range(n_attempts):
        x_rand = np.random.random()
        y_rand = np.random.random()
        dist = math.sqrt( x_rand **2 + y_rand**2 )
        if dist <= 1.:
            count_successes += 1
    return float(count_successes) / n_attempts


print( math.pi / 4. )
print( estimate_pi(10) )

0.7853981633974483
0.8


Close but not close enough. However, we can see that as we increase the number of attempts the value converges to the actual value of pi.

In [3]:
for nattempts in [10, 100, 1000, 10000, 100000, 1000000]:
    print(nattempts, estimate_pi(n_attempts = nattempts))

10 0.8
100 0.78
1000 0.788
10000 0.7835
100000 0.78569
1000000 0.785179


# Applications to networks

The first application that we will look at is if a network is more modular than would be expected otherwise. Initially it was assumed that the modularity of real graphs was related to the evolution of a network (i.e. selection by agents). However, it was later shown that randomly generated graphs (with no network structure) would be modular also, which casts the question of how modular a real network really is (and how much that matters).

For simplicity's sake, we will treat the network as unweighted to simplify the problem.

In [6]:
#Don't execute, will give an error
import os
import networkx as nx
from subprocess import call


call(["netcarto", "-f", '../data/got/got.edges', '-o', '../data/got/uw_got.mod', '-c 0.900'])
#Run through the swaps
got = nx.read_edgelist('../data/got/got.edges', data=(('weight', int), ))
for i in range(100):
    fpath = '../data/got/{0}'.format(i)
    if not os.path.exists(fpath):
        os.mkdir(fpath)
    temp_got = got.copy()
    nx.double_edge_swap(temp_got, nswap = len(temp_got.nodes()), max_tries = 1000)
    netpath = fpath + '/got.edges'
    nx.write_edgelist(temp_got, netpath)
    outpath = fpath + '/got.mod'
    call(["netcarto", "-f", netpath, '-o', outpath, '-c 0.900'])
    

In [None]:
#Exercise
#Write a function for modularity


Imperfect, but good enough for our purposes. In the real research world the random values would be a smoother gaussian distribution (which requires more iterations), but the shape is clear from this point.

Given this we can calculate the z-Score of the observed value in comparison to the random values to determine statistical significance.

In [54]:
import numpy as np
z = (obs - np.mean(randmods)) / np.std(randmods)
print('Z-score: ', z)

Z-score:  17.9031048538


And we can now see that the observed network is significantly more modular than would be expected from random chance. 

What is important to remember, is that in this network we fixed the degree of each node. So this modularity is not due to a change in the connectivity profiles of any one node (the degree distribution is exactly the same), but instead it is because of **who connects to who**.

# Further applications

Bootstrapping can be applied to anything so long as an appropriate null model can be constructed. Writing the bootstrapping code is significantly **easier** than mentally constructing the appropriate null model to isolate your quantity of interest. 



# Exercise

You want to determine if a node is connected more to its in-module neighbors than to out-module neighbors. What is the appropriate null model?

In [55]:
#Exercise


# Bootstrap and processes

We can also bootstrap randomized networks to test if a process is occuring quicker or slower due to the observed network dynamics. Take the code we wrote for an SI infection earlier and use it with the got network.

Then bootstrap the network and run the code again. Is the infection significantly quicker or slower than expected because of the observed structure of the network?

In [57]:
#Exercise
