In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

from IPython.core.display import HTML
HTML("""
<style>
.imagesource {
    font-size: xx-small;
}
</style>
""")

import os
import networkx as nx
from networkx.algorithms import bipartite

from IPython.core.display import HTML
def css_styling():
    styles = open("custom_style.css", "r").read()
    return HTML(styles)
css_styling()

# These lines load the tests.
from client.api.assignment import load_assignment
hw04 = load_assignment('hw04.ok')

### L&S 88-4: Social networks

# Homework 04

## Part I: Small worlds

In this homework assignment, we're going to explore the concept of *small worlds*.  Small worlds have long been studied by social networks researchers, and they have also been discussed in popular culture. The rough idea is that social networks can typically be expected to have two characteristics:

* a high level of clustering
* a short average path length

A high level of clustering is consistent with the idea of triadic closure. And a short average path length is supposed to capture situations we often seem to encounter in our day to day lives: e.g., two strangers find that they have an unexpected acquaintance in common and exclaim "it's a small world!" (see the Milgram article below).

In the first part of this homework, we're going to try to assess how well these two small world predictions hold up empirically. We're going to focus on the Add Health networks. We should bear in mind that the small world theory is really about very large networks, so we will be evaluating it in an unusual situation: networks of moderate size taken from children who all live in the same community.

If you want to learn more about small-world networks, check out this article describing an early empirical study by Milgram:

* [Milgram 1967](http://measure.igpp.ucla.edu/GK12-SEE-LA/Lesson_Files_09/Tina_Wey/TW_social_networks_Milgram_1967_small_world_problem.pdf)

More recently, researchers have studied mathematical models that can produce networks with small-world properties. Here are a couple of examples:

* [Watts & Strogatz 1998](http://www.nature.com/nature/journal/v393/n6684/abs/393440a0.html)
* [Watts 1999](http://www.jstor.org/stable/10.1086/210318?seq=1#page_scan_tab_contents)

We'll use the code that we used in the labs to read the Add Health networks in.

In [None]:
def read_add_health_network(network_id):
    """
    network_id : integer from 1 to 84
    
    read in the Add Health network corresponding to the given id number and
    return it as an undirected networkx object
    """

    # this file was downloaded from
    # http://moreno.ss.uci.edu/data.html#adhealth
    edge_file = os.path.join("add-health", "comm" + str(network_id) + ".dat")
    with open(edge_file, 'r') as f:
        edge_lines = f.readlines()
        
    network = nx.parse_edgelist(edge_lines, nodetype=int, data=[('activity_level', float)])
    
    # note that we call the to_undirected method to ensure we get an undirected network
    return(network.to_undirected())

number_add_health_networks = 84
add_health_networks = [read_add_health_network(x) for x in range(1,number_add_health_networks+1)]

And here's a helper function that will calculate the average degree of a network for you.

In [None]:
def average_degree(net):
    return(2 * net.number_of_edges() / net.number_of_nodes())

### [2 pt] Empirical distribution in the Add Health networks

First, we'll look at the empirical distribution of clustering and average path length in the Add Health networks.

**Question** Write a loop that goes through each of the 84 Add Health networks and calculates the clustering coefficient and the number of nodes in the network. (Please use the average clustering coefficient, implemented by the `average_clustering` function from the networkx package.) Store the results in a Table called `add_health_clustering` using columns called `num_nodes` and `avg_clustering_coef`.

In [None]:
clustering = ...
num_nodes = ...

for ... in ...:
    clustering = ...
    num_nodes = ...
    
add_health_clustering = Table().with_columns(['num_nodes', ...,
                                              'avg_clustering_coef', ...])
add_health_clustering

In [None]:
_ = hw04.grade('q1')

**Question** Plot a histogram showing the distribution of clustering coefficients across the 84 Add Health networks.

In [None]:
...

**Question** Make a scatter plot that compares the number of nodes in each network (x axis) to the clustering coefficient (y axis). Does it look like the clustering coefficient changes as the number of nodes does?

In [None]:
...

[Answer here]

### [2 pt] Average path length of biggest component

Remember that it really only makes sense to think about the average path length between two nodes that are in the same component. (Nodes in different components have no path between them.) Since some of the Add Health networks have more than one component, we'll start by picking out only the largest component in each network.

In [None]:
def get_biggest_component(network):
    biggest = max(nx.connected_component_subgraphs(network), key=len)
    return(biggest)

add_health_biggest_components = [get_biggest_component(g) for g in add_health_networks]

**Question** Write a loop that goes through the largest component of each of the 84 Add Health networks and calculates the average shortest path length and the number of nodes in the network. Store the results in a Table called `add_health_sp` using columns called `num_nodes` and `avg_shortest_path`.

In [None]:
avg_shortest_path = ...
num_nodes = ...

avg_degree = ...

for ... in ...:
    avg_shortest_path = ...
    num_nodes = ...
    avg_degree = ...

In [None]:
add_health_sp = Table().with_columns(['num_nodes', ...,
                                      'avg_shortest_path', ...,
                                      'avg_degree', ...])

In [None]:
_ = hw04.grade('q2_b')

**Question** Plot a histogram showing the distribution of average shortest path lengths across the 84 Add Health networks' largest components.

In [None]:
...

**Question** Make a scatter plot that compares the number of nodes in each largest component (x axis) to the average shortest path (y axis). Does it look like the average shortest path changes as the number of nodes does?

In [None]:
...

[Answer here]

### P-values

In the introduction to this section, you read that the small world theory suggests that a social network should have a large clustering coefficient and a small average path length. But what do large and small mean? In other words, what should we think about comparing these networks to?

We'll use Erdos-Renyi random networks as a null model. Specifically, we're going to

* pick one specific Add Health network to test
* generate ER networks that 'match' that specific Add Health network
* compare the clustering coefficient / average path lengths of the ER networks to the ones we observe in the Add Health network


Let's pick out one particular Add Health network to focus on for this part.

In [None]:
# the specific Add Health network we'll look at
ahn = add_health_networks[17]

In [None]:
def er_by_degree(n, avg_degree):
    
    return(nx.erdos_renyi_graph(n=n, p=avg_degree / (n-1)))

def rand_er_network(network):
    """
    Return a random network generated from the configuration model using
    the degree sequence of the network passed in
    """
    network_n = network.number_of_nodes()
    network_dbar = average_degree(network)
    return(er_by_degree(network_n, network_dbar))

### [3 pt] Developing  a simulation from a null model

**Question** Write a function which, given a network, returns its average shortest path length. If the network has more than one component, your function should return the average path length in the biggest component.

In [None]:
def avg_path_length(...):
    if ... :
        net = ...
    
    return(...)

**Question** Write a simulation that generates 100 Erdos Renyi random networks that match the Add Health network `ahn`. (By 'match', we mean that the ER network should have the same average degree and number of nodes as the Add Health network `ahn`.). For each generated ER network, calculate the average clustering and use the function you wrote above to calculate the average path length. Store the results in a table called `er_res`.

In [None]:
observed_apl = ...
observed_cc = ...

er_cc = ...
er_apl = ...

for _ in ...:
    
    er_net = ...
    er_cc = ...
    er_apl = ...
    
er_res = Table().with_columns('cc', ...,
                              'apl', ...)


In [None]:
_ = hw04.grade('q3_a')

**Question** Now print out the observed average path length in the Add Health network `ahn` and plot a histogram of the average path lengths in the ER networks you just simulated. Look at where the observed Add Health network's statistic would fall in the ER distribution.

In [None]:
...

In [None]:
...

### [3 pt] Calculating P values

**Question** Now use your results to calculate a $p$ value for the hypothesis that the Add Health network's average path length generated by the ER model; the alternative hypothesis should be that the Add Health network's average path length is larger than it would be in the ER model.

In [None]:
emp_p_value_apl = ...
emp_p_value_apl

In [None]:
_ = hw04.grade('q4')

**Question** Now print out the observed average clustering in the Add Health network `ahn` and plot a histogram of the average clusterings in the ER networks you just simulated. Look at where the observed Add Health network's statistic would fall in the ER distribution.

In [None]:
...

In [None]:
...

**Question** Now use your results to calculate a $p$ value for the hypothesis that the Add Health network's average clustering was generated by the ER model; the alternative hypothesis should be that the Add Health network's average clustering is less than it would be in the ER model.

In [None]:
emp_p_value_cc = ...
emp_p_value_cc

In [None]:
_ = hw04.grade('q5')

**Question** What do these two $p$ values lead you to conclude about the agreement between the ER model and the small world hypothesis (at least, using information from the Add Health network)?

[answer here]

### Hand in the homework on Bcourses

Please hand this homework in on Bcourses by **midnight on Tuesday, Nov. 15th.**<BR>
**Late homework will not be accepted**, so please be sure to hand in as much as you have finished by the deadline. Good luck!