In [None]:
!pip install --upgrade networkx

In [None]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
plt.style.use('fivethirtyeight')

import networkx as nx

In [None]:
# Load in the ok book
from client.api.notebook import Notebook
lab2_2 = Notebook('lab2_2.ok')
_ = lab2_2.auth(inline=True, force=True)

# Lab 2 (Part 2/2)

We are continuing our study of complete network data from the [Add Health project](http://www.cpc.unc.edu/projects/addhealth).

In part 1, we started using the `networkx` package. We

* learned how to read in a complete-network dataset
* learned how to take a subgraph from a complete network
* learned how to plot a network a few different ways
* ... but we also discovered that plotting networks is only moderately useful for understanding them

We're going to continue working with complete network data. First, we're going to learn about a few different ways to quantitatively describe various aspects of network structure. Then we're going to actually compute those metrics for all of the Add Health friendship networks. This will give us a chance to practice writing functions and using iteration.

## Quantifying network structure

There are many different ways of quantifying network structure. We're going to start by discussing different ways of measuring *network connectivity*. Roughly speaking, a network has a high level of connectivity when any node can reach another node by following a small number of network edges. In the case of the Add Health student friendship networks, a highly connected network could arise when students are friends with many of their fellow students. A poorly connected network, on the other hand, could arise when students are segregated into distinct groups that don't interact much with one another.

<img src="example_network.png" style="width: 60%;">

Some of the ** basic metrics** of this network are as follows



* number of nodes -  8
* number of edges -  5
* average degree -  avg(1 + 1 + 3 + 1 + 1 + 1 + 1 + 1) = 10/8
* number of connected components -  3



And the **shortest distance** between each pair of nodes is as follows



|             |  node 1 | node 2 |  node 3 |  node 4 |
|   :----:    |  :---:  |  :---: |  :---:  |  :---:  |
|   node 1    |    -    |    2   |    1    |    2    |
|   node 2    |    2    |    -   |    1    |    2    |
|   node 3    |    1    |    1   |    -    |    1    |
|   node 4    |    2    |    2   |    1    |    -    |

Furthermore, some additional **hand-calculated metrics** of the **largest component** of the graph are as follows.



* average path length - 9 / 6
* diameter - 2
* radius - 1
* fraction of nodes in periphery - 3 / 4
* fraction of nodes in core - 1 / 4

### Calculating network metrics with `networkx` (with your neighbor)

Now we are going to use the `networkx` package to check the calculations we made by hand.

In [None]:
ex_network = nx.Graph([(1,3), (2,3), (3,4), (5,6), (7,8)])

**Question** Check that your network is correct by drawing it and comparing it to the image above.

In [None]:
nx.draw(ex_network, with_labels=True)

The next few questions ask you to use the following functions to check your calculations:

* `number_of_nodes()`
* `number_of_edges()`
* `number_connected_components()`

**Practice** Check the number of nodes

In [None]:
ex_network.number_of_nodes()

**Practice** Check the number of edges

In [None]:
ex_network.number_of_edges()

**Question** What is the average degree of this network?

In [None]:
q4 = ...

In [None]:
_ = lab2_2.grade('q4')

**Practice** Check the number of connected components

In [None]:
nx.number_connected_components(ex_network)

Several of the metrics we discussed only make sense when the entire network is one connected component. We will take the largest connected component of the example network (as we did when we made the calculations by hand above).

**Question** Take the largest connected component of the example network. (To do this, look at the help file for the `connected_component_subgraphs` function; the example in the help file shows how to do this)

In [None]:
nx.connected_component_subgraphs?

In [None]:
ex_network_lc = max(nx.connected_component_subgraphs(ex_network), key=len)

**Question** Check that this worked correctly by drawing `ex_network_lc`

In [None]:
nx.draw(ex_network_lc, with_labels=True)

The next few practices ask you to use the following functions to check your calculations:

* `average_shortest_path_length()`
* `radius()`
* `diameter()`
* `periphery()`
* `center()`

**Practice** Check the average shortest path length

In [None]:
nx.average_shortest_path_length(ex_network_lc)

**Practice** Check the radius

In [None]:
nx.radius(ex_network_lc)

**Practice** Check the diameter

In [None]:
nx.diameter(ex_network_lc)

**Practice** Check the fraction of nodes in the periphery

In [None]:
len(nx.periphery(ex_network_lc)) / ex_network_lc.number_of_nodes()

**Practice** Check the fraction of nodes in the center

In [None]:
len(nx.center(ex_network_lc)) / ex_network_lc.number_of_nodes()

### Opening up a school network

Recall that the Add Health study sampled schools in many different communities. In part 1, we looked at the network from one of those communities. Now, we're going to look at *all* of the communities. By looking at many different friendship networks, we can hope to better understand the structure of student friendship networks, since we will be able to use evidence from many different networks, instead of from a single example. At the same time, we will try to better understand the different metrics of network structure and how they relate to each other.

In part 1, we had to go through a couple of steps to read a file in and open up a single network. These steps would make a great function, since we will need to go through them each time we want to open 84 different files.

Take a look at this function, which you will use in a moment:

In [None]:
def read_add_health_network(network_id):
    """
    network_id : integer from 1 to 84
    
    read in the Add Health network corresponding to the given id number and
    return it as an undirected networkx object
    """

    # this file was downloaded from
    # http://moreno.ss.uci.edu/data.html#adhealth
    edge_file = os.path.join("data", "comm" + str(network_id) + ".dat")
    with open(edge_file, 'r') as f:
        edge_lines = f.readlines()
        
    network = nx.parse_edgelist(edge_lines, nodetype=int, data=[('activity_level', float)])
    
    # note that we call the to_undirected method to ensure we get an undirected network
    return(network.to_undirected())

Now let's use this function to actually read in all 84 of the Add Health school networks:

*This takes a couple of secs*

In [None]:
number_add_health_networks = 84
add_health_networks = [read_add_health_network(x) for x in range(1,number_add_health_networks+1)]

Done! Look at the contents of `add_health_networks` to better understand what it is.

### Calculating network statistics for all of the Add Health communities

**Practice** Let's start by making a dataset that has the number of nodes in each of the 84 Add Health community networks.

In [None]:
num_nodes = make_array()

for g in add_health_networks:
    num_nodes = np.append(num_nodes, g.number_of_nodes())

add_health_firsttry = Table().with_columns([
     'id', np.arange(1, number_add_health_networks+1),
     'num_nodes', num_nodes,
    ])

add_health_firsttry

**Question** Now, following the pattern above, make a more complete dataset called `add_health` which has columns

* `id`
* `num_nodes`
* `num_edges`
* `avg_degree`
* `num_components`

In [None]:
# Create empty arrays for these variables, we will later make them into columns of the table
num_nodes = ...
num_edges = ...
...


for g in add_health_networks:
    num_nodes = np.append(num_nodes, g.number_of_nodes())
    num_edges = ...
    avg_degree = np.append(avg_degree, ...)
    num_components = np.append(num_components, nx.number_connected_components(g))

add_health = Table().with_columns([
     'id', np.arange(1, number_add_health_networks+1),
     'num_nodes', num_nodes,
     'num_edges', num_edges,
     'avg_degree', avg_degree,
     'num_components', num_components,
    ])

In [None]:
q5 = add_health.num_rows

In [None]:
_ = lab2_2.grade('q5')

Let's take a look at the dataset that we just created:

In [None]:
add_health

**Question** Make a histogram that shows the distribution of each column (except for `id`).

In [None]:
add_health.hist('num_nodes', bins=np.arange(0,3000,100))

In [None]:
add_health.hist('num_edges', bins=np.arange(0,12000,500))

In [None]:
add_health.hist('avg_degree', bins=np.arange(0,10,1))

In [None]:
add_health.hist('num_components', bins=np.arange(0,10,1))

### Relationship between metrics of network structure

Remember that the goal of these different metrics is to try to find a way to summarize the structure of a network.  It turns out that this is too hard a task to have a single solution: the best way to summarize or describe a network can depend a lot on what you are interested in understanding about the network. For example, one type of summary might tell you about what networks are at high or low risk of quickly spreading an infectious disease and a different type of network metric might tell you about how hierarchical or egalitarian relationships between network members are.

It would be very helpful to understand how these different metrics are related to each other. For example, if two metrics always increase or decrease together, that might tell us that they are capturing the same underlying aspect of network structure. On the other hand, if two metrics are totally unrelated to one another, then that might tell us that each one captures an independent aspect of network structure.

One way to investigate this topic would be to use math to try to derive results that relate the different network metrics to each other. That's a great thing to do (and there has been a lot of work on this topic). But since we're learning how to analyze data, we're going to take a different approach: we're going to use our empirical dataset to see how these metrics behave in a set of real-world friendship networks.

**Practice** Make a scatterplot that investigates the relationship between each of the four pairs of metrics in the previous question. For each scatter plot, briefly comment on whether it suggests that your prediction is correct or not. (We're not doing any formal tests here, so this evidence will only be suggestive.)

In [None]:
add_health.scatter('num_nodes', 'num_edges')

In [None]:
add_health.scatter('num_nodes', 'avg_degree')

In [None]:
add_health.scatter('num_nodes', 'num_components')

In [None]:
add_health.scatter('num_components', 'avg_degree')

### Rerun the tests and submit your lab

In [None]:
import os
print("Running all tests...")
_ = [lab2_2.grade(q[:-3]) for q in os.listdir("tests2") if q.startswith('q')]
print("Finished running all tests.")

In order to submit your assignment, run the next cell.

You can submit as many times as you want (up to the deadline: Feb 19th, Tuesday 9pm).

In [None]:
_ = lab2_2.submit()