In [1]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
plt.style.use('fivethirtyeight')

import networkx as nx

In [2]:
def css_styling():
    styles = open('../notebook_styles.css', 'r').read()
    return HTML(styles)
css_styling()

# Lab 04

**Question** Write your name here

<div class='response'>
[Answer here]
</div>

**Question** Write your partner's name here

<div class='response'>
[Answer here]
</div>

**Question** Write your partner's favorite type of music here

<div class='response'>
[Answer here]
</div>

In this lab, we're going to continue our study of complete network data from the [Add Health project](http://www.cpc.unc.edu/projects/addhealth).

Last week, we started using the `networkx` package. We

* learned how to read in a complete-network dataset
* learned how to take a subgraph from a complete network
* learned how to plot a network a few different ways
* ... but we also discovered that plotting networks is only moderately useful for understanding them

This week, we're going to continue working with complete network data. First, we're going to learn about a few different ways to quantitatively describe various aspects of network structure. Then we're going to actually compute those metrics for all of the Add Health friendship networks. This will give us a chance to practice writing functions and using iteration, two topics that you have studied recently in Data 8.



## Quantifying network structure

There are many different ways of quantifying network structure. We're going to start by discussing different ways of measuring *network connectivity*. Roughly speaking, a network has a high level of connectivity when any node can reach another node by following a small number of network edges. In the case of the Add Health student friendship networks, a highly connectected network could arise when students are friends with many of their fellow students. A poorly connected network, on the other hand, could arise when students are segregated into distinct groups that don't interact much with one another.

**Discussion question (with your neighbor)** Look at the example network below and try to come up with as many ways as possible of quantifying the connectivity of the network.

<div class='response'>
[Answer here]
</div>

<img src="example_network.png" style="width: 60%;">

**Question** Calculate the following network metrics by hand for the example network above:


* <div class='response'>number of nodes - [answer here]</div>
* <div class='response'>number of edges - [answer here]</div>
* <div class='response'>average degree - [answer here]</div>
* <div class='response'>number of connected components - [answer here]</div>



**Question** Fill in the table below with the path lengths between pairs of nodes in the largest component of the example network above; the table entry i,j should have the distance of the shortest path between node i and node j:



|             |  node 1 | node 2 |  node 3 |  node 4 |
|   :----:    |  :---:  |  :---: |  :---:  |  :---:  |
|   node 1    |    -    |    ?   |    ?    |    ?    |
|   node 2    |    ?    |    -   |    ?    |    ?    |
|   node 3    |    ?    |    ?   |    -    |    ?    |
|   node 4    |    ?    |    ?   |    ?    |    -    |


**Question** Calculate the following network metrics by hand for the largest component of the example network above:

* <div class='response'>average path length - [answer here]</div>
* <div class='response'>diameter - [answer here]</div>
* <div class='response'>radius - [answer here]</div>
* <div class='response'>fraction of nodes in periphery - [answer here]</div>
* <div class='response'>fraction of nodes in core - [answer here]</div>

### Calculating network metrics with `networkx` (with your neighbor)

Now we are going to use the `networkx` package to check the calculations we made by hand.

First, we'll make a `networkx` object that has the example network.  
  
  
**Question** Fill in the code below to create a `networkx` object that represents the example network; you can do this by filling an edge list in.

In [118]:
ex_network = nx.Graph([...]) # example edgelist for a triangle: [(1,2), (1,3), (2,3)]

**Question** Check that your network is correct by drawing it and comparing it to the image above.

In [None]:
nx.draw(..., with_labels=True)

The next few questions ask you to use the following functions to check your calculations:

* `number_of_nodes()`
* `number_of_edges()`
* `number_connected_components()`

**Question** Check the number of nodes

In [None]:
...

**Question** Check the number of edges

In [None]:
...

**Question** Check the average degree

In [None]:
...

**Question** Check the number of connected components

In [None]:
...

Several of the metrics we discussed only make sense when the entire network is one connected component. We will take the largest connected component of the example network (as we did when we made the calculations by hand above).

**Question** Take the largest connected component of the example network. (To do this, look at the help file for the `connected_component_subgraphs` function; the example in the help file shows how to do this)

In [9]:
ex_network_lc = max(...)

**Question** Check that this worked correctly by drawing `ex_network_lc`

In [None]:
nx.draw(..., with_labels=True)

The next few questions ask you to use the following functions to check your calculations:

* `average_shortest_path_length()`
* `radius()`
* `diameter()`
* `periphery()`
* `center()`

**Question** Check the average shortest path length

In [None]:
...

**Question** Check the radius

In [None]:
...

**Question** Check the diameter

In [None]:
...

**Question** Check the fraction of nodes in the periphery

In [None]:
...

**Question** Check the fraction of nodes in the center

In [None]:
...

### Opening up a school network

Recall that the Add Health study sampled schools in many different communities. In the last lab, we looked at the network from one of those communities. In today's lab, we're going to look at *all* of the communities. By looking at many different friendship networks, we can hope to better understand the structure of student friendship networks, since we will be able to use evidence from many different networks, instead of from a single example. At the same time, we will try to better understand the different metrics of network structure and how they relate to each other.

Last week, we had to go through a couple of steps to read a file in and open up a single network. These steps would make a great function, since we will need to go through them each time we want to open 84 different files.

Take a look at this function, which you will use in a moment:

In [19]:
def read_add_health_network(network_id):
    """
    network_id : integer from 1 to 84
    
    read in the Add Health network corresponding to the given id number and
    return it as an undirected networkx object
    """

    # this file was downloaded from
    # http://moreno.ss.uci.edu/data.html#adhealth
    edge_file = os.path.join("..", "data", "add-health", "comm" + str(network_id) + ".dat")
    with open(edge_file, 'r') as f:
        edge_lines = f.readlines()
        
    network = nx.parse_edgelist(edge_lines, nodetype=int, data=[('activity_level', float)])
    
    # note that we call the to_undirected method to ensure we get an undirected network
    return(network.to_undirected())

Now let's use this function to actually read in all 84 of the Add Health school networks:

In [20]:
number_add_health_networks = 84
add_health_networks = [read_add_health_network(x) for x in range(1,number_add_health_networks+1)]

Done! Look at the contents of `add_health_networks` to better understand what it is.

### Calculating network statistics for all of the Add Health communities

**Question** Let's start by making a dataset that has the number of nodes in each of the 84 Add Health community networks. Fill in the missing code below:

In [141]:
num_nodes = make_array()

for g in ...:
    num_nodes = np.append(..., ...)

add_health_firsttry = Table().with_columns([
     'id', np.arange(1, number_add_health_networks+1),
     'num_nodes', num_nodes
    ])

add_health_firsttry

**Question** Now, following the pattern above, make a more complete dataset called `add_health` which has columns

* `id`
* `num_nodes`
* `num_edges`
* `avg_degree`
* `num_components`

In [None]:
...

for ... in ...:
    ...
    ...
    
add_health = ...

Let's take a look at the dataset that we just created:

In [28]:
add_health

id,num_nodes,num_edges,avg_degree,num_components
1,69,220,6.37681,1
2,105,349,6.64762,2
3,32,91,5.6875,1
4,281,1136,8.08541,1
5,157,730,9.29936,1
6,108,378,7.0,1
7,441,1700,7.70975,3
8,204,809,7.93137,1
9,248,1004,8.09677,1
10,678,2795,8.24484,1


**Question** Make a histogram that shows the distribution of each column (except for `id`).

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

**Question** What are these distributions, exactly? For example, the histogram for `num_components` shows that almost half of the distribution has 1 component. What does that mean about the friendship networks of adolescents in the Add Health study?

<div class='response'>
[answer here]
</div>

**Question** Go back and look at the [description of how the data were collected](http://moreno.ss.uci.edu/data.html#adhealth) again. Now look again at the histogram of the distribution of average degrees. Can you think of anything about this data collection method that might affect the average degree distribution you see?

<div class='response'>
[answer here]
</div>

### Relationship between metrics of network structure

Remember that the goal of these different metrics is to try to find a way to summarize the structure of a network.  It turns out that this is too hard a task to have a single solution: the best way to summarize or describe a network can depend a lot on what you are interested in understanding about the network. For example, one type of summary might tell you about what networks are at high or low risk of quickly spreading an infectious disease and a different type of network metric might tell you about how hierarchical or egalitarian relationships between network members are.

It would be very helpful to understand how these different metrics are related to each other. For example, if two metrics always increase or decrease together, that might tell us that they are capturing the same underlying aspect of network structure. On the other hand, if two metrics are totally unrelated to one another, then that might tell us that each one captures an independent aspect of network structure.

One way to investigate this topic would be to use math to try to derive results that relate the different network metrics to each other. That's a great thing to do (and there has been a lot of work on this topic). But since we're learning how to analyze data, we're going to take a different approach: we're going to use our empirical dataset to see how these metrics behave in a set of real-world friendship networks.

**Question** Think about the definitions of the four metrics we have looked at so far. Do you expect them to be related to one another? Make a prediction for each of the following pairs:

* <div class='response'>number of nodes and number of edges - [answer here]</div>
* <div class='response'>number of nodes and average degree - [answer here]</div>
* <div class='response'>number of nodes and number of components - [answer here]</div>
* <div class='response'>average degree and number of components - [answer here]</div>

A prediction might be one of the following possibilities: (i) no relationship; (ii) directly related (when one increases, the other one does too); (iii) inverseley related (when one increases, the other one tends to decrease); (iv) something else. 

Of course, it is OK if your prediction turns out not to be correct.

**Question** Make a scatterplot that investigates the relationship between each of the four pairs of metrics in the previous question. For each scatter plot, briefly comment on whether it suggests that your prediction is correct or not. (We're not doing any formal tests here, so this evidence will only be suggestive.)

In [None]:
...

<div class='response'>[Answer here]</div>

In [None]:
...

<div class='response'>[Answer here]</div>

In [None]:
...

<div class='response'>[Answer here]</div>

In [None]:
...

<div class='response'>[Answer here]</div>

## Submit the lab

You're almost done! Now please create a pdf version of your completed lab by **either**:

* printing your notebook to a pdf file
* going to the Jupyter 'File' menu, choosing 'Download as' and then 'PDF via LaTeX (.pdf)'. 

Please save the resulting .pdf on your computer and then **submit the .pdf on bcourses**.

**The lab must be submitted by the end of the day on Monday, Oct. 2. Late labs will not be accepted.**