In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab_complete_network_data.ipynb")

In [None]:
!pip install --upgrade networkx

In [None]:
!pip install scipy==1.8.0

In [None]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
import os
import networkx as nx
plt.style.use('fivethirtyeight')

%matplotlib inline

# Lab 1: Working with complete network data

Today, we will work with complete network data, meaning data on an entire population of individuals and all of the connections between them. I'll refer to this as the *complete network perspective*. The complete network perspective can be very useful, since it enables us to study how the actual structure of the network can affect important outcomes like the spread of a disease, information about job openings, and even social status.

For almost any social network we study, we are interested in understanding the structure of the complete network. Unfortunately, it is typically extremely difficult to obtain complete network data. Most studies that have done so have put a tremendous amount of time, effort and resources into data collection.  Any time we analyze data---including complete network data---we have to bear in mind the strengths and limitations of the way that the data were collected.

### The Add Health Study

For our first complete network dataset, we're going to be looking at data from a study called [Add Health](http://www.cpc.unc.edu/projects/addhealth). Here is a description of the study, taken from the front page of the [Add Health website](http://www.cpc.unc.edu/projects/addhealth):

<blockquote>
The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32. Add Health is re-interviewing cohort members in a Wave V follow-up from 2016-2018 to collect social, environmental, behavioral, and biological data with which to track the emergence of chronic disease as the cohort moves through their fourth decade of life.
<BR><BR>
Add Health combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviors in adolescence are linked to health and achievement outcomes in young adulthood. The fourth wave of interviews expanded the collection of biological data in Add Health to understand the social, behavioral, and biological linkages in health trajectories as the Add Health cohort ages through adulthood, and the fifth wave of data collection continues this biological data expansion.
</blockquote>

Here are some terms from that description that might not be familiar to you:

* `longitudinal` - logitudinal studies follow people over time, instead of just interviewing people at one point in time
* `cohort` - a group of people that is followed over time; for Add Health, the cohort is the group of people who were interviewed as 7-12th graders in 1994-95.
* `nationally representative` - the participants in the study were chosen in a principled way that enables researchers to make inferences about the US population from the small number of people they interview

Add Health interviewed adolescents in many different schools that were randomly sampled from all over the US. We're going to work with the data from the friendship network of students in just one of those schools today.

Now that we have some background, we'll need to talk a bit about some of the technical details that go into working with complete network data.

# Representing a network in a computer

<!-- BEGIN QUESTION -->

## Question 1:
How can we turn a network -- which we have been thinking of as an abstract concept -- into something concrete that we can analyze and manipulate using a computer? What information do we need to be able to describe the network? What are the advantages and disadvantages of the different solutions you can think of?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Loading a complete network

We will explore friendships from an American school; the data can be found at [http://moreno.ss.uci.edu/data.html#adhealth](http://moreno.ss.uci.edu/data.html#adhealth).

The data are in [UCI format](https://gephi.org/users/supported-graph-formats/ucinet-dl-format/). [UCI](https://sites.google.com/site/ucinetsoftware/home) is a tool that can be used to perform all sorts of network analysis. We won't be using UCI in this class, but you might come across it in future classes. For now, we're interested in some of the data that are included with UCI. These UCI datasets contain edge lists which the Python package `networkx` can read in.

The code below opens up the data file, but just reads it as lines of text (instead of interpreting the lines of text as a description of a network).<BR>

Run the code chunk below, and take a look at its output.

In [None]:
os.getcwd()

In [None]:
# this file was downloaded from
# http://moreno.ss.uci.edu/data.html#adhealth
edge_file = os.path.join("data", "comm1.dat")
with open(edge_file, 'r') as f:
    edge_lines = f.readlines()

In [None]:
edge_lines

Note that it looks like there are 4 extraneous lines at the top of the file before the edge list starts. Fortunately, the `networkx` package is smart enough to skip these four lines. 

Each line of the dataset has three numbers on it. The first two numbers are the nodes representing the end point of the respective edges, while the third number indicates the edge of each weight.
*[Hint: In order to understand how the data are formatted, read the "Description" section of the [website](http://moreno.ss.uci.edu/data.html#adhealth) where the data can be downloaded.]*



In order to convert the edgelist contained in the datafile into a `networkx` object, we use the `nx.parse_edgelist` function:

In [None]:
g = nx.parse_edgelist(edge_lines, nodetype=int, data=[('activity_level', float)])

`g` is now a network object. And we will be working with `g` from now on.

You can see some more information about how this function works by looking at the help file:

In [None]:
nx.parse_edgelist?

In addition to the lines from the datafile, we also passed a couple other arguments to the `parse_edgelist` function. They are:

* `nodetype` - specifies the Python type used to represent each node. We have integer ids, so we use `int` here
* `data` - describes any extra information about each edge that is contained in the edge list; in our case, there is a floating point value that describes the amount of interaction an edge represents. See the [data description](http://moreno.ss.uci.edu/data.html#adhealth) and the `parse_edgelist` help file for more information

Now that we have read our edgelist into a `networkx` object, we can start to investigate this network.

For example, we can list the edges in the network:

In [None]:
g.edges()

In [None]:
g.edges(data=True)

## Question 2:
How many edges are in the network? How many nodes are there?<BR>
[*Hint: there are many ways to answer this question. For example, to count the number of nodes, you may find it helpful to look at nodes() method.*]

In [None]:
q2_edges = ...

print('num edges: ', q2_edges)

In [None]:
q2_nodes = ...

print('num nodes: ', q2_nodes)

In [None]:
grader.check("q2")

# Drawing the graph

In [None]:
nx.draw_networkx(g, with_labels=True)

In order to illustrate a few important concepts, it will be helpful to first investigate a small subset of the network that we just read in.

Remember that a network can be represented mathematically as a `graph`. (Note: this is different concept from a plot, or a graphical display of data, which can also be called a graph.) A subset of a graph is called a `subgraph`.

The `subgraph` function enables us to create a subgraph from a specific set of node ids:

In [None]:
g_subgraph = nx.subgraph(g, [3, 6, 29, 34, 40]) # pass in the network and a list of nodes

We can get a drawing of this subgraph using the `draw` function:

In [None]:
nx.draw_networkx(g_subgraph, with_labels=True)

We will discuss drawing networks in more detail below.

### Other representations of the network

We discussed how there are different ways to represent a network in a computer. The edge list is very practical because many real networks are quite *sparse*, meaning that they have relatively few edges. The edge list is a particularly convenient way of storing a description of a network in a file (or in memory) when the network is large and sparse.

Another way to store a network is as an *adjacency matrix*. The *adjacency matrix* is not too practical for large networks because the amount of memory it requires increases quickly with the number of nodes in the network. However, the adjacency matrix turns out to be convenient to work with mathematically, so many formal results rely upon it.

The adjacency matrix is a matrix -- i.e., it is an array of numbers, like a table. It is square, meaning that it has the same number of rows and columns. These rows and columns are ordered so that each id corresponds to one row and column.

Each entry in an adjacency matrix can be located by its coordinates: (row number, column number).  If an entry is 0, it means that there is no edge between the vertices corresponding to the row and column. If an entry is 1, then there is an edge between the vertices corresponding to the row and column. (For those who are curious, [Wikipedia](https://en.wikipedia.org/wiki/Adjacency_matrix) has a discussion of adjacency matrices.)

<BR>

Of course, `networkx` will display the edge list and the adjacency matrix representations of the network for you. Let's call the `edges` method and the `nx.adjacency_matrix` function for subgraph g_subgraph.

In [None]:
g_subgraph.edges()

In [None]:
nx.adjacency_matrix(g_subgraph)

In order to actually show the contents of the adjacency matrix (since we know, in this case, that it's not too big), we can use the `todense` method:

In [None]:
nx.adjacency_matrix(g_subgraph).todense()

There's the matrix -- but, it's a bit hard to interpret it without knowing which node id corresponds to which row and column. If you look at the [help file](https://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.linalg.graphmatrix.adjacency_matrix.html) for `adjacency_matrix`, you will see that it says that, by default, it orders the rows/columns according to the results of `nodes()`. So we can interpret the matrix above by calling:

In [None]:
g_subgraph.nodes()

Check that the matrix you get makes sense by comparing it to the plot above.

### Plotting a network

As we saw above, `networkx` will help us draw a network using the `draw` function:

In [None]:
nx.draw_networkx(g_subgraph)

Now look at the help files for the `networkx` package and try to find at least three other ways to draw the network.<BR> 
[*Hint: try typing `nx.draw` and then push Tab; you should see a list with possible completions pop up.*]

Each time you make a new plot, read the help file and try to explain how this plot is made. (If you don't understand the help file, that is OK -- some if it goes beyond what we have discussed so far. Just do your best.)

In [None]:
nx.draw_circular(g_subgraph)

In [None]:
nx.draw_spectral(g_subgraph)

In [None]:
nx.draw_kamada_kawai(g_subgraph)

<!-- BEGIN QUESTION -->

## Question 3:
What do you learn from these graphs about plotting networks?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### The complete network

Remember that, so far, we have mostly been looking at a subgraph of the complete school network. This was helpful because it is easier to illustrate some network concepts with smaller networks. Now we will turn to the entire network.

Let's use the `draw` function to draw the complete school network again. 

In [None]:
nx.draw_networkx(g)

**Practice** Now use the three other drawing methods that you discovered above to produce different plots of this network.

In [None]:
nx.draw_random(g)

In [None]:
nx.draw_spring(g)

In [None]:
nx.draw_shell(g)

<!-- BEGIN QUESTION -->

## Question 4:
As its name suggests, `draw_random` draws the network with a random layout. You will get a different plot each time you call `draw_random`. Write a simple loop that will call `draw_random` 10 different times, producing 10 different random drawings of this network.<BR>
*[HINT: after each plot you draw in your loop, tell matplotlib that you want to start a new plot (rather than adding to the existing one) by calling `plt.figure()`]*

In [None]:
for i in range(10):
    nx.draw_random(..., with_labels=True)
    ...

<!-- END QUESTION -->

### Degree plots

In lecture, we brainstormed a few of the quantitative ways that we could try to summarize the structure of a network. One of these ways was to look at the *degrees* in the network. The degree of a node in the network is the total number of other nodes that it is connected to.

Now we will investigate the degrees in this school friendship network.

Think for a second about what the collection of degrees in a network tells us: it tells us, for each node, how connected that node is.  Often, we are interested in understanding how much connectedness varies from node to node. For example, in some networks, it can be the case that every node has exactly the same degree; in other networks, there can be huge differences in the degrees of different nodes.

## Question 5:
Create a Table called `g_degrees` that has two columns:
* `id` - has the id of each node in the network
* `degree` - has the degree of each node in the network

*[Hint: the `degree` method returns a dictionary, which is a type of data structure in Python. For our purposes, it is helpful to understand that you can get the entries in a dictionary using the `values` function. It will take a little exploring to figure out exactly how; feel free to work this out together with your neighbor.]*
<BR>

**Degree Table for g_subgraph**

In [None]:
# the function .degree returns the degrees of each node in the network you are working with
g_subgraph.degree(g_subgraph.nodes())

In [None]:
# you can use the list function to convert it into the list format that is easier to work with
list(g_subgraph.degree(g_subgraph.nodes()))

In [None]:
# then we want to create a table listing all the nodes and their degree levels
g_degrees = Table().with_columns([
    'id', g_subgraph.nodes(),
    'degree', [y for (x,y) in list(g_subgraph.degree(g_subgraph.nodes()))] # use all the y values in the pairs (x,y) in the list
])

    
g_degrees

**Now create degrees table for the complete graph (with network g) rather than the subgraph:**

In [None]:
g_degrees = Table().with_columns([
    'id', ...,
    'degree', [y for (x,y) in ...]
])

print(g_degrees)


In [None]:
grader.check("q5")

Now you can make a histogram of the degrees of the nodes in the network

In [None]:
g_degrees.hist('degree', bins=np.arange(0, 16, 1))

## To recap our process so far...

To recap our process so far:

* started using the `networkx` package;
* learned how to read in a complete-network dataset;
* learned how to take a subgraph from a complete network;
* learned how to plot a network a few different ways;
* ... but we also discovered that plotting networks is only moderately useful for understanding them.

Let's continue using complete network data from the [Add Health project](http://www.cpc.unc.edu/projects/addhealth).

First, we're going to cover a few different ways to quantitatively describe various aspects of network structure. Then we're going to actually compute those metrics for all of the Add Health friendship networks. This will give us a chance to practice writing functions and using iteration.

## Quantifying network structure

There are many different ways to quantify network structure. We're going to start by discussing different ways to measure *network connectivity*. 

Roughly speaking, a network has a high level of connectivity when any node can reach another node by following a small number of network edges. In the case of the Add Health student friendship networks, a highly connected network could arise when students are friends with many of their fellow students. A poorly connected network, on the other hand, could arise when students are segregated into distinct groups that don't interact much with one another.

<img src="example_network.png" style="width: 60%;">

Some of the **basic metrics** of this network are as follows



* number of nodes: 8
* number of edges: 5
* average degree: (1 + 1 + 3 + 1 + 1 + 1 + 1 + 1) / 8 = 10 / 8
* number of connected components: 3



And the **shortest distance** between each pair of nodes in the largest component of the graph is as follows



|             |  node 1 | node 2 |  node 3 |  node 4 |
|   :----:    |  :---:  |  :---: |  :---:  |  :---:  |
|   node 1    |    -    |    2   |    1    |    2    |
|   node 2    |    2    |    -   |    1    |    2    |
|   node 3    |    1    |    1   |    -    |    1    |
|   node 4    |    2    |    2   |    1    |    -    |

### Calculating network metrics with `networkx` (please do this with a partner)

Now we are going to use the `networkx` package to check the calculations we made by hand.

In [None]:
ex_network = nx.Graph([(1,3), (2,3), (3,4), (5,6), (7,8)])

**Practice:** Check that your network is correct by drawing it and comparing it to the image above.

In [None]:
nx.draw_networkx(ex_network, with_labels=True)

The next few questions ask you to use the following functions to check your calculations:

* `number_of_nodes()`
* `number_of_edges()`
* `number_connected_components()`

Check the number of nodes

In [None]:
ex_network.number_of_nodes()

Check the number of edges

In [None]:
ex_network.number_of_edges()

## Question 6: 

What is the average degree of this network?

In [None]:
q6 = ...

In [None]:
grader.check("q6")

Check the number of connected components

In [None]:
nx.number_connected_components(ex_network)

Several of the metrics we discussed only make sense when the entire network is one connected component. We will take the largest connected component of the example network (as we did when we made the calculations by hand above).

Take the largest connected component of the example network. (To do this, look at the help file for the `connected_components` function; the example in the help file shows how to do this)

In [None]:
ex_network_lc = max((ex_network.subgraph(c) for c in nx.connected_components(ex_network)), key=len)

Check that this worked correctly by drawing `ex_network_lc`

In [None]:
nx.draw_networkx(ex_network_lc, with_labels=True)

### Opening up a school network

Recall that the Add Health study sampled schools in many different communities. In part 1, we looked at the network from one of those communities. Now, we're going to look at *all* of the communities. By looking at many different friendship networks, we can hope to better understand the structure of student friendship networks, since we will be able to use evidence from many different networks, instead of from a single example. At the same time, we will try to better understand the different metrics of network structure and how they relate to each other.

In part 1, we had to go through a couple of steps to read a file in and open up a single network. These steps would make a great function, since we will need to go through them each time we want to open 84 different files.

Take a look at this function, which you will use in a moment:

In [None]:
def read_add_health_network(network_id):
    """
    network_id : integer from 1 to 84
    
    read in the Add Health network corresponding to the given id number and
    return it as an undirected networkx object
    """

    # this file was downloaded from
    # http://moreno.ss.uci.edu/data.html#adhealth
    edge_file = os.path.join("data", "comm" + str(network_id) + ".dat")
    with open(edge_file, 'r') as f:
        edge_lines = f.readlines()
        
    network = nx.parse_edgelist(edge_lines, nodetype=int, data=[('activity_level', float)])
    
    # note that we call the to_undirected method to ensure we get an undirected network
    return(network.to_undirected())

Now let's use this function to actually read in all 84 of the Add Health school networks.

*(This might take a couple of seconds)*

In [None]:
number_add_health_networks = 84
add_health_networks = [read_add_health_network(x) for x in range(1,number_add_health_networks+1)]

Done! Look at the contents of `add_health_networks` to better understand what it is.

### Calculating network statistics for all of the Add Health communities

Let's start by making a dataset that has the number of nodes in each of the 84 Add Health community networks.

In [None]:
num_nodes = make_array()

for g in add_health_networks:
    num_nodes = np.append(num_nodes, g.number_of_nodes())

add_health_firsttry = Table().with_columns([
     'id', np.arange(1, number_add_health_networks+1),
     'num_nodes', num_nodes,
    ])

add_health_firsttry

## Question 7:

Now, following the pattern above, make a more complete dataset called `add_health` which has columns

* `id`
* `num_nodes`
* `num_edges`
* `avg_degree`
* `num_components`

In [None]:
num_nodes = ...
num_edges = ...
...


for g in add_health_networks:
    num_nodes = np.append(num_nodes, g.number_of_nodes())
    num_edges = ...
    avg_degree = np.append(avg_degree, ...)
    num_components = np.append(num_components, nx.number_connected_components(g))

add_health = Table().with_columns([
     'id', np.arange(1, number_add_health_networks+1),
     'num_nodes', num_nodes,
     'num_edges', num_edges,
     'avg_degree', avg_degree,
     'num_components', num_components,
    ])
    
q7 = add_health.num_rows

In [None]:
grader.check("q7")

Let's take a look at the dataset that we just created:

In [None]:
add_health

<!-- BEGIN QUESTION -->

## Question 8: 

Make four histograms that show the distribution of each column (except for `id`).

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

### Relationship between metrics of network structure

Remember that the goal of these different metrics is to try to find a way to summarize the structure of a network.  It turns out that this is too hard a task to have a single solution: the best way to summarize or describe a network can depend a lot on what you are interested in understanding about the network. For example, one type of summary might tell you about what networks are at high or low risk of quickly spreading an infectious disease and a different type of network metric might tell you about how hierarchical or egalitarian relationships between network members are.

It would be very helpful to understand how these different metrics are related to each other. For example, if two metrics always increase or decrease together, that might tell us that they are capturing the same underlying aspect of network structure. On the other hand, if two metrics are totally unrelated to one another, then that might tell us that each one captures an independent aspect of network structure.

One way to investigate this topic would be to use math to try to derive results that relate the different network metrics to each other. That's a great thing to do (and there has been a lot of work on this topic). But since we're learning how to analyze data, we're going to take a different approach: we're going to use our empirical dataset to see how these metrics behave in a set of real-world friendship networks.

<!-- BEGIN QUESTION -->

## Question 9:

Make scatterplots that investigate the relationship between each of the four metrics in the previous question (for a total of six scatterplots). For each scatter plot, briefly comment on whether it suggests that your prediction is correct or not. (We're not doing any formal tests here, so this evidence will only be suggestive.)

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

# Submitting your work

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Please upload the .zip file to Gradescope by Tuesday at 2pm.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)