In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hwk07_friendship_paradox.ipynb")

In [None]:
!pip install --upgrade networkx

In [None]:
from IPython.core.display import HTML
from datascience import *
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
plt.style.use('fivethirtyeight')
import networkx as nx

# Homework 7: The friendship paradox

## Why your friends (probably) have more friends than you do

Please read this short article and answer a couple of questions about it:
[Friends you can count on](https://opinionator.blogs.nytimes.com/2012/09/17/friends-you-can-count-on/) by Steve Strogatz  

If you are really curious, you can also optionally look at the original paper that discussed the friendship paradox (and which was the inspiration for this homework's title):
[Why your friends have more friends than you do](http://www.journals.uchicago.edu/doi/abs/10.1086/229693) by Scott Feld [OPTIONAL BACKGROUND]

<!-- BEGIN QUESTION -->

# Question 1
According to Strogatz, why do people experience airplanes, restaurants, parks, and beaches to be more crowded than averages would suggest?  
*[Please answer in one or two sentences]*


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# Question 2 

Which two groups did Christakis and Fowler monitor to see who got the flu first? Which group ended up actually getting the flu first?  
*[Please answer in one or two sentences]*

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Testing the friendship paradox

In this homework, we're going to see if the friendship paradox seems to hold for the networks from the Add Health study.  We'll start by loading the Add Health networks into memory, as we have done in past lab. This function will help.

#### Read in Add Health networks

In [None]:
def read_add_health_network(network_id):
    """
    network_id : integer from 1 to 84
    
    read in the Add Health network corresponding to the given id number and
    return it as an undirected networkx object
    """

    # this file was downloaded from
    # http://moreno.ss.uci.edu/data.html#adhealth
    edge_file = os.path.join(".", "data", "comm" + str(network_id) + ".dat")
    with open(edge_file, 'r') as f:
        edge_lines = f.readlines()
        
    network = nx.parse_edgelist(edge_lines, nodetype=int, data=[('activity_level', float)])
    
    # note that we call the to_undirected method to ensure we get an undirected network
    return(network.to_undirected())

Now let's use `read_add_health_network` to actually read in all 84 of the Add Health school networks:

In [None]:
number_add_health_networks = 84
add_health_networks = [read_add_health_network(x) for x in range(1,number_add_health_networks+1)]

## Examining the friendship paradox in one network

We'll start by focusing on one specific network from the Add Health dataset. We'll develop some code using this one network. Later on, we'll generalize our results to all of the networks.

In [None]:
one_network = add_health_networks[0]

# **Question 3** 

Make a table that has two columns: one with the id of each node, and another with the degree of each node.


In [None]:
degree_data = Table().with_columns([
    'id', ...,
    'degree', ...
    ])

degree_data

In [None]:
grader.check("q3")

Now let's work on figuring out how to get the average degree of the neighbors of a single node.

In [None]:
one_node = list(one_network.nodes())[0]
one_node

This bit of code will show `one_node` and some of the nodes around it:

* `one_node` itself has id 1
* the *neighbors* of `one_node` have ids 36, 37, and 52
* the neighbors of `one_node`'s neighbors (which help you see the degree of each of `one_node`'s neighbors)

NOTE: this code uses some features of the `networkx` library that we aren't going to talk about in this class. So you don't have to understand exactly how it works (though that would be a good challenge if you want one).

In [None]:
nx.draw(one_network.subgraph(list(nx.single_source_shortest_path_length(one_network, one_node, cutoff=2).keys())),
        with_labels=True)

The `neighbors()` method will return the set of nodes that are adjacent to `one_node`:

In [None]:
[y for y in one_network.neighbors(one_node)]

Also, you can get the degree of a specific node using the `degree` method:

In [None]:
one_network.degree(one_node)

As we can see from the drawing above, one_node has degree 3.

You can use these facts to help answer the next question.

# **Question 4** 

The code below should calculate the average degree of `one_node`'s neighbors. Fill in the missing parts.



In [None]:
one_node_neighbors = one_network.neighbors(one_node)

# Average degree of one_node's neighbors = (total degree of one_node's all neighbors)/ the number of neighbors

nbr_degree_total = 0 # set up to count the total degree of the neighbors
num_nbrs = 0  # set up to count the number of neighbors

for nbr in ...: # for each neighbor in all of the neighbors of one_node
    nbr_degree_total = nbr_degree_total + ...
    num_nbrs = num_nbrs + 1

result = ... / ...

print("average degree of neighbors is ", result)

In [None]:
grader.check("q4")

Now let's generalize the code you just wrote by turning it into a function. This will allow you to easily calculate the average degree of the neighbors of any node you want.

# **Question 5** 

Fill in the code below to create a function that, given any network `g` and node `node`, will return the average degree of the node's neighbors.


In [None]:
def get_average_degree_of_neighbors(g, node):
    """Given a network and a node, compute the average degree of the node's neighbors.
    
    Parameters
    ----------
    g : networkx Graph object
        The network that node is a member of
    node : networkx node (actually just an integer)
        The node
    
    Returns
    -------
    float
        The average degree of the neighbors of node
    
    """
    
    ## get the nodes that are the neighbors of node
    node_neighbors = ...
    
    nbr_degree_total = 0
    num_nbrs = 0
    
    ## get the degrees of each of those nodes
    for nbr in node_neighbors:
        nbr_degree_total = nbr_degree_total + ...
        num_nbrs = num_nbrs + 1
        
    ## calculate the average
    avg_nbr_degree = ... / ...    
    
    ## return it
    return(avg_nbr_degree)

In [None]:
grader.check("q5")

# **Question 6** 

Now use the function you wrote to calculate the average of the neighbors' degrees for every node in `one_network`.


In [None]:
avg_friends_degree = make_array()

for node in ...:
    avg_friends_degree = np.append(avg_friends_degree, ...)

nbr_avg_degrees = Table().with_columns([
    'id', ...,
    'avg_friends_degree', ...
])

nbr_avg_degrees

In [None]:
grader.check("q6")

# **Question 7** 

Now you have created a table `degree_data` which has (node id, node degree) and a second table `nbr_avg_degrees` that has (node id, average of friends' degrees). Join these two tables together so that you have a Table with (node id, node degree, average of friends' degrees); call the resulting Table `friend_data`.

Please refer to this tutorial to learn about the "join" function in the datascience module. http://data8.org/datascience/_autosummary/datascience.tables.Table.join.html?highlight=join#datascience.tables.Table.join


In [None]:
friend_data = ....join('id', ...)

friend_data

In [None]:
grader.check("q7")

<!-- BEGIN QUESTION -->

# **Question 8** 

What does the friendship paradox predict about the values in the 'degree' and 'avg_friends_degree' column of the `friend_data` Table that you just made? Does it say that (i) on average, they should be about the same; (ii) on average, `degree` should be bigger than `avg_friends_degree`; or (iii) on average, `avg_friends_degree` should be bigger than `degree`?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# **Question 9** 

Now make a scatter plot that shows at the relationship between the degree of each node (x axis) and the average degree of the node's friends (y axis).


In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# **Question 10** 

Does the plot you just made seem consistent with what would be predicted from the friendship paradox?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## The friendship paradox across all of the Add Health networks

#### Write a function that calculates the fraction of nodes whose degree is less than the average of their neighbors.

The plot you just made investigated the friendship paradox by looking at every single node in one network. Now we are going to try to look at all of the different networks in the Add Health study.

One way to do so would be to look at every single node across all of the networks in the Add Health study. However, we're going to try a different approach: we're going to develop a metric that can be calculated once for every network, and then we'll compare that metric across the different networks in the Add Health study.

The metric we'll look at is the fraction of nodes in the network whose degree is smaller than the average of its friends' degrees. Intuitively, when this metric is high, then many nodes in the network experience the friendship paradox (because they have fewer friends than the average of their friends).

# **Question 11** 

The function below should take a network and calculate the fraction of the nodes in the network that have degree smaller than the average of its friends' degrees. Fill in the missing parts. (This function should make use of the work you did above.)


In [None]:
def frac_degree_lt_neighbors(g):
    degree = make_array()
    avg_friends_degree = make_array()
    
    for node in g.nodes():
        degree = np.append(degree, ...)
        avg_friends_degree = np.append(avg_friends_degree, ...)

    # calculate the fraction of nodes whose degree is smaller than the average of
    # their friends' degrees and return it
    # (HINT: you should fill in a boolean expression here to help calculate the fraction)
    return(np.mean(...))    

In [None]:
grader.check("q11")

Here is a function that takes the function that you wrote above and uses it to calculate the average number of neighbors of each node in a given network. So this function:

1. For each node in the network, calculate the average number of friends that the node's friends have
2. Take the average of that quantity over all of the nodes in the network
3. This result is a feature of this network.

We'll use this function below.

In [None]:
def get_avg_nbr_degree(g):
    avg_nbr_degree_total = 0 # set up to count the average degree of neighbors of all the nodes of a given network g
    num_nodes = 0 # set up to count the number of all the nodes of a given network g
    
    for node in g.nodes():
        avg_nbr_degree_total = avg_nbr_degree_total + get_average_degree_of_neighbors(g, node)
        num_nodes = num_nodes + 1
        
    return(avg_nbr_degree_total / num_nodes)

To see an example of the function in action, try this out:

In [None]:
get_avg_nbr_degree(one_network)

This means that the average node in the network `one_network` has friends whose average degree is about 7.7

#### Apply the function to calculate the average degree and average friends' degree for all add health networks.

# **Question 12** 

Now go through and, for each Add Health network, calculate (i) the average degree of this network; (ii) the average of each node's neighbors' degrees (from the function above); (iii) the fraction of nodes for which the degree is smaller than the average of its neighbors' degrees (from the first question after q11).


In [None]:
avg_degree = make_array()
avg_neighbor_degree = make_array()
frac_smaller_than_neighbors = make_array()

for g in add_health_networks:
    avg_degree = np.append(avg_degree, ...)
    avg_neighbor_degree = np.append(avg_neighbor_degree, ...)
    frac_smaller_than_neighbors = np.append(frac_smaller_than_neighbors, ...)

add_health_msmts = Table().with_columns([
     'id', np.arange(1, number_add_health_networks+1),
     'avg_degree', avg_degree,
     'avg_neighbor_degree', avg_neighbor_degree,
     'frac_lt_neighbors', frac_smaller_than_neighbors
    ])

In [None]:
grader.check("q12")

<!-- BEGIN QUESTION -->

# **Question 13** 


Make a scatterplot that compares the average degree (x axis) and the average neighbor degree (y axis) across all of the Add Health networks.


In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# **Question 14** 

Does the scatterplot you just made seem to be consistent with the friendship paradox?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# **Question 15** 

Make a histogram that shows, across all of the Add Health networks, the distribution of the fraction of nodes whose degree is smaller than the neighbors' average degree.


In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# **Question 16** 

Does the histogram you just made seem to be consistent with what you would expect from the friendship paradox? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Please upload the .zip file to Gradescope by 11:59pm on Thursday 11/30.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)