In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hwk_clusteringcoef.ipynb")

In [None]:
from IPython.core.display import HTML
from datascience import *
import itertools
import matplotlib

import matplotlib.pyplot as plt
import numpy as np
import os
plt.style.use('fivethirtyeight')

import networkx as nx
%matplotlib inline

#np.random.seed(99)

# Homework 2 - Calculating the clustering coefficient

In this assignment, we are going to calculate the average clustering coefficient of a network based on the concepts we discussed in the lecture. We will start with a small, toy network and then we will move on to analyze a complete network dataset from the [Add Health project](http://www.cpc.unc.edu/projects/addhealth).

## Practice questions: calculating average clustering coefficient by hand

**When you have attempted these two practice questions, you can scroll down to the bottom of the notebook to see the answers (and check your work).  You do not need to submit answers to these practice questions.**


Consider the network created by the following code.

In [None]:
ex_network = nx.Graph([(1,3), (2,3), (1,2), (3,4), (5,6), (7,8)])
ex_network.add_node(9)
nx.draw_circular(ex_network, with_labels=True, font_color='white')


**Practice question 1)** For each node in the above graph, calculate the following things:
* Degree of the node, 
* Number of pairs of neighbors of the node,
* Number of the pairs of neighbors that are directly connected with each other

Write down your answer as a table with the following four columns:
1) NodeId, 2) Degree of the node, 3) Number of pairs of neighbors, 4) Number of the pairs of the neighbors that are directly connected

Hint: Remember that the number of pairs of neighbors of the node refers to the number of possible connections between a node's neighbors. For example, node 3 is connected to 1, 2, and 4 so there's three possible connections: (1,2) (1,4) (2,4)

*Replace this markdown cell with your answer*

**Practice question 2)** Recall that the **clustering coefficient** of a node is the proportion of the pairs of the neighbors that are connected to each other; in other words, it quantifies the extent to which a node's friends are friends with one another. (You can check the lecture slides and lecture demo for a review). 
For each node in the above mentiond graph, calculate the clustering coefficient. Present your answers using a table with two columns: NodeId, Clustering Coefficient

*Replace this markdown cell with your answer*

## Question 1 :
Calculate the average clustering coefficient for the whole graph used in the practice questions above and assign it to q1.

In [None]:
q1 = ...

In [None]:
grader.check("q1")

### Calculating the average clustering coefficient using the `networkx` package

The `networkx` library provides a function, `average_clustering`, that can be used to calculate the average clustering coefficient of a graph. Use the `average_clustering` method to calculate the average clustering coefficient of the above graph.  

The average clustering coefficient calculated by this function should be the same with your answer calculated by hand for q4. (There may be a very minor difference if you are rounding off your answer.)

In [None]:
nx.average_clustering(ex_network)

## Creating a function to calculating clustering coefficient from scratch

In this part of this homework, we are going to write a function to calculate the average clustering coefficient of a network from scratch.

We'll start by spelling out an algorithm for calculating the clustering coefficient:

**Algorithm** for calculating the clustering coefficient of ONE node is as follows.

1. Get all the neighboring nodes for the node x
2. Get all the possible pairs for the neighboring nodes of x
3. For each of the pair created in the last step, count the number of pairs that are directly connected to each other
4. Divide the number of pairs that are directly connected (Step 3) by the total number of pairs

**Aside on interators**: Many of the methods in the networkx library return an iterator. A detailed dive into the concept of iterators is beyond the scope of this course (and homework). But for our purposes, there are two useful things to remember about iterators:

1. Iterators can be used in for loops just like any other container. In other words, iterators can be used in for loop as follows
<code> for x in iterator </code>
2. Iterators can be easily converted into lists by using list comprehension as follows
<code> [x for x in iterator] </code>

**Step 1**: To get all the neighboring nodes for a node `x`, we can use the `neighbors` function in `networkx`. `neighbors` works on a `graph` object:

In [None]:
def get_neighbors(graph_instance, node_id): # two inputs: the network as a graph object, and the id of the node
    ''' Get all the neighbors of node_id in graph_instance as a list'''
    neighbors_iter = graph_instance.neighbors(node_id) # Use the .neighbor function of the graph object that results an iterator for the neighbors
    neighbors_list = [neighbor for neighbor in neighbors_iter] # Covert the iterator into a list (easier to work with)
    return neighbors_list

Now we can get the neighbors of node 3

In [None]:
neighbor_n3 = get_neighbors(ex_network, 3)
neighbor_n3

**Step 2**: Now that we have the list of neighbors for a given node. The next step is to convert the list of neighbors into list of possible pair of neighbors. We are going to use the `combinations` functions of the builtin `itertools` library to do this. 

Note that the `itertools.combinations` function returns an iterator. To keep things simple, we are going to convert this iterator into a list (see above).

In [None]:
# example usage of itertools.combinations
# you get a interator of all the possible combinations of 2 elements from the [1,2,3] array
[x for x in itertools.combinations([1,2,3],2)]

In [None]:
# now let's define a function that returns all the possible pairs of neighbors of one node.
def get_neighbors_pairs(neighbors_list): # this function should apply to the returned neighbor list from the function get_neighbors
    lst = [x for x in itertools.combinations(neighbors_list,2)]
    return lst

In [None]:
# you can have all the possible pairs between node 1, 2, and 4, who are node 3's neighbors
allpairs_n3 = get_neighbors_pairs(neighbor_n3)
allpairs_n3 

**Step 3:** In the next step, we are required to count the number of pairs that are directly connected with each other. We are going to use the `neighbors` function from the `graph` class.

In [None]:
def count_connected_neighbors(graph_instance,neighbors_pairs ):
    count=0 # we prepare a count 0, and add 1 to it when we have a connected pair of neighbors
    for x, y in neighbors_pairs: 
        if x in graph_instance.neighbors(y): 
            # If two nodes x and y are directly connected with each other in graph g, then x will be in the neighbors of y and vice versa.
            count+=1
    return count

In [None]:
connected_count_n3=count_connected_neighbors(ex_network,allpairs_n3)
connected_count_n3

## Question 2:

**Step 4:** As the last step in the algorithm, we are going to calculate the clustering coefficient of node 3.

In [None]:
def get_clustering_coeff(allpairs, connected_neighbors_count): 
    # 2 inputs: the list of all possible pairs the count of connected neighbors
    n_allpairs=len(allpairs)
    if n_allpairs!=0:
        cc=float(connected_neighbors_count)/n_allpairs
    else:
        cc=0
    return cc

_Type your answer here, replacing this text._

In [None]:
q2 = ...
q2

In [None]:
grader.check("q2")

## Question 3:
Now that we have all the functions to compute the clustering coefficient, we are going to write a function to internally call all of these steps. Complete the following code.

In [None]:
def get_cc_node(graph_instance, node_id):
    '''return the clustering coefficient of the node_id in the graph_instance'''
    neighbors = ... # Hint: Use one of the functions defined in the previous questions
    
    pairs = ... # Hint: Use one of the functions defined in the previous questions
    
    connected_count = ... # Hint: Use one of the functions defined in the previous questions
    
    cc = ... # Hint: Use one of the functions defined in the previous questions
    
    return cc
    
q3 = get_cc_node(ex_network, 3)

In [None]:
grader.check("q3")

##  Question 4: 

Now that we have a the function to calculate the clustering coefficient of a single node in a graph, our next step is to calculate the average clustering coefficient of all the nodes in a graph.

This can be done by:

1. Calculating the clustering coefficient for all the nodes in a graph and saving it in an array
2. Calculating the mean of the array 

Complete the following function to calculate the average clustering coefficient of all the nodes in a graph.

In [None]:
def get_average_cc(graph_instance):
    
    cc_array = make_array() # begin with an empty array and then append new results to it
    
    for node in ...: # look at all nodes
        cc = ...
        cc_array = np.append(cc_array,cc)
    
    return np.mean(cc_array)
    

q4 = get_average_cc(ex_network)
q4

In [None]:
grader.check("q4")

Note that the average clustering created by your function q4 should be equal to the average clustering coefficient calculated by hand in q1.

In [None]:
round(q1,3) == round(q4,3)

## Clustering coefficients in real world Add Health networks

As the next step in this homework, we are going to calculate the average clustering coefficient for real world networks from the Add Health study.

To start, this function will be helpful: it reads the data in for a single Add Health network.

In [None]:
def read_add_health_network(network_id):
    """
    network_id : integer from 1 to 84
    
    read in the Add Health network corresponding to the given id number and
    return it as an undirected networkx object
    """

    # this file was downloaded from
    # http://moreno.ss.uci.edu/data.html#adhealth
    edge_file = os.path.join("data", "comm" + str(network_id) + ".dat")
    with open(edge_file, 'r') as f:
        edge_lines = f.readlines()
        
    network = nx.parse_edgelist(edge_lines, nodetype=int, data=[('activity_level', float)])
    
    # note that we call the to_undirected method to ensure we get an undirected network
    return(network.to_undirected())

Now let's use this function to actually read in all 84 of the Add Health school networks:

In [None]:
number_add_health_networks = 84
add_health_networks = [read_add_health_network(x) for x in range(1,number_add_health_networks+1)]
# Running this cell will take a few seconds
# now add_health_networks is an object containing 84 networks

## Question 5:

To warm up, let's calculate the average clustering coefficient for **the first network** in the Add Health study.

In [None]:

g = add_health_networks... # assign the first network in add_health_networks to 'g', use the index correctly

cc_nx = nx.average_clustering(...) # we use the average clustering coefficient function to do the calculation

print('Average clustering coefficient calculated by the networkx library', cc_nx)

cc_custom = get_average_cc(...) # we use the customized function to do the calculation

print('Average clustering coefficient calculated by our custom function', cc_custom)


In [None]:
grader.check("q5")

# Calculating clustering coefficient for all of the Add Health networks

Now we'll calculate the clustering coefficient for all of the Add Health networks.

## Question 6: 
Let's start by making a dataset that has the average clustering coefficient in each of the 84 Add Health community networks. Fill in the missing code below:

In [None]:
cc_ah = make_array()

for g in ...:
    cc_ah = np.append(..., ...) # np.append will append the new results to the end of original array
                                # np.append(the original array, the new element), check np.append? for more info
                                # we want to make an array of all the clustering coefficients for all the 84 networks

add_health_df = Table().with_columns([  # here we want to make a table for these coefficients
     'id', np.arange(1, number_add_health_networks+1), # the first column has the ids of the nodes
     'clustering_coeff', ...                           # the second column has the clustering coefficient
    ])
add_health_df

In [None]:
grader.check("q6")

<!-- BEGIN QUESTION -->

## Question 7:

We might wonder how much the clustering coefficient changes from community to community. Make a histogram that shows the distribution of the clustering coefficient for the Add Health communities (don't include `id`).

In [None]:
...

<!-- END QUESTION -->

### Answers to practice questions

Answer to practice question 1:

```
 Columns: NodeId, Degree of the node, Number of pairs of neighbors, Number of the pairs of the neighbors that are directly connected
 
 1,2,1,1
 2,2,1,1
 3,3,3,1
 4,1,0,0
 5,1,0,0
 6,1,0,0
 7,1,0,0
 8,1,0,0
 9,0,0,0
```

Answer to practice question 2:

```
 Columns: NodeId, Clustering Coefficient
 1,1
 2,1
 3,0.333
 4,0
 5,0
 6,0
 7,0
 8,0
 9,0
```

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Please upload the .zip file to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)