In [1]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
import pickle
import pandas as pd

plt.style.use('fivethirtyeight')

import networkx as nx

def css_styling():
    styles = open('../notebook_styles.css', 'r').read()
    return HTML(styles)
css_styling()

In [None]:
# These lines load the tests.
from client.api.notebook import Notebook 
hwk_homophily = Notebook('hwk_homophily_moc.ok')
_ = hwk_homophily.auth(inline=True)

### L&S 88-4: Social networks

# Homework 05

## Patterns of connection between US Legislators

In this homework, we'll be investigating patterns of connections on Twitter among Members of Congress (MOC). This dataset comes from the official Twitter accounts of members of Congress in the fall of 2016. We've made a few simplifications here:

* on Twitter, following is a *directed relation*. So person A can follow person B without person B necessarily following person A. Here, we've taken these directed relationships and turned them into an undirected network.
* Almost every Senator and Representative is in this dataset, but a few are missing; we'll ignore these missing people here

Let's start by loading the dataset:

In [None]:
official_congress_twitter = pickle.load(open('../data/congress-twitter/us_congress_2016_twitter.pickle', 'rb'))

## Exploratory analysis of the dataset

Like the network that we studied in Lab 07, the nodes in the `official_congress_twitter` network have attributes. These attributes include:

* `official_full` - the MOC's full name
* `gender` - the MOC's gender
* `party` - the MOC's political party
* `state` - the MOC's state
* `type` - either `sen` for Senator or `rep` for Representative

**Question** Fill in the missing parts of this helper function, which extracts node ids and attributes from a given network.

In [None]:
def attribute_to_table(g, att):
    """
    Given a network `g` and the name of an attribute `att`,
    return a table that has a column with node ID and a column with the
    node's attribute value
    """
    node_ids = ...
    
    att_vals = list(...)
    
    result = Table().with_columns('node_id', node_ids,
                                  att, att_vals)
    
    return(result)

## example usage
attribute_to_table(official_congress_twitter, 'gender')

In [None]:
_ = hwk_homophily.grade('q_attribute_to_table')

**Question** According to this dataset, what fraction of congress is women?

In [None]:
...

frac_women = ...

print("female proportion of members of Congress: ", frac_women)

In [None]:
_ = hwk_homophily.grade('q_frac_women')

**Question** According to this dataset, what fraction of congress is Republican?

In [None]:
...

frac_republican = ...

print("Republican proportion of members of Congress: ", frac_republican)

In [None]:
_ = hwk_homophily.grade('q_frac_republican')

**Question** What is the average degree of the network?

In [None]:
moc_avg_degree = ...

print('The average degree is: ', moc_avg_degree)

In [None]:
_ = hwk_homophily.grade('q_avg_degree')

## Investigating homophily by gender and by party

Now we'll turn to  a substantive question about patterns of connection in this network: does the network seem to have more homophily by party or by gender?

**Question** What would it mean if the patterns of connection in this network were homophilous by party?

<div class='response'>
[answer here]
</div>

**Question** Would you predict that connections by party will be (1) homophilous, (2) random, or (3) heterophilous (opposites attract)? Why?

<div class='response'>
[answer here]
</div>

**Question** Would you predict that connections by gender will be (1) homophilous, (2) random, or (3) heterophilous (opposites attract)? Why?

<div class='response'>
[answer here]
</div>

**Question** We talked about the assortativity coefficient as one way of quantifying the amount of homophily in a network. What is implied about homophily when the assortativity coefficient is (1) negative, (2) zero, and (3) positive?

<div class='response'>
1. [answer here]   
2. [answer here]   
3. [answer here]
</div>

## Connections by gender and by party

Now we'll make a couple of plots that show the MOC Twitter network with nodes colored by gender (first plot) and by party (second plot).  

You don't have to write any code here, but you have to answer some questions.

#### Plot of network by gender

This first plot will draw the MOC Twitter network with nodes colored according to their gender: red for females and blue for males.

In [None]:
pos = nx.spring_layout(official_congress_twitter)

node_color_scheme = {'M' : 'blue', 'F' : 'red'}
node_colors = [node_color_scheme.get(official_congress_twitter.node[node]['gender'], 'white') 
               for node in official_congress_twitter.nodes()]

plt.figure(figsize=(6,6))
nx.draw_networkx_nodes(official_congress_twitter, 
                       pos,
                       node_color=node_colors, 
                       scale=10, node_size=10)

nx.draw_networkx_edges(official_congress_twitter, 
                       pos,
                       alpha=.01)
plt.title('Members of Congress by gender', loc='left')

**Question** How useful is this plot for understanding homophily by gender in this network?

<div class='response'>
[answer here]
</div>

#### Plot of network by party

This next plot will draw the MOC Twitter network with nodes colored according to party: blue for Democrats, red for Republicans, and white for Independents.

In [None]:
pos = nx.spring_layout(official_congress_twitter)

node_color_scheme = {'Democrat' : 'blue', 'Republican' : 'red', 'Independent' : 'grey'}
node_colors = [node_color_scheme.get(official_congress_twitter.node[node]['party'], 'white') 
               for node in official_congress_twitter.nodes()]

plt.figure(figsize=(6,6))
nx.draw_networkx_nodes(official_congress_twitter, 
                       pos,
                       node_color=node_colors, 
                       scale=10, node_size=10)

nx.draw_networkx_edges(official_congress_twitter, 
                       pos,
                       alpha=.01)
plt.title('Members of Congress by party', loc='left')

**Question** How useful is this plot for understanding homophily by party in this network?

<div class='response'>
[answer here]
</div>

### Quantifying homophily using the assortativity coefficient

Now we will move beyond visualization and start to quantify the patterns of connections in the MOC Twitter datset.

**Question** Calculate the assortativity coefficient for gender in this network

In [None]:
observed_r_gender = ...

print("Observed assortativity coefficient for gender: ", observed_r_gender)

In [None]:
_ = hwk_homophily.grade('q_obsvd_r_gender')

**Question** Calculate the assortativity coefficient for party in this network

In [None]:
observed_r_party = ...

print("Observed assortativity coefficient for party: ", observed_r_party)

In [None]:
_ = hwk_homophily.grade('q_obsvd_r_party')

**Question** How would you interpret these results? In other words, what do these two results suggest about homophily in the MOC Twitter network?

<div class='response'>
[answer here]
</div>

## Testing a hypothesis

Calculating the assortativity coefficient is a useful first step in our investigation of homophily in the MOC Twitter network. However, as we discussed in Lab 07, it is difficult to interpret the assortativity coefficient alone. After all, if the connections in the MOC Twitter network were formed completely at random, with no homophily at all, then we could see values of the assortativity coefficient that were different from 0 just by chance.

Thus, to get more persuasive evidence for or against the theory that there is homophily in who follows who among members of Congress, we will adopt the approach we used in Lab 07: we will develop a null model that describes a world with no homophily, and then we will see whether or not the observed assortativity coefficient seems to be typical of the assortativity coefficients produced by the null model. We'll summarize our results with a p value.


We'll start by looking a few functions which are provided to you.

These first two functions should be familiar from Lab 07.

In [None]:
def er_by_degree(n, avg_degree):
    return(nx.erdos_renyi_graph(n=n, p=avg_degree / (n-1)))

def rand_er_network(network):

    network_n = network.number_of_nodes()
    network_dbar = network.number_of_edges() / network_n
    return(er_by_degree(network_n, network_dbar))

The next function is similar to a function we saw in Lab 07. However, this function is more general: given an observed network, it generates a 'matching' ER random network. The ER network will match the observed one because

* it has the same number of nodes
* in expectation, it has the same average degree
* it has the same distribution of attributes as the observed network


In [None]:
def matching_rand_er_network_with_attributes(network):
    """
    Return an ER random network that matches the network passed in as an arugment (`network`)
    in terms of 
    (i) the number of nodes; 
    (ii) the expected average degree; and
    (iii) the distribution of node attributes
    """
       
    # get the values of the attributes in the observed network
    attribute_values = [x[1] for x in network.nodes(data=True)]
    
    # generate matching ER random network
    g = rand_er_network(network)
    
    # give the nodes in the random network the same attributes as
    # in the observed network
    for node, idx in enumerate(g.nodes()):
        g.node[node].update(attribute_values[idx])
    
    # return the result
    return(g)

**Question** Use this function to generate one random network that matches the `official_congress_twitter` network.

In [None]:
example_er = ...

Let's just double-check that the ER network has the same distribution of attributes as the original network:

In [None]:
attribute_to_table(official_congress_twitter, 'gender').group('gender')

In [None]:
attribute_to_table(example_er, 'gender').group('gender')

In [None]:
_ = hwk_homophily.grade('q_matching_er_random_net')

Looks good!

**Question** Fill in the missing parts of the code below to simulate 250 random ER networks that match the observed MOC Twitter network. For each randomly generated ER network, calculate the assortativity coefficient for gender and for party.   
*[NOTE: it may take ~2 minutes to run this part]*

In [None]:
er_r_gender = make_array()
er_r_party = make_array()

for _ in range(250):
    
    er_net = ...
    
    er_r_gender = np.append(cm_r_gender, ...)
    er_r_party = np.append(cm_r_party, ...)
    
er_r = Table().with_columns('er_r_gender', er_r_gender,
                            'er_r_party', er_r_party)

er_r

In [None]:
_ = hwk_homophily.grade('q_250_er_nets')

**Question** Make a histogram that shows the distribution of the assortativity coefficients for gender in the ER networks. Plot the observed gender assortativity coefficient from the MOC Twitter network on the x axis with a red dot (as we did in Lab 07).

In [None]:
...

plt.scatter(..., ..., ..., ...)

# NOTE: you'll need to run this code below to
# make sure the axis labels are correct 
# (there seems to be a bug in the defaults, making this necessary)
ax = plt.gca()          # get current axes
ax.set_xscale('linear') # relabel with linear scale

**Question** In order to estimate the expected value of the assortativity coefficient for gender under the null model, calculate the mean of the gender assortativity coefficients across the ER networks.

In [None]:
er_gender_expected_assortativity = ...

er_gender_expected_assortativity

In [None]:
_ = hwk_homophily.grade('q_expected_r_gender')

**Question** Summarize the finding above by calculating a P value. Your null hypothesis should be that the observed gender assortativity coefficient comes from the null model; the alternative is that it does not.

In [None]:
emp_p_value_gender = np.mean(np.abs(... - ...) >= 
                             np.abs(... - ...))
emp_p_value_gender

In [None]:
_ = hwk_homophily.grade('q_emp_p_gender')

**Question** Make a histogram that shows the distribution of the assortativity coefficients for party in the ER networks. Plot the observed party assortativity coefficient from the MOC Twitter network on the x axis with a red dot (as we did in Lab 07).

In [None]:
...

plt.scatter(..., ..., ..., ...)

# NOTE: you'll need to run this code below to
# make sure the axis labels are correct 
# (there seems to be a bug in the defaults, making this necessary)
ax = plt.gca()          # get current axes
ax.set_xscale('linear') # relabel with linear scale

In [None]:
## SOLUTION

er_r.hist('er_r_party')

plt.scatter(observed_r_party, 0, color='red', s=30);

# NOTE: you'll need to run this code below to
# make sure the axis labels are correct 
# (there seems to be a bug in the defaults, making this necessary)
ax = plt.gca()          # get current axes
ax.set_xscale('linear') # relabel with linear scale

**Question** In order to estimate the expected value of the assortativity coefficient for gender under the null model, calculate the mean of the gender assortativity coefficients across the ER networks.

In [None]:
er_party_expected_assortativity = ...

er_party_expected_assortativity

In [None]:
_ = hwk_homophily.grade('q_expected_r_party')

**Question** Summarize the finding above by calculating a P value. Your null hypothesis should be that the observed party assortativity coefficient comes from the null model; the alternative is that it does not.

In [None]:
emp_p_value_party = np.mean(np.abs(... - ...) >= 
                             np.abs(... - ...))
emp_p_value_party

In [None]:
_ = hwk_homophily.grade('q_emp_p_party')

## Tests

In [None]:
# this cell runs all the tests at once!
print("Running all tests...")
_ = [hwk_homophily.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]
print("Finished running all tests.")

# Please hand this homework in using two different methods

Both submissions must be completed by **midnight on Wednesday, November 8th.**<BR>
**Late homework will not be accepted**, so please be sure to hand in as much as you have finished by the deadline. Good luck!

**FIRST, please run the following cell to submit using `okpy`**

In [None]:
_ = hwk_homophily.submit()

**SECOND** Please hand this homework in as a `.pdf` file on Bcourses. 