In [None]:
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')


import os
import networkx as nx
from networkx.algorithms import bipartite
import pandas as pd

import zipfile
import requests
import io

## Affiliation and bipartite graphs

This dataset comes from Adam Bonica's extensive [Database on Ideology, Money in Politics and Elections (DIME)](https://data.stanford.edu/dime) project (and also the related [DIME-plus](https://data.stanford.edu/dime-plus) project).

There is a lot of interesting stuff in this project, and much of it has the structure of an affiliation network. Today, we'll focus on bill co-sponsorship - that is, records of bills in the legislature, along with information about which legislator sponsored each one.

If you find these data interesting, here are some other articles that you might want to check out:

* [Fowler (2006), Connecting the Congress: A Study of Cosponsorship Networks](https://www.cambridge.org/core/journals/political-analysis/article/connecting-the-congress-a-study-of-cosponsorship-networks/B42907E13C3D1F12BBC7618C8E0EECED)
* [Fowler (2006b), Legislative cosponsorship networks in the US House and Senate](https://www.sciencedirect.com/science/article/pii/S0378873305000730)

First, let's get data on candidates.

We'll get the candidate datafile from the [Stanford Library Archive](https://exhibits.stanford.edu/data/catalog/nc588sy1714)

In [None]:
#cand_url = "https://stacks.stanford.edu/file/druid:nc588sy1714/dime_recipients_1979_2014_v2.csv.zip"
cand_url = "https://stacks.stanford.edu/file/druid:nc588sy1714/dime_recipients_all_1979_2014_v2.csv.zip"

In [None]:
recipient_db = pd.read_csv(cand_url)

In [None]:
recipient_db.head()

In [None]:
recipient_db['fecyear']

In [None]:
recipient_db.columns

Now open up the bill cosponsorship data

In [None]:
bill_url = "https://stacks.stanford.edu/file/druid:gf077df0685/bills_db.csv.zip"

In [None]:
#dfa = pd.read_csv(os.path.join(data_dir, "bills_db.csv"))

r = requests.get(bill_url)

# this opens up the zipfile at the given URL and reads 
# the data contained in the 'bills_db.csv' file
with zipfile.ZipFile(io.BytesIO(r.content)) as z:
   # open the csv file in the dataset
   with z.open("bills_db.csv") as f:
       
      # read the dataset
      dfa = pd.read_csv(f)
       

dfa.shape

In [None]:
dfa[100:200].head()

Grab bills from the 113th Congress, and drop bills with missing sponsor info

In [None]:
df_113congress = dfa[dfa['congno'] == 113].dropna(subset=['sponsors']).copy()
df_113congress

It looks like the 113th Congress ran from Jan 2013 to December 2014.

In [None]:
df_113congress['date'].min()

In [None]:
df_113congress['date'].max()

In [None]:
df_113congress['cosponsors']

In [None]:
def split_ids(x):
    if str(x) == 'nan':
        return []
    else:
        return str(x).split('|')

In [None]:
df_113congress['spons_cospons_list'] = df_113congress['sponsors'].apply(split_ids) + df_113congress['cosponsors'].apply(split_ids)

In [None]:
df_113congress.head()

Let's get a dataframe with the attributes of each bill (we can use this later)

In [None]:
bill_info_dict = df_113congress.set_index('bill.id').to_dict(orient='index')

In [None]:
bill_info_dict

`explode` is a `pandas` function that takes a dataset with a column whose entries are a list -- like our column with the IDs of bill sponsors/cosponsors -- and turns it into a longer dataset with one row for each entry in the list

In [None]:
df_113congress = df_113congress.explode('spons_cospons_list')

In [None]:
df_113congress.head()

In [None]:
bill_edges = [(row['spons_cospons_list'], row['bill.id']) for idx, row in df_113congress.iterrows()]

In [None]:
bill_edges

In [None]:
len(bill_edges)

In [None]:
all_legislators = list(set(x[0] for x in bill_edges))
all_bills = list(set(x[1] for x in bill_edges))

In [None]:
recipient_db

In [None]:
cand_info = recipient_db[recipient_db['bonica.rid'].isin(all_legislators)]
cand_info = cand_info[cand_info['cycle'] <= 2012].copy()
cand_info.shape

Let's check that we have matches for all of the legislators
(we do)

In [None]:
len(cand_info['bonica.rid'].unique())

In [None]:
len(all_legislators)

In [None]:
cand_info.sort_values('bonica.rid')

In [None]:
cand_info[cand_info['bonica.rid'] == 'cand994'].name

It looks like the same name can show up more than once in 2012 if there are special circumstances. For example, candidate 994 appears to be [Dean Heller](https://en.wikipedia.org/wiki/2012_United_States_Senate_election_in_Nevada) who was appointed to a Senate seat in 2011 and who then ran for the seat in 2012.

For our purposes, we'll just pick one of the rows when there is more than one. Specifically, we'll first sort the dataset so that the `cycle` values go from 2012 down; then, we'll use `drop_duplicates` to keep only one row for each value of the candidate ID. The upshot is that we will get the 2012 entry for each candidate, unless there isn't one - in which case, we'll get the soonest entry that comes before 2012.

In [None]:
cand_info = cand_info.sort_values('cycle', ascending=False).drop_duplicates(subset='bonica.rid')
cand_info

In [None]:
cand_info_dict = cand_info.set_index('bonica.rid').to_dict(orient='index')

In [None]:
cand_info_dict

Finally, let's take our (relatively) clean data and use it to make an affiliation network

In [None]:
bill_net = nx.Graph()
bill_net.add_nodes_from(all_legislators, bipartite=0)
bill_net.add_nodes_from(all_bills, bipartite=1)
bill_net.add_edges_from(bill_edges)

In [None]:
bill_net

In [None]:
nx.draw_networkx(bill_net)

Add candidate info and bill info as node attributes

In [None]:
nx.set_node_attributes(bill_net, values=cand_info_dict)

In [None]:
nx.set_node_attributes(bill_net, values=bill_info_dict)

In [None]:
bill_pos = dict()

longer_max = max(len(all_legislators), len(all_bills))
leg_scale = longer_max / len(all_legislators)
bill_scale = longer_max / len(all_bills)


for idx, node_id in enumerate(all_legislators):
    bill_pos[node_id] = (1, idx*leg_scale)
    
for idx, node_id in enumerate(all_bills):
    bill_pos[node_id] = (2, idx*bill_scale)

# this size is very tall, which will be helpful
# because there are quite a few bills
plt.figure(figsize=(10,100))
nx.draw(bill_net, pos=bill_pos, with_labels=True)

Let's double-check that this network is bipartite...

In [None]:
nx.is_bipartite(bill_net)

Project the affiliation network onto the set of legislators.

Q: In this new network, when is there an edge between two legislators?

In [None]:
legislator_network_weighted = bipartite.weighted_projected_graph(bill_net, all_legislators)
nx.set_node_attributes(legislator_network_weighted, values=cand_info_dict)

In [None]:
#bill_network_weighted = bipartite.weighted_projected_graph(bill_net, all_bills)

In [None]:
nx.draw_spring(legislator_network_weighted)

In [None]:
for edge in sorted(legislator_network_weighted.edges(data=True)):
    print(edge)

So, there's a weight on each edge. The weight is the number of bills each pair of legislators co-sponsored.

Let's look at the distribution of edge weights - i.e., the distribution of number of bills co-sponsored by pairs of legislators

In [None]:
edge_weights = [e[2]['weight'] for e in legislator_network_weighted.edges(data=True)]
Table().with_column('edge_weight', edge_weights).hist()

We can see that legislators cosponsor lots of bills with each other.

To simplify the analysis, we'll pick a value as a cutoff for 'meaningful' collaboration. We'll keep edges whose weight is above some threshold. This function will help us do that...

In [None]:
def edge_threshold(net, threshold):
    
    # we'll make a copy of this network so that we don't
    # change the original
    new_net = net.copy()
    
    for e in list(new_net.edges()):
        if new_net[e[0]][e[1]]['weight'] < threshold:
            new_net.remove_edge(e[0], e[1])
    
    return(new_net)

Let's use a cutoff of 8 cosponsored bills...

In [None]:
legislator_network_thresh = edge_threshold(legislator_network_weighted, threshold = 8)

In [None]:
nx.draw_networkx(legislator_network_thresh)

Kind of messy and hard to interpret.

Let's get the giant component of this network and try to visualize it
(bearing in mind the limitations of these visualizations)

In [None]:
lnt_gc = legislator_network_thresh.subgraph(max(nx.connected_components(legislator_network_thresh), key=len)).copy()

In [None]:
nx.draw_networkx(lnt_gc, 
                 nx.spring_layout(lnt_gc))                          

We could spend more time trying to improve this visualization - but, ultimately, these graph visualizations are pretty limited.

An alternative is to try to examine quantitative summaries of different characteristics of the network. Of course, which summaries we decide to examine depends on the substantive question we are trying to investigate.

In this network, it seems reasonable to be interested in whether or not similar legislators collaborate with each other - i.e., whether or not there is homophily in this legislator collaboration network.

Let's remind ourselves of the different traits we have measured for each legislator by looking a at a specific example:

In [None]:
lnt_gc.nodes(data=True)['cand1116']

We see lots of traits!

The *assortativity coefficient* is a popular way to quantify the extent to which nodes with the same trait are more likely to be connected together. (You'll see more about the assortativity coefficient in future assignments.) 

For now, it's useful to know that assortativity coefficients close to 1 mean lots of homophily; close to 0 means no homophily; and negative means inverse homophily.

With that in mind, let's calculate assortativity coefficients for a few different traits:

In [None]:
nx.attribute_assortativity_coefficient(lnt_gc, 'party')

In [None]:
nx.attribute_assortativity_coefficient(lnt_gc, 'state')

In [None]:
nx.attribute_assortativity_coefficient(lnt_gc, 'recipient.cfscore')

In [None]:
nx.attribute_assortativity_coefficient(lnt_gc, 'cand.gender')