### Assignment 3: Graph Visualization

#### Summer 2021
**Authors:** GOAT Team (Estaban Aramayo, Ethan Haley, Claire Meyer, and Tyler Frankenburg) 

This assignment looks at a CSV of Donor + Donor Recipient Data from OpenSecret, which tracks political donations.

This particular dataset tracks the donation total during the 2020 election cycle from individuals, companies, and PACs/Super PACs to the 148 members of congress who objected to certification of 2020 Electoral College results in January 2021.

This data is available [here](https://docs.google.com/spreadsheets/d/1PPjz-U1LueQYHaVCU8iCYf3O4lc-OYN7uOf3OknhYxo/edit#gid=1325242852). 


In [1]:
import networkx as nx
import pandas
import matplotlib.pyplot as plt
from pyvis.network import Network

First we import the CSV and do a couple quick checks to see the shape and form of the data. 

In [2]:
df = pandas.read_csv('donor_members.csv')
df.head()

Unnamed: 0,PAC,CID,CRPName,Distid,Total,Unnamed: 5,Unnamed: 6
0,American Medical Assn,N00025219,"Burgess, Michael",TX26,"$20,000",,
1,American Medical Assn,N00028152,"McCarthy, Kevin",CA23,"$20,000",,Direct contributions data covers the 2020 elec...
2,American Dental Assn,N00005736,"Babin, Brian",TX36,"$20,000",,
3,American Dental Assn,N00025219,"Burgess, Michael",TX26,"$20,000",,
4,American Dental Assn,N00035346,"Carter, Buddy",GA01,"$17,500",,


In [3]:
df.shape

(2686, 7)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2686 entries, 0 to 2685
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   PAC         2686 non-null   object 
 1   CID         2686 non-null   object 
 2   CRPName     2686 non-null   object 
 3   Distid      2686 non-null   object 
 4   Total       2686 non-null   object 
 5   Unnamed: 5  0 non-null      float64
 6   Unnamed: 6  1 non-null      object 
dtypes: float64(1), object(6)
memory usage: 147.0+ KB


Convert the `Total` (donation) column to integer

In [5]:
df.Total = [int(''.join(c for c in donation if str.isnumeric(c))) for donation in df.Total]

In [6]:
print(f'Donations range from {min(df.Total)} to {max(df.Total)} dollars.')

Donations range from 10000 to 30000 dollars.


Then, we use the `from_pandas_dataframe` function to create a networkx graph from the dataframe. [Source](https://networkx.org/documentation/networkx-1.10/reference/generated/networkx.convert_matrix.from_pandas_dataframe.html). 

In [7]:
test_graph = nx.from_pandas_dataframe(df, source="PAC", target="CRPName", edge_attr="Total")

In [8]:
print(nx.info(test_graph))

Name: 
Type: Graph
Number of nodes: 712
Number of edges: 2675
Average degree:   7.5140


For this assignment we want to explore diameter. However, diameter requires a connected graph. First, let's check if this graph is, with the `is_connected` function. [Source](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.is_connected.html#networkx.algorithms.components.is_connected).

In [9]:
print(nx.is_connected(test_graph))

False


This graph is not connected. We can look for subgraphs that are however, and focus measurement there. The `connected_component_subgraphs` function generates any available connected subgraphs. [Source](https://networkx.org/documentation/networkx-1.9.1/reference/generated/networkx.algorithms.components.connected.connected_component_subgraphs.html).

In [10]:
graphs = list(nx.connected_component_subgraphs(test_graph))
print("There are", len(graphs), "connected subgraphs in this graph.")

There are 2 connected subgraphs in this graph.


Let's compare the size of these subgraphs by the number of nodes.

In [11]:
print("The first subgraph has", len(graphs[0].nodes()), "nodes.")
print("The second subgraph has", len(graphs[1].nodes()), "nodes.")

The first subgraph has 710 nodes.
The second subgraph has 2 nodes.


Let's select the larger of the two, and explore further.



In [12]:
subgraph_test = graphs[0]

colors = []
for node in subgraph_test:
    if node in df["CRPName"].values:
        colors.append("violet")
    else: colors.append("lightgreen")
        
#node colors `for` loop source: https://stackoverflow.com/a/59473049

In [13]:
sgt_net = Network(height='750px', width='100%', bgcolor='#222222', font_color='white', heading="Donors to Electoral College Objectors, 2020 Cycle", notebook=False)

# set the physics layout of the network
sgt_net.barnes_hut()

sources = df['PAC']
targets = df['CRPName']

edge_data = zip(sources, targets)

for e in edge_data:
    src = e[0]
    dst = e[1]
    
    sgt_net.add_node(src, src, title=src, color='violet')
    sgt_net.add_node(dst, dst, title=dst, color='lightgreen')
    sgt_net.add_edge(src, dst)

neighbor_map = sgt_net.get_adj_list()

# add neighbor data to node hover data
for node in sgt_net.nodes:
    node['title'] += ' Neighbors:<br>' + '<br>'.join(neighbor_map[node['id']])
    node['value'] = len(neighbor_map[node['id']])

#sgt_net.show_buttons()  #to use, must comment out `set options` code below

#options JSON string can be pasted into set.options() from dynamic editor after running the html with show_buttons() above
sgt_net.set_options("""
var options = {
  "nodes": {
    "borderWidth": 0
  },
  "edges": {
    "color": {
      "color": "rgba(25,148,150,1)",
      "highlight": "rgba(210,254,255,1)",
      "hover": "rgba(29,225,229,1)",
      "inherit": false
    },
    "smooth": false
  },
  "physics": {
    "barnesHut": {
      "gravitationalConstant": -80000,
      "springLength": 150,
      "springConstant": 0.001
    },
    "maxVelocity": 20,
    "minVelocity": 0.75
  }
}""")

#sgt_net.show('political_donations.html') #writes the local html file, and launches in browser


In [14]:
from IPython.display import IFrame

IFrame('https://curdferguson.github.io/gory-graph/', width=1000, height=1000) #this version is a demo hosted on github rep curdferguson/gory-graph branch TFpyvis

In [None]:
plt.figure(figsize = (30, 30))
ax = plt.subplot()

nx.draw_networkx(subgraph_test, ax=ax, node_color=colors)

plt.figtext(.5,.9,'Network Analysis - Donors to 2020 Electoral College Objectors', fontsize=50, ha='center')
plt.show()

We can use the built in diameter function to determine the diameter of this subgraph.

In [None]:
diameter_test = nx.diameter(subgraph_test)

In [None]:
print("The diameter is: ", diameter_test)

We can also look at top nodes based on some of our centrality measures, e.g. degree centrality, closeness, and betweenness. We start by pulling the sorted_map function from [the textbook's repo](https://www.oreilly.com/library/view/social-network-analysis/9781449311377/), then using different NetworkX built in centrality functions. 

In [None]:
def sorted_map(dd: dict) -> dict:
    """
    Sorts dict by its values (desc)
    
    :param dd: dictionary with numeric values
    :return sorted dictionary ordered by its numeric value
    """
    sorted_dict = sorted(dd.items(), key=lambda x: (-x[1], x[0]))
    return sorted_dict

In [None]:
d = nx.degree_centrality(subgraph_test)
ds = sorted_map(d)
ds[:10]

In [None]:
c = nx.closeness_centrality(subgraph_test)
cs = sorted_map(c)
cs[:10]

In [None]:
b = nx.betweenness_centrality(subgraph_test)
bs = sorted_map(b)
bs[:10]

There are some consistent names across all 3.

#### Let's look at the more central nodes in the graph

In [None]:
# while we're at it, let's make it bipartite 
from networkx.algorithms import bipartite

# separates pols from PACs
pols = set(df.CRPName)
pacs = set(df.PAC)

bip = nx.Graph()
bip.add_nodes_from(pols, bipartite=0)
bip.add_nodes_from(pacs, bipartite=1)

bip.add_weighted_edges_from(zip(df.CRPName, df.PAC, df.Total), weight='donation')

#### If we only consider nodes with more connections, what's a good cutoff to remove the others?

In [None]:
plt.hist([len(bip[n]) for n in bip if len(bip[n]) < 30], bins=30)
plt.xlabel('Donations per pol/PAC')
plt.ylabel('Number of pols/PACs');

Looks like removing nodes with 3 or fewer connections will remove over half of them

In [None]:
# Resulting graph still too crowded.  How about 10 or more connections:
bigs = [n for n in bip if len(bip[n]) > 10]
bigs = bip.subgraph(bigs)        
len(bigs.nodes())

In [None]:
colors = ['lightgreen' if n in pacs else 'violet' for n in bigs]
len(colors)

In [None]:
plt.figure(figsize = (30, 30))
ax = plt.subplot()

nx.draw_networkx(bigs, ax=ax, node_color=colors)

plt.figtext(.5,.9,'Network Analysis - Key Donors to 2020 Electoral College Objectors', fontsize=40, ha='center')
plt.show()

### Look at only the pols and PACs who appear in the top 20 of at least one centrality measure

In [None]:
centrals = [n for n in bip if (n in list(zip(*ds[:20]))[0]
                               or n in list(zip(*bs[:20]))[0]
                               or n in list(zip(*cs[:20]))[0])]
centrals = bip.subgraph(centrals)
len(centrals)                              

In [None]:
colors = ['lightgreen' if n in pacs else 'violet' for n in centrals]
len(colors)

In [None]:
plt.figure(figsize = (30, 30))
ax = plt.subplot()

nx.draw_networkx(centrals, ax=ax, node_color=colors)

plt.figtext(.5,.9,'Network Analysis - Key Donors and 2020 Electoral College Objectors', fontsize=40, ha='center')
plt.show()

Project the bipartite graph onto the politician side

In [None]:
lpart = bipartite.weighted_projected_graph(bip, pols)

In [None]:
colors = ['lightgreen' if n in pacs else 'violet' for n in lpart]
plt.figure(figsize = (30, 30))
ax = plt.subplot()

nx.draw_networkx(lpart, ax=ax, node_color=colors, seed=620)

plt.figtext(.5,.9,'Network Analysis - Bipartite Weighted Projection', fontsize=40, ha='center')
plt.show()

In [None]:
bipartite.weighted_projected_graph?

In [None]:
nx.layout?

In [None]:
#nx.write_graphml(lpart, "lpart_test.graphml")

----------(below is the original notebook)-------------

In [None]:
#nx.write_graphml(subgraph_test, "subgraph_test.graphml")

In [None]:
IFrame('https://ebhtra.github.io/gory-graph/network/', width=1000, height=1000)