In [24]:
import urllib.request
import pylab
import igraph
import matplotlib
import pandas as pd
import numpy
import graphistry
graphistry.register(key='a16918d5aaa30201ed0bbba1fc70a7e561b7740ca4713119165dd77011f68f29855fdced90924fa70aa4966625dfafad')
%matplotlib inline

In [19]:
%qtconsole --colors=Linux

In [2]:
raw = urllib.request.urlretrieve("http://graphics.wsj.com/hillary-clinton-email-documents/api/search.php?subject=&text=&to=&from=&start=&end=&sort=docDate&order=desc&docid=&limit=27159&offset=0",
                   "emails.json")

In [3]:
import json

fp = open("emails.json", "r")
emails = json.load(fp)
fp.close()
data = emails['rows']
print("number of rows: ", len(data))

number of rows:  27159


In [4]:
def acceptable_field(f):
    return f.strip() != "" and f.find(";") == -1
data = [x for x in data if acceptable_field(x['from']) and acceptable_field(x['to'])]
print("number of rows (after cleaning): ", len(data))

number of rows (after cleaning):  26076


In [18]:
names = []
edgelist = []
for row in data:
    if not row['from'] in names:
        names += [row['from']]
    if not row['to'] in names:
        names += [row['to']]
    edgelist += [(names.index(row['from']), names.index(row['to']))]

G = igraph.Graph(n=len(names), edges=edgelist, vertex_attrs={ 'name' : names }, directed=True)
graphistry.bind(source='from', destination='to').plot(G)



Use this link if graph does not display: <a href="https://labs.graphistry.com/graph/graph.html?dataset=PyGraphistry/Z9D82S6BOX&amp;type=vgraph&

(If it still isn't filtered, use the filter button (third from the bottom) on the side panel and write "point:degree >= 70")

I had a lot of issues when trying to view iGraph plots, which prompted me to search for other Python modules for viewing node and edge graphs.  I stumbled upon this GitHub repository (https://github.com/graphistry/pygraphistry) for a module called Graphistry.  The team that designed it is from University of California at Berkeley. It had lots of demonstrations for how to use the module's features effectively, and I thought it would be a great way to implement the graphical part of this analysis.  

Graphistry is a very cool tool, because it allows the user to work interactively with the plot without changing the code each time.  Here are some of the things I noticed about Clinton's email network:

With the histogram tool in the control panel, I filtered the different number of users.  When it was set to be only points with degree 24.5k and up, the only point was Hillary Clinton.  I found that Hillary is just like me, and must occasionally send herself emails as reminders, since she was the only point on the filtered graph, but it still had four edges.  I can also see that she responded directly to less than half of the emails she recieved, as her degree_in is 18,040, and her degree_out is 7,361. 

I filtered the graph to only include those with degree 70 and higher.  I chose 70 based on the calculation below, which found that 70 is the average degree of Clinton's email contacts.  This left 30 nodes in the graph, implying that Clinton had a small, but very close circle of confidants among her colleagues. 

The biggest drawback of Graphistry is that it appears to have been designed to work primarily with comma separated value files and Python's Pandas module, and this data is in JSON format.  To work around this in my following steps, I converted the file to CSV using this website: http://konklone.io/json/

In [15]:
print("number of edges (before simplification): ", len(G.es))
G.es['weight'] = [1]*len(G.es)
G.simplify(combine_edges=sum)
print("number of edges (after simplification): ", len(G.es))
graphistry.bind(source='from', destination='to')

number of edges (before simplification):  622
number of edges (after simplification):  622


{'bindings': {'destination': 'to',
  'edge_color': None,
  'edge_label': None,
  'edge_title': None,
  'edge_weight': None,
  'edges': None,
  'node': None,
  'nodes': None,
  'point_color': None,
  'point_label': None,
  'point_size': None,
  'point_title': None,
  'source': 'from'},
 'settings': {'height': 500, 'url_params': {'info': 'true'}}}

I wanted to filter the graph based on the average number of degrees each email contact had in the graph, as I showed above.  This is the calculation I did, which is modified from the Reddit comments lab, to find the average degree between the email contacts.  

In [17]:
from math import sqrt
clintonDegree = list(map(lambda x,y: sqrt(x*y), G.indegree(), G.outdegree()))
averageDegree = sum(clintonDegree) / float(len(clintonDegree))
print("The average degree of connectedness between the email contacts is: ", averageDegree)

The average degree of connectedness between the email contacts is:  69.2276933059538


Thus, on average the people in Clinton's email network sent a recieved an average total of 70 emails.  This number appears to be skewed fairly highly by the her top email contacts, considering only 30 people are above the average degree.  This makes sense, as any important person will have many close colleagues and friends, and distant acquiantances.  

In [26]:
csv_raw = pd.read_csv("result.csv")

In the following cell, I compute the pageranks and communities of Clinton's emails csv file.  Graphistry has a neat feature which allows the user to size each node based on its pagerank score, and color each edge based on its community.  A community structure is similar to a clustering of data, where each community is a collection of nodes and edges more closely connected to one another.  (https://en.wikipedia.org/wiki/Community_structure) 

In [27]:
plotter = graphistry.bind(source = "from", destination = "to")
ig = plotter.pandas2igraph(csv_raw)
ig.vs['pagerank'] = ig.pagerank()
ig.vs['community'] = ig.community_infomap().membership

plotter.bind(point_color='community', point_size='pagerank').plot(ig)



Use this link if the graph does not display inline: <a href="https://labs.graphistry.com/graph/graph.html?dataset=PyGraphistry/9N8Q7W6A6R&amp;type=vgraph&

I left the graph unfiltered at first to get a view of the entire email network population.  The colors of the communities really show here, and, as one would expect, the largest one is of Hillary's close circle of colleagues, which is colored in light blue.  The only node that is clearly sized larger than the others is Hillary Clinton, in the center of the graph.  So, the pagerank algorithm only found a big difference between her node and every other node.  This feature of the module could be used better in a different situation, where the nodes are not all directly connected to one person in particular.  Interestingly, there is another community in dark blue which also includes Hillary.  This could reflect a difference between colleague networks and family and friends networks.  Besides these communitites, there are a few smaller communities, probably reflecting separate projects that she is taking part in that do not directly pertain to her work as Secretary of State.

In [28]:
plotter.bind(point_color='community', point_size='pagerank').plot(ig)

Use this link if the graph does not display inline:
    <a href="https://labs.graphistry.com/graph/graph.html?dataset=PyGraphistry/BZDOV7SRPR&amp;type=vgraph&

(If it still isn't filtered, use the filter button (third from the bottom) on the side panel and write "point:degree >= 70")

Then, I shrunk the graph to only include those with degree 70 and above to take a closer look at the two main communities that Clinton is a part of.  Dennis Ross and William Burns are the two main members of the dark blue community, besides Clinton herself.  Dennis Ross is an American diplomat who advised Hillary while she was Secretary of State about the Persian Gulf and Southwest Asian regions (https://en.wikipedia.org/wiki/Dennis_Ross).  William Burns was the Deputy Secretary of State between 2011 and 2014 (https://en.wikipedia.org/wiki/William_Joseph_Burns). So, my theory about Clinton using her email primarily among friends as well as colleagues may have been incorrect, assuming that the largest community was not entirely her friends and family, considering it was a work email.