# Analysis of Clinton's emails

Let us first import the pandas library to handle tables of data in a convenient way. Other Python modules are used and will be loaded when needed:
* re, collections, json
* networkx (pip install networkx)
* community (pip install python-louvain, for the community detection)

In [1]:
import pandas as pd

## Read the file

We first load the file. We only need the text of the emails, so only ```Emails.csv``` is necessary. You should get it from the Kaggle repository.

In [2]:
df = pd.read_csv('Emails.csv')

In [15]:
len(df)

7945

## Preprocessing: filter the text, search for proper nouns

As usual in data analysis, the preprocessing steps are crucial and take a large part of the analysis code and of the analysis time.

We want to find some important words in the texts, and get rid of the useless articles for example. We could use a Natural Language Processing toolbox, there are several in Python. However, I want to keep this example simple. So we will select the proper nouns in the texts which can be found easily because they begin with a capital letter. 

In [3]:
import re
capitalized_word_list =[]
filtered_text_list = []
for row in df.itertuples():
    # The text are filtered with regex, keeping only alphanumeric characters
    filtered_text = re.findall('\w+', str(row.ExtractedBodyText), re.UNICODE)
    filtered_text_list.append(filtered_text)
    capitalized_in_single_text= [word for word in filtered_text if word.istitle()]
    [capitalized_word_list.append(Word) for Word in capitalized_in_single_text]
# For each email, we keep only the email Id, the text and the date 
dataframe_f = df[['Id','ExtractedBodyText','MetadataDateSent']].copy()
# We add the filtered text
dataframe_f.loc[:,'filtered_text'] = filtered_text_list

We have now a list of Capitalized words appearing in the emails ```Word_list``` and a table ```dataframe_f``` containing the text and some info about the emails.

Unfortunately, not all words beginning with a Capital letter are proper nouns. The first word of each sentence has a capital as well. In the next step we get rid of the words that appear frequently both with or without a capital letter. It is sign that they are not proper nouns.

### Filter the capitalized words and keep only the proper nouns

Let us first turn the list of capitalized words into a dataframe  of words and their respective occurence in the corpus.

In [4]:
from collections import Counter
Word_dic = Counter(capitalized_word_list)
Word_df = pd.DataFrame(list(Word_dic.items()),columns=['word','occur'])
sorted_words = Word_df.sort_values('occur',ascending=False).reset_index(drop=True)

We reduce the number of words by keeping only the ones appearing more often.

In [5]:
print('Number of capitalized words',len(sorted_words))
threshold = 20
sorted_words = sorted_words[sorted_words.occur > threshold]
print('Number of capitalized words appearing more than {} times: {}'.format(threshold,len(sorted_words)))

Number of capitalized words 10074
Number of capitalized words appearing more than 20 times: 811


For each capitalized word, let us record the number of times it appears with no capital letter in the corpus.

In [6]:
%%time
lowc_word_dic = {}
for word in sorted_words.word:
    word_lc = word.lower()
    count = 0
    # for each text, search the word in lower case and count the nb of times it appears
    for row in dataframe_f.itertuples():
        wordlist = row.filtered_text
        if len(set(wordlist)&set([word_lc]))>0:
            word_indices = [i for i, x in enumerate(wordlist) if x == word_lc]
            count += len(word_indices)
    lowc_word_dic[word] = count
# Create a new dataframe with the words and their occurence in lower case
Wordlc_df = pd.DataFrame(list(lowc_word_dic.items()),columns=['word','lc_occur'])
# Merge the data from capitalized / lower case into a single dataframe
df_1 = pd.merge(Word_df, Wordlc_df, on='word', how='outer')

CPU times: user 25.9 s, sys: 40 ms, total: 26 s
Wall time: 26 s


Keep only the proper noun that appear more often in capital than in lower case *and* do not appear in lower case more than 100 times

In [8]:
print('Number of words with a capital: {}'.format(len(df_1)))
df_2 = df_1[df_1.occur>df_1.lc_occur].sort_values('occur',ascending=False)
print('Number of words appearing more in capital: {}'.format(len(df_2)))
df_3 = df_2[df_2.lc_occur<100].sort_values('occur',ascending=False)
print('Number of words appearing less than 100 times without capital: {}'.format(len(df_3)))

Number of words with a capital: 10074
Number of words appearing more in capital: 544
Number of words appearing less than 100 times without capital: 530


## Create the graph

The nodes of the graph are the word obtained in the previous processing. 

Two words are linked by an edge if they appear together in at least one email. The weight of the edge is the number of time the nodes appear together.

In [9]:
import networkx as nx
import itertools
G = nx.Graph()
for wordlist in dataframe_f.filtered_text:
    wordset = set(wordlist)&set(df_3.word.tolist())
    if len(wordset)>0:
        couples = itertools.combinations(wordset, 2)
        for edge in couples:
            if G.has_edge(edge[0],edge[1]):
                # just increase the weight by one
                G[edge[0]][edge[1]]['weight'] += 1
            else:
                # new edge with weight=1
                G.add_edge(edge[0], edge[1], weight=1)
        #G.add_edges_from(couples)
        #[G.add_edge(couple) for couple in couples]

## Community detection

To get a nicer and more informative visualization we can run a community detection algorithm on the graph.

In [10]:
import community
#first compute the best partition
clusterDic = community.best_partition(G)
nx.set_node_attributes(G,'cluster',clusterDic)

## Prepare for visualization

In [11]:
# Edge info
print('Nb of edges: {}'.format(G.size()))
n1,n2,weights = zip(*G.edges(data='weight'))
import numpy as np
print('mean edge weight: '+str(np.mean(weights))+', max edge weight: '+str(np.max(weights)))

Nb of edges: 68768
mean edge weight: 3.69035016287, max edge weight: 239


The graph is too connected to be visualized like that. We are going to reduce the number of connections. But we do not want to simply delete the weakest connections, otherwise many disconnected nodes could be created. Instead we will remove the edges with low weight, compared to the (geometric) mean degree of the nodes it connects. In other words, we will keep the strongest connections of each node. This way, the number of overly connected nodes (hubs) will be reduced and the nodes with a few weak connections will keep them.

In [12]:
# Remove some edges
print('Initial nb of edges: {}'.format(G.size()))
for u,v,a in G.edges(data=True):
    mean_node_degree = np.sqrt(G.degree(u, weight='weight')*G.degree(v, weight='weight'))
    if a['weight']<0.035*mean_node_degree:
        G.remove_edge(u,v)
print('Final nb of edges: {}'.format(G.size()))

Initial nb of edges: 68768
Final nb of edges: 1715


### Write the graph to a file

In [13]:
# Compute the degree of each node, used in the visualization (node size)
degreeDic = G.degree(weight='weight')
nx.set_node_attributes(G,'degree',degreeDic)

In [14]:
# Write the graph to a json file
from networkx.readwrite import json_graph
datag = json_graph.node_link_data(G)
import json
s = json.dumps(datag)
datag['links'] = [
        {
            'source': datag['nodes'][link['source']]['id'],
            'target': datag['nodes'][link['target']]['id'],
            'weight': link['weight']
        }
        for link in datag['links']]
s = json.dumps(datag)
with open("docs/HCgraph2.json", "w") as f:
    f.write(s)