In [1]:
! wget https://storage.googleapis.com/reinfer-datasets/enron_mail_20150507.tar.gz

--2019-06-02 10:24:54--  https://storage.googleapis.com/reinfer-datasets/enron_mail_20150507.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.210.48, 2a00:1450:4009:807::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.210.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 443254787 (423M) [application/x-tar]
Saving to: ‘enron_mail_20150507.tar.gz’


2019-06-02 10:24:58 (93.0 MB/s) - ‘enron_mail_20150507.tar.gz’ saved [443254787/443254787]



In [21]:
!tar -xf enron_mail_20150507.tar.gz

## 1. Wrangle the data to understand the relationships between senders and recipients

In [96]:
import re
import glob
from collections import defaultdict, Counter
from mailbox import mboxMessage

# What we want to do here is create a data-structure that maps from: recieved : n
# to do this efficiently we use the defaultdict-counter combo
# I experimented a little using the 'mbox' class to do everything at once but couldn't make it work. If I have time i'll
# go back

# Certain emails don't have 'To' or 'From' usually ones that are going to a group list. 
# As we don't know who is on these lists lets ignore this class of message for now, even 
# though this isn't ideal, as it is conceivable that big influencers are much more likely to 
# emails these lists
            
# Currently we negelect cc's and bbc's for purely compuational reasons, though
# there is no other reason not to include them

# data structures:
# Effectively a graph, and in reality one would just build a graph using a standard library and run HITS on that. Here
# log both the incoming and outgoing edges independently for each of use in the custom HITS implementation
outgoing = defaultdict(Counter) # outgoingconnection counts
incoming = defaultdict(Counter) # incoming connection counts


previously_seen = set() # keeps a hash of the payloads of previously seen messages to avoid double counting

for f in glob.glob("maildir/**", recursive=True):
    try:
        with open(f) as mbox_file:
            msg = mboxMessage(mbox_file)
                
            payload = msg.get_payload()
            if  msg["From"] is not None and msg["To"] is not None and payload is not None:
                payload_hash = hash(payload)
                if payload_hash not in previously_seen:
                    fr = msg["From"]
                    to = re.sub('\ |\n|\t', '', msg["To"]).split(",") # remove special characters and spaces
                    outgoing[fr].update([person for person in to if person != fr]) # important to remove self edges as these are a big source of noise
                    for person in to:
                        if person != fr:
                            incoming[person].update([fr]) 
                    previously_seen.add(payload_hash)
            
    except (IsADirectoryError, UnicodeDecodeError) as e:
        pass

    

In [43]:
print ("{} unique senders".format(len(connections)))

19567 unique senders


## 2. Implementation of Hubs and Authorities

Just implement a vanilla HA without regards for performance. I'm not sure how long it will take to run, so if its slow i'll optimize it.

In [101]:
class Score:
    __slots__ = 'hub', 'auth'
    def __init__(self, hub=1, auth=1):
        self.hub = hub
        self.auth= auth ## todo: check the initialization of slots

ha_scores = {email:Score() for email in list(outgoing.keys()) + list(incoming.keys())}  # email: [hub score, authority scores]

def auth_update(scores, incoming_edges, weight_transformation):
    norm = 0
    for person in scores.keys():
        scores[person].auth = 0
        for connection, weight in incoming_edges[person].items():
            
            scores[person].auth += weight_transformation(weight)*scores[connection].hub
        norm += scores[person].auth**2
            
    norm = norm**0.5
    for person in scores.keys():
        scores[person].auth /= norm
        
def hub_update(scores, outgoing_edges, weight_transformation):
    norm = 0
    for person in scores.keys():
        scores[person].hub = 0
        for connection, weight in outgoing_edges[person].items():
            scores[person].hub += weight_transformation(weight)*scores[connection].auth
        norm += scores[person].hub**2
            
    norm = norm**0.5
    for person in scores.keys():
        scores[person].hub /= norm
        
        
def hits(scores,
         outgoing_edges,
         incoming_edges,
         weight_transformation=lambda weight:weight,
         max_iter=100):
    """
    Update HITS hubs and authorities values for nodes.
    
    :param scores : The hubs and authorities scores for each node
                    in the graph. Hubs score calculated on outgoing
                    connections and authorities score calculated
                    from incoming connections.
    
    :param incoming_edges: Incoming edges and associated counts 
    
    :param outgoing_edges: Out edges and associated counts
    
    :param weight_transformation: Function applied to the raw
                     counts to derive the weights
    
    :param max_iter: Number of iterations the algorithm runs for.
                     Note, we currently don't check for convergence
                     and an improvement to this algorithm could be
                     to perform such an action.       
    """
    for iteration in range(max_iter):
        print ("Running iteration",iteration,end="\r")
        auth_update(scores, incoming_edges, weight_transformation)
        hub_update(scores, outgoing_edges, weight_transformation)



# 3. Run HITS and find the influential people in the organisation

In [102]:
hits(ha_scores, outgoing, incoming)
authority_scores = sorted([(person, score.auth) for person, score in ha_scores.items()], key=lambda x:x[1])
hub_scores = sorted([(person, score.hub) for person, score in ha_scores.items()], key=lambda x:x[1])

Running iteration 99

In [103]:
authority_scores[-10:]

[('mpalmer@enron.com', 0.13138844701950742),
 ('alan.comnes@enron.com', 0.14488565586942323),
 ('skean@enron.com', 0.1520301828018625),
 ('sandra.mccubbin@enron.com', 0.16583307696183436),
 ('harry.kingerski@enron.com', 0.1828764769951093),
 ('karen.denne@enron.com', 0.22208054837876512),
 ('james.steffes@enron.com', 0.2546958855676058),
 ('paul.kaufman@enron.com', 0.28180420772830467),
 ('susan.mara@enron.com', 0.28202716121445387),
 ('richard.shapiro@enron.com', 0.29244369621562544)]

In [99]:
hub_scores[-10:]

[('karen.denne@enron.com', 0.041451846701707165),
 ('sgovenar@govadv.com', 0.04875516915071102),
 ('d..steffes@enron.com', 0.04894466242855275),
 ('alan.comnes@enron.com', 0.06781710065529507),
 ('miyung.buster@enron.com', 0.06918860433321587),
 ('james.steffes@enron.com', 0.07136241435638094),
 ('mary.hain@enron.com', 0.07209525253439127),
 ('ginger.dernehl@enron.com', 0.1631010334561144),
 ('susan.mara@enron.com', 0.3035096098153337),
 ('jeff.dasovich@enron.com', 0.9174549003982896)]

## Digging into these scores - the importance of the edge weights

Looking at these people, it seems like they aren't necessarily influential people in the company, but may perform tasks that require them to be cc'd in on a lot of emails (or send them). It currently appears as through the algorithm is placing too much emphasis on the number of emails sent. Consequently, we experiment we a callback passed into the HITS algorithm that calculates the weights from the email counts. Due to time constraints, the only callback currently tested is the extreme case that binarizes weights: 1 if present and 0 if not. Below are the results calculated for this case

In [105]:
hits(ha_scores, outgoing, incoming, weight_transformation=lambda c: (bool(c)))
authority_scores = sorted([(person, score.auth) for person, score in ha_scores.items()], key=lambda x:x[1])
hub_scores = sorted([(person, score.hub) for person, score in ha_scores.items()], key=lambda x:x[1])

Running iteration 99

In [106]:
authority_scores[-10:]

[('mark.haedicke@enron.com', 0.07488758142809376),
 ('tana.jones@enron.com', 0.07552716732968225),
 ('mark.taylor@enron.com', 0.07895266309179885),
 ('tim.belden@enron.com', 0.08083804773983622),
 ('steven.kean@enron.com', 0.0816167001256113),
 ('elizabeth.sager@enron.com', 0.08242413411150346),
 ('sally.beck@enron.com', 0.08918405915572611),
 ('greg.whalley@enron.com', 0.0933039061254153),
 ('john.lavorato@enron.com', 0.10744838163419328),
 ('louise.kitchen@enron.com', 0.12264309889716965)]

In [107]:
hub_scores[-10:]

[('maxine.levingston@enron.com', 0.10223140028217378),
 ('daniel.muschar@enron.com', 0.10303974818686064),
 ('technology.enron@enron.com', 0.12042261529136816),
 ('nicki.daw@enron.com', 0.12087279895445646),
 ('billy.lemmons@enron.com', 0.12609532448743777),
 ('david.oxley@enron.com', 0.12858231679410545),
 ('outlook.team@enron.com', 0.14392841207707993),
 ('kenneth.lay@enron.com', 0.14907545981705006),
 ('sally.beck@enron.com', 0.1587743105635194),
 ('david.forster@enron.com', 0.18663705607163994)]

Certainly, these people seem to be more important to the company operations. Looking at the top authorities, for example yields former head of trading operations,John Lavorato, ranking second. Greg Whalley was former president and clearly another big player. Qualitatively, the big hubs don't seem to be nearly as 'authoritative' (defined in the conventioanl sense), which makes sense. These seem to be people, or group emails, that were involved in large amounts administrative communication: necessary functions for the company but certainly not ranking as high in the chain of command.

# 4. What did the important people talk about differentially?

A naive approach could be to perform an enrichment analysis between the word frequencies in the top ranked and bottom ranked hubs and authorities. Alternatively, framing this as a classification problem and trying to discriminate between these two groups could prove fruitful. Certainly, certain classical machine learning algorithms allow for easy interpretation of feature importance (i.e. weights in a logistic regression or gini impurity importance in random forests). For more complex models, mean decrease accuracy