# Case Study 5 : Data Science in Email Data

**Required Readings:** 
* [Enron Emails](https://www.kaggle.com/wcukierski/enron-email-dataset) 
* Please download the Enron Email dataset from [here](https://www.cs.cmu.edu/~./enron/).
* [TED Talks](https://www.ted.com/talks) for examples of 10 minutes talks.


** NOTE **
* Please don't forget to save the notebook frequently when working in Jupyter Notebook, otherwise the changes you made can be lost.

*----------------------

# Problem: pick a data science problem that you plan to solve using Email Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using the data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?

Since there was a scandal with Enron, we're looking at the top 5 executives with the most to do with this scandal. Gathering the topics and body most relevant to the scandal and analyzing the sentiment and any suspicion. Discussing things like "Bankruptcy", "Fraud", "Shutdown", etc.









# Data Exploration: Exploring the Email Dataset

** plot email communication graph/network** 
* each node is an email account
* the weight of an edge between two accounts depends on how many emails have been sent between them.

In [None]:
import os, sys, csv, datetime, json
import numpy as np
from IPython.display import clear_output

csv.field_size_limit(sys.maxint)

# Local path to data in project folder
data_dir = 'data/'

# Get each data file path
data_files = [os.path.abspath(os.path.join(data_dir, file)) for file in os.listdir(data_dir)]

# Store data
comms_frequencies = dict()
max_count = -sys.maxint - 1
counter = 0

# Read files
for data_file in data_files:
    with open(data_file, 'r') as f:
        reader = csv.reader(f)
        next(reader) # Skip header
        
        # Read each row
        for row in reader:
            message_dict = dict()
            message_parts = row[1].split('\n') # Split message by newline

            # Parse line by line and store by message part key
            for line in message_parts:
                if line.startswith('From:'):
                    message_dict['From'] = line[4:]
                elif line.startswith('To:'):
                    message_dict['To'] = line[2:]
                elif line.startswith('X-To:'):
                    message_dict['X-To'] = line[3:]
                elif line.startswith('X-From:'):
                    message_dict['X-From'] = line[6:]
            
            send_email = message_dict['From'].split('@')[0].strip()
            
            # Track only senders at Enron.com to save time/space
            try:
                send_domain = message_dict['From'].split('@')[1].strip()
                if not send_domain == 'enron.com':
                    continue
            except IndexError:
                # Different email format, continue anyway
                pass
            
            recv_emails = None
            
            # Get all receivers of email in either field of email
            if message_dict.has_key('To'):
                recv_emails = [email.strip() for email in message_dict['To'].split(',')]
            elif message_dict.has_key('X-To'):
                recv_emails = [email.strip() for email in message_dict['X-To'].split(',')]
            else:
                # No senders found, skip
                continue
            
            # Add counts to global dictionary
            for recv_email in recv_emails:
                if comms_frequencies.has_key(send_email):
                    if comms_frequencies[send_email].has_key(recv_email):
                        comms_frequencies[send_email][recv_email] += 1
                    else:
                        comms_frequencies[send_email][recv_email] = 1
                else:
                    comms_frequencies[send_email] = dict()
                    comms_frequencies[send_email][recv_email] = 1
            
            # Store maximum number of emails between people for weighting
            max_count = comms_frequencies[send_email][recv_email] if max_count < comms_frequencies[send_email][recv_email] else max_count

            # Progress output
            counter += 1            
            if counter % 10000 == 0:
                clear_output()
                print 'Processed {} records.'.format(counter)

clear_output()
print 'Finished {} records. Max weight was {}.'.format(counter, max_count)

In [None]:
import cPickle as pickle

# Write out data so we don't have to run it again
output_data = {
    'comms_frequencies': comms_frequencies
}

output_file = 'comms_frequencies'

with open(output_file, 'wb') as f:
    pickle.dump(output_data, f)
    print 'Wrote data.'

In [None]:
# Read data in again from file
output_file = 'comms_frequencies'

# Data structure
comms_frequencies = None

with open(output_file, 'r') as f:
    output_data = pickle.load(f)
    comms_frequencies = output_data['comms_frequencies']
    print 'Loaded data.'

In [None]:
import matplotlib.pyplot as plt
import networkx as nx

plt.figure(figsize=(200,200))
G=nx.Graph()

# Add only first 100 senders and all their connected nodes because it's unreadable after that and takes
# too long to make a graph
count = 0
for send in comms_frequencies.keys():
    for recv in comms_frequencies[send].keys():
        G.add_edge(send, recv, weight=float(comms_frequencies[send][recv] / max_count))
    
    count += 1
    
    if count > 100:
        break

pos=nx.spring_layout(G)
nx.draw_networkx_nodes(G,pos,node_size=2000)
nx.draw_networkx_edges(G,pos,width=2)

# labels
nx.draw_networkx_labels(G,pos,font_size=6,font_family='sans-serif')

plt.axis('off')
plt.savefig("weighted_graph.png") # Save graph as image to view better
plt.show()

# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

Write codes to implement the solution in python:

In [None]:
# These are the top executives:

# Kenneth Lay:
# kenneth.lay@enron.com
# Founder Chairman and CEO 

# Jeffery Skilling:
# jeff.skilling@enron.com
# Former President, and COO

# Andrew Fastow:
# andrew.fastow@enron.com
# Former Chief Financial Officer 

# Rebecca Mark:
# rebecca.mark@enron.com
# Former Vice Chairman, Chairman and CEO of Enron International:

# Stephen F. Cooper:
# Interim CEO and CRO

In [None]:
# This is initial setup to running our solution

import os, sys, csv, datetime, json, re
from datetime import datetime
import numpy as np
from IPython.display import clear_output

csv.field_size_limit(sys.maxint)

# Local path to data in project folder
data_dir = 'data/'

# Get each data file path
data_files = [os.path.abspath(os.path.join(data_dir, file)) for file in os.listdir(data_dir)]

# Store data
max_count = -sys.maxint - 1
counter = 0

exec_emails = ['kenneth.lay@enron.com', 'jeff.skilling@enron.com', 'andrew.fastow@enron.com', 'rebecca.mark@enron.com']
email_dict = dict()

for email in exec_emails:
    email_dict[email] = []

print 'Setup complete.'

In [None]:
# Run this cell only if there is new data, otherwise load from file below

# Used regex to isolate field, see reference here:
# https://rforwork.info/2013/11/03/a-rather-nosy-topic-model-analysis-of-the-enron-email-corpus/
x_filename_pat = re.compile("X-FileName:.+\n")
to_pat = re.compile("To:.+\n+")
xto_pat = re.compile("X-To:.+\n")
cc_pat = re.compile("Cc:.+\n")
bcc_pat = re.compile("Bcc:.+\n")
xcc_pat = re.compile("X-cc:.+\n")
xbcc_pat = re.compile("X-bcc:.+\n")
mimver_pat = re.compile("Mime-Version:.+\n")
ctype_pat = re.compile("Content-Type:.+\n")
ctype_enc_pat = re.compile("Content-Transfer-Encoding:.+\n")
xfolder_pat = re.compile("X-Folder:.+\n")
xorigin_pat = re.compile("X-Origin:.+\n")
from_pat = re.compile("From:.+\n")
mess_id_pat = re.compile("Message-ID:.+\n")
xfrom_pat = re.compile("X-From:.+\n")
date_pat = re.compile("Date:.+\n")
subject_pat = re.compile("Subject:.+\n")
    
# Read files
for data_file in data_files:
    with open(data_file, 'r') as f:
        reader = csv.reader(f)
        next(reader) # Skip header
        
        # Read each row
        for row in reader:
            email_new = x_filename_pat.sub('', row[1])
            email_new = xfrom_pat.sub('', email_new)
            email_new = xto_pat.sub('', email_new)
            email_new = mimver_pat.sub('', email_new)
            email_new = ctype_pat.sub('', email_new)
            email_new = ctype_enc_pat.sub('', email_new)
            email_new = xfolder_pat.sub('', email_new)
            email_new = xorigin_pat.sub('', email_new)
            email_new = mess_id_pat.sub('', email_new)
            email_new = xcc_pat.sub('', email_new)
            email_new = xbcc_pat.sub('', email_new)
            email_new = cc_pat.sub('', email_new)
            email_new = bcc_pat.sub('', email_new)
            email_new = to_pat.sub('', email_new)
            
            from_field = from_pat.findall(email_new)[0].replace("From:", "").strip()
            email_new = from_pat.sub('', email_new)
            
            # If sender is one of top executives, store it
            if from_field in exec_emails:
                email_dict[from_field].append(email_new)
            
            # Progress output
            counter += 1
#             if counter > 0:
#                 break
            if counter % 10000 == 0:
                clear_output()
                print 'Processed {} records.'.format(counter)

clear_output()
print 'Finished {} records.'.format(counter)

In [None]:
import cPickle as pickle

# Write out data so we don't have to run it again
output_data = {
    'email_data': email_dict
}

output_file = 'email_data'

with open(output_file, 'wb') as f:
    pickle.dump(output_data, f)
    print 'Wrote data.'

In [None]:
# Read data in again from file
output_file = 'email_data'

# Data structure
email_dict = None

with open(output_file, 'r') as f:
    output_data = pickle.load(f)
    email_dict = output_data['email_data']
    print 'Loaded data.'

In [None]:
import re

exec_word_freq = dict()

for email in email_dict.keys():
    if len(email_dict[email]) > 0:
        if not exec_word_freq.has_key(email):
            exec_word_freq[email] = dict()
        
        for email_text in email_dict[email]:
            words = re.split("\s+", email_text)
            
            for word in words:
                word = re.compile("^(.*?)\@enron.com").sub('', word)
                if not exec_word_freq[email].has_key(word):
                    exec_word_freq[email][word] = 0
                exec_word_freq[email][word] += 1

print 'Finished word frequencies.'

The format of the word dict is as follows:
```
{
    email: {
        'word': 1,
        'otherword': 20
    }
}
```

Please delete this cell when complete

# Results: summarize and visualize the results discovered from the analysis

Please use figures, tables, or videos to communicate the results with the audience.


In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary








*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "jupyter notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . Each team present their case studies in class for 10 minutes.

Please compress all the files in a zipped file.


** How to submit: **

        Please submit through Canvas, in the Assignment "Case Study 5".
        
** Note: Each team only needs to submit one submission in Canvas **


# Peer-Review Grading Template:

** Total Points: (100 points) ** Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.

Please add an "**X**" mark in front of your rating: 

For example:

*2: bad*
          
**X** *3: good*
    
*4: perfect*


    ---------------------------------
    The Problem: 
    ---------------------------------
    
    1. (10 points) how well did the team describe the problem they are trying to solve using the data? 
       0: not clear
       2: I can barely understand the problem
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
    
    2. (10 points) do you think the problem is important or has a potential impact?
        0: not important at all
        2: not sure if it is important
        4: seems important, but not clear
        6: interesting problem
        8: an important problem, which I want to know the answer myself
       10: very important, I would be happy invest money on a project like this.
    
    ----------------------------------
    Data Collection and Processing:
    ----------------------------------
    
    3. (10 points) Do you think the data collected/processed are relevant and sufficient for solving the above problem? 
       0: not clear
       2: I can barely understand what data they are trying to collect/process
       4: I can barely understand why the data is relevant to the problem
       6: the data are relevant to the problem, but better data can be collected
       8: the data collected are relevant and at a proper scale
      10: the data are properly collected and they are sufficient

    -----------------------------------
    Data Exploration:
    -----------------------------------
    4. How well did the team solve the following task:
    
    (1) plot email communication graph/network (10 points):
       0: missing answer
       4: okay, but with major problems
       7: good, but with minor problems
      10: perfect
    

    -----------------------------------
    The Solution
    -----------------------------------
    5.  how well did the team describe the solution they used to solve the problem? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
       
    6. how well is the solution in solving the problem? (10 points)
       0: not relevant
       2: barely relevant to the problem
       4: okay solution, but there is an easier solution.
       6: good, but can be improved
       8: very good, but solution is simple/old
       10: innovative and technically sound
       
    7. how well did the team implement the solution in python? (10 points)
       0: the code is not relevant to the solution proposed
       2: the code is barely understandable, but not relevant
       4: okay, the code is clear but incorrect
       6: good, the code is correct, but with major errors
       8: very good, the code is correct, but with minor errors
      10: perfect 
   
    -----------------------------------
    The Results
    -----------------------------------
     8.  How well did the team present the results they found in the data? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
      10: crystal clear
       
     9.  How do you think of the results they found in the data?  (5 points)
       0: not clear
       1: likely to be wrong
       2: okay, maybe wrong
       3: good, but can be improved
       4: make sense, but not interesting
       5: make sense and very interesting
     
    -----------------------------------
    The Presentation
    -----------------------------------
    10. How all the different parts (data, problem, solution, result) fit together as a coherent story?  
       0: they are irrelevant
       1: I can barely understand how they are related to each other
       2: okay, the problem is good, but the solution doesn't match well, or the problem is not solvable.
       3: good, but the results don't make much sense in the context
       4: very good fit, but not exciting (the storyline can be improved/polished)
       5: a perfect story
      
    11. Did the presenter make good use of the 10 minutes for presentation?  
       0: the team didn't present
       1: bad, barely finished a small part of the talk
       2: okay, barely finished most parts of the talk.
       3: good, finished all parts of the talk, but some part is rushed
       4: very good, but the allocation of time on different parts can be improved.
       5: perfect timing and good use of time      

    12. How well do you think of the presentation (overall quality)?  
       0: the team didn't present
       1: bad
       2: okay
       3: good
       4: very good
       5: perfect


    -----------------------------------
    Overall: 
    -----------------------------------
    13. How many points out of the 100 do you give to this project in total?  Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.
    Total score:
    
    14. What are the strengths of this project? Briefly, list up to 3 strengths.
       1: 
       2:
       3:
    
    15. What are the weaknesses of this project? Briefly, list up to 3 weaknesses.
       1:
       2:
       3:
    
    16. Detailed comments and suggestions. What suggestions do you have for this project to improve its quality further.
    
    
    

    ---------------------------------
    Your Vote: 
    ---------------------------------
    1. [Overall Quality] Between the two submissions that you are reviewing, which team would you vote for a better score?  (5 bonus points)
        0: I vote the other team is better than this team
        5: I vote this team is better than the other team 
        
    2. [Presentation] Among all the teams in the presentation, which team do you think deserves the best presentation award for this case study?  
        1: Team 1
        2: Team 2
        3: Team 3
        4: Team 4
        5: Team 5
        6: Team 6
        7: Team 7
        8: Team 8
        9: Team 9


