# IND5003 Enron Project
## Contents of this Jupyter Notebook
### Dataset
Dataset from Prof: 
(https://www.cs.cmu.edu/~./enron/)
- Unstructured Dataset containing raw text in the form of emails
**Make sure that the dataset 'maildir' is in the same directory as your project on your own system. Else this would not work**

## Approach to this Project

### Research Questions
Thought of Research Questions to provide a narrative and storyline in our analysis of the Enron Email Corpus

We would be taking a broad to narrow approach in terms of the narrative that we are looking to create. 

Some research questions that we would be hoping to analyse in this project are:
1. Are there any topical shifts that we can identify prior, leading up to and after the fraud has been exposed?
2. Can we detect sentiment shifts and anomalies in communication patterns leading up to, during and after the exposure of the fraud?
3. Identification of key individuals involved in the key discussions and anomalies in their comms patterns

#### Timeline

We have an intention to split our analysis into 3 defined timeframes:
    1. Pre-Crisis 
    2. During Crisis
    3. Post Crisis
    [GUYS WE NEED TO THINK OF THE DATES TO SPLIT THE TIMEFRAMES INTO. I WAS THINKING WE CAN FOLLOW JUST THE TIMELINE. THIS PART DEFINITELY HAVE TO STATE ASSUMPTIONS]

These research questions would be answered using a variety of NLP and Unsupervised Learning Techniques. 

[GUYS WE NEED TO THINK OF THE TECHNIQUES]


### Overall Steps to Tackle this Project
1. Data Extraction
2. Data Cleaning & Preprocessing
3. Sender Frequency Analysis
4. Visualisation
    - Word Cloud & Bar Charts for the Top Senders (20%)
    - Network Graph
5. Unsupervised Techniques
    - LDA Topic Modeling - Find out key topics from the top 20% of senders
    - Temporal Analysis - Segment the emails by quarters. Look at the way communication changes over time.
    - Hierarchical Clustering - Group senders based on communication patterns
    - Anomaly Detection - Look for outliers and abnormal communication patterns.

### Section 1: Data Extraction
- Extract the emails from the unstructured raw folder


In [1]:
# Import the relevant libraries required for Section 1
import os # Required for directory traversal
import pandas as pd
import email
from email import policy
from email.parser import BytesParser
from collections import defaultdict
from itertools import islice


In [2]:
# Set the maildir path to the respective paths in your system 
# ! Note that maildir should be in the same directory as your project on your own system, would change if you are using windows
maildir_path = '/Users/Dylan/Documents/IND5003/Projects/maildir'

In [3]:
# Create a list of all the directories in the maildir for sanity check
maildir_list = os.listdir(maildir_path)
print(maildir_list)

['arnold-j', 'phanis-s', 'lavorato-j', 'stclair-c', 'townsend-j', 'forney-j', 'symes-k', 'reitmeyer-j', 'hyatt-k', 'steffes-j', 'kaminski-v', 'wolfe-j', 'mcconnell-m', 'skilling-j', 'zipper-a', 'shively-h', 'donoho-l', 'sanchez-m', 'delainey-d', 'germany-c', 'whalley-l', 'buy-r', 'harris-s', 'tholt-j', 'cash-m', 'sanders-r', '.DS_Store', 'staab-t', 'semperger-c', 'mccarty-d', 'mclaughlin-e', 'ring-a', 'stokley-c', 'hain-m', 'weldon-c', 'ring-r', 'farmer-d', 'sager-e', 'zufferli-j', 'ybarbo-p', 'watson-k', 'dasovich-j', 'arora-h', 'slinger-r', 'martin-t', 'storey-g', 'ruscitti-k', 'shankman-j', 'schwieger-j', 'perlingiere-d', 'saibi-e', 'griffith-j', 'meyers-a', 'grigsby-m', 'taylor-m', 'rapp-b', 'causholli-m', 'derrick-j', 'bass-e', 'south-s', 'salisbury-h', 'beck-s', 'tycholiz-b', 'shackleton-s', 'kitchen-l', 'merriss-s', 'blair-l', 'quenet-j', 'lokey-t', 'williams-j', 'panus-s', 'gang-l', 'hendrickson-s', 'schoolcraft-d', 'mann-k', 'kuykendall-t', 'allen-p', 'giron-d', 'lewis-a', 'jo

In [4]:
# def parse_email(file_path):
#     try:
#         with open(file_path, 'rb') as f:
#             msg = BytesParser(policy=policy.default).parse(f)
        
#         # Extract fields from the email
#         email_from = msg['From']
#         email_to = msg['To']
#         email_date = msg['Date']
#         email_subject = msg['Subject']
#         email_body = msg.get_body(preferencelist=('plain')).get_content() if msg.get_body(preferencelist=('plain')) else ''
        
#         return [email_from, email_to, email_date, email_subject, email_body]
#     except Exception as e:
#         print(f"Error parsing file {file_path}: {e}")
#         return None

# def batch_iterator(iterator, batch_size):
#     """Yield batches of specified size from an iterator."""
#     while True:
#         batch = list(islice(iterator, batch_size))
#         if not batch:
#             break
#         yield batch

# def load_emails(maildir_path, batch_size=10, max_emails=50):
#     email_data = []
#     file_paths = []

#     # Walk through the directory to collect file paths
#     for root, dirs, files in os.walk(maildir_path):
#         for file in files:
#             if file == '.DS_Store' or file.startswith('.'):
#                 continue  # Skip system files and hidden files
#             file_paths.append(os.path.join(root, file))
#             if len(file_paths) >= max_emails:
#                 break
#         if len(file_paths) >= max_emails:
#             break

#     # Process emails in batches
#     for batch in batch_iterator(iter(file_paths), batch_size):
#         batch_data = []
#         for file_path in batch:
#             result = parse_email(file_path)
#             if result is not None:
#                 batch_data.append(result)
        
#         # Append batch data to the main list
#         email_data.extend(batch_data)

#     # Create a DataFrame from the extracted data
#     df = pd.DataFrame(email_data, columns=['From', 'To', 'Date', 'Subject', 'Body'])
#     return df

# # Load and parse emails
# emails_df = load_emails(maildir_path, batch_size=10, max_emails=50)

# # Display the DataFrame
# print(emails_df.head())

### Loading the Data into a Pandas DF

In [9]:

# ! This is a very large dataset and will take a long time to run
# ! DO NOT RUN THIS FOR FUN UNLESS YOU WANT YOUR COMPUTER TO CRASH
def parse_email(file_path):
    try:
        with open(file_path, 'rb') as f:
            msg = BytesParser(policy=policy.default).parse(f)
        
        # Extract fields from the email
        email_from = msg['From']
        email_to = msg['To']
        email_date = msg['Date']
        email_subject = msg['Subject']
        email_body = msg.get_body(preferencelist=('plain')).get_content() if msg.get_body(preferencelist=('plain')) else ''
        
        return [email_from, email_to, email_date, email_subject, email_body]
    except Exception as e:
        print(f"Error parsing file {file_path}: {e}")
        return None

def batch_iterator(iterator, batch_size):
    """Yield batches of specified size from an iterator."""
    while True:
        batch = list(islice(iterator, batch_size))
        if not batch:
            break
        yield batch

def load_emails(maildir_path, batch_size=1000):
    email_data = []
    file_paths = []

    # Walk through the directory to collect file paths
    for root, dirs, files in os.walk(maildir_path):
        for file in files:
            if file == '.DS_Store' or file.startswith('.'):
                continue  # Skip system files and hidden files
            file_paths.append(os.path.join(root, file))

    # Process emails in batches
    for batch in batch_iterator(iter(file_paths), batch_size):
        batch_data = []
        for file_path in batch:
            result = parse_email(file_path)
            if result is not None:
                batch_data.append(result)
        
        # Append batch data to the main list
        email_data.extend(batch_data)

    # Create a DataFrame from the extracted data
    df = pd.DataFrame(email_data, columns=['From', 'To', 'Date', 'Subject', 'Body'])
    return df

# Load and parse emails
emails_df = load_emails(maildir_path, batch_size=1000)

# Display the DataFrame
print(emails_df.head())

Error parsing file /Users/Dylan/Documents/IND5003/Projects/maildir/kitchen-l/sent_items/24.: 'ValueTerminal' object does not support item assignment
Error parsing file /Users/Dylan/Documents/IND5003/Projects/maildir/kitchen-l/_americas/netco_eol/83.: 'ValueTerminal' object does not support item assignment
Error parsing file /Users/Dylan/Documents/IND5003/Projects/maildir/kitchen-l/_americas/netco_eol/82.: 'ValueTerminal' object does not support item assignment
Error parsing file /Users/Dylan/Documents/IND5003/Projects/maildir/kitchen-l/_americas/esvl/87.: 'ValueTerminal' object does not support item assignment
Error parsing file /Users/Dylan/Documents/IND5003/Projects/maildir/kitchen-l/_americas/netco_restart/3.: 'ValueTerminal' object does not support item assignment
                        From  \
0            msagel@home.com   
1    slafontaine@globalp.com   
2    iceoperations@intcx.com   
3  jeff.youngflesh@enron.com   
4  caroline.abramo@enron.com   

                            

In [10]:
# Convert emails_df to a CSV file
# Save the DataFrame as a CSV file in the specified directory

#emails_df.to_csv('/Users/Dylan/Documents/IND5003/Projects/emails_uncleaned.csv', index=False)


In [2]:
# Load the CSV file back into a DataFrame

# PLEASE CHANGE IT TO YOUR OWN DIRECTORY IN YOUR OWN SYSTEM 

# Load the CSV file back into a DataFrame

# PLEASE CHANGE IT TO YOUR OWN DIRECTORY IN YOUR OWN SYSTEM 

enron_uncleaned_emails = pd.read_csv('/Users/Dylan/Documents/IND5003/Projects/enron_emails_uncleaned.csv')
print(enron_uncleaned_emails.head())

                        From  \
0            msagel@home.com   
1    slafontaine@globalp.com   
2    iceoperations@intcx.com   
3  jeff.youngflesh@enron.com   
4  caroline.abramo@enron.com   

                                                  To  \
0                                  jarnold@enron.com   
1                              john.arnold@enron.com   
2  icehelpdesk@intcx.com, internalmarketing@intcx...   
3  anthony.gilmore@enron.com, colleen.koenig@enro...   
4                             mike.grigsby@enron.com   

                              Date  \
0  Thu, 16 Nov 2000 09:30:00 -0800   
1  Fri, 08 Dec 2000 05:05:00 -0800   
2  Tue, 15 May 2001 09:43:00 -0700   
3  Mon, 27 Nov 2000 01:49:00 -0800   
4  Tue, 12 Dec 2000 09:33:00 -0800   

                                             Subject  \
0                                             Status   
1                                 re:summer inverses   
2                      The WTI Bullet swap contracts   
3  Invitation: EB

In [4]:
# Find "kitchen" in the column "From"

# This is to determine that there are still emails sent from louise kitchen despite the original parsing error due to the encoding of the email
kitchen_emails = enron_uncleaned_emails[enron_uncleaned_emails['From'].str.contains('kitchen', case=False, na=False)]
print(kitchen_emails)

                            From  \
617     louise.kitchen@enron.com   
874     louise.kitchen@enron.com   
910     louise.kitchen@enron.com   
926     louise.kitchen@enron.com   
5072    louise.kitchen@enron.com   
...                          ...   
503415  louise.kitchen@enron.com   
503507  louise.kitchen@enron.com   
509732  louise.kitchen@enron.com   
509790  louise.kitchen@enron.com   
509829  louise.kitchen@enron.com   

                                                       To  \
617                                 john.arnold@enron.com   
874     tim.belden@enron.com, f..calger@enron.com, m.....   
910     wes.colwell@enron.com, georgeanne.hodges@enron...   
926                                    c..bland@enron.com   
5072                              john.lavorato@enron.com   
...                                                   ...   
503415  rob.milnthorp@enron.com, f..calger@enron.com, ...   
503507  rob.milnthorp@enron.com, f..calger@enron.com, ...   
509732  k..allen@e

## Section 2: Data Preprocessing
### Start with the Cleaning
* Check for any nulls
* Drop the missing values
* Remove the duplicates
* Format the dates 



In [5]:
# Check for Nulls in Each Column
missing_values = enron_uncleaned_emails.isnull().sum()
missing_values_df = pd.DataFrame({'Column': missing_values.index, 'Missing Values': missing_values.values})
print(missing_values_df)

    Column  Missing Values
0     From               0
1       To           21847
2     Date               0
3  Subject           19187
4     Body               0


In [6]:
# Description of the DataFrame
enron_uncleaned_emails.describe()

Unnamed: 0,From,To,Date,Subject,Body
count,517396,495549,517396,498209,517396
unique,20326,58556,224119,159286,249020
top,kay.mann@enron.com,pete.davis@enron.com,"Wed, 27 Jun 2001 16:02:00 -0700",RE:,"As you know, Enron Net Works (ENW) and Enron G..."
freq,16735,9155,1118,6477,112


In [7]:
# Fill out the missing values with empty strings
enron_cleaned_emails = enron_uncleaned_emails.fillna('')

In [8]:
# Post cleaning Check
missing_values_check = enron_cleaned_emails.isnull().sum()
missing_values_df_check = pd.DataFrame({'Column': missing_values_check.index, 'Missing Values': missing_values_check.values})
print(missing_values_df_check)

    Column  Missing Values
0     From               0
1       To               0
2     Date               0
3  Subject               0
4     Body               0


In [12]:
# Describe the cleaned DataFrame
enron_cleaned_emails.describe()

Unnamed: 0,From,To,Date,Subject,Body
count,517396,517396.0,517396,517396.0,517396
unique,20326,58557.0,224119,159287.0,249020
top,kay.mann@enron.com,,"Wed, 27 Jun 2001 16:02:00 -0700",,"As you know, Enron Net Works (ENW) and Enron G..."
freq,16735,21847.0,1118,19187.0,112


* From running the code above, the output would show that there are 249020 emails with unique bodies out of the 517396 emails. 
    * This means that ~51.9% of emails in the uncleaned dataframe are not unique
    * This would ensure that the subsequent analytical metrics (when performing LDA, TFIDF, Word2vec) are not inflated. 

In [10]:
# Remove duplicate emails based on the 'Body' column, keeping only the first occurrence
enron_cleaned_emails_body_unique = enron_cleaned_emails.drop_duplicates(subset=['Body'], keep='first')

# Describe the DataFrame after removing duplicates
enron_cleaned_emails_body_unique.describe()

Unnamed: 0,From,To,Date,Subject,Body
count,249020,249020.0,249020,249020.0,249020
unique,20129,58069.0,219084,158462.0,249020
top,jeff.dasovich@enron.com,,"Wed, 27 Jun 2001 16:02:00 -0700",,John:\n?\nI'm not really sure what happened be...
freq,5486,9024.0,1118,8577.0,1


In [11]:
# # When running the code block above, i observed that jeff dasovich sent the most emails. 
# # Now i want to explore the number of emails he sent
# # Filter emails where 'From' is 'jeff.dasovich@enron.com'
# jeff_emails = enron_cleaned_emails_body_unique[enron_cleaned_emails_body_unique['From'] == 'jeff.dasovich@enron.com']

# # Count the number of emails he sent
# jeff_emails_count = jeff_emails.shape[0]

# # Display the count
# print(f"Jeff Dasovich sent {jeff_emails_count} emails.")

# # Display the first few rows of Jeff's emails
# jeff_emails.head()


Jeff Dasovich sent 5486 emails.


Unnamed: 0,From,To,Date,Subject,Body
27219,jeff.dasovich@enron.com,d..steffes@enron.com,"Tue, 23 Oct 2001 14:25:50 -0700",RE:,thanks.\n\n -----Original Message-----\nFrom: ...
27277,jeff.dasovich@enron.com,"ginger.dernehl@enron.com, d..steffes@enron.com...","Wed, 17 Oct 2001 14:43:57 -0700",RE: Golf - November Direct Report Meeting,Yes.\n\n -----Original Message-----\nFrom: \tD...
27332,jeff.dasovich@enron.com,"d..steffes@enron.com, susan.mara@enron.com","Tue, 23 Oct 2001 17:02:14 -0700",FW: Stipulation Comments,I asked Mike Day to make sure that we were kep...
27334,jeff.dasovich@enron.com,"d..steffes@enron.com, susan.mara@enron.com","Mon, 22 Oct 2001 11:52:32 -0700",FW: Angelides Oct. 19th Letter to L. Lynch Urg...,FYI. Here's a note I sent to the large custom...
27369,jeff.dasovich@enron.com,c..williams@enron.com,"Fri, 26 Oct 2001 11:28:06 -0700",RE: Edison meet and confer call,Let's try this. It's all inter-related. PG&E...
