# Word Cloud Prototypes | Team Enron Vizards

### Course: W209 - Section 5
### Author: Akiko Iwamizu

---------------------------------------

## About The Data

Provided by [Kaggle Link](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset), the Enron email dataset contains approximately 500,000 emails generated by employees of the Enron Corporation. It was obtained by the Federal Energy Regulatory Commission during its investigation of Enron's collapse.

We conducted basically data preparation and manipulation before beginning EDA. These steps are not included in this notebook, but can be found in our team repo in [GitHub](https://github.com/akikoiwamizu/enron-vizards).

We begin the exploratory data analysis by importing the required libraries and the cleaned dataset.

In [1]:
import altair as alt
import nltk
import pandas as pd
import os
import re
import numpy as np

from ast import literal_eval
from matplotlib import pyplot as plt
from ipywidgets import widgets, interact, interactive, fixed
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer as stemmer
from nltk.stem import WordNetLemmatizer as lemmatizer
from nltk import pos_tag

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

pd.set_option("display.max_columns", None)
pd.set_option("mode.chained_assignment", None)

alt.data_transformers.disable_max_rows()

nltk.download("punkt") # necessary for tokenization
nltk.download("wordnet") # necessary for lemmatization
nltk.download("stopwords") # necessary for removal of stop words
nltk.download("averaged_perceptron_tagger") # necessary for POS tagging
nltk.download("maxent_ne_chunker" ) # necessary for entity extraction
nltk.download("words")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/akikoiwamizu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/akikoiwamizu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/akikoiwamizu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/akikoiwamizu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/akikoiwamizu/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/akikoiwamizu/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [2]:
print(os.listdir("../../data"))

['Enron_data_cleansing_colab.ipynb', 'key_players_clean.csv', '.ipynb_checkpoints']


In [11]:
df = pd.read_csv("../../data/emails_clean.csv", 
                 index_col=False,
                 converters=
                 {
                     "X-From": literal_eval,
                     "X-To": literal_eval,
                     "X-cc": literal_eval,
                     "X-bcc": literal_eval
                 }
)
df.shape

  df = pd.read_csv("../../data/emails_clean.csv",


(517092, 22)

After loading in the cleaned dataset, let's take a look at the initial dataframe.

In [12]:
df.head()

Unnamed: 0.1,Unnamed: 0,Date,From,To,Subject,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,Body,Cc,Bcc,Time,Attendees,Re,day,month,year,time,tzdiff
0,0,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,[Phillip K Allen],[Tim Belden],[],[],"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,Here is our forecast\n\n,,,,,,14,5,2001,16:39:00,-700
1,1,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,[Phillip K Allen],[John J Lavorato],[],[],"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,Traveling to have a business meeting takes the...,,,,,,4,5,2001,13:51:00,-700
2,2,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,[Phillip K Allen],[Leah Van Arsdall],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,test successful. way to go!!!,,,,,,18,10,2000,03:00:00,-700
3,3,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,[Phillip K Allen],[Randall L Gay],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,"Randy,\n\n Can you send me a schedule of the s...",,,,,,23,10,2000,06:13:00,-700
4,4,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,[Phillip K Allen],[Greg Piper],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,Let's shoot for Tuesday at 11:45.,,,,,,31,8,2000,05:07:00,-700


### Adding useful date dimensions

In [6]:
# Create date/time fields that we will need for this viz
df["hour"] = pd.to_datetime(df["time"]).dt.hour.values
df["date_clean"] = pd.to_datetime(df[["year", "month", "day"]])
df["weekday"] = pd.to_datetime(df["date_clean"]).dt.day_name()

### Standardize names

In [7]:
df_clean = df.copy()

# Email authors
authors = pd.Series([item for row in df_clean["X-From"] for item in row if item])
authors = authors.apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())
authors = pd.DataFrame({"name": authors.value_counts().index,
                        "sent": authors.value_counts()}).reset_index(drop=True).sort_values(by="sent", ascending=False)

# Email recipients
all_recipients = df_clean["X-To"] + df_clean["X-cc"] + df_clean["X-bcc"]
recipients = pd.Series([item for row in all_recipients for item in row if item])
recipients = recipients.apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())
recipients = pd.DataFrame({"name": recipients.value_counts().index,
                           "received": recipients.value_counts()}).reset_index(drop=True).sort_values(by="received", ascending=False)


In [8]:
# Standardize all name variations for the key players in the scandal
andrew_fastow = authors.name[authors.name.apply(lambda x: "Andrew S Fastow" in x)]
jeff_skilling = authors.name[authors.name.apply(lambda x: "Jeff Skilling" in x)]
ken_lay = authors.name[authors.name.apply(lambda x: "Ken Lay" in x)]
sherron_watkins = authors.name[authors.name.apply(lambda x: "Sherron Watkins" in x)]


In [9]:
# Clean dataset with correct names
def clean_names(name_list):
    for i, name in enumerate(name_list):
        if name in andrew_fastow:
            name_list[i] = "Andrew Fastow"
        elif name in jeff_skilling:
            name_list[i] = "Jeff Skilling"
        elif name in ken_lay:
            name_list[i] = "Kenneth Lay"
        elif name in sherron_watkins:
            name_list[i] = "Sherron Watkins"
        return name_list
    

In [10]:
df_clean["X-From"] = df_clean["X-From"].apply(lambda x: clean_names(x))
df_clean["X-To"] = df_clean["X-To"].apply(lambda x: clean_names(x)) 
df_clean["X-cc"] = df_clean["X-cc"].apply(lambda x: clean_names(x)) 
df_clean["X-bcc"] = df_clean["X-bcc"].apply(lambda x: clean_names(x))
df_clean.head()

Unnamed: 0.1,Unnamed: 0,Date,From,To,Subject,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,Body,Cc,Bcc,Time,Attendees,Re,day,month,year,time,tzdiff,hour,date_clean,weekday
0,0,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,[Phillip K Allen],[Tim Belden],[],[],"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,Here is our forecast\n\n,,,,,,14,5,2001,16:39:00,-700,16,2001-05-14,Monday
1,1,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,[Phillip K Allen],[John J Lavorato],[],[],"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,Traveling to have a business meeting takes the...,,,,,,4,5,2001,13:51:00,-700,13,2001-05-04,Friday
2,2,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,[Phillip K Allen],[Leah Van Arsdall],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,test successful. way to go!!!,,,,,,18,10,2000,03:00:00,-700,3,2000-10-18,Wednesday
3,3,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,[Phillip K Allen],[Randall L Gay],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,"Randy,\n\n Can you send me a schedule of the s...",,,,,,23,10,2000,06:13:00,-700,6,2000-10-23,Monday
4,4,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,[Phillip K Allen],[Greg Piper],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,Let's shoot for Tuesday at 11:45.,,,,,,31,8,2000,05:07:00,-700,5,2000-08-31,Thursday


### Clean up with email content

In [20]:
# Load spacy
import re
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import spacy

nlp = spacy.load('en_core_web_sm')

def clean_string(text, stem="None"):

    final_string = ""

    # Make lower
    text = text.lower()

    # Remove line breaks
    text = re.sub(r'\n', ' ', text)

    # Remove puncuation
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)

    # Remove stop words
    text = text.split()
    useless_words = nltk.corpus.stopwords.words("english")
    useless_words = useless_words + ["im", "cc", "subject", "re", "na", "original", "message", "image", "email", "address"]

    text_filtered = [word for word in text if not word in useless_words]

    # Remove numbers
    text_filtered = [re.sub(r'\w*\d\w*', '', w) for w in text_filtered]

    # Stem or Lemmatize
    if stem == 'Stem':
        stemmer = PorterStemmer() 
        text_stemmed = [stemmer.stem(y) for y in text_filtered]
    elif stem == 'Lem':
        lem = WordNetLemmatizer()
        text_stemmed = [lem.lemmatize(y) for y in text_filtered]
    elif stem == 'Spacy':
        text_filtered = nlp(' '.join(text_filtered))
        text_stemmed = [y.lemma_ for y in text_filtered]
    else:
        text_stemmed = text_filtered

    final_string = ' '.join(text_stemmed)

    return final_string

In [21]:
df_clean["Body_Clean"] = df_clean["Body"].apply(lambda x: clean_string(x, stem='Lem'))

In [22]:
df_clean.head()

Unnamed: 0.1,Unnamed: 0,Date,From,To,Subject,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,Body,Cc,Bcc,Time,Attendees,Re,day,month,year,time,tzdiff,hour,date_clean,weekday,Body_Clean
0,0,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,[Phillip K Allen],[Tim Belden],[],[],"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,Here is our forecast\n\n,,,,,,14,5,2001,16:39:00,-700,16,2001-05-14,Monday,forecast
1,1,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,[Phillip K Allen],[John J Lavorato],[],[],"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,Traveling to have a business meeting takes the...,,,,,,4,5,2001,13:51:00,-700,13,2001-05-04,Friday,traveling business meeting take fun trip espec...
2,2,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,[Phillip K Allen],[Leah Van Arsdall],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,test successful. way to go!!!,,,,,,18,10,2000,03:00:00,-700,3,2000-10-18,Wednesday,test successful way go
3,3,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,[Phillip K Allen],[Randall L Gay],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,"Randy,\n\n Can you send me a schedule of the s...",,,,,,23,10,2000,06:13:00,-700,6,2000-10-23,Monday,randy send schedule salary level everyone sche...
4,4,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,[Phillip K Allen],[Greg Piper],[],[],\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,Let's shoot for Tuesday at 11:45.,,,,,,31,8,2000,05:07:00,-700,5,2000-08-31,Thursday,let shoot tuesday


### Create a data subset for only the scandal period (2001-)

In [23]:
# Create subsets of the email dataset before and after the scandal period
after_scandal_period = df_clean[df_clean["date_clean"] >= "2001-09-30"]
before_scandal_period = df_clean[df_clean["date_clean"] < "2001-09-30"]

In [24]:
len(after_scandal_period)

112931

In [25]:
len(before_scandal_period)

404161

### Find the most common words in the email messages during the scandal period

In [30]:
# Email authors + their messages
email_messages_after = after_scandal_period[["date_clean", "From", "X-From", "Body", "Body_Clean"]].reset_index(drop=True)

# Clean up "From" field
email_messages_after["Name"] = email_messages_after["X-From"].str[0]
email_messages_after["Name"] = email_messages_after["Name"].apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())

# Add flag to indicate scandal period
email_messages_after["Scandal_Period"] = True
email_messages_after.head()


Unnamed: 0,date_clean,From,X-From,Body,Body_Clean,Name,Scandal_Period
0,2001-12-30,pallen70@hotmail.com,[phillip allen],"\n \n>From: ""Greg Thorse"" \n>To: \n>CC: ""Phill...",greg thorse phillip allen bishop corner date s...,Phillip Allen,True
1,2001-12-27,arsystem@mailman.enron.com,[ARSystem],This request has been pending your approval fo...,request pending approval day please click re...,Arsystem,True
2,2001-10-18,webmaster@earnings.com,[Earnings.com],"If you cannot read this email, please click he...",cannot read please click earningscom news ear...,Earnings Com,True
3,2001-10-18,discount@open2win.oi3.net,[Hotel Discounts],"\n[IMAGE] \t\t[IMAGE] \t\n\t\tDear PHILLIP, ...",dear phillip congratulation get hotel discoun...,Hotel Discounts,True
4,2001-10-18,no.address@enron.com,[Enron Global Technology@ENRON],"\nEarlier this week, Enron Global Technology a...",earlier week enron global technology announced...,Enron Global Technology,True


In [31]:
# Email authors + their messages
email_messages_before = before_scandal_period[["date_clean", "From", "X-From", "Body", "Body_Clean"]].reset_index(drop=True)

# Clean up "From" field
email_messages_before["Name"] = email_messages_before["X-From"].str[0]
email_messages_before["Name"] = email_messages_before["Name"].apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())

# Add flag to indicate scandal period
email_messages_before["Scandal_Period"] = False
email_messages_before.head()


Unnamed: 0,date_clean,From,X-From,Body,Body_Clean,Name,Scandal_Period
0,2001-05-14,phillip.allen@enron.com,[Phillip K Allen],Here is our forecast\n\n,forecast,Phillip K Allen,False
1,2001-05-04,phillip.allen@enron.com,[Phillip K Allen],Traveling to have a business meeting takes the...,traveling business meeting take fun trip espec...,Phillip K Allen,False
2,2000-10-18,phillip.allen@enron.com,[Phillip K Allen],test successful. way to go!!!,test successful way go,Phillip K Allen,False
3,2000-10-23,phillip.allen@enron.com,[Phillip K Allen],"Randy,\n\n Can you send me a schedule of the s...",randy send schedule salary level everyone sche...,Phillip K Allen,False
4,2000-08-31,phillip.allen@enron.com,[Phillip K Allen],Let's shoot for Tuesday at 11:45.,let shoot tuesday,Phillip K Allen,False


In [32]:
# Export snapshot of data + compress
compression_opts = dict(method='zip', archive_name='email_messages_before.csv')  
email_messages_before.to_csv('email_messages_before.zip', index=False, compression=compression_opts)  

In [33]:
# Export snapshot of data + compress
compression_opts = dict(method='zip', archive_name='email_messages_after.csv')  
email_messages_after.to_csv('email_messages_after.zip', index=False, compression=compression_opts)  

In [34]:
result = pd.concat([email_messages_before, email_messages_after])
result = result[["date_clean", "Name", "Body_Clean", "Scandal_Period"]].reset_index(drop=True)
result.drop_duplicates()
result.head()

Unnamed: 0,date_clean,Name,Body_Clean,Scandal_Period
0,2001-05-14,Phillip K Allen,forecast,False
1,2001-05-04,Phillip K Allen,traveling business meeting take fun trip espec...,False
2,2000-10-18,Phillip K Allen,test successful way go,False
3,2000-10-23,Phillip K Allen,randy send schedule salary level everyone sche...,False
4,2000-08-31,Phillip K Allen,let shoot tuesday,False


In [35]:
len(result)

517092

In [36]:
# Export snapshot of data + compress
compression_opts = dict(method='zip', archive_name='result.csv')  
result.to_csv('result.zip', index=False, compression=compression_opts)  

### Create a subset of the email data to include only the key players of the scandal

In [37]:
key_names = ["Andrew Fastow", "Jeff Skilling", "Kenneth Lay", "Sherron Watkins"]

In [38]:
def get_key_player_name(df):
    if df["Name_From"] == "Jeff Skilling" or df["Name_To"] == "Jeff Skilling":
        return "Jeff Skilling"
    elif df["Name_From"] == "Kenneth Lay" or df["Name_To"] == "Kenneth Lay":
        return "Kenneth Lay"
    elif df["Name_From"] == "Andrew Fastow" or df["Name_To"] == "Andrew Fastow":
        return "Andrew Fastow"
    elif df["Name_From"] == "Sherron Watkins" or df["Name_To"] == "Sherron Watkins":
        return "Sherron Watkins"
    else:
        return "Other"

In [46]:
# Email authors + their messages
key_players = df_clean[["date_clean", "From", "X-From", "X-To", "Body", "Body_Clean"]].reset_index(drop=True)

# Clean up "From" field
key_players["Name_From"] = key_players["X-From"].str[0]
key_players["Name_To"] = key_players["X-To"].str[0]
key_players["Name_From"] = key_players["Name_From"].apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())
key_players["Name_To"] = key_players["Name_To"].apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())
key_players.head()


Unnamed: 0,date_clean,From,X-From,X-To,Body,Body_Clean,Name_From,Name_To
0,2001-05-14,phillip.allen@enron.com,[Phillip K Allen],[Tim Belden],Here is our forecast\n\n,forecast,Phillip K Allen,Tim Belden
1,2001-05-04,phillip.allen@enron.com,[Phillip K Allen],[John J Lavorato],Traveling to have a business meeting takes the...,traveling business meeting take fun trip espec...,Phillip K Allen,John J Lavorato
2,2000-10-18,phillip.allen@enron.com,[Phillip K Allen],[Leah Van Arsdall],test successful. way to go!!!,test successful way go,Phillip K Allen,Leah Van Arsdall
3,2000-10-23,phillip.allen@enron.com,[Phillip K Allen],[Randall L Gay],"Randy,\n\n Can you send me a schedule of the s...",randy send schedule salary level everyone sche...,Phillip K Allen,Randall L Gay
4,2000-08-31,phillip.allen@enron.com,[Phillip K Allen],[Greg Piper],Let's shoot for Tuesday at 11:45.,let shoot tuesday,Phillip K Allen,Greg Piper


In [47]:
# Create a subset of email messages from key players in the scandal
key_names = ["Jeff Skilling", "Kenneth Lay"]
key_players = key_players.loc[key_players["Name_From"].isin(key_names) | 
                              key_players["Name_To"].isin(key_names)].reset_index(drop=True)
key_players["Key_Player"] = key_players.apply(get_key_player_name, axis=1)
key_players["Scandal_Period"] = np.where(key_players["date_clean"] >= "2001-09-30", True, False)
key_players = key_players[["date_clean", "Key_Player", "Body_Clean", "Scandal_Period"]]

key_players_clean = key_players.drop_duplicates(keep=False).reset_index(drop=True)
key_players_clean.head()

Unnamed: 0,date_clean,Key_Player,Body_Clean,Scandal_Period
0,2001-07-14,Kenneth Lay,dear dr lay name iris mack former bos vince ka...,False
1,2002-01-11,Kenneth Lay,please advised copy congressional subpoena iss...,True
2,2002-01-09,Kenneth Lay,respond january later pm enron corp circula...,True
3,2001-12-17,Kenneth Lay,following schedule event board director meetin...,True
4,2001-12-11,Kenneth Lay,agenda meeting board director enron corp pm c...,True


In [48]:
len(key_players_clean)

1246

In [50]:
key_players_clean.head(10)

Unnamed: 0,date_clean,Key_Player,Body_Clean,Scandal_Period
0,2001-07-14,Kenneth Lay,dear dr lay name iris mack former bos vince ka...,False
1,2002-01-11,Kenneth Lay,please advised copy congressional subpoena iss...,True
2,2002-01-09,Kenneth Lay,respond january later pm enron corp circula...,True
3,2001-12-17,Kenneth Lay,following schedule event board director meetin...,True
4,2001-12-11,Kenneth Lay,agenda meeting board director enron corp pm c...,True
5,2001-11-19,Kenneth Lay,regard bjrn hagelmann director rac market risk...,True
6,2001-08-16,Kenneth Lay,attached enron broadband notification memo aug...,False
7,2001-08-17,Kenneth Lay,attached enron america gastrading violation me...,False
8,2001-08-20,Kenneth Lay,attached enron america gastrading violation me...,False
9,2001-08-22,Kenneth Lay,attached net open position maturity gap risk v...,False


In [51]:
# Export snapshot of data + compress
compression_opts = dict(method='zip', archive_name='key_players_clean.csv')  
key_players_clean.to_csv('key_players_clean.zip', index=True, compression=compression_opts)  