# Word Cloud Prototypes | Team Enron Vizards

### Course: W209 - Section 5
### Author: Akiko Iwamizu

---------------------------------------

## About The Data

Provided by [Kaggle Link](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset), the Enron email dataset contains approximately 500,000 emails generated by employees of the Enron Corporation. It was obtained by the Federal Energy Regulatory Commission during its investigation of Enron's collapse.

We conducted basically data preparation and manipulation before beginning EDA. These steps are not included in this notebook, but can be found in our team repo in [GitHub](https://github.com/akikoiwamizu/enron-vizards).

We begin the exploratory data analysis by importing the required libraries and the cleaned dataset.

In [None]:
import altair as alt
import nltk
import pandas as pd
import os
import re

from ast import literal_eval
from matplotlib import pyplot as plt
from ipywidgets import widgets, interact, interactive, fixed
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer as stemmer
from nltk.stem import WordNetLemmatizer as lemmatizer
from nltk import pos_tag

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

pd.set_option("display.max_columns", None)
pd.set_option("mode.chained_assignment", None)

alt.data_transformers.disable_max_rows()

nltk.download("punkt") # necessary for tokenization
nltk.download("wordnet") # necessary for lemmatization
nltk.download("stopwords") # necessary for removal of stop words
nltk.download("averaged_perceptron_tagger") # necessary for POS tagging
nltk.download("maxent_ne_chunker" ) # necessary for entity extraction
nltk.download("words")

In [None]:
print(os.listdir("../../data"))

In [None]:
df = pd.read_csv("../../data/emails_clean.csv", 
                 converters=
                 {
                     "X-From": literal_eval,
                     "X-To": literal_eval,
                     "X-cc": literal_eval,
                     "X-bcc": literal_eval
                 }
)
df.shape

After loading in the cleaned dataset, let's take a look at the initial dataframe.

In [None]:
df.head()

### Adding useful date dimensions

In [None]:
# Create date/time fields that we will need for this viz
df["hour"] = pd.to_datetime(df["time"]).dt.hour.values
df["date_clean"] = pd.to_datetime(df[["year", "month", "day"]])
df["weekday"] = pd.to_datetime(df["date_clean"]).dt.day_name()

### Standardize names

In [None]:
df_clean = df.copy()

# Email authors
authors = pd.Series([item for row in df_clean["X-From"] for item in row if item])
authors = authors.apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())
authors = pd.DataFrame({"name": authors.value_counts().index,
                        "sent": authors.value_counts()}).reset_index(drop=True).sort_values(by="sent", ascending=False)

# Email recipients
all_recipients = df_clean["X-To"] + df_clean["X-cc"] + df_clean["X-bcc"]
recipients = pd.Series([item for row in all_recipients for item in row if item])
recipients = recipients.apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())
recipients = pd.DataFrame({"name": recipients.value_counts().index,
                           "received": recipients.value_counts()}).reset_index(drop=True).sort_values(by="received", ascending=False)


In [None]:
# Standardize all name variations for the key players in the scandal
andrew_fastow = authors.name[authors.name.apply(lambda x: "Andrew S Fastow" in x)]
arthur_andersen = authors.name[authors.name.apply(lambda x: "Arthur Andersen" in x)]
jeff_skilling = authors.name[authors.name.apply(lambda x: "Jeff Skilling" in x)]
ken_lay = authors.name[authors.name.apply(lambda x: "Ken Lay" in x)]
sherron_watkins = authors.name[authors.name.apply(lambda x: "Sherron Watkins" in x)]


In [None]:
# Clean dataset with correct names
def clean_names(name_list):
    for i, name in enumerate(name_list):
        if name in andrew_fastow:
            name_list[i] = "Andrew Fastow"
        elif name in arthur_andersen:
            name_list[i] = "Arthur Andersen"
        elif name in jeff_skilling:
            name_list[i] = "Jeff Skilling"
        elif name in ken_lay:
            name_list[i] = "Kenneth Lay"
        elif name in sherron_watkins:
            name_list[i] = "Sherron Watkins"
        return name_list
    

In [None]:
df_clean["X-From"] = df_clean["X-From"].apply(lambda x: clean_names(x))
df_clean["X-To"] = df_clean["X-To"].apply(lambda x: clean_names(x)) 
df_clean["X-cc"] = df_clean["X-cc"].apply(lambda x: clean_names(x)) 
df_clean["X-bcc"] = df_clean["X-bcc"].apply(lambda x: clean_names(x))
df_clean.head()

### Create a data subset for only the scandal period (2001-)

In [None]:
# Create a subset of the email dataset to only the scandal period
# Defining the scandal period as the date when Bethany McLean published the article "Is Enron Overpriced?"
# in Fortune magazine. She writes that investors are generally clueless as to how Enron earns its reported profits.
scandal_period = df_clean[df_clean["date_clean"] >= "2001-09-30"]
scandal_period.head()

In [None]:
len(scandal_period)

### Find the most common words in the email messages during the scandal period

In [None]:
# Email authors + their messages
email_messages = scandal_period[["date_clean", "From", "X-From", "Body"]].reset_index(drop=True)

# Clean up "From" field
email_messages["Name"] = email_messages["X-From"].str[0]
email_messages["Name"] = email_messages["Name"].apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())

email_messages.head()


In [None]:
# Load spacy
import re
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import spacy

nlp = spacy.load('en_core_web_sm')

def clean_string(text, stem="None"):

    final_string = ""

    # Make lower
    text = text.lower()

    # Remove line breaks
    text = re.sub(r'\n', ' ', text)

    # Remove puncuation
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)

    # Remove stop words
    text = text.split()
    useless_words = nltk.corpus.stopwords.words("english")
    useless_words = useless_words + ["im", "cc", "subject", "re", "na", "original", "message", "image", "email", "address"]

    text_filtered = [word for word in text if not word in useless_words]

    # Remove numbers
    text_filtered = [re.sub(r'\w*\d\w*', '', w) for w in text_filtered]

    # Stem or Lemmatize
    if stem == 'Stem':
        stemmer = PorterStemmer() 
        text_stemmed = [stemmer.stem(y) for y in text_filtered]
    elif stem == 'Lem':
        lem = WordNetLemmatizer()
        text_stemmed = [lem.lemmatize(y) for y in text_filtered]
    elif stem == 'Spacy':
        text_filtered = nlp(' '.join(text_filtered))
        text_stemmed = [y.lemma_ for y in text_filtered]
    else:
        text_stemmed = text_filtered

    final_string = ' '.join(text_stemmed)

    return final_string

In [None]:
# Next apply the clean_string function to the text
email_messages["Body_Clean"] = email_messages["Body"].apply(lambda x: clean_string(x, stem='Lem'))

In [None]:
email_messages.head()

In [None]:
# Export snapshot of data + compress
compression_opts = dict(method='zip', archive_name='scandal_period2.csv')  
email_messages.to_csv('scandal_period2.zip', index=False, compression=compression_opts)  

In [None]:
%matplotlib inline

# Remove stopwords
stopwords = set(STOPWORDS)
stopwords.update(["please", "hi", "im", "ill", "cc", "subject", "re", "na", "original", "message", "image", "email", "address"])

wordcloud = WordCloud(stopwords=stopwords,
                      background_color="white", 
                      max_words=100,
                      width=1024, 
                      height=500,
                      colormap="magma"                      
                     ).generate(text)

plt.figure(figsize=(15,15))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
plt.savefig("cloud_all.png")

### Create a subset of the email data to include only the key players of the scandal

In [None]:
# Email authors + their messages
key_players = df_clean[["date_clean", "From", "X-From", "X-To", "Body"]].reset_index(drop=True)

# Clean up "From" field
key_players["Name_From"] = key_players["X-From"].str[0]
key_players["Name_To"] = key_players["X-To"].str[0]
key_players["Name_From"] = key_players["Name_From"].apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())
key_players["Name_To"] = key_players["Name_To"].apply(lambda x: x.split("@")[0].replace("\"", "").replace("'", "").replace(".", " ").replace("_", " ").strip().title())
key_players.head()


In [None]:
# Create a subset of email messages from key players in the scandal
key_names = ["Andrew Fastow", "Jeff Skilling", "Kenneth Lay", "Sherron Watkins"]
key_players = key_players[["date_clean", "Name_From", "Name_To", "Body"]]
key_players = key_players.loc[key_players["Name_From"].isin(key_names) | key_players["Name_To"].isin(key_names)].reset_index(drop=True)
key_players.drop_duplicates()
len(key_players)


In [None]:
# Next apply the clean_string function to the text
key_players["Body_Clean"] = key_players["Body"].apply(lambda x: clean_string(x, stem='Lem'))

key_players = key_players[["date_clean", "Name_From", "Name_To", "Body_Clean"]]
key_players.drop_duplicates()
len(key_players)

In [None]:
key_players.head(-100)

In [None]:
test = key_players["Body_Clean"].str.split(expand=True).stack().value_counts().reset_index()
test.columns = ["Word", "Count"]
test.head()

In [None]:
# Export snapshot of data + compress
compression_opts = dict(method='zip', archive_name='key_players.csv')  
key_players.to_csv('key_players.zip', index=True, compression=compression_opts)  

In [None]:
%matplotlib inline
text = " ".join(item for item in key_players["Body_Clean"])

stopwords = set(STOPWORDS)
stopwords.update(["jeffskillinenroncom", "u", "skillingcorpenronenron", "please", "hi", "im", "ill", "cc", "subject", "re", "na", "original", "message", "image", "email", "address"])

wordcloud = WordCloud(stopwords=stopwords,
                      background_color="white", 
                      max_words=100,
                      width=1024, 
                      height=500,
                      colormap="magma"                      
                     ).generate(text)

plt.figure(figsize=(15,15))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
plt.savefig("cloud_key_players.png")

To improve the visualization, let's allow users to interactively select the key players in the scandal and see how their word clouds change.

In [None]:
%matplotlib notebook
def clean_emails(Employee, frame, col):
    newframe=frame.copy()  
    newframe[col] = newframe[col].str.replace('\d+', "", regex=True).str.replace("?", "", regex=True).str.replace('\W', " ", regex=True).str.lower()
    newframe = newframe[newframe["Name"]==Employee].astype(str)
    
    text = " ".join(newframe[col][:])
    pattern = re.compile(r'\b(' + r'|'.join(set(STOPWORDS)) + r')\b\s*')
    cleantext = pattern.sub("", text)
    
    text_tokenized = cleantext.split(" ")
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in text_tokenized if word != ""]
    cloudtext = " ".join(words)
    
    return cloudtext

def make_clouds(Employee, frame, col, Maximum, title):
    cloudtext = clean_emails(Employee, frame, col)
    wordcloud = WordCloud(max_font_size=40, max_words=Maximum, background_color="white", collocations=False).generate(cloudtext)
    wordcloud.generate_from_frequencies
    
    plt.style.use("tableau-colorblind10")
    plt.figure(figsize=(10,10))
    plt.title(title, fontsize=18, fontweight="bold")
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    
cloud = interact(make_clouds, 
         Employee=key_players,
         Maximum=[100, 75, 50, 25],
         df=fixed(email_messages_kp), 
         col=fixed("Body"), 
         title=fixed("\nMost Common Words in Emails by the Enron Scandal Key Players\n"),
         frame=fixed(email_messages_kp[["Name","Body"]])
        )


#### Conclusion: 

These data views suggest that the most common words used by key players in their emails vary quite a bit depending on the employee. Across all the key players, there are various greetings and meeting times like the key player's names and "Monday" or "pm" likely for scheduling purposes, but there are also many common words like "enronxgate", "investigation", and various references to internal initiatives like the petrobas gas initiative. 

---------------------------------------