# Final Project Exploratory Visualization
### by Daniel Monteiro
In our W209 final project we want to use the Enron email dataset to analyze which type of information the emails could have told us about the scandal which took place from Oct 2001 to Dec 2001 and shook the financial markets. The main question I'm focusing on is "Who was heavily involved in the scandal?". I will provide visualizations to try to answer this question.

In [289]:
!pip install nx_altair

import pandas as pd
import numpy as np
import altair as alt
import json
import datetime
import matplotlib.pyplot as plt
import operator

import nx_altair as nxa
import networkx as nx

from ast import literal_eval
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

# Lists need to be recognized
df = pd.read_csv('C:\\Users\\dmont\\Downloads\\emails_clean.csv',converters={"X-From": literal_eval,
                                                                             "X-To": literal_eval,
                                                                             "X-cc": literal_eval,
                                                                             "X-bcc": literal_eval})
# Transform object into datetime
df.Date = pd.to_datetime(df.Date)
df["date"] = df.Date.apply(lambda x: str(x.date()))
df.date = pd.to_datetime(df.date)
df.day = df.Date.apply(lambda x: x.day)
#Adjust data
# Filter for relevant period
df[(df.date > '2001-06-01') & (df.date < '2002-02-01')].reset_index(drop=True, inplace=True)

# Who were the people heavily involved in the scandal?

## First chart: Weekly number of emails

# Create smaller df with aggregated data
dates = pd.DataFrame({"date" : df["date"].value_counts().index, "count" : df["date"].value_counts()}).reset_index(drop=True).dropna()
# Relevant period
dates.date = dates.date.apply(lambda x: str(x.date())) 
dates = dates[(dates.date > '2001-07-31') & (dates.date < '2002-02-01')].reset_index(drop=True)
# Add info about scandal
dates["Period"] = dates.date.apply(lambda x: "Scandal" if (x > '2001-09-20') and (x < '2001-12-01') else "Pre-/post-scandal")

dates["week"] = pd.to_datetime(dates.date).apply(lambda x: x.week)
dates["count_week"] = dates.week.apply(lambda x: dates.groupby(["week"]).sum().loc[x][0])
dates.to_csv('bar_chart_email_count.csv', index=False)

In [583]:
# Import data from smaller csv
dates = pd.read_csv('bar_chart_email_count.csv')

# Plot improved chart
alt.Chart(dates, title="Weekly email communication 2 months before and after the scandal took place").mark_bar().encode(
    x = alt.X('date:T', axis = alt.Axis(title="Date", format = ("%b %Y"), tickCount = 12)),
    y = alt.Y('count_week:Q', title="Number of emails"),
    color=alt.Color('Period', scale=alt.Scale(scheme='dark2')),
    tooltip=[alt.Tooltip('week', title = "Week"),
             alt.Tooltip('count_week', title = 'Number of emails')]
).properties(width=900, height=600).configure_axis(
    labelFontSize=16,
    titleFontSize=16).configure_range(
    category={'scheme': 'viridis'}
).configure_title(fontSize=20).configure_legend(
titleFontSize=16,
labelFontSize=16
) 

On September 20, 2001, an article appeared in the Dallas' issue of The Wall Street Journal about the mark-to-market accounting practices which were used in the energy industry and that outsiders had no way of knowing which practices were used in the companies. A short-seller who read the article then checked Enron's latest 10-K report and found that the numbers were not adding up and that insiders were selling stocks in large amounts. By looking at the amount of emails which was sent 2 months before and after the scandal took place we are clearly able to see that email communication started to increase after the article was published. A usual amount of 2,000 emails per week was sent in August and until mid September. But as soons as the article was released and rumor had it that Enron might be involved in a accounting scandal, email communication started to grow exponentially and peaked in late October 2001 when the company announced that restatements to its financial statements for the past couple of years were necessary and that this meant that earnings would be reduced by $613 millon. After the announcement the email communication dropped again to a still high level and then peaked a second time in late November when Enron declared bankruptcy.  

## Second chart: Network

# Clean dataset with correct names
def clean_names(name_list):
    unique_names = list(set(name_list))
    kenneth_lay = [x for x in unique_names if "Ken Lay" in x]
    jeff_skilling = [x for x in unique_names if "Jeff Skilling" in x]
    andrew_fastow = ["Andrew S Fastow"]
    
    for i, name in enumerate(name_list):
        if name in kenneth_lay:
            name_list[i] = 'Kenneth Lay'
        elif name in jeff_skilling:
            name_list[i] = 'Jeffrey Skilling'
        elif name in andrew_fastow:
            name_list[i] = 'Andrew Fastow'
        elif name_list[i].count('@') == 0:
            name_list[i] = name_list[i].replace(".","").replace('"','')
    return name_list

# Reduce dataframe to scandal period
df_scandal = df[(df.date > '2001-09-20') & (df.date < '2001-12-01')]

# Create dataframe for network
df_scandal["X-From"] = pd.Series([item for row in df_scandal["X-From"] for item in row if item])
all_recipients = df_scandal["X-To"] + df_scandal["X-cc"] + df_scandal["X-bcc"]

to_lst = []
from_lst = []

for i, row in enumerate(all_recipients):
    for item in row:
        from_lst.append(df_scandal["X-From"].iloc[i])
        to_lst.append(item)
        
from_to = pd.DataFrame({"from" : from_lst, "to" : to_lst})
from_to.replace("", np.nan, inplace=True)
from_to.dropna(inplace=True)
from_to.reset_index(drop=True, inplace=True)
from_to.loc[:,"from"] = clean_names(from_to.loc[:,"from"].reset_index(drop=True))
from_to.loc[:,"to"] = clean_names(from_to.loc[:,"to"].reset_index(drop=True))
from_to["from"] = from_to["from"].astype(str)
from_to["to"] = from_to["to"].astype(str)

# Count emails from author to recipient
count_from_to = from_to.groupby(["from","to"])["to"].agg(['count']).reset_index()
count_from_to.columns = ["from","to","email_count"]
# Remove emails to oneself
count_from_to = count_from_to[count_from_to["from"] != count_from_to.to]
# Join dataframes to filter out low email count
from_to_short = pd.merge(from_to, count_from_to, on=['from','to'], how='left')
from_to_short.dropna(inplace=True)
from_to_short = from_to_short[from_to_short.email_count > 4]
from_to_short.drop(columns="email_count", inplace=True)
from_to_short.reset_index(drop=True, inplace=True)
len(from_to_short)

# Generate NX Graph
G = nx.Graph()
G = nx.from_pandas_edgelist(from_to_short, 'from', 'to')

# Compute positions for viz.
pos = nx.spring_layout(G, seed=4)

# Name the nodes
for i in sorted(G.nodes()):
    G.nodes[i]["name"] = i
    if i == "Kenneth Lay":
        G.nodes[i]["ken"] = "Yes"
    else:
        G.nodes[i]["ken"] = "No"

#degrees = dict(G.degree(G.nodes()))
nx.set_node_attributes(G, degrees, 'degree')

# Export graph data into gml file
nx.write_gml(G, "enron_network.gml")

In [554]:
G = nx.read_gml("enron_network.gml")

# Coloring Ken
col_ken = [name == "Kenneth Lay" for name in G.nodes]

# Draw the network
network = nxa.draw_networkx(G, pos=pos,
                        node_size='degree:Q',
                        node_tooltip = "name"
                       )

# Get the node layer
edges = network.layer[0]
nodes = network.layer[1]

# Condition nodes based on brush
nodes = nodes.encode(
    fill = alt.Fill('ken:N', scale=alt.Scale(scheme='dark2'), legend=None),
    size = alt.Size('degree:Q', title=["Number of nodes"])
)

(edges+nodes).properties(
    height=800, width=1000, title={
      "text": ["Email communication during scandal"], 
      "subtitle": ["Zoom-in for more details. Hover over dots for tooltips."],
      "fontSize":20,
      "subtitleFontSize": 16}
).interactive()

We can see that Kenneth Lay is at the center of the email communcation. We can also deduct key people like Jeff Dasovich, John Arnold, Michelle Cash and Rick Buy to whom employees reported and who then communicated with Kenneth Lay. However, the other two culprits in the case, Jeffrey Skilling and Andrew Fastow, do appear on this network of most influential email communicator. We cannot exactly say why both, Skilling and Fastow, did not communicate much via email but since both of them were involved in the fraud we can assume that they did not want confidential communication about it to be recorded as email. Outside the main network, there are two more smaller networks which show no connection to the main network. Apparently, not all departments were connected to each other, so that communication islands existed.

### Are the key culprits involved in the Enron scandal among the Top 20 people who sent and received the emails when the scandal started?
These people are namely Kenneth Lay (founder, Chairman and CEO), Jeffrey Skilling (former President, and COO) and Andrew Fastow (former CFO) (Source: Wikipedia). Since these people knew about the fraud there might be evidence in the number of emails received that these key people were involved.

# Email recipients limited to highest 10k
authors = pd.DataFrame({"Name" : df_scandal["X-From"].value_counts().index,
                            "Sent" : df_scandal["X-From"].value_counts()}).reset_index(drop=True).sort_values(by="Sent", ascending=False)

# Email recipients limited to highest 10k
all_recipients = df_scandal["X-To"] + df_scandal["X-cc"] + df_scandal["X-bcc"]
recipients = pd.Series([item for row in all_recipients for item in row if item])
recipients = pd.DataFrame({"Name" : recipients.value_counts().index,
                            "Received" : recipients.value_counts()}).reset_index(drop=True).sort_values(by="Received", ascending=False)

# Merge both dataframes
email_people = pd.merge(authors, recipients, how='outer', on="Name").fillna(0)
email_people["Total"] = email_people.Sent + email_people.Received
email_people = email_people.sort_values(by="Total", ascending=False)[:20].reset_index(drop=True)
email_people["is_culprit"] = email_people.Name.apply(lambda x: True if x in ["Kenneth Lay","Jeffrey Skilling","Andrew Fastow"] else False)

# Add rank
email_people = email_people.reset_index()
email_people["Rank"] = email_people.index +1
email_people.Name = email_people["Rank"].astype(str) + " - " + email_people.Name

frames = [email_people[email_people["index"] < 10],
          pd.DataFrame({"index":"...","Name":"...","Sent":0,"Received":0,"Total":0,"is_culprit":False,"Rank":11},index=[11]),
          email_people[email_people.Name == "50 - Kenneth Lay"],
          pd.DataFrame({"index":"...","Name":" ...","Sent":0,"Received":0,"Total":0,"is_culprit":False,"Rank":60},index=[60]),
          email_people[email_people["Rank"] > 89]]
culprit = pd.concat(frames)

In [582]:
# Import data
culprit = pd.read_csv("culprit.csv")
# Draw chart
alt.Chart(culprit, title="Top 20 email accounts by total emails sent/received during scandal").mark_bar().encode(
    y=alt.Y("Name:N", sort=alt.EncodingSortField(field="Rank", order='ascending'),
            axis=alt.Axis(labelLimit=250), title=None),
    x=alt.X('Total:Q', title="Total emails sent/received"),
    color=alt.Color('is_culprit', title="Is culprit?")
).properties(width=800, height=600).configure_axis(
    labelFontSize=16,
    titleFontSize=16).configure_range(
    category={'scheme': 'dark2'}
).configure_title(fontSize=20).configure_legend(
titleFontSize=16,
labelFontSize=16
)

It is not suprising to see that CEO Kenneth Lay was among the Top 20 email authors and recipients. However, again we do not find neither his colleagues Jeffrey Skilling nor Andrew Fastow on the list and even among the Top 100 they are not seen. This again seems noteworthy given that both were highly involved in the fraud. Aside this, we also see two company mailings lists at the top which is most likely the company-wide communication of management to their employees and replies to those emails. Louise Kitchen and Richard Shapiro had been the most communicative persons in the company. However, we did not recognize them in the network chart which might suggest that although many emails were sent and received, they did not seem to have the range and connections others like John Arnold and Jeff Dasovich had.

# Import count_from_to
count_from_to = pd.read_csv('count_from_to.csv')

# Remove people with less than x emails
count_from_to_5 = count_from_to[count_from_to.email_count > 10]

alt.Chart(count_from_to_5).mark_circle(
    stroke='black'
).encode(
    x='to:N',
    y='from:N',
    size='email_count:Q',
    tooltip='email_count:Q',
    fillOpacity=alt.FillOpacity(
        'email_count:Q',
        #scale=alt.Scale(domain=[fill_threshold, fill_threshold + 0.01],range=[0 ,1])
    )
).interactive()