<a href="https://colab.research.google.com/github/franzis17/EnronEmailAnalysis/blob/main/Business_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b>Introduction</b>

The Enron Corporation is a company based in America that provides energy, services, and commodity. Due to fraudulent accounting practices and financial scandals, Enron is notorious for corporate fraud and corruption.

## Aim of the Report Analysis

This report aims to identify potential topics of interest in relation to the fraudulent activities that Enron did by analysing the volume of email over time, top 10 senders and receivers of email, and subject keywords.

# <b>Importing Packages and Connecting to the Database</b>

IMPORTANT: Must run first before running any analysis


## Initialise all packages needed in the program

Packages used:
* calendar --> for showing the month name in Analysis 1 instead of the number of the month
* matplotlib --> for visualisation purposes. Plot the data in to some kind of a chart.
* pandas --> for efficiently sorting different kinds of data (i.e. sorting messages data in a monthly order)
* sqlite3 --> for extracting data in the database and perform 3 different kinds of analysis on the data



In [None]:
import calendar
import matplotlib.pyplot as plt
import pandas as pd
import re  # regular expression
import sqlite3
from wordcloud import WordCloud

# Generate Stopwords
from google.colab import output
!curl -Ol https://raw.githubusercontent.com/michael-borck/isys2001-worksheets/main/stopwords.py
output.clear()
print("Required packages installed")

## Connect to the database

In [None]:
# Connect to the database using 'sqlite3'
conn = sqlite3.connect('/content/drive/MyDrive/a-quick_uploads/enron.db')

# Create a cursor to navigate each rows in the database and query any specific data from the dataset
cur = conn.cursor()

# <u>**Analysis 1.**</u> Email Traffic Over Time

Aim: Analyze the volume of emails sent over time by counting the number of messages of each employees sent per month.

## **Problem:**

* Must count the total number of messages of all employees they are sending per month.

## **Inputs:**

* List of messages

## **Outputs:**

* A line chart that shows the number of emails per month. (x = month, y = number of emails(messages))

## **Algorithm:**
  1. Create a dataframe based on an SQL statement that obtains all messages from the database table "message".
  2. Using the "message" dataframe, use the total amount of messages of the dataframe to sort the data monthly.
  3. Using the monthly-sorted messages data, create a line chart that shows the total amount of messages per month.

* **Important**: Must close the connection to the SQLite Database when done with the database.

## **Python Implementation:**

In [None]:
'''
Note:
> Lines 12 to 29 were generated by ChatGPT. More info on reference list [ChatGPT-1]
'''

# Query database for all messages and to be put on a dataframe
sql='''
SELECT * FROM message
'''
messages_df = pd.read_sql(sql, conn)

# Make a new month table to group each messages by month
messages_df['date'] = pd.to_datetime(messages_df['date'])
messages_df['month'] = messages_df['date'].dt.month

# Apply month names instead of the month numbers
messages_df['month'] = messages_df['month'].apply(lambda x: calendar.month_name[x])
# Set month as a categorical column with desired order (Order the month chronologically)
month_order = [calendar.month_name[i] for i in range(1, 13)]
messages_df['month'] = pd.Categorical(messages_df['month'], categories=month_order, ordered=True)

monthly_counts = messages_df.groupby('month').size()
monthly_counts.plot(kind='line', figsize=(10, 6))

# Plot the total amount of messages sent per month
plt.xlabel('Month')
plt.ylabel('Number of messages')
plt.title('Total Number of Messages per Month in the year 2000')
plt.show()

# <u>**Analysis 2.**</u> Top Senders and Receivers
Aim: Identify the most frequent email senders and recipients by aggregating(collecting) the data in the 'Message' and 'RecepientInfo' tables.

## **Problems:**
* Must find the most frequent email senders
* Must find the most frequent email receivers

## **Inputs:**
* Dataset of all senders from the "message" database table
* Dataset of all receivers from the "recipientinfo" database table

## **Outputs:**
* A set of all the most frequent email senders
* A set of all the most frequent email receivers

## **Algorithm:**

Finding the most frequent email <b>senders</b>:
1. Get all the messages from the database
2. Compute which "sender" has the most count and put it all on a data set sorted from highest to lowest email sent
3. Using the sorted data set, plot it all on a bar chart with: x = a number of emails and y = employee names

Finding the most frequent email <b>receivers</b>:
1. Get all the recepientinfo dataset from the database
2. Compute which "recepient" has the most count and put it all on a data set sorted from highest to lowest received email
3. Using the sorted data set, plot it all on a bar chart with: x = number of emails and y = employee name

## **Python Implementation:**

## <b>Discussion</b>



# <u>**Analysis 3.**</u> Subject Keyword Analysis
Aim: Extract keywords from email subjects in the 'Message' table and analyze the frequency of words used to understand common topics of discussion.

## **Problem:**

* Must understand the common topics of discussion from the dataset of 'Message' table to see if it is related to any potential fraudulent activities.

## **Inputs:**

* List of words from the subject section of all email email messages. (database table located in message.subject)

## **Outputs:**

* WordCloud that shows the most common words used in the subject section of the email with. (Frequency of words depends on the size, i.e. large-sized words are most frequently used and smaller-sized are less frequent.)

## **Algorithm:**
  1. Extract all dataset of 'Message' table.
  2. Clean the 'subject' contents of the message dataset to only contain absolute words (words must not contain any symbols/numbers)
  3. Generate list of stop words called "ENGLISH_STOP_WORDS" from Stopwords.py, which is a list of words that do not provide much meaning to the analysis.
  4. Create a WordCloud using the "Cleaned-subject" list of words.

## **Python Implementation:**

Before running: Must run the Analysis 1 first to get the dataframe of messages "messages_df"

In [None]:
# Must test the Code from Development Notebook first before putting it here

## **Discussion:**

Analysis 3 is conducted above to find any keywords in the subjects of all emails to gain an understanding about the fraudulent activities of Enron. Ken Lay, (add more keywords here).

Ken Lay is an important subject regarding Enron's unethical business practices as Ken Lay was the CEO and Founder of Enron, which indicates that Ken was responsible for Enron's immoral accounting activities and financial scandal.


Below are some keywords that implies something:

* "Codesite":
  * Based from analysing the message body of all subjects with the keyword "Codesite", it looks like the company had some for of system that checks for any variances to 

## Close the connection to the database
**Note:** Use when done with the database

In [None]:
conn.close()

# Reference List

* [1] https://pandas.pydata.org/docs/user_guide/10min.html#plotting
* [2] Clean function obtained from WK-9 tutorial and is used to remove symbols and numbers from words.

**ChatGPT**
* [1]
  * Purpose: To plot the frequency of messages per month to a line chart
  * Prompt: how can I use pandas package in python to plot the frequency of messages per month in a year, given the messages data are in a dataframe.