# Multi-table Datasets - ENRON Archive

## 1. Data import

Connect to the file 'assets/datasets/enron.db' using one of these methods:

- sqlite3 python package
- pandas.read_sql
- SQLite Manager Firefox extension

Take a look at the database and query the master table. How many Tables are there in the db?

> Answer:
There are 3 tables:
- MessageBase
- RecipientBase
- EmployeeBase

In [1]:
import sqlite3
conn = sqlite3.connect('enron.db')

Query the `sqlite_master` table to retrieve the schema of the `EmployeeBase` table.

1. What fields are there?
1. What's the type of each of them?

In [12]:
data = c.execute('SELECT * FROM sqlite_master;')
data.fetchall()

[(u'view',
  u'Employee',
  u'Employee',
  0,
  u"CREATE VIEW Employee AS\nSELECT\n    eid,\n    name,\n    longdepartment,\n    title,\n\n    gender,\n    CASE gender\n        WHEN 'Female' THEN 1\n        ELSE 0\n    END AS genF,\n\n    seniority,\n    CASE seniority\n        WHEN 'Junior' THEN 1\n        ELSE 0\n    END AS senJ,\n\n    department,\n    CASE department\n        WHEN 'Legal' THEN 1\n        ELSE 0\n    END AS depL,\n    CASE department\n        WHEN 'Trading' THEN 1\n        ELSE 0\n    END AS depT\nFROM\n    EmployeeBase"),
 (u'view',
  u'EmployeeWithVars',
  u'EmployeeWithVars',
  0,
  u'CREATE VIEW EmployeeWithVars AS\nSELECT\n    eid,\n    1 AS intercept,\n    genF,\n    senJ,\n    depL,\n    depT,\n    genF * senJ AS genF_senJ,\n    genF * depL AS genF_depL,\n    genF * depT AS genF_depT,\n    senJ * depL AS senJ_depL,\n    senJ * depT AS senJ_depT,\n    genF * senJ * depL AS genF_senJ_depL,\n    genF * senJ * depT AS genF_senJ_depT\nFROM\n    Employee'),
 (u'tab

1. Print the first 5 rows of EmployeeBase table
1. Print the first 5 rows of MessageBase table
1. Print the first 5 rows of RecipientBase table

**Hint**  use `SELECT` and `LIMIT`.

In [30]:
import pandas as pd
pd.read_sql('SELECT * FROM EmployeeBase LIMIT 5;', con=conn)

Unnamed: 0,eid,name,department,longdepartment,title,gender,seniority
0,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior
1,2,Harry Arora,Forestry,ENA East Power,VP Trading,Male,Senior
2,3,Robert Badeer,Forestry,ENA West Power,Mgr Trading,Male,Junior
3,4,Susan Bailey,Legal,ENA Legal,Specialist Legal,Female,Junior
4,5,Eric Bass,Forestry,ENA Gas Texas,Trader,Male,Junior


In [32]:
import pandas as pd
pd.read_sql('SELECT * FROM MessageBase LIMIT 5;', con=conn)

Unnamed: 0,mid,filename,unix_time,subject,from_eid
0,1,taylor-m/sent/11,910930020,Cd$ CME letter,138
1,2,taylor-m/sent/17,911459940,Indemnification,138
2,3,taylor-m/sent/18,911463840,Re: Indemnification,138
3,4,taylor-m/sent/23,911874180,"Re: Coral Energy, L.P.",138
4,5,taylor-m/sent/27,912396120,Bankruptcy Code revisions,138


In [31]:
import pandas as pd
pd.read_sql('SELECT * FROM RecipientBase LIMIT 5;', con=conn)

Unnamed: 0,mid,rno,to_eid
0,1,1,59
1,2,1,15
2,3,1,15
3,4,1,109
4,4,2,49


Import each of the 3 tables to a Pandas Dataframes

In [36]:
employeedf = pd.DataFrame(pd.read_sql('SELECT * FROM EmployeeBase;', con=conn))
messagedf = pd.DataFrame(pd.read_sql('SELECT * FROM MessageBase;', con=conn))
recipientdf = pd.DataFrame(pd.read_sql('SELECT * FROM RecipientBase;', con=conn))

## 2. Data Exploration

Use the 3 dataframes to answer the following questions:

1. How many employees are there in the company?
- How many messages are there in the database?
- Convert the timestamp column in the messages. When was the oldest message sent? And the newest?
- Some messages are sent to more than one recipient. Group the messages by message_id and count the number of recepients. Then look at the distribution of recepient numbers.
    - How many messages have only one recepient?
    - How many messages have >= 5 recepients?
    - What's the highest number of recepients?
    - Who sent the message with the highest number of recepients?
- Plot the distribution of recepient numbers using Bokeh.

In [37]:
len(employeedf)

156

In [38]:
len(messagedf)

21635

In [44]:
messagedf['date'] = pd.to_datetime(messagedf["unix_time"], unit='s')
messagedf.dtypes

mid                   int64
filename             object
unix_time             int64
subject              object
from_eid              int64
date         datetime64[ns]
dtype: object

In [45]:
print min(messagedf.date)
print max(messagedf.date)

1998-11-13 04:07:00
2002-06-21 13:37:34


In [85]:
rec_by_message = pd.pivot_table(recipientdf, 
                                index=['mid'], 
                                values=["to_eid"], 
                                aggfunc=len
                               )
print "Messages that have only one recipient: ", len(rec_by_message[rec_by_message.to_eid == 1])
print "Messages that have 5 or more recipients: ", len(rec_by_message[rec_by_message.to_eid >= 5])
print "Highest number of recipients: ", max(rec_by_message.to_eid)

Messages that have only one recipient:  14985
Messages that have 5 or more recipients:  1380
Highest number of recipients:  57


In [93]:
messageid = rec_by_message.loc[rec_by_message.to_eid == max(rec_by_message.to_eid)].index.values[0]

pd.read_sql('SELECT from_eid FROM MessageBase WHERE mid=' + str(messageid), con=conn)
pd.read_sql('SELECT name FROM EmployeeBase WHERE eid=67', con=conn)

Unnamed: 0,name
0,John J. Lavorato


In [82]:
messageid = rec_by_message.loc[rec_by_message.to_eid == max(rec_by_message.to_eid)].index.values[1]
pd.read_sql('SELECT from_eid FROM MessageBase WHERE mid=' + str(messageid), con=conn)
pd.read_sql('SELECT name FROM EmployeeBase WHERE eid=67', con=conn)

Unnamed: 0,name
0,John J. Lavorato


In [100]:
from bokeh.charts import Histogram, show, output_notebook

hist = Histogram(rec_by_message, values='to_eid', color='green',
                 title="Distribution of Recipient Numbers")

output_notebook()
show(hist)

Rescale to investigate the tail of the curve

## 3. Data Merging

Use the pandas merge function to combine the information in the 3 dataframes to answer the following questions:

1. Are there more Men or Women employees?
- How is gender distributed across departments?
- Who is sending more emails? Men or Women?
- What's the average number of emails sent by each gender?
- Are there more Juniors or Seniors?
- Who is sending more emails? Juniors or Seniors?
- Which department is sending more emails? How does that relate with the number of employees in the department?
- Who are the top 3 senders of emails? (people who sent out the most emails)

Answer the following questions regarding received messages:

- Who is receiving more emails? Men or Women?
- Who is receiving more emails? Juniors or Seniors?
- Which department is receiving more emails? How does that relate with the number of employees in the department?
- Who are the top 5 receivers of emails? (people who received the most emails)

Which employees sent the most 'mass' emails?

Keep exploring the dataset, which other questions would you ask?

Work in pairs. Give each other a challenge and try to solve it.