# Multi-table Datasets - ENRON Archive

## 1. Data import

Connect to the file 'assets/datasets/enron.db' using one of these methods:

- sqlite3 python package
- pandas.read_sql
- SQLite Manager Firefox extension

Take a look at the database and query the master table. How many Tables are there in the db?

In [58]:
import sqlite3
import pandas as pd
conn = sqlite3.connect('data/enron.db') 
c = conn.cursor()

c.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
# three tables

[(u'MessageBase',), (u'RecipientBase',), (u'EmployeeBase',)]

Query the `sqlite_master` table to retrieve the schema of the `EmployeeBase` table.

1. What fields are there?
1. What's the type of each of them?

In [59]:
fetch = c.execute("SELECT * from sqlite_master;").fetchall()
for row in fetch:
    print row
    print

(u'view', u'Employee', u'Employee', 0, u"CREATE VIEW Employee AS\nSELECT\n    eid,\n    name,\n    longdepartment,\n    title,\n\n    gender,\n    CASE gender\n        WHEN 'Female' THEN 1\n        ELSE 0\n    END AS genF,\n\n    seniority,\n    CASE seniority\n        WHEN 'Junior' THEN 1\n        ELSE 0\n    END AS senJ,\n\n    department,\n    CASE department\n        WHEN 'Legal' THEN 1\n        ELSE 0\n    END AS depL,\n    CASE department\n        WHEN 'Trading' THEN 1\n        ELSE 0\n    END AS depT\nFROM\n    EmployeeBase")

(u'view', u'EmployeeWithVars', u'EmployeeWithVars', 0, u'CREATE VIEW EmployeeWithVars AS\nSELECT\n    eid,\n    1 AS intercept,\n    genF,\n    senJ,\n    depL,\n    depT,\n    genF * senJ AS genF_senJ,\n    genF * depL AS genF_depL,\n    genF * depT AS genF_depT,\n    senJ * depL AS senJ_depL,\n    senJ * depT AS senJ_depT,\n    genF * senJ * depL AS genF_senJ_depL,\n    genF * senJ * depT AS genF_senJ_depT\nFROM\n    Employee')

(u'table', u'MessageBase'

In [60]:
fields = c.execute("SELECT sql from sqlite_master WHERE type='table' and name='EmployeeBase';").fetchall()
print ''.join(fields[0])
print
fields = c.execute("SELECT sql from sqlite_master WHERE type='table' and name='MessageBase';").fetchall()
print ''.join(fields[0])
print
fields = c.execute("SELECT sql from sqlite_master WHERE type='table' and name='RecipientBase';").fetchall()
print ''.join(fields[0])

CREATE TABLE EmployeeBase (
                  [eid] INTEGER,
  [name] TEXT,
  [department] TEXT,
  [longdepartment] TEXT,
  [title] TEXT,
  [gender] TEXT,
  [seniority] TEXT
                  
                  )

CREATE TABLE MessageBase (
    mid INTEGER,
    filename TEXT,
    unix_time INTEGER,
    subject TEXT,
    from_eid INTEGER,
    
    PRIMARY KEY(mid ASC),
    FOREIGN KEY(from_eid) REFERENCES Employee(eid)
)

CREATE TABLE RecipientBase (
    mid INTEGER,
    rno INTEGER,
    to_eid INTEGER,
    
    PRIMARY KEY(mid ASC, rno ASC)
    FOREIGN KEY(mid) REFERENCES Message(mid)
    FOREIGN KEY(to_eid) REFERENCES Employee(eid)
)


1. Print the first 5 rows of EmployeeBase table
1. Print the first 5 rows of MessageBase table
1. Print the first 5 rows of RecipientBase table

In [61]:
results = c.execute("SELECT * FROM EmployeeBase LIMIT 5;").fetchall()
for row in results:
    print row

(1, u'John Arnold', u'Forestry', u'ENA Gas Financial', u'VP Trading', u'Male', u'Senior')
(2, u'Harry Arora', u'Forestry', u'ENA East Power', u'VP Trading', u'Male', u'Senior')
(3, u'Robert Badeer', u'Forestry', u'ENA West Power', u'Mgr Trading', u'Male', u'Junior')
(4, u'Susan Bailey', u'Legal', u'ENA Legal', u'Specialist Legal', u'Female', u'Junior')
(5, u'Eric Bass', u'Forestry', u'ENA Gas Texas', u'Trader', u'Male', u'Junior')


In [62]:
# alternate: from pandas.io import sql
pd.read_sql('SELECT * FROM EmployeeBase LIMIT 5', conn)

Unnamed: 0,eid,name,department,longdepartment,title,gender,seniority
0,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior
1,2,Harry Arora,Forestry,ENA East Power,VP Trading,Male,Senior
2,3,Robert Badeer,Forestry,ENA West Power,Mgr Trading,Male,Junior
3,4,Susan Bailey,Legal,ENA Legal,Specialist Legal,Female,Junior
4,5,Eric Bass,Forestry,ENA Gas Texas,Trader,Male,Junior


In [63]:
results = c.execute("SELECT * FROM MessageBase LIMIT 5;").fetchall()
for row in results:
    print row

(1, u'taylor-m/sent/11', 910930020, u'Cd$ CME letter', 138)
(2, u'taylor-m/sent/17', 911459940, u'Indemnification', 138)
(3, u'taylor-m/sent/18', 911463840, u'Re: Indemnification', 138)
(4, u'taylor-m/sent/23', 911874180, u'Re: Coral Energy, L.P.', 138)
(5, u'taylor-m/sent/27', 912396120, u'Bankruptcy Code revisions', 138)


In [64]:
results = c.execute("SELECT * FROM RecipientBase LIMIT 10;").fetchall()
for row in results:
    print row
    
# The first field is message id, the second is recipient number, and the third is the id of the recipient.
# mid, rno, to_eid

(1, 1, 59)
(2, 1, 15)
(3, 1, 15)
(4, 1, 109)
(4, 2, 49)
(4, 3, 120)
(4, 4, 59)
(5, 1, 45)
(5, 2, 53)
(6, 1, 113)


Import each of the 3 tables to a Pandas Dataframes

In [65]:
employees = pd.read_sql("SELECT * FROM EmployeeBase;", conn)
recipients = pd.read_sql("SELECT * FROM RecipientBase;", conn)
messages = pd.read_sql("SELECT * FROM MessageBase;", conn)

In [66]:
recipients.head(10)

Unnamed: 0,mid,rno,to_eid
0,1,1,59
1,2,1,15
2,3,1,15
3,4,1,109
4,4,2,49
5,4,3,120
6,4,4,59
7,5,1,45
8,5,2,53
9,6,1,113


In [67]:
recipients.mid.value_counts()

12116    57
12151    57
12140    55
14404    52
16035    49
16431    24
8116     22
15577    21
15148    21
21103    20
21047    19
19671    18
10087    17
6495     17
19028    16
21628    16
12584    16
14194    16
18305    16
11492    16
14547    16
11516    16
14457    16
15584    15
13770    15
15121    15
15219    15
8970     15
15116    15
1355     15
         ..
2930      1
17243     1
21337     1
9047      1
11094     1
13141     1
15188     1
2898      1
4945      1
6992      1
17211     1
19258     1
21305     1
11062     1
17147     1
13109     1
15156     1
2866      1
4913      1
17179     1
19226     1
21273     1
11030     1
13077     1
15124     1
787       1
2834      1
4881      1
6928      1
2049      1
Name: mid, dtype: int64

## 2. Data Exploration

Use the 3 dataframes to answer the following questions:

1. How many employees are there in the company?
- How many messages are there in the database?
- Convert the timestamp column in the messages. When was the oldest message sent? And the newest?
- Some messages are sent to more than one recipient. Group the messages by message_id and count the number of recepients. Then look at the distribution of recepient numbers.
    - How many messages have only one recepient?
    - How many messages have >= 5 recepients?
    - What's the highest number of recepients?
    - Who sent the message with the highest number of recepients?
- Plot the distribution of recepient numbers using Bokeh.

In [68]:
len(employees)

156

In [69]:
len(messages)

21635

In [70]:
datetimes = messages['unix_time'].apply(pd.datetime.fromtimestamp)
print "first msg was sent on:", min(datetimes)
print "last msg was sent on:", max(datetimes)

first msg was sent on: 1998-11-12 23:07:00
last msg was sent on: 2002-06-21 09:37:34


In [71]:
from collections import Counter

In [72]:
#counts = recipients.groupby('mid')['to_eid'].count().value_counts()
counts = Counter(recipients.groupby('mid')['to_eid'].count())
counts

Counter({1: 14985,
         2: 2962,
         3: 1435,
         4: 873,
         5: 711,
         6: 180,
         7: 176,
         8: 61,
         9: 24,
         10: 29,
         11: 47,
         12: 33,
         13: 57,
         14: 11,
         15: 28,
         16: 9,
         17: 2,
         18: 1,
         19: 1,
         20: 1,
         21: 2,
         22: 1,
         24: 1,
         49: 1,
         52: 1,
         55: 1,
         57: 2})

In [73]:
from bokeh.plotting import figure,show,output_notebook
output_notebook()

In [74]:
x = [i[0] for i in counts.most_common()]
y = [i[1] for i in counts.most_common()]
left_border = [val-0.5 for val in x]
right_border = [val+0.5 for val in x]


p= figure(title="Message Recipients",tools='',x_axis_label='# of recipients',y_axis_label='Counts')
p.quad(top=y,left=left_border,right=right_border,bottom=0,line_color='black')
show(p)

Rescale to investigate the tail of the curve

In [75]:
x = [i[0] for i in counts.most_common()[5:]]  # chop off the first 5
y = [i[1] for i in counts.most_common()[5:]]  # chop off the first 5
left_border = [val-0.5 for val in x]
right_border = [val+0.5 for val in x]

p= figure(title="Message Recipients",tools='',x_axis_label='# of recipients',y_axis_label='Counts')
p.quad(top=y,left=left_border,right=right_border,bottom=0,line_color='black')
show(p)

In [76]:
right_border

[6.5,
 7.5,
 8.5,
 13.5,
 11.5,
 12.5,
 10.5,
 15.5,
 9.5,
 14.5,
 16.5,
 17.5,
 21.5,
 57.5,
 18.5,
 19.5,
 20.5,
 22.5,
 24.5,
 49.5,
 52.5,
 55.5]

## 3. Data Merging

Use the pandas merge function to combine the information in the 3 dataframes to answer the following questions:

1. Are there more Men or Women employees?
- How is gender distributed across departments?
- Who is sending more emails? Men or Women?
- What's the average number of emails sent by each gender?
- Are there more Juniors or Seniors?
- Who is sending more emails? Juniors or Seniors?
- Which department is sending more emails? How does that relate with the number of employees in the department?
- Who are the top 3 senders of emails? (people who sent out the most emails)

In [77]:
employees.gender.value_counts()

Male      113
Female     43
Name: gender, dtype: int64

In [78]:
pd.read_sql('select gender, count(eid) from EmployeeBase group by gender', conn)

Unnamed: 0,gender,count(eid)
0,Female,43
1,Male,113


More men

In [79]:
employees.gender.value_counts() / employees.gender.count()

Male      0.724359
Female    0.275641
Name: gender, dtype: float64

In [80]:
# How is gender distributed across departments?
employees.groupby('department')['gender'].value_counts() / employees.groupby('department')['gender'].count()

department  gender
Forestry    Male      0.833333
            Female    0.166667
Legal       Female    0.520000
            Male      0.480000
Other       Male      0.718310
            Female    0.281690
dtype: float64

    Forestry 83% Male
    Legal    48% Male
    Other    72% Male
    Company  72% Male

In [81]:
# Who is sending more emails? Men or Women?
df = pd.merge(employees, messages, left_on='eid', right_on='from_eid')
df.gender.value_counts() / df.gender.count()

Male      0.593529
Female    0.406471
Name: gender, dtype: float64

In [82]:
# test
pd.read_sql('''SELECT count(e.gender) Gender_Count 
            FROM EmployeeBase AS e
            JOIN MessageBase m
            ON e.eid = m.from_eid 
            group by gender ''', conn)

Unnamed: 0,Gender_Count
0,8794
1,12841


In [83]:
# What's the average number of emails sent by each gender?
df.gender.value_counts() / employees.gender.value_counts()

Male      113.637168
Female    204.511628
Name: gender, dtype: float64

Women sent almost twice as many messages on average

In [84]:
employees.seniority.value_counts()

Junior    82
Senior    74
Name: seniority, dtype: int64

In [85]:
df.seniority.value_counts()

Senior    12439
Junior     9196
Name: seniority, dtype: int64

In [86]:
df.seniority.value_counts() / employees.seniority.value_counts()

Junior    112.146341
Senior    168.094595
Name: seniority, dtype: float64

In [87]:
pd.read_sql('''SELECT count(e.gender) Gender_Count 
            FROM EmployeeBase AS e
            JOIN MessageBase m
            ON e.eid = m.from_eid 
            group by gender ''', conn)

Unnamed: 0,Gender_Count
0,8794
1,12841


Senior employees send more messages in absolute value and also on average

In [88]:
# Which department is sending more emails? How does that relate with the number of employees in the department?
df.department.value_counts()

Legal       10396
Other        6852
Forestry     4387
Name: department, dtype: int64

In [89]:
df.department.value_counts() / employees.department.value_counts()

Forestry     73.116667
Legal       415.840000
Other        96.507042
Name: department, dtype: float64

Legal is sending many more messages than the other departments

In [90]:
# Who are the top 5 senders of emails? (people who sent out the most emails)
top5senders = df.eid.value_counts().head().reset_index()
top5senders.columns = ['eid', 'msgs_sent']
top5senders

Unnamed: 0,eid,msgs_sent
0,20,1597
1,59,1379
2,120,1142
3,131,859
4,138,658


In [91]:
pd.merge(employees, top5senders, on='eid')

Unnamed: 0,eid,name,department,longdepartment,title,gender,seniority,msgs_sent
0,20,Jeff Dasovich,Legal,Regulatory and Government Affairs,Director,Male,Senior,1597
1,59,Tana Jones,Legal,ENA Legal,Specialist Legal,Female,Junior,1379
2,120,Sara Shackleton,Legal,ENA Legal,Gen Cnsl Asst,Female,Junior,1142
3,131,James D. Steffes,Legal,Regulatory and Government Affairs,VP of Government Affairs,Male,Senior,859
4,138,Mark E. Taylor,Legal,ENA Legal,VP & Gen Cnsl,Male,Senior,658


# 3.b (Optional) More merging

Answer the following questions regarding received messages:

- Who is receiving more emails? Men or Women?
- Who is receiving more emails? Juniors or Seniors?
- Which department is receiving more emails? How does that relate with the number of employees in the department?
- Who are the top 5 receivers of emails? (people who received the most emails)

In [92]:
# Who is receiving more emails? Men or Women?

In [93]:
df1 = pd.merge(df, recipients, on='mid')
df2 = pd.merge(df1, employees, left_on='to_eid', right_on='eid')
df2.head()

Unnamed: 0,eid_x,name_x,department_x,longdepartment_x,title_x,gender_x,seniority_x,mid,filename,unix_time,...,from_eid,rno,to_eid,eid_y,name_y,department_y,longdepartment_y,title_y,gender_y,seniority_y
0,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior,1611,arnold-j/sent/379,954317280,...,1,1,42,42,John Griffith,Forestry,ENA Gas Financial,Mgr Trading,Male,Junior
1,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior,4828,arnold-j/sent/151,970463160,...,1,3,42,42,John Griffith,Forestry,ENA Gas Financial,Mgr Trading,Male,Junior
2,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior,5026,arnold-j/sent/132,971078940,...,1,1,42,42,John Griffith,Forestry,ENA Gas Financial,Mgr Trading,Male,Junior
3,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior,7579,arnold-j/sent/774,978509400,...,1,1,42,42,John Griffith,Forestry,ENA Gas Financial,Mgr Trading,Male,Junior
4,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior,10581,arnold-j/sent/526,986536620,...,1,1,42,42,John Griffith,Forestry,ENA Gas Financial,Mgr Trading,Male,Junior


In [94]:
df2.gender_y.value_counts() / df2.gender_y.count()

Male      0.665547
Female    0.334453
Name: gender_y, dtype: float64

In [95]:
df2.gender_y.value_counts() / employees.gender.value_counts()

Male      226.097345
Female    298.581395
dtype: float64

In [96]:
# Who is receiving more emails? Juniors or Seniors?
df2.seniority_y.value_counts() / df2.seniority_y.count()

Senior    0.623476
Junior    0.376524
Name: seniority_y, dtype: float64

In [97]:
df2.seniority_y.value_counts() / employees.seniority.value_counts()

Junior    176.268293
Senior    323.432432
dtype: float64

In [98]:
# Which department is receiving more emails? How does that relate with the number of employees in the department?
df2.department_y.value_counts()

Legal       16311
Other       13653
Forestry     8424
Name: department_y, dtype: int64

In [99]:
df2.department_y.value_counts() / employees.department.value_counts()

Forestry    140.400000
Legal       652.440000
Other       192.295775
dtype: float64

In [100]:
# Who are the top 5 receivers of emails? (people who received the most emails)
top5receivers = df2.to_eid.value_counts().head().reset_index()
top5receivers.columns = ['eid', 'msgs_received']
top5receivers

Unnamed: 0,eid,msgs_received
0,131,1797
1,122,1730
2,138,1477
3,61,1290
4,120,1173


In [101]:
pd.merge(employees, top5receivers, on='eid').sort_values('msgs_received', ascending=False)

Unnamed: 0,eid,name,department,longdepartment,title,gender,seniority,msgs_received
3,131,James D. Steffes,Legal,Regulatory and Government Affairs,VP of Government Affairs,Male,Senior,1797
2,122,Richard Shapiro,Legal,Regulatory and Government Affairs,VP of Regulatory Affairs,Male,Senior,1730
4,138,Mark E. Taylor,Legal,ENA Legal,VP & Gen Cnsl,Male,Senior,1477
0,61,Steven J. Kean,Other,Enron,VP & Chief of Staff,Male,Senior,1290
1,120,Sara Shackleton,Legal,ENA Legal,Gen Cnsl Asst,Female,Junior,1173


Which employees sent the most 'mass' emails?

In [102]:
top10massmails = df2.groupby('mid')['rno'].max().sort_values(ascending=False).head(10).reset_index()
massmails = pd.merge(top10massmails, messages)
massmails

Unnamed: 0,mid,rno,filename,unix_time,subject,from_eid
0,12151,57,baughman-d/ect_admin/22,990546780,,67
1,12116,57,baughman-d/all_documents/398,990510780,,67
2,12140,55,lavorato-j/sent_items/18,990528836,,67
3,14404,52,lay-k/sent_items/10,998565865,Associate/Analyst Program,68
4,16035,49,beck-s/sent_items/368,1002290637,Enron Center South (ECS) Move Back-up Plan,7
5,16431,24,sanchez-m/sent_items/99,1002893728,Park City Bound,112
6,8116,22,wolfe-j/all_documents/32,980396100,7th Annual Party,112
7,15148,21,baughman-d/enron_power/miso/19,1000822727,RTO/regulatory update,117
8,15577,21,kitchen-l/sent_items/990,1001571865,FW: Fantastic Friday/Super Saturday Interviewers,65
9,21103,20,mckay-j/ubswenergy_com/3,1013408919,book names,63


In [103]:
pd.merge(massmails, employees, left_on='from_eid', right_on='eid')[['name', 'title', 'mid', 'subject', 'rno']]

Unnamed: 0,name,title,mid,subject,rno
0,John J. Lavorato,ENA President & CEO,12151,,57
1,John J. Lavorato,ENA President & CEO,12116,,57
2,John J. Lavorato,ENA President & CEO,12140,,55
3,Kenneth Lay,President & CEO,14404,Associate/Analyst Program,52
4,Sally Beck,VP,16035,Enron Center South (ECS) Move Back-up Plan,49
5,Monique Sanchez,Associate,16431,Park City Bound,24
6,Monique Sanchez,Associate,8116,7th Annual Party,22
7,Susan Scott,Cnsl,15148,RTO/regulatory update,21
8,Louise Kitchen,COO,15577,FW: Fantastic Friday/Super Saturday Interviewers,21
9,Kam Keiser,Mgr Trading,21103,book names,20


Keep exploring the dataset, which other questions would you ask?

Work in pairs. Give each other a challenge and try to solve it.