# Get Data

This notebook:

- downloads the raw data.
- extracts it.
- shuffles it.
- splits it into train, validation and test sets (along with "small" versions of each as easier to start/work with examples).
- saves the sets as CSV files.
- can be used to eyeball the data a little.

In [1]:
import zipfile
import pandas as pd

Get the data from kaggle and save it to the data folder along with some splitting and sampling.

In [2]:
# params
dataset_name = "wcukierski/enron-email-dataset"
n_small = 10000

In [3]:
# download using kaggle cli
!kaggle datasets download -d {dataset_name} -p ./data

enron-email-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [4]:
# unzip the data using python
with zipfile.ZipFile("./data/enron-email-dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("./data/")

In [5]:
# read data
df = pd.read_csv("./data/emails.csv")
print(df.shape)

(517401, 2)


In [6]:

# shuffle data
df = df.sample(frac=1).reset_index(drop=True)

# split into train, test and holdout 80, 10, 10
df_train = df.sample(frac=0.8)
df_train_small = df_train.sample(n_small)
df_test = df.drop(df_train.index)
df_holdout = df_test.sample(frac=0.5)
df_holdout_small = df_holdout.sample(n_small)
df_test = df_test.drop(df_holdout.index)
df_test_small = df_test.sample(n_small)

# print shapes
print(f"df_train: {df_train.shape}")
print(f"df_train_small: {df_train_small.shape}")
print(f"df_test: {df_test.shape}")
print(f"df_test_small: {df_test_small.shape}")
print(f"df_holdout: {df_holdout.shape}")
print(f"df_holdout_small: {df_holdout_small.shape}")

# save data
df_train.to_csv("./data/emails_train.csv", index=False)
df_train_small.to_csv("./data/emails_train_small.csv", index=False)
df_test.to_csv("./data/emails_test.csv", index=False)
df_test_small.to_csv("./data/emails_test_small.csv", index=False)
df_holdout.to_csv("./data/emails_holdout.csv", index=False)
df_holdout_small.to_csv("./data/emails_holdout_small.csv", index=False)

df_train: (413921, 2)
df_train_small: (10000, 2)
df_test: (51740, 2)
df_test_small: (10000, 2)
df_holdout: (51740, 2)
df_holdout_small: (10000, 2)


In [7]:
# sample some data to eyeball it
df_sample = df_train_small.sample(5)
for i, row in df_sample.iterrows():
    print("="*100)
    print(f"file: {row['file']}")
    print(f"message:\n{row['message']}")
    print("="*100)

file: guzman-m/discussion_threads/796.
message:
Message-ID: <17904092.1075840657699.JavaMail.evans@thyme>
Date: Thu, 22 Feb 2001 09:34:00 -0800 (PST)
From: bill.iii@enron.com
To: portland.shift@enron.com
Subject: Group Meeting on Tuesday Feb 27th
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Bill Williams III
X-To: Portland Shift
X-cc: 
X-bcc: 
X-Folder: \mark guzman 6-28-02\Notes Folders\Discussion threads
X-Origin: GUZMAN-M
X-FileName: mark guzman 6-28-02.nsf

We will be having a group meeting at 4 pm on Feb 27th. The meeting will be to 
discuss strategies, identify issues, and set direction for this spring.
The meeting should last till 6...or 6:30.  Parking will be available in the 
garage below.

Pizza will be provided.

Bill
file: whalley-l/discussion_threads/735.
message:
Message-ID: <14927272.1075857999084.JavaMail.evans@thyme>
Date: Fri, 8 Dec 2000 02:20:00 -0800 (PST)
From: alhamd.alkhayat@enron.com
To: greg.whalley@enron.