# Summary

The following is the step taken to clean the datasets
- Removing duplicates based on 'raw_mail' column
- Change the "None" to an actuall None value to all column
- Remove email address that is not in the same format
- Fill empty/none email address with bfill and ffill
- Fill empty/none subject with bffill and ffill
- Update the 'date' so it is in one format
- Add malicious column

# import libraries

Load the libraries that are needed

In [1]:
import pandas as pd
from datetime import datetime
from dateutil import parser
import re

Load the datasets

In [10]:
fraudDataframe = pd.read_json('datasets/raw/fradulent_emails.json', orient='index')
phishingDataframe = pd.read_json('datasets/raw/phishing-chorpus.json', orient='index')
enronDataframe = pd.read_csv('datasets/raw/enron-emails.csv')

print("Total fraud emails:", len(fraudDataframe))
print("Total phishing emails:", len(phishingDataframe))
print("Total enron emails:", len(enronDataframe))


Total fraud emails: 3978
Total phishing emails: 4196
Total enron emails: 517401


Remove duplicates (if any)

In [11]:
fraudDataframe = fraudDataframe.drop_duplicates(subset="raw_mail")
phishingDataframe = phishingDataframe.drop_duplicates(subset="raw_mail")
enronDataframe = enronDataframe.drop_duplicates(subset="raw_mail")

print("Total fraud emails:", len(fraudDataframe))
print("Total phishing emails:", len(phishingDataframe))
print("Total enron emails:", len(enronDataframe))

Total fraud emails: 3939
Total phishing emails: 4190
Total enron emails: 517401


From a quick glance from all the 3 datasets, there are multiple inconsistensy that can be found in the format of the values. 

- from and to columns contains not only the emails
- datetime isn't in one format

In [13]:
fraudDataframe.sample(5)

Unnamed: 0,raw_mail,subject,from,to,status,date,body
3357,Return-Path: <mrschristinaholden4@tiscali.de>\...,"DEAR BELOVED,",MRS CHRISTINA HOLDEN <mrschristinaholden4@tisc...,undisclosed-recipients: ;,O,"Wed, 13 Dec 2006 02:59:10 +0100 (CET)","DEAR BELOVED, \n \nI am Mrs. Christina Holden...."
1921,Return-Path: <felixeze123@jubii.dk>\nX-Sieve: ...,URGENT REPLY NEEDED,"""felixeze123"" <felixeze123@jubii.dk>",felixeze123@jubii.dk,RO,"Mon, 31 Oct 2005 10:18:38 +0000 (GMT)","<html><head><style type=""text/css"">body{font:1..."
2565,Return-Path: <akamichel1_ci@yahoo.fr>\nX-Sieve...,Very Very Urgent,=?iso-8859-1?q?Aka=20MICHEL?= <akamichel1_ci@y...,aiweb06@cs.umbc.edu,RO,"Wed, 24 May 2006 10:02:27 +0000 (GMT)",J'ai une nouvelle adresse mailVous pouvez main...
2929,Return-Path: <g_mai55555@yahoo.co.uk>\nX-Sieve...,Mail From Mrs Angelo,,R@M,RO,,"Hello Sir/Madam, firstly I will like to introd..."
809,Message-Id: <200404302333.i3UNWpSK023538@aquad...,,"""Mr Bruno Williams"" <brunowiliams04@voila.fr>",webmaster@aclweb.org,RO,"Sat, 1 May 2004 01:33:05 +0200",Bank of Africa \nAvenue Jean-Paul II \n08 BP 0...


In [15]:
phishingDataframe.sample(5)

Unnamed: 0,raw_mail,subject,from,to,status,date,body
2162,Return-Path: <support@paypal.com>\nX-Original-...,IMPORTANT: Update your PayPal records,"""Paypal Inc."" <acc@paypal.com>",user@example.com,O,"Sat, 30 Sep 2006 18:31:21 +0500","<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.01 T..."
186,Return-Path: <nobody@guzel.sozler.com>\nX-Orig...,Account Review Team,PayPal <paypal@email.paypal.com>,username@domain.com,O,"Tue, 04 Oct 2005 12:14:40 +0300","\r\n<html>\r\n<head>\r\n<style type=""text/css""..."
2480,Return-Path: <root@ns.waiphra.com>\nX-Original...,IMPORTANT:Security Issues [Account Flagged],PayPal <service@email-paypal.com>,user@example.com,RO,"Mon, 22 Jan 2007 01:09:06 +0700","\r\n<html>\r\n<img src=""http://link.p0.com/1x1..."
2145,Return-Path: <Support@paypal.com>\nX-Original-...,Account Review. PayPal Team identified some un...,"""PayPal""<Support@paypal.com>",undisclosed-recipients:;,RO,"Wed, 27 Sep 2006 10:17:59 -0700","<html>\r\n\r\n<head>\r\n <style type=""text/cs..."
1910,Return-Path: <Update@paypal.com>\nX-Original-T...,Account Review.PayPal Team identified some unu...,"""PayPal Review"" <Update@paypal.com>","adam@example.com, dros@example.com, fern@examp...",,"Tue, 18 Apr 2006 11:08:06 -0300","<html>\r\n\r\n<head>\r\n<meta http-equiv=3D""Co..."


In [16]:
enronDataframe.sample(5)

Unnamed: 0,raw_mail,subject,from,to,status,date,body
155652,Message-ID: <3511365.1075859830818.JavaMail.ev...,eThink About It: 5/21/01,enron.announcements@enron.com,all.worldwide@enron.com,,"Sun, 20 May 2001 12:38:00 -0700 (PDT)",Got a question about the Building Guy? Ask Be...
128708,Message-ID: <6142735.1075853726105.JavaMail.ev...,Re: July Volume Request for C&I customers behi...,kdestep@columbiaenergygroup.com,chris.germany@enron.com,,"Thu, 6 Jul 2000 03:01:00 -0700 (PDT)",Please note that the 280 dth/day is going to 2...
69961,Message-ID: <207781.1075851651526.JavaMail.eva...,CA Gas-related,jeff.dasovich@enron.com,jeff.dasovich@enron.com,,"Tue, 2 Oct 2001 11:11:27 -0700 (PDT)",PG&E Gas Accord--$10K\nPUC proceeding to consi...
43325,Message-ID: <5963663.1075857865136.JavaMail.ev...,Southern California Edison Company,rhonda.denton@enron.com,"tim.belden@enron.com, dana.davis@enron.com, ge...",,"Fri, 17 Nov 2000 02:13:00 -0800 (PST)",We have received the executed Master Power Pur...
141842,Message-ID: <32045577.1075855218724.JavaMail.e...,Gas Indices,feedback@intcx.com,gasindex@list.intcx.com,,"Thu, 27 Dec 2001 10:00:07 -0800 (PST)",\n\n ...


In [20]:
print(fraudDataframe.isna().sum(), '\n') # contains none but isnt register as one
print(phishingDataframe.isna().sum(), '\n') # contains none but isnt register as one
print(enronDataframe.isna().sum()) # contains null values

raw_mail    0
subject     0
from        0
to          0
status      0
date        0
body        0
dtype: int64 

raw_mail    0
subject     0
from        0
to          0
status      0
date        0
body        0
dtype: int64 

raw_mail         0
subject      19187
from             0
to           21847
status      517401
date             0
body             0
dtype: int64


As it show above, only the enron datasets contains a null values while in fact all 3 datasets does contains a "null" values

If we check the values for a "None" in a string format we will infact found that the rest of the datasets does in fact contains a Null value 

In [21]:
print((fraudDataframe == "None").sum())
print((phishingDataframe == "None").sum())
print((enronDataframe == "None").sum())

raw_mail      0
subject      17
from        365
to          948
status        0
date        534
body          0
dtype: int64
raw_mail     0
subject     49
from         4
to           9
status       5
date         3
body         0
dtype: int64
raw_mail    0
subject     0
from        0
to          0
status      0
date        0
body        0
dtype: int64


Update the datasets to change the "None" values to an actual None, by applying a function to each column

In [22]:
def updateToNone(val):
    if val == "None":
        return None
    else:
        return val

fraudDataframe['subject'] = fraudDataframe['subject'].apply(updateToNone)
fraudDataframe['to'] = fraudDataframe['to'].apply(updateToNone)
fraudDataframe['from'] = fraudDataframe['to'].apply(updateToNone)
fraudDataframe['status'] = fraudDataframe['status'].apply(updateToNone)
fraudDataframe['date'] = fraudDataframe['date'].apply(updateToNone)

print((fraudDataframe == "None").sum())
print(fraudDataframe.isna().sum()) 

raw_mail    0
subject     0
from        0
to          0
status      0
date        0
body        0
dtype: int64
raw_mail      0
subject      17
from        948
to          948
status        0
date        534
body          0
dtype: int64


In [23]:
phishingDataframe['subject'] = phishingDataframe['subject'].apply(updateToNone)
phishingDataframe['to'] = phishingDataframe['to'].apply(updateToNone)
phishingDataframe['from'] = phishingDataframe['to'].apply(updateToNone)
phishingDataframe['status'] = phishingDataframe['status'].apply(updateToNone)
phishingDataframe['date'] = phishingDataframe['date'].apply(updateToNone)

print((fraudDataframe == "None").sum())
print(fraudDataframe.isna().sum()) 

raw_mail    0
subject     0
from        0
to          0
status      0
date        0
body        0
dtype: int64
raw_mail      0
subject      17
from        948
to          948
status        0
date        534
body          0
dtype: int64


To fix the inconsistensy format of the email in columns "from" and "to" , we will used regex to extract the valid emails first then, fill the empty values with valid values from the datasets

the pattern for capturing a valid email is "[a-zA-Z0-9-_.]*@a-zA-Z0-9-]*(\\.[a-zA-Z]*)*"

this pattern will look for email that have a valid username + @ + domain

Applying the regex pattern will be done to each "From" and "To" column for each datasets using

In [24]:
pattern = "([a-zA-Z0-9-_.]*@[a-zA-Z0-9-]*(\.[a-zA-Z]*)*)"

In [27]:
p_count = phishingDataframe[phishingDataframe['from'].str.contains(
    pattern) == False]['from'].count()
f_count = fraudDataframe[fraudDataframe['from'].str.contains(
    pattern, regex=True) == False]['from'].count()
e_count = enronDataframe[enronDataframe['from'].str.contains(
    pattern, regex=True) == False]['to'].count()

print("Phishing missing from email:", p_count)
print("fraud missing from email:", f_count)
print("enron missing from email:", e_count)

  p_count = phishingDataframe[phishingDataframe['from'].str.contains(
  f_count = fraudDataframe[fraudDataframe['from'].str.contains(
  e_count = enronDataframe[enronDataframe['from'].str.contains(


Phishing missing from email 1220
fraud missing from email 545
enron missing from email 1


In [None]:
p_count = phishingDataframe[phishingDataframe['from'].str.contains(
    '[a-zA-Z0-9-_.]*@a-zA-Z0-9-]*(\.[a-zA-Z]*)*', regex=True) == False]['from'].count()
f_count = fraudDataframe[fraudDataframe['from'].str.contains(
    '([a-zA-Z0-9-_.])*@([a-zA-Z0-9-])*(\.[a-zA-Z]*)*', regex=True) == False]['from'].count()
e_count = enronDataframe[enronDataframe['from'].str.contains(
    '([a-zA-Z0-9-_.])*@([a-zA-Z0-9-])*(\.[a-zA-Z]*)*', regex=True) == False]['to'].count()

print("Phishing missing from email", p_count)
print("fraud missing from email", f_count)
print("enron missing from email", e_count)


the 2 cells above shows the number of time the pattern match for each datasets

In [31]:
parsedFrom = fraudDataframe['from'].str.extract(pattern)
parsedTo = fraudDataframe['to'].str.extract(pattern)

fraudDataframe['parsed_from'] = parsedFrom[0]
fraudDataframe['parsed_to'] = parsedTo[0]

In [15]:
parsedFrom = phishingDataframe['from'].str.extract(pattern)
parsedTo = phishingDataframe['to'].str.extract(pattern)

phishingDataframe['parsed_from'] = parsedFrom[0]
phishingDataframe['parsed_to'] = parsedTo[0]

In [16]:
parsedFrom = enronDataframe['from'].str.extract(pattern)
parsedTo = enronDataframe['to'].str.extract(pattern)

enronDataframe['parsed_from'] = parsedFrom[0]
enronDataframe['parsed_to'] = parsedTo[0]

Bellow is the kind of fields that wasnt register as an email

In [17]:
print(fraudDataframe[fraudDataframe['parsed_from'].isna()]['from'].unique())
print(fraudDataframe[fraudDataframe['parsed_to'].isna()]['to'].unique())
print(phishingDataframe[phishingDataframe['parsed_from'].isna()]['from'].unique())
print(phishingDataframe[phishingDataframe['parsed_to'].isna()]['to'].unique())

[None 'undisclosed-recipients: ;' 'undisclosed-recipients:;' ''
 'undisclosed recipients: ;' 'N/A <>, N/A <>' 'N/A <>']
[None 'undisclosed-recipients: ;' 'undisclosed-recipients:;' ''
 'undisclosed recipients: ;' 'N/A <>, N/A <>' 'N/A <>']
['undisclosed-recipients: ;' '[removed]' None 'undisclosed-recipients:;'
 'unlisted-recipients:; (no To-header on input)'
 '<Undisclosed-Recipient:;>' '=?euc-kr?B?u+e2+7nnu/W6rsbtwfawocG3?=' '']
['undisclosed-recipients: ;' '[removed]' None 'undisclosed-recipients:;'
 'unlisted-recipients:; (no To-header on input)'
 '<Undisclosed-Recipient:;>' '=?euc-kr?B?u+e2+7nnu/W6rsbtwfawocG3?=' '']


In [18]:
phishingDataframe['parsed_from'] = phishingDataframe['parsed_from'].ffill().bfill()
fraudDataframe['parsed_from'] = fraudDataframe['parsed_from'].ffill().bfill()
enronDataframe['parsed_from'] = enronDataframe['parsed_from'].ffill().bfill()

In [19]:
phishingDataframe['parsed_to'] = phishingDataframe['parsed_to'].ffill().bfill()
fraudDataframe['parsed_to'] = fraudDataframe['parsed_to'].ffill().bfill()
enronDataframe['parsed_to'] = enronDataframe['parsed_to'].ffill().bfill()

all the datasets subject column contains a null value, we will fill this value using existing fields in the datasets

In [20]:
phishingDataframe['subject'] = phishingDataframe.subject.ffill().bfill()
fraudDataframe['subject'] = fraudDataframe.subject.ffill().bfill()
enronDataframe['subject'] = enronDataframe.subject.ffill().bfill()

In [21]:
print('Number of row that have empty subject for phishingDataframe:', phishingDataframe.subject.isnull().sum())
print('Number of row that have empty subject for fraudDataframe:', fraudDataframe.subject.isnull().sum())
print('Number of row that have empty subject for enronDataframe:', enronDataframe.subject.isnull().sum())

Number of row that have empty subject for phishingDataframe: 0
Number of row that have empty subject for fraudDataframe: 0
Number of row that have empty subject for enronDataframe: 0


In [32]:
enronDataframe['parsed_date'] = enronDataframe.date.apply(lambda date: parser.parse(date).isoformat())

Fill empty fields so no null exist by doing backward and forward fill

In [33]:
fraudDataframe.date = fraudDataframe.date.ffill().bfill()
phishingDataframe.date = phishingDataframe.date.ffill().bfill()

In [34]:
diff = phishingDataframe.shape[0] - phishingDataframe.date.str.contains('[A-Za-z]{0,3}, \d* [A-Za-z]{0,3} \d{4}').sum()
print("Total date row that are not in format for phishingDataframe:", diff)
diff = fraudDataframe.shape[0] - fraudDataframe.date.str.contains('[A-Za-z]{0,3}, \d* [A-Za-z]{0,3} \d{4}').sum()
print("Total date row that are not in format for phishingDataframe:", diff)

Total date row that are not in format for phishingDataframe: 444
Total date row that are not in format for phishingDataframe: 151


In [35]:
def parseDate(date):
    try:
        return parser.parse(date).isoformat()
    except Exception as e:
        return None

In [36]:
phishingDataframe['parsed_date'] = phishingDataframe.date.str.replace('\.', ':', regex=True)
phishingDataframe['parsed_date'] = phishingDataframe['parsed_date'].apply(parseDate)



Manual cleaning for cases that are to few to automate

In [37]:
phishingDataframe.loc[821].parsed_date = parser.parse("Fri, 09 Jun 2006 08:23:29 +0500 (EST)").isoformat()
phishingDataframe.loc[892].parsed_date = parser.parse("Fri, 23 Jun 2006 13:25:46 -0100 (EST)").isoformat()
phishingDataframe.loc[896].parsed_date = parser.parse("Fri, 23 Jun 2006 21:36:05 +0800").isoformat()
phishingDataframe.loc[1066].parsed_date = parser.parse("Wed, 26 Jul 2006 09:48:28 -0800").isoformat()
phishingDataframe.loc[1067].parsed_date = parser.parse("Wed, 26 Jul 2006 12:50:48 -0600").isoformat()
phishingDataframe.loc[1072].parsed_date = parser.parse("Thu, 27 Jul 2006 03:06:10 -0800").isoformat()
phishingDataframe.loc[1074].parsed_date = parser.parse("Wed, 26 Jul 2006 15:24:52 -0500").isoformat()
phishingDataframe.loc[1075].parsed_date = parser.parse("Wed, 26 Jul 2006 15:43:42 -0500").isoformat()
phishingDataframe.loc[1076].parsed_date = parser.parse("Wed, 26 Jul 2006 19:03:49 -0300").isoformat()
phishingDataframe.loc[1077].parsed_date = parser.parse("Wed, 26 Jul 2006 19:35:02 -0300").isoformat()
phishingDataframe.loc[1095].parsed_date = parser.parse("31.07.2006").isoformat()
phishingDataframe.loc[1173].parsed_date = parser.parse("Thu, 3 Aug 2006 00:13:00 -0530").isoformat()
phishingDataframe.loc[2421].parsed_date = parser.parse("Tue, 09 Jan 2007 14:00:44 +0430").isoformat()
phishingDataframe.loc[3540].parsed_date = parser.parse("Sun, 10 Sep 2006 14:00:47 +0000").isoformat()
phishingDataframe.loc[3643].parsed_date = parser.parse("Fri, 09 Mar 2007 18:11:57 +0530").isoformat()
phishingDataframe.loc[3896].parsed_date = parser.parse("07.08.2006").isoformat()
phishingDataframe.loc[3963].parsed_date = parser.parse("Mon, 24 Feb 2003 17:32:08 +0000").isoformat()
phishingDataframe.loc[4117].parsed_date = parser.parse("Sun, 10 Sep 2006 12:08:54 -0300").isoformat()

In [38]:
fraudDataframe['parsed_date'] = fraudDataframe.date.str.replace('\.', ':', regex=True)
fraudDataframe['parsed_date'] = fraudDataframe['parsed_date'].apply(parseDate)



In [39]:
def myfunc(row):
    if row.parsed_date == None:
        try:
            row.parsed_date = parser.parse(
                re.search("([A-Za-z]{1,3}, \d{0,2} [A-Za-z]* \d{2,4} \d{2}:\d{2}:\d{2} ((\+|\-)?\d{4})?)", 
                          row.date).group(1)).isoformat()
            return row
        except Exception as e:
            return row
    else:
        return row

fraudDataframe = fraudDataframe.apply(myfunc, axis=1)

Manual Updates for cases that are to few to automate

In [40]:
fraudDataframe.loc[542].parsed_date = parser.parse("Sun, 09 nov 2003 21:18:28").isoformat()
fraudDataframe.loc[1236].parsed_date = parser.parse("Tue, 09 nov 2004 15:38:35 -0300").isoformat()

Add prediction label that will be used in model training

In [41]:
fraudDataframe['malicious'] = True
phishingDataframe['malicious'] = True
enronDataframe['malicious'] = False

In [42]:
fraudDataframe.to_csv(path_or_buf='datasets/clean/fraud-emails.csv', index=False)
phishingDataframe.to_csv(path_or_buf='datasets/clean/phishing-emails.csv', index=False)
enronDataframe.to_csv(path_or_buf='datasets/clean/enron-emails.csv', index=False)