# Problem Identification:
Our project is concerned with classifying phishing vs non-phishing emails accurately. Given the nature of our datasets and task, we will be conducting Supervised Learning (Classification).

# Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import re

url_pattern = r'(https?://\S+|www\.\S+)'


# Data Cleaning

#### **Datasets Used**:

- **Figshare: Seven Phishing Email Datasets**  
  Link: [https://figshare.com/articles/dataset/Seven_Phishing_Email_Datasets/25432108](https://figshare.com/articles/dataset/Seven_Phishing_Email_Datasets/25432108)  
  Subsets: Assassin, Ling, Enron, TREC05, TREC06, TREC07, CEAS08  

- **Kaggle: Phishing Email Dataset**  
  Link: [https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset)  
  Subsets: Nazario, Nigerian_Fraud  

- **Kaggle: Phishing Email Data by Type**  
  Link: [https://www.kaggle.com/datasets/charlottehall/phishing-email-data-by-type](https://www.kaggle.com/datasets/charlottehall/phishing-email-data-by-type)  
  Subset: phishing_data_by_type  

- **Kaggle: Human & LLM Generated Emails**  
  Link: [https://www.kaggle.com/datasets/francescogreco97/human-llm-generated-phishing-legitimate-emails](https://www.kaggle.com/datasets/francescogreco97/human-llm-generated-phishing-legitimate-emails)  
  Subsets: legit, phishing  

- **Kaggle: Phishing Persuasion Dataset**  
  Link: [https://www.kaggle.com/datasets/ahmadtijjani/phishing-urgency-authority-persuasion](https://www.kaggle.com/datasets/ahmadtijjani/phishing-urgency-authority-persuasion)  
  Subset: phishing_dataset_with_category  


#### Load the Datasets into Dataframes:

In [2]:
assassin_df = pd.read_csv('../datasets/Assassin.csv', low_memory=False)
ceas_df = pd.read_csv('../datasets/CEAS_08.csv', low_memory=False)
enron_df = pd.read_csv('../datasets/Enron.csv', low_memory=False)
ling_df = pd.read_csv('../datasets/Ling.csv', low_memory=False)
trec05_df = pd.read_csv('../datasets/TREC_05.csv', engine='python', on_bad_lines='skip')
trec06_df = pd.read_csv('../datasets/TREC_06.csv', engine='python', on_bad_lines='skip')
trec07_df = pd.read_csv('../datasets/TREC_07.csv', engine='python', on_bad_lines='skip')
nazario_df = pd.read_csv('../datasets/Nazario.csv')
nigerian_df = pd.read_csv('../datasets/Nigerian_Fraud.csv')
phishingtype_df = pd.read_csv('../datasets/phishing_data_by_type.csv')
legit_df = pd.read_csv('../datasets/legit.csv')
phishing_df = pd.read_csv('../datasets/phishing.csv', engine='python', on_bad_lines='skip')
phishingcategory_df = pd.read_csv('../datasets/phishing_dataset_with_category.csv')

We did engine='python' and on_bad_lines='skip' because there may be some malformed lines and this might help us to read the data better.

### Dataset 1: Assassin.csv

First things first, we will be observing what the Dataset looks like to get a feel for what we are working with.

In [3]:
assassin_df.info()
assassin_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5809 entries, 0 to 5808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sender    5809 non-null   object
 1   receiver  5599 non-null   object
 2   date      5809 non-null   object
 3   subject   5793 non-null   object
 4   body      5808 non-null   object
 5   label     5809 non-null   int64 
 6   urls      5809 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 317.8+ KB


Unnamed: 0,sender,receiver,date,subject,body,label,urls
0,Robert Elz <kre@munnari.OZ.AU>,Chris Garrigues <cwg-dated-1030377287.06fa6d@D...,"Thu, 22 Aug 2002 18:26:25 +0700",Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1
1,Steve Burt <Steve_Burt@cursor-system.com>,"""'zzzzteana@yahoogroups.com'"" <zzzzteana@yahoo...","Thu, 22 Aug 2002 12:46:18 +0100",[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1
2,"""Tim Chapman"" <timc@2ubh.com>",zzzzteana <zzzzteana@yahoogroups.com>,"Thu, 22 Aug 2002 13:52:38 +0100",[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1
3,Monty Solomon <monty@roscom.com>,undisclosed-recipient: ;,"Thu, 22 Aug 2002 09:15:25 -0400",[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1
4,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,zzzzteana@yahoogroups.com,"Thu, 22 Aug 2002 14:38:22 +0100",Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1


From this we understand that this dataset has 5809 rows of data and 7 columns.

We will most probably drop the sender and reciever columns later on as these are unique identifiers, and don't want to train the model on this.

It seems there are around 16 rows which do not have a subject and 1 row that doesn't have a body. This is fine as sometimes emails do not have this anyway, so perhaps there may not be any harm to include such rows. 


In [4]:
print("Number of Duplicate Rows: ",assassin_df.duplicated().sum(),'\n')
print(assassin_df[assassin_df.duplicated('body')])

Number of Duplicate Rows:  0 

Empty DataFrame
Columns: [sender, receiver, date, subject, body, label, urls]
Index: []


Additionally, there do not seem to be duplicate rows which is also a plus. We can do more thorough checking by searching for duplicates in the body area only, actually this may be better and it seems there are still no duplicates.

In [5]:
assassin_df.nunique()

sender      2523
receiver    1598
date        5557
subject     4187
body        5808
label          2
urls           2
dtype: int64

From this we can understand that the 'label' and 'urls' columns are formatted in 0 and 1 which is what we desire, hence no need to normalize the labels as there are only 2 to begin with. Interestingly, the subject has lower number of unique values than the body, indicating that the same subject may have been used for different emails.

In [6]:
assassin_df.rename(columns={'label': 'isPhishing'}, inplace=True)

Here, we have fixed the column name of 'label' to 'isPhishing' for clarity and consistency through the datasets.

In [7]:
assassin_df['date'] = pd.to_datetime(assassin_df['date'].str.extract(r'(\w{3}, \d{1,2} \w{3} \d{4})')[0], format='%a, %d %b %Y',
    errors='coerce'
)

assassin_df.info()
assassin_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5809 entries, 0 to 5808
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   sender      5809 non-null   object        
 1   receiver    5599 non-null   object        
 2   date        5355 non-null   datetime64[ns]
 3   subject     5793 non-null   object        
 4   body        5808 non-null   object        
 5   isPhishing  5809 non-null   int64         
 6   urls        5809 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 317.8+ KB


Unnamed: 0,sender,receiver,date,subject,body,isPhishing,urls
0,Robert Elz <kre@munnari.OZ.AU>,Chris Garrigues <cwg-dated-1030377287.06fa6d@D...,2002-08-22,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1
1,Steve Burt <Steve_Burt@cursor-system.com>,"""'zzzzteana@yahoogroups.com'"" <zzzzteana@yahoo...",2002-08-22,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1
2,"""Tim Chapman"" <timc@2ubh.com>",zzzzteana <zzzzteana@yahoogroups.com>,2002-08-22,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1
3,Monty Solomon <monty@roscom.com>,undisclosed-recipient: ;,2002-08-22,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1
4,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,zzzzteana@yahoogroups.com,2002-08-22,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1


I have converted the date strings/object to a datetime object instead for easier handling. Perhaps, this can help us when splitting the dataset in a timewise fashion.

In [8]:
assassin_df.drop(columns=['sender','receiver'], inplace=True)
assassin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5809 entries, 0 to 5808
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        5355 non-null   datetime64[ns]
 1   subject     5793 non-null   object        
 2   body        5808 non-null   object        
 3   isPhishing  5809 non-null   int64         
 4   urls        5809 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 227.0+ KB


### Dataset 2: Ling.csv

In [9]:
ling_df.info()
ling_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2859 entries, 0 to 2858
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   subject  2797 non-null   object
 1   body     2859 non-null   object
 2   label    2859 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 67.1+ KB


Unnamed: 0,subject,body,label
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0
1,,"lang classification grimes , joseph e . and ba...",0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0
3,risk,a colleague and i are researching the differin...,0
4,request book information,earlier this morning i was on the phone with a...,0


In [10]:
ling_df.rename(columns={'label': 'isPhishing'}, inplace=True) 
ling_nan = ling_df[ling_df['isPhishing'].isna()]
ling_df = ling_df.dropna(subset=['isPhishing'])

In [11]:
ling_df

Unnamed: 0,subject,body,isPhishing
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0
1,,"lang classification grimes , joseph e . and ba...",0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0
3,risk,a colleague and i are researching the differin...,0
4,request book information,earlier this morning i was on the phone with a...,0
...,...,...,...
2854,win $ 300usd and a cruise !,"raquel 's casino , inc . is awarding a cruise ...",1
2855,you have been asked to join kiddin,"the list owner of : "" kiddin "" has invited you...",1
2856,anglicization of composers ' names,"judging from the return post , i must have sou...",0
2857,"re : 6 . 797 , comparative method : n - ary co...",gotcha ! there are two separate fallacies in t...,0


In [12]:
ling_df['urls'] = 0   # creates a new column 'url' and fills it with 0
ling_df['urls'] = ling_df['body'].apply(lambda x: 1 if re.search(url_pattern, str(x)) else 0)
ling_df

Unnamed: 0,subject,body,isPhishing,urls
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0,0
1,,"lang classification grimes , joseph e . and ba...",0,0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0,0
3,risk,a colleague and i are researching the differin...,0,0
4,request book information,earlier this morning i was on the phone with a...,0,0
...,...,...,...,...
2854,win $ 300usd and a cruise !,"raquel 's casino , inc . is awarding a cruise ...",1,0
2855,you have been asked to join kiddin,"the list owner of : "" kiddin "" has invited you...",1,0
2856,anglicization of composers ' names,"judging from the return post , i must have sou...",0,0
2857,"re : 6 . 797 , comparative method : n - ary co...",gotcha ! there are two separate fallacies in t...,0,0


### Dataset 3: Enron.csv

In [13]:
enron_df.info()
enron_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29767 entries, 0 to 29766
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   subject  29569 non-null  object
 1   body     29767 non-null  object
 2   label    29767 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 697.8+ KB


Unnamed: 0,subject,body,label
0,"hpl nom for may 25 , 2001",( see attached file : hplno 525 . xls )\n- hpl...,0
1,re : nom / actual vols for 24 th,- - - - - - - - - - - - - - - - - - - - - - fo...,0
2,"enron actuals for march 30 - april 1 , 201","estimated actuals\nmarch 30 , 2001\nno flow\nm...",0
3,"hpl nom for may 30 , 2001",( see attached file : hplno 530 . xls )\n- hpl...,0
4,"hpl nom for june 1 , 2001",( see attached file : hplno 601 . xls )\n- hpl...,0


In [14]:
enron_df.rename(columns={'label': 'isPhishing'}, inplace=True)

In [15]:
print(enron_df.nunique())
enron_df['isPhishing'].unique()

subject       23570
body          29767
isPhishing        2
dtype: int64


array([0, 1])

As we can see, the labels take the desired values.

In [16]:
enron_df['urls'] = 0   # creates a new column 'url' and fills it with 0
enron_df['urls'] = enron_df['body'].apply(lambda x: 1 if re.search(url_pattern, str(x)) else 0)
enron_df

Unnamed: 0,subject,body,isPhishing,urls
0,"hpl nom for may 25 , 2001",( see attached file : hplno 525 . xls )\n- hpl...,0,0
1,re : nom / actual vols for 24 th,- - - - - - - - - - - - - - - - - - - - - - fo...,0,0
2,"enron actuals for march 30 - april 1 , 201","estimated actuals\nmarch 30 , 2001\nno flow\nm...",0,0
3,"hpl nom for may 30 , 2001",( see attached file : hplno 530 . xls )\n- hpl...,0,0
4,"hpl nom for june 1 , 2001",( see attached file : hplno 601 . xls )\n- hpl...,0,0
...,...,...,...,...
29762,confidence is back,"hello ,\nmy boyfriend began having problems wi...",1,0
29763,important information,love - potion for your darling is all you want...,1,0
29764,vys - make itnger,you have feelings of guilt and embarrassment ...,1,0
29765,the best thing come in large parcels,spur - m formula\nincrease sperm production 50...,1,0


### Dataset 4: TREC_05.csv

In [17]:
trec05_df.info()
trec05_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59015 entries, 0 to 59014
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sender    58999 non-null  object
 1   receiver  55089 non-null  object
 2   date      54807 non-null  object
 3   subject   54341 non-null  object
 4   body      55502 non-null  object
 5   label     55279 non-null  object
 6   urls      55230 non-null  object
dtypes: object(7)
memory usage: 3.2+ MB


Unnamed: 0,sender,receiver,date,subject,body,label,urls
0,"""Hu, Sylvia"" <Sylvia.Hu@ENRON.com>","""Acevedo, Felecia"" <Felecia.Acevedo@ENRON.com>...","Fri, 29 Jun 2001 08:36:09 -0500","FW: June 29 -- BNA, Inc. Daily Labor Report",User ID: enrondlr PW: bnaweb22 -----O...,0,1
1,"""Webb, Jay"" <Jay.Webb@ENRON.com>","""Lambie, Chris"" <Chris.Lambie@ENRON.com>","Fri, 29 Jun 2001 09:37:04 -0500",NGX failover plan.,"\nHi Chris, \n\nTonight we are rolling out a ...",0,0
2,"""Symms, Mark"" <Mark.Symms@ENRON.com>","""Thomas, Paul D."" <Paul.D.Thomas@ENRON.com>","Fri, 29 Jun 2001 08:39:30 -0500",RE: Intranet Site,Rika r these new?\n\n -----Original Message---...,0,1
3,"""Thorne, Judy"" <Judy.Thorne@ENRON.com>","""Grass, John"" <John.Grass@ENRON.com>, ""Nemec, ...","Fri, 29 Jun 2001 10:35:17 -0500",FW: ENA Upstream Company information,"John/Gerald, We are currently trading under GT...",0,0
4,"""Williams, Jason R (Credit)"" <Jason.R.Williams...","""Nemec, Gerald"" <Gerald.Nemec@ENRON.com>, ""Dic...","Fri, 29 Jun 2001 10:40:02 -0500",New Master Physical,Gerald and Stacy -\n\nAttached is a worksheet ...,0,0


Interestingly, we find that all of the datatypes here are object. This isn't what we want as ideally we want the 'label' and 'urls' columns to be of type int64. 

Additionally, we also want to drop the sender and the receiver columns as they are unique identifiers.

In [18]:
trec05_df.drop(columns=["sender","receiver"], inplace=True)
trec05_df.rename(columns={'label': 'isPhishing'}, inplace=True)
trec05_df.head()

Unnamed: 0,date,subject,body,isPhishing,urls
0,"Fri, 29 Jun 2001 08:36:09 -0500","FW: June 29 -- BNA, Inc. Daily Labor Report",User ID: enrondlr PW: bnaweb22 -----O...,0,1
1,"Fri, 29 Jun 2001 09:37:04 -0500",NGX failover plan.,"\nHi Chris, \n\nTonight we are rolling out a ...",0,0
2,"Fri, 29 Jun 2001 08:39:30 -0500",RE: Intranet Site,Rika r these new?\n\n -----Original Message---...,0,1
3,"Fri, 29 Jun 2001 10:35:17 -0500",FW: ENA Upstream Company information,"John/Gerald, We are currently trading under GT...",0,0
4,"Fri, 29 Jun 2001 10:40:02 -0500",New Master Physical,Gerald and Stacy -\n\nAttached is a worksheet ...,0,0


In [19]:
trec05_df.nunique()

date          53965
subject       42419
body          55473
isPhishing       68
urls             20
dtype: int64

I wanted to check what values each columns was taking, namely interested in the 'isPhishing' and 'urls' columns. There seem to be too many unique types here so we need to standardize this. Let's see in depth what each column entails.

In [20]:
trec05_df['isPhishing'].unique()

array(['0', '1', None,
       ' and former Energy Department Secretary James Schlesinger. ',
       ' and 40', ' according to Trione & Gordon. ',
       " 9p stronger at 565p after Saudi Arabia suggested the mood was shifting towards a production cut of 1.5m barrels at next week's Opec meeting. News that Russian oil companies are considering a cut also helped. ",
       " pointing to the index's 115-point advance over the week. ",
       ' resigned in August. ',
       ' which could result in additional layoffs from the computer maker. ',
       " while domestic lenders' loans are not. ",
       ' will retain those positions. ', '000 miles of pipe.',
       " Enron will potentially need to provide more than $500 million to pay off the notes. Fitch will closely monitor Enron's cash position through the merger period. ",
       ' +1-415-894-9376',
       " Williams and El Paso Corp.--have seen their fortunes dim as California's electricity meltdown has slowed the pace of deregulation nat

As we can see, this file seems to be poorly constructed, hence there are wrong types of data. We will remove the Null types and anything that doesn't fit 0/1 scheme as that will mess up our training.

In [21]:
trec05_df = trec05_df[trec05_df['isPhishing'].isin(['0', '1'])]
trec05_df['isPhishing'] = trec05_df['isPhishing'].astype(int)

In [22]:
trec05_df.head()

Unnamed: 0,date,subject,body,isPhishing,urls
0,"Fri, 29 Jun 2001 08:36:09 -0500","FW: June 29 -- BNA, Inc. Daily Labor Report",User ID: enrondlr PW: bnaweb22 -----O...,0,1
1,"Fri, 29 Jun 2001 09:37:04 -0500",NGX failover plan.,"\nHi Chris, \n\nTonight we are rolling out a ...",0,0
2,"Fri, 29 Jun 2001 08:39:30 -0500",RE: Intranet Site,Rika r these new?\n\n -----Original Message---...,0,1
3,"Fri, 29 Jun 2001 10:35:17 -0500",FW: ENA Upstream Company information,"John/Gerald, We are currently trading under GT...",0,0
4,"Fri, 29 Jun 2001 10:40:02 -0500",New Master Physical,Gerald and Stacy -\n\nAttached is a worksheet ...,0,0


In [23]:
trec05_df['urls'].nunique()
trec05_df['urls'].unique()

array(['1', '0'], dtype=object)

It seems that cleaning the 'isPhishing' column has fixed the 'urls' columns as well, we will now just fix the datatype.

In [24]:
trec05_df['urls'] = trec05_df['urls'].astype(int)
trec05_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 55210 entries, 0 to 59014
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        53879 non-null  object
 1   subject     53892 non-null  object
 2   body        55209 non-null  object
 3   isPhishing  55210 non-null  int64 
 4   urls        55210 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 2.5+ MB


In [25]:
trec05_df['date'] = pd.to_datetime(
    trec05_df['date'], 
    errors='coerce')

  trec05_df['date'] = pd.to_datetime(


In [26]:
trec05_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 55210 entries, 0 to 59014
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        53785 non-null  object
 1   subject     53892 non-null  object
 2   body        55209 non-null  object
 3   isPhishing  55210 non-null  int64 
 4   urls        55210 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 2.5+ MB


### Dataset 5: TREC_06.csv

In [27]:
trec06_df.info()
trec06_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16439 entries, 0 to 16438
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sender    16173 non-null  object 
 1   receiver  15904 non-null  object 
 2   date      15947 non-null  object 
 3   subject   16064 non-null  object 
 4   body      16397 non-null  object 
 5   label     16382 non-null  float64
 6   urls      16382 non-null  float64
dtypes: float64(2), object(5)
memory usage: 899.1+ KB


Unnamed: 0,sender,receiver,date,subject,body,label,urls
0,jhpb@sarto.budd-lake.nj.us,,"Tue, 28 Jul 1992 03:13:55 +0000",new Catholic mailing list now up and running,The mailing list I queried about a few weeks a...,0.0,0.0
1,Stella Lowry <rookcuduq@yahoo.com>,Brian <bernice@groucho.cs.psu.edu>,"Sat, 03 Apr 1993 10:34:36 -0500",re[12]:,\n ...,1.0,1.0
2,Walter <trwmpca@downtowncumberland.com>,arline@groucho.cs.psu.edu,"Tue, 06 Apr 1993 20:33:13 -0600",Take a moment to explore this.,Academic Qualifications available from prestig...,1.0,0.0
3,Scott Schwartz <schwartz@groucho.cs.psu.edu>,9fans <plan9-fans@cs.psu.edu>,"Fri, 09 Apr 1993 14:29:43 -0400",Greetings,Greetings all. This is to verify your subscri...,0.0,0.0
4,Mr Jailyn Koepke <kiflsbizc@attheworld.com>,melvin@groucho.cs.psu.edu,"Fri, 09 Apr 1993 21:31:58 -0800",LOANS @ 3.17% (27 term),try chauncey may conferred the luscious not co...,1.0,0.0


In [28]:
trec06_df.drop(columns=["receiver", "sender"], inplace=True)
trec06_df.rename(columns={'label': 'isPhishing'}, inplace=True)

In [29]:
trec06_df.head()

Unnamed: 0,date,subject,body,isPhishing,urls
0,"Tue, 28 Jul 1992 03:13:55 +0000",new Catholic mailing list now up and running,The mailing list I queried about a few weeks a...,0.0,0.0
1,"Sat, 03 Apr 1993 10:34:36 -0500",re[12]:,\n ...,1.0,1.0
2,"Tue, 06 Apr 1993 20:33:13 -0600",Take a moment to explore this.,Academic Qualifications available from prestig...,1.0,0.0
3,"Fri, 09 Apr 1993 14:29:43 -0400",Greetings,Greetings all. This is to verify your subscri...,0.0,0.0
4,"Fri, 09 Apr 1993 21:31:58 -0800",LOANS @ 3.17% (27 term),try chauncey may conferred the luscious not co...,1.0,0.0


In [30]:
trec06_df.nunique()

date          15907
subject       11575
body          16394
isPhishing        2
urls              2
dtype: int64

In [31]:
trec06_df = trec06_df.dropna(subset=['isPhishing'])
trec06_df[['urls', 'isPhishing']] = trec06_df[['urls', 'isPhishing']].astype('int64')

In [32]:
trec06_df['date'] = pd.to_datetime(
    trec06_df['date'], 
    errors='coerce')
trec06_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16382 entries, 0 to 16438
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        15847 non-null  object
 1   subject     16048 non-null  object
 2   body        16381 non-null  object
 3   isPhishing  16382 non-null  int64 
 4   urls        16382 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 767.9+ KB


  trec06_df['date'] = pd.to_datetime(


### Dataset 6: TREC_07.csv

In [33]:
trec07_df.info()
trec07_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68221 entries, 0 to 68220
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sender    68221 non-null  object
 1   receiver  59812 non-null  object
 2   date      55253 non-null  object
 3   subject   53495 non-null  object
 4   body      53766 non-null  object
 5   label     53753 non-null  object
 6   urls      53747 non-null  object
dtypes: object(7)
memory usage: 3.6+ MB


Unnamed: 0,sender,receiver,date,subject,body,label,urls
0,Tomas Jacobs <RickyAmes@aol.com>,the00@speedy.uwaterloo.ca,"Sun, 08 Apr 2007 21:00:48 +0300","Generic Cialis, branded quality@",\n\n\n\n\n\n\nDo you feel the pressure to perf...,1,0
1,Yan Morin <yan.morin@savoirfairelinux.com>,debian-mirrors@lists.debian.org,"Sun, 08 Apr 2007 12:52:30 -0400",Typo in /debian/README,"Hi, i've just updated from the gulus and I che...",0,1
2,Sheila Crenshaw <7stocknews@tractionmarketing....,the00@plg.uwaterloo.ca,"Sun, 08 Apr 2007 17:12:19 +0000",authentic viagra,Mega authenticV I A G R A $ DISCOUNT priceC...,1,1
3,Stormy Dempsey <vqucsmdfgvsg@ruraltek.com>,opt4@speedy.uwaterloo.ca,"Sun, 08 Apr 2007 17:15:47 -0100",Nice talking with ya,"\nHey Billy, \n\nit was really fun going out t...",1,1
4,"""Christi T. Jernigan"" <dcube@totalink.net>",ktwarwic@speedy.uwaterloo.ca,"Sun, 08 Apr 2007 19:19:07 +0200",or trembling; stomach cramps; trouble in sleep...,"\nsystem"" of the home. It will have the capab...",1,0


In [34]:
trec07_df.nunique()

sender      41124
receiver     7979
date        53311
subject     29278
body        53757
label           6
urls            3
dtype: int64

In [35]:
print(trec07_df['label'].unique())
print(trec07_df['urls'].unique())

['1' '0' None '  ' 'const  ' 'uint16)' 'uint']
['0' '1' None nan 'const char*));']


In [36]:
trec07_df.drop(columns=["sender","receiver"], inplace=True)
trec07_df.rename(columns={'label': 'isPhishing'}, inplace=True) 

In [37]:
trec07_df = trec07_df[trec07_df['isPhishing'].isin(['0', '1'])]

In [38]:
trec07_df['date'] = pd.to_datetime(
    trec07_df['date'], 
    errors='coerce')
trec07_df.info()
trec07_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 53745 entries, 0 to 68220
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        53519 non-null  object
 1   subject     53296 non-null  object
 2   body        53745 non-null  object
 3   isPhishing  53745 non-null  object
 4   urls        53745 non-null  object
dtypes: object(5)
memory usage: 2.5+ MB


  trec07_df['date'] = pd.to_datetime(


Unnamed: 0,date,subject,body,isPhishing,urls
0,2007-04-08 21:00:48+03:00,"Generic Cialis, branded quality@",\n\n\n\n\n\n\nDo you feel the pressure to perf...,1,0
1,2007-04-08 12:52:30-04:00,Typo in /debian/README,"Hi, i've just updated from the gulus and I che...",0,1
2,2007-04-08 17:12:19+00:00,authentic viagra,Mega authenticV I A G R A $ DISCOUNT priceC...,1,1
3,2007-04-08 17:15:47-01:00,Nice talking with ya,"\nHey Billy, \n\nit was really fun going out t...",1,1
4,2007-04-08 19:19:07+02:00,or trembling; stomach cramps; trouble in sleep...,"\nsystem"" of the home. It will have the capab...",1,0


In [39]:
trec07_df['isPhishing'] = trec07_df['isPhishing'].astype(int)
trec07_df['urls'] = trec07_df['urls'].astype(int)
trec07_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 53745 entries, 0 to 68220
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        53519 non-null  object
 1   subject     53296 non-null  object
 2   body        53745 non-null  object
 3   isPhishing  53745 non-null  int64 
 4   urls        53745 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 2.5+ MB


### Dataset 7: CEAS_08.csv

In [40]:
ceas_df.info()
ceas_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39154 entries, 0 to 39153
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sender    39154 non-null  object
 1   receiver  38692 non-null  object
 2   date      39154 non-null  object
 3   subject   39126 non-null  object
 4   body      39154 non-null  object
 5   label     39154 non-null  int64 
 6   urls      39154 non-null  int64 
dtypes: int64(2), object(5)
memory usage: 2.1+ MB


Unnamed: 0,sender,receiver,date,subject,body,label,urls
0,Young Esposito <Young@iworld.de>,user4@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 16:31:02 -0700",Never agree to be a loser,"Buck up, your troubles caused by small dimensi...",1,1
1,Mok <ipline's1983@icable.ph>,user2.2@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 18:31:03 -0500",Befriend Jenna Jameson,\nUpgrade your sex and pleasures with these te...,1,1
2,Daily Top 10 <Karmandeep-opengevl@universalnet...,user2.9@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 20:28:00 -1200",CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...,1,1
3,Michael Parker <ivqrnai@pobox.com>,SpamAssassin Dev <xrh@spamassassin.apache.org>,"Tue, 05 Aug 2008 17:31:20 -0600",Re: svn commit: r619753 - in /spamassassin/tru...,Would anyone object to removing .so from this ...,0,1
4,Gretchen Suggs <externalsep1@loanofficertool.com>,user2.2@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 19:31:21 -0400",SpecialPricesPharmMoreinfo,\nWelcomeFastShippingCustomerSupport\nhttp://7...,1,1


In [41]:
ceas_df.drop(columns=["sender","receiver"], inplace=True)
ceas_df.rename(columns={'label': 'isPhishing'}, inplace=True) 

In [42]:
ceas_df

Unnamed: 0,date,subject,body,isPhishing,urls
0,"Tue, 05 Aug 2008 16:31:02 -0700",Never agree to be a loser,"Buck up, your troubles caused by small dimensi...",1,1
1,"Tue, 05 Aug 2008 18:31:03 -0500",Befriend Jenna Jameson,\nUpgrade your sex and pleasures with these te...,1,1
2,"Tue, 05 Aug 2008 20:28:00 -1200",CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...,1,1
3,"Tue, 05 Aug 2008 17:31:20 -0600",Re: svn commit: r619753 - in /spamassassin/tru...,Would anyone object to removing .so from this ...,0,1
4,"Tue, 05 Aug 2008 19:31:21 -0400",SpecialPricesPharmMoreinfo,\nWelcomeFastShippingCustomerSupport\nhttp://7...,1,1
...,...,...,...,...,...
39149,"Fri, 08 Aug 2008 10:34:50 -0400",CNN Alerts: My Custom Alert,\n\nCNN Alerts: My Custom Alert\n\n\n\n\n\n\n ...,1,0
39150,"Fri, 08 Aug 2008 10:35:11 -0400",CNN Alerts: My Custom Alert,\n\nCNN Alerts: My Custom Alert\n\n\n\n\n\n\n ...,1,0
39151,"Fri, 08 Aug 2008 22:00:43 +0800",Slideshow viewer,Hello there ! \nGreat work on the slide show v...,0,0
39152,"Fri, 08 Aug 2008 09:00:46 -0500",Note on 2-digit years,"\nMail from sender , coming from intuit.com\ns...",0,0


In [43]:
ceas_df['date'] = pd.to_datetime(ceas_df['date'], errors='coerce')

  ceas_df['date'] = pd.to_datetime(ceas_df['date'], errors='coerce')


In [44]:
ceas_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39154 entries, 0 to 39153
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        39139 non-null  object
 1   subject     39126 non-null  object
 2   body        39154 non-null  object
 3   isPhishing  39154 non-null  int64 
 4   urls        39154 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 1.5+ MB


In [45]:
ceas_df.head()

Unnamed: 0,date,subject,body,isPhishing,urls
0,2008-08-05 16:31:02-07:00,Never agree to be a loser,"Buck up, your troubles caused by small dimensi...",1,1
1,2008-08-05 18:31:03-05:00,Befriend Jenna Jameson,\nUpgrade your sex and pleasures with these te...,1,1
2,2008-08-05 20:28:00-12:00,CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...,1,1
3,2008-08-05 17:31:20-06:00,Re: svn commit: r619753 - in /spamassassin/tru...,Would anyone object to removing .so from this ...,0,1
4,2008-08-05 19:31:21-04:00,SpecialPricesPharmMoreinfo,\nWelcomeFastShippingCustomerSupport\nhttp://7...,1,1


### Dataset 8: Nazario.csv

In [46]:
nazario_df.info()
nazario_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1565 entries, 0 to 1564
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sender    1565 non-null   object
 1   receiver  1469 non-null   object
 2   date      1564 non-null   object
 3   subject   1561 non-null   object
 4   body      1565 non-null   object
 5   urls      1565 non-null   int64 
 6   label     1565 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 85.7+ KB


Unnamed: 0,sender,receiver,date,subject,body,urls,label
0,Mail System Internal Data <MAILER-DAEMON@monke...,,28 Sep 2017 09:57:25 -0400,DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA,This text is part of the internal format of yo...,1,1
1,cPanel <service@cpanel.com>,jose@monkey.org,"Fri, 30 Oct 2015 00:00:48 -0500",Verify Your Account,Business with \t\t\t\t\t\t\t\tcPanel & WHM \t...,1,1
2,Microsoft Outlook <recepcao@unimedceara.com.br>,,"Fri, 30 Oct 2015 06:21:59 -0300 (BRT)",Helpdesk Mailbox Alert!!!,Your two incoming mails were placed on pending...,1,1
3,Ann Garcia <AnGarcia@mcoe.org>,"""info@maaaaa.org"" <info@maaaaa.org>","Fri, 30 Oct 2015 14:54:33 +0000",IT-Service Help Desk,Password will expire in 3 days. Click Here To ...,0,1
4,"""USAA"" <usaaacctupdate@sccu4u.com>",Recipients <usaaacctupdate@sccu4u.com>,"Fri, 30 Oct 2015 14:02:33 -0500",Final USAA Reminder - Update Your Account Now,"To ensure delivery to your inbox, please add U...",1,1


This dataset has 1565 rows of data and 7 columns. There are mostly non-null values, which is what we want.

In [47]:
print("Number of Duplicate Rows: ",nazario_df.duplicated().sum(),'\n')
print(nazario_df[nazario_df.duplicated('body')])

Number of Duplicate Rows:  0 

Empty DataFrame
Columns: [sender, receiver, date, subject, body, urls, label]
Index: []


Everything seems to be a unique row of data has we have checked for duplicated in our points of interest (body and overall rows).

In [48]:
nazario_df.nunique()

sender      1438
receiver     356
date        1564
subject     1419
body        1565
urls           2
label          1
dtype: int64

Interestingly, this means that all the emails we have here are classified as phishing.
Subject lines tend to be repeated but the content within the emails (body) seems to still be different.

In [49]:
nazario_df.rename(columns={'label': 'isPhishing'}, inplace=True)

In [50]:
nazario_df['date'] = pd.to_datetime(nazario_df['date'].str.extract(r'(\w{3}, \d{1,2} \w{3} \d{4})')[0], format='%a, %d %b %Y')

In [51]:
nazario_df.info()
nazario_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1565 entries, 0 to 1564
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   sender      1565 non-null   object        
 1   receiver    1469 non-null   object        
 2   date        1140 non-null   datetime64[ns]
 3   subject     1561 non-null   object        
 4   body        1565 non-null   object        
 5   urls        1565 non-null   int64         
 6   isPhishing  1565 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 85.7+ KB


Unnamed: 0,sender,receiver,date,subject,body,urls,isPhishing
0,Mail System Internal Data <MAILER-DAEMON@monke...,,NaT,DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA,This text is part of the internal format of yo...,1,1
1,cPanel <service@cpanel.com>,jose@monkey.org,2015-10-30,Verify Your Account,Business with \t\t\t\t\t\t\t\tcPanel & WHM \t...,1,1
2,Microsoft Outlook <recepcao@unimedceara.com.br>,,2015-10-30,Helpdesk Mailbox Alert!!!,Your two incoming mails were placed on pending...,1,1
3,Ann Garcia <AnGarcia@mcoe.org>,"""info@maaaaa.org"" <info@maaaaa.org>",2015-10-30,IT-Service Help Desk,Password will expire in 3 days. Click Here To ...,0,1
4,"""USAA"" <usaaacctupdate@sccu4u.com>",Recipients <usaaacctupdate@sccu4u.com>,2015-10-30,Final USAA Reminder - Update Your Account Now,"To ensure delivery to your inbox, please add U...",1,1


In [52]:
nazario_df.drop(columns=['sender','receiver'], inplace=True)
nazario_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1565 entries, 0 to 1564
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        1140 non-null   datetime64[ns]
 1   subject     1561 non-null   object        
 2   body        1565 non-null   object        
 3   urls        1565 non-null   int64         
 4   isPhishing  1565 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 61.3+ KB


### Dataset 9: Nigerian_Fraud.csv

In [53]:
nigerian_df.info()
nigerian_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3332 entries, 0 to 3331
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sender    3001 non-null   object
 1   receiver  2008 non-null   object
 2   date      2850 non-null   object
 3   subject   3293 non-null   object
 4   body      3332 non-null   object
 5   urls      3332 non-null   int64 
 6   label     3332 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 182.3+ KB


Unnamed: 0,sender,receiver,date,subject,body,urls,label
0,MR. JAMES NGOLA. <james_ngola2002@maktoob.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 02:38:20 +0000",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,0,1
1,Mr. Ben Suleman <bensul2004nng@spinfinder.com>,R@M,"Thu, 31 Oct 2002 05:10:00 -0000",URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",0,1
2,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:17:55 +0100",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
3,PRINCE OBONG ELEME <obong_715@epatra.com>,webmaster@aclweb.org,"Thu, 31 Oct 2002 22:44:20 -0000",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
4,Maryam Abacha <m_abacha03@www.com>,R@M,"Fri, 01 Nov 2002 01:45:04 +0100",I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",0,1


In [54]:
nigerian_df.nunique()

sender      2876
receiver     865
date        2843
subject     2551
body        3332
urls           2
label          1
dtype: int64

In [55]:
nigerian_df.drop(columns=['sender','receiver'], inplace=True)
nigerian_df.rename(columns={'label':'isPhishing'}, inplace=True)

In [56]:
nigerian_df

Unnamed: 0,date,subject,body,urls,isPhishing
0,"Thu, 31 Oct 2002 02:38:20 +0000",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,0,1
1,"Thu, 31 Oct 2002 05:10:00 -0000",URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",0,1
2,"Thu, 31 Oct 2002 22:17:55 +0100",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
3,"Thu, 31 Oct 2002 22:44:20 -0000",GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
4,"Fri, 01 Nov 2002 01:45:04 +0100",I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",0,1
...,...,...,...,...,...
3327,,CONTACT GLOBAL MAX SHIPING COMPANY,"Atten: My Dear ,\n \nI have Paid the fee for y...",0,1
3328,"Mon, 17 Sep 2007 22:28:11 +0000",TREAT AS URGENT.,\nFrom: Mr Ali Sherif. African Development Ban...,1,1
3329,"Tue, 18 Sep 2007 10:54:53 +0000",From Dr Usman Ibrahim / Mr Wahid Yoffe property.,\nFROM DR USMAN IBRAHIM DANKO.AUDITING AND ACC...,1,1
3330,"Wed, 19 Sep 2007 00:52:16 +0100",My Beloved In Christ.,"\nBeloved in the Lord Jesus Christ, PLEASE END...",1,1


In [57]:
nigerian_df['date'] = pd.to_datetime(nigerian_df['date'], utc=True)

In [58]:
nigerian_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3332 entries, 0 to 3331
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   date        2850 non-null   datetime64[ns, UTC]
 1   subject     3293 non-null   object             
 2   body        3332 non-null   object             
 3   urls        3332 non-null   int64              
 4   isPhishing  3332 non-null   int64              
dtypes: datetime64[ns, UTC](1), int64(2), object(2)
memory usage: 130.3+ KB


In [59]:
nigerian_df

Unnamed: 0,date,subject,body,urls,isPhishing
0,2002-10-31 02:38:20+00:00,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,0,1
1,2002-10-31 05:10:00+00:00,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",0,1
2,2002-10-31 21:17:55+00:00,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
3,2002-10-31 22:44:20+00:00,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,0,1
4,2002-11-01 00:45:04+00:00,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",0,1
...,...,...,...,...,...
3327,NaT,CONTACT GLOBAL MAX SHIPING COMPANY,"Atten: My Dear ,\n \nI have Paid the fee for y...",0,1
3328,2007-09-17 22:28:11+00:00,TREAT AS URGENT.,\nFrom: Mr Ali Sherif. African Development Ban...,1,1
3329,2007-09-18 10:54:53+00:00,From Dr Usman Ibrahim / Mr Wahid Yoffe property.,\nFROM DR USMAN IBRAHIM DANKO.AUDITING AND ACC...,1,1
3330,2007-09-18 23:52:16+00:00,My Beloved In Christ.,"\nBeloved in the Lord Jesus Christ, PLEASE END...",1,1


### Dataset 10: phishing_data_by_type.csv

In [60]:
phishingtype_df.info()
phishingtype_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Subject  157 non-null    object
 1   Text     159 non-null    object
 2   Type     159 non-null    object
dtypes: object(3)
memory usage: 3.9+ KB


Unnamed: 0,Subject,Text,Type
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,Fraud
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",Fraud
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,Fraud
3,from Mrs.Johnson,Goodday Dear\n\n\nI know this mail will come t...,Fraud
4,Co-Operation,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,Fraud


In [61]:
phishingtype_df.rename(columns={'Subject':'subject','Text':'body','Type':'isPhishing'},inplace=True)
phishingtype_df

Unnamed: 0,subject,body,isPhishing
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,Fraud
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",Fraud
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,Fraud
3,from Mrs.Johnson,Goodday Dear\n\n\nI know this mail will come t...,Fraud
4,Co-Operation,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,Fraud
...,...,...,...
154,These Bags Just Arrived For Spring,Bags so perfect—you'll never want to be withou...,Commercial Spam
155,POTUS Comes to Broadway this April! Get Ticket...,INAUGURAL BROADWAY PERFORMANCE APRIL 14\r\nA N...,Commercial Spam
156,Let’s talk about Bridgerton!,GET THE BEST OF EVERYTHING IN THE APP\n\nSTARB...,Commercial Spam
157,MONDAY MIX: All eyes on Ukraine,Hi!\n \nSpring forward with our newest noPac c...,Commercial Spam


In [62]:
phishingtype_df.nunique()

subject       157
body          159
isPhishing      4
dtype: int64

In [63]:
phishingtype_df['isPhishing'].unique()

array(['Fraud', 'Phishing', 'False Positives ', 'Commercial Spam'],
      dtype=object)

One important class label we found is "False Positives" this can hopefully be used to help the models differentiate between True positive and False positives. Hence, we will now label these accordingly.

In [64]:
phishingtype_df['isPhishing'] = phishingtype_df['isPhishing'].replace({'Fraud':1,'Phishing':1,'False Positives ':0,'Commercial Spam':0})

  phishingtype_df['isPhishing'] = phishingtype_df['isPhishing'].replace({'Fraud':1,'Phishing':1,'False Positives ':0,'Commercial Spam':0})


In [65]:
phishingtype_df['isPhishing'].unique()

array([1, 0])

In [66]:
phishingtype_df.info()
phishingtype_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   subject     157 non-null    object
 1   body        159 non-null    object
 2   isPhishing  159 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 3.9+ KB


Unnamed: 0,subject,body,isPhishing
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,1
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",1
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,1
3,from Mrs.Johnson,Goodday Dear\n\n\nI know this mail will come t...,1
4,Co-Operation,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,1


Successfully completed.

In [67]:
phishingtype_df['urls'] = 0   # creates a new column 'url' and fills it with 0
phishingtype_df['urls'] = phishingtype_df['body'].apply(lambda x: 1 if re.search(url_pattern, str(x)) else 0)
phishingtype_df

Unnamed: 0,subject,body,isPhishing,urls
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,1,0
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",1,0
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,1,0
3,from Mrs.Johnson,Goodday Dear\n\n\nI know this mail will come t...,1,0
4,Co-Operation,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,1,1
...,...,...,...,...
154,These Bags Just Arrived For Spring,Bags so perfect—you'll never want to be withou...,0,0
155,POTUS Comes to Broadway this April! Get Ticket...,INAUGURAL BROADWAY PERFORMANCE APRIL 14\r\nA N...,0,0
156,Let’s talk about Bridgerton!,GET THE BEST OF EVERYTHING IN THE APP\n\nSTARB...,0,0
157,MONDAY MIX: All eyes on Ukraine,Hi!\n \nSpring forward with our newest noPac c...,0,0


### Dataset 11: legit.csv

In [68]:
legit_df.info()
legit_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    1000 non-null   object
 1   label   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


Unnamed: 0,text,label
0,"Dear Michael, I hope this message finds you we...",1
1,"Dear Jennifer, We hope you're doing well. We'r...",1
2,"Dear Robert, Your attention is urgently requir...",1
3,"Dear Emily, We're writing to remind you of the...",1
4,"Dear William, We need your immediate attention...",1


In [69]:
print("Number of Duplicate Rows: ",legit_df.duplicated().sum(),'\n')
print(legit_df[legit_df.duplicated('text')])

Number of Duplicate Rows:  2 

                                                  text  label
387  Dear Michael, I hope this message finds you we...      1
388  Dear Sarah, I trust this email finds you well....      1


This is quite odd as these entries are clearly different but for some reason marked as a duplicate?

In [70]:
legit_df.nunique()

text     998
label      1
dtype: int64

label is as expected and so is text (mostly), save for the other 2 that we got above...

In [71]:
legit_df.rename(columns={'label': 'isPhishing','text': 'body'}, inplace=True)
legit_df.replace
legit_df.info()
legit_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   body        1000 non-null   object
 1   isPhishing  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


Unnamed: 0,body,isPhishing
0,"Dear Michael, I hope this message finds you we...",1
1,"Dear Jennifer, We hope you're doing well. We'r...",1
2,"Dear Robert, Your attention is urgently requir...",1
3,"Dear Emily, We're writing to remind you of the...",1
4,"Dear William, We need your immediate attention...",1


In [72]:
legit_df['urls'] = 0   # creates a new column 'url' and fills it with 0
legit_df['urls'] = legit_df['body'].apply(lambda x: 1 if re.search(url_pattern, str(x)) else 0)
legit_df

Unnamed: 0,body,isPhishing,urls
0,"Dear Michael, I hope this message finds you we...",1,1
1,"Dear Jennifer, We hope you're doing well. We'r...",1,1
2,"Dear Robert, Your attention is urgently requir...",1,1
3,"Dear Emily, We're writing to remind you of the...",1,1
4,"Dear William, We need your immediate attention...",1,1
...,...,...,...
995,"Dear Ms. Julia Scott, Are you fascinated by th...",1,1
996,"Dear Mr. Jonathan Taylor, Are you ready to emb...",1,1
997,"Dear Ms. Samantha Clark, Are you captivated by...",1,1
998,"Dear Mr. Benjamin Davis, Are you passionate ab...",1,1


### Dataset 12: phishing.csv

In [73]:
phishing_df.info()
phishing_df.head()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 595 entries, ('Dear User', ' We have received reports indicating that your account has been flagged for suspicious activities. To ensure the safety of your account and prevent any unauthorized access', ' we require you to confirm your account details immediately. By clicking on the link below', ' you will be directed to a secure page where you can update your account information. Failure to comply within the specified timeframe may result in a temporary suspension of your account. chaseonline-login.com Thank you for your cooperation in this matter. Should you have any concerns or questions', ' please contact our support team urgently. Sincerely') to ('Hi Sarah', ' I hope this email finds you well. We are reaching out to individuals who are passionate about making a difference in war-torn communities. Your support can help us provide aid and resources to those who have been affected the most. To contribute and learn more about our organi

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,text,label
Dear User,We have received reports indicating that your account has been flagged for suspicious activities. To ensure the safety of your account and prevent any unauthorized access,we require you to confirm your account details immediately. By clicking on the link below,you will be directed to a secure page where you can update your account information. Failure to comply within the specified timeframe may result in a temporary suspension of your account. chaseonline-login.com Thank you for your cooperation in this matter. Should you have any concerns or questions,please contact our support team urgently. Sincerely,The Account Security Team,1.0
Dear Michael,I hope this email finds you well. I am Laura Thompson,the Head of Security at our esteemed financial institution. We have recently detected some suspicious activities on your account and it is crucial that we address this issue promptly to protect your funds. To ensure the security of your account,we have implemented an enhanced verification process. Please click on the following link to verify your account details: https://account-verification-link.com Failure to comply within 48 hours may result in temporary suspension or permanent closure of your account. Your immediate attention is required. For any inquiries or concerns,please feel free to reach out to our dedicated customer support team at support@financialinstitution.com. Thank you for your cooperation. Best regards,Laura Thompson Head of Security Financial Ins...,1.0
Dear Sarah,I am Daniel Anderson,the Security Manager at our esteemed financial institution. We have recently detected some unusual activities on your account and it is vital that we address this matter immediately to safeguard your funds. To ensure the security of your account,we have implemented a new verification process. Please click on the following link to verify your account details: https://secure-verification-link.com Failure to comply within 48 hours may lead to temporary suspension or permanent closure of your account. Your immediate attention is required. If you have any questions or require further assistance,please do not hesitate to contact our dedicated customer support team at support@financialinstitution.com. Thank you for your cooperation in this matter. Sincerely,Daniel Anderson Security Manager Financial In...,1.0
Dear Olivia,We hope this email finds you well. Our system has detected suspicious activities on your social media account. To ensure the security of your account,we kindly request you to verify your account information. To proceed with the verification process,please click on the following link: https://account-verification-link.com Please be aware that failure to verify your account within 24 hours may result in a temporary suspension. Your immediate attention to this matter is highly appreciated. Best regards,Benjamin Nelson Social Media Support Team,1,
Dear John,I hope this email finds you well. I am reaching out to you as a representative of the Social Media Security Team. We have noticed some suspicious activities on your social media account and need your immediate attention to ensure the security of your personal information. To resolve this issue,we kindly request that you verify your account by clicking on the following link: https://socialmedia-verification.com/verify?id=12345678 Please note that failure to verify your account within 24 hours might result in permanent suspension. We appreciate your prompt action in this matter. Stay secure,Emily Johnson Social Media Security Team,1,,


As we can see, the csv file is very poorly formatted. We will need to make our own format in accordance to what we have been doing.

Below, we will read the whole excel file into the dataframe:

In [74]:
# Open the file and read lines
with open('../datasets/phishing.csv', 'r', encoding='utf-8') as file:
    lines = file.readlines()

# Remove newline characters and extra spaces
lines = [line.strip() for line in lines if line.strip()]

# Create a DataFrame with one column: 'raw_text'
phishing_df = pd.DataFrame(lines, columns=['raw_text'])

In [75]:
phishing_df

Unnamed: 0,raw_text
0,"text,label"
1,"Dear User, We have received reports indicating..."
2,"Dear Sarah Thompson, I hope this email finds y..."
3,"Dear Michael, I hope this email finds you well..."
4,"Dear Sarah, I am Daniel Anderson, the Security..."
...,...
996,"Hi Olivia, I hope this email finds you in good..."
997,"Dear Michael, I hope this email finds you well..."
998,"Hi Jessica, I hope this email finds you in goo..."
999,"Dear Benjamin, I hope this email finds you wel..."


We need to do a couple of things. 

Firstly, we will remove the first row of data as those are headers that have incorrectly come into the rows.

In [76]:
phishing_df = phishing_df.iloc[1:].reset_index(drop=True)
phishing_df

Unnamed: 0,raw_text
0,"Dear User, We have received reports indicating..."
1,"Dear Sarah Thompson, I hope this email finds y..."
2,"Dear Michael, I hope this email finds you well..."
3,"Dear Sarah, I am Daniel Anderson, the Security..."
4,"Dear John, I hope this email finds you well. A..."
...,...
995,"Hi Olivia, I hope this email finds you in good..."
996,"Dear Michael, I hope this email finds you well..."
997,"Hi Jessica, I hope this email finds you in goo..."
998,"Dear Benjamin, I hope this email finds you wel..."


Next, each of the rows has a ",1" appended to the end of it, we will be removing that and adding the target column.

In [77]:
print(phishing_df.loc[0, 'raw_text'])

Dear User, We have received reports indicating that your account has been flagged for suspicious activities. To ensure the safety of your account and prevent any unauthorized access, we require you to confirm your account details immediately. By clicking on the link below, you will be directed to a secure page where you can update your account information. Failure to comply within the specified timeframe may result in a temporary suspension of your account. chaseonline-login.com Thank you for your cooperation in this matter. Should you have any concerns or questions, please contact our support team urgently. Sincerely, The Account Security Team,1


In [78]:
phishing_df['raw_text'] = phishing_df['raw_text'].str.replace(r',1$', '', regex=True)

In [79]:
print(phishing_df.loc[0, 'raw_text'])

Dear User, We have received reports indicating that your account has been flagged for suspicious activities. To ensure the safety of your account and prevent any unauthorized access, we require you to confirm your account details immediately. By clicking on the link below, you will be directed to a secure page where you can update your account information. Failure to comply within the specified timeframe may result in a temporary suspension of your account. chaseonline-login.com Thank you for your cooperation in this matter. Should you have any concerns or questions, please contact our support team urgently. Sincerely, The Account Security Team


As we can see above, the ",1" has been removed from the rows. Now, we will rename and make appropriate columns.

In [80]:
phishing_df = phishing_df.rename(columns={'raw_text': 'body'})
phishing_df['isPhishing'] = 1
phishing_df

Unnamed: 0,body,isPhishing
0,"Dear User, We have received reports indicating...",1
1,"Dear Sarah Thompson, I hope this email finds y...",1
2,"Dear Michael, I hope this email finds you well...",1
3,"Dear Sarah, I am Daniel Anderson, the Security...",1
4,"Dear John, I hope this email finds you well. A...",1
...,...,...
995,"Hi Olivia, I hope this email finds you in good...",1
996,"Dear Michael, I hope this email finds you well...",1
997,"Hi Jessica, I hope this email finds you in goo...",1
998,"Dear Benjamin, I hope this email finds you wel...",1


In [81]:
phishing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   body        1000 non-null   object
 1   isPhishing  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [82]:
phishing_df['urls'] = 0   # creates a new column 'url' and fills it with 0
phishing_df['urls'] = phishing_df['body'].apply(lambda x: 1 if re.search(url_pattern, str(x)) else 0)
phishing_df

Unnamed: 0,body,isPhishing,urls
0,"Dear User, We have received reports indicating...",1,0
1,"Dear Sarah Thompson, I hope this email finds y...",1,1
2,"Dear Michael, I hope this email finds you well...",1,1
3,"Dear Sarah, I am Daniel Anderson, the Security...",1,1
4,"Dear John, I hope this email finds you well. A...",1,1
...,...,...,...
995,"Hi Olivia, I hope this email finds you in good...",1,1
996,"Dear Michael, I hope this email finds you well...",1,1
997,"Hi Jessica, I hope this email finds you in goo...",1,1
998,"Dear Benjamin, I hope this email finds you wel...",1,1


### Dataset 13: phishing_dataset_with_category.csv

In [83]:
phishingcategory_df.info()
phishingcategory_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      1000 non-null   object
 1   category  1000 non-null   object
 2   label     1000 non-null   object
dtypes: object(3)
memory usage: 23.6+ KB


Unnamed: 0,text,category,label
0,Warning: Unusual login attempt detected on you...,urgency,phishing
1,Urgent! Your Google has been compromised. Clic...,urgency,phishing
2,This is an official notice from Amazon. Your a...,authority,phishing
3,"As per HMRC regulations, you must update your ...",authority,phishing
4,Immediate action required: Your Spotify subscr...,urgency,phishing


In [84]:
phishingcategory_df.rename(columns={'text':'body','label':'isPhishing'}, inplace=True)
phishingcategory_df

Unnamed: 0,body,category,isPhishing
0,Warning: Unusual login attempt detected on you...,urgency,phishing
1,Urgent! Your Google has been compromised. Clic...,urgency,phishing
2,This is an official notice from Amazon. Your a...,authority,phishing
3,"As per HMRC regulations, you must update your ...",authority,phishing
4,Immediate action required: Your Spotify subscr...,urgency,phishing
...,...,...,...
995,Attention: Your recent activity violates our G...,authority,phishing
996,You’ve been selected for a premium Netflix tri...,persuasion,phishing
997,This is an official notice from Apple. Your ac...,authority,phishing
998,Claim your limited-time special cashback deal ...,persuasion,phishing


In [85]:
phishingcategory_df.nunique()

body          79
category       3
isPhishing     1
dtype: int64

In [86]:
phishingcategory_df.drop(columns=['category'], inplace=True)
phishingcategory_df['isPhishing'] = phishingcategory_df['isPhishing'].replace('phishing', 1)
phishingcategory_df['isPhishing'] = phishingcategory_df['isPhishing'].astype(int)

  phishingcategory_df['isPhishing'] = phishingcategory_df['isPhishing'].replace('phishing', 1)


In [87]:
phishingcategory_df

Unnamed: 0,body,isPhishing
0,Warning: Unusual login attempt detected on you...,1
1,Urgent! Your Google has been compromised. Clic...,1
2,This is an official notice from Amazon. Your a...,1
3,"As per HMRC regulations, you must update your ...",1
4,Immediate action required: Your Spotify subscr...,1
...,...,...
995,Attention: Your recent activity violates our G...,1
996,You’ve been selected for a premium Netflix tri...,1
997,This is an official notice from Apple. Your ac...,1
998,Claim your limited-time special cashback deal ...,1


In [88]:
phishingcategory_df['urls'] = 0   # creates a new column 'url' and fills it with 0
phishingcategory_df['urls'] = phishingcategory_df['body'].apply(lambda x: 1 if re.search(url_pattern, str(x)) else 0)
phishingcategory_df

Unnamed: 0,body,isPhishing,urls
0,Warning: Unusual login attempt detected on you...,1,0
1,Urgent! Your Google has been compromised. Clic...,1,0
2,This is an official notice from Amazon. Your a...,1,0
3,"As per HMRC regulations, you must update your ...",1,0
4,Immediate action required: Your Spotify subscr...,1,0
...,...,...,...
995,Attention: Your recent activity violates our G...,1,0
996,You’ve been selected for a premium Netflix tri...,1,0
997,This is an official notice from Apple. Your ac...,1,0
998,Claim your limited-time special cashback deal ...,1,0


In [89]:
phishingcategory_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   body        1000 non-null   object
 1   isPhishing  1000 non-null   int64 
 2   urls        1000 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 23.6+ KB


### Final Grouping/Merging:

Now, We just combine all the datasets and have our final baseline dataset that we can continue from.

In [90]:
final_df = pd.concat([assassin_df, ling_df, enron_df, 
                     trec05_df, trec06_df, trec07_df, 
                     ceas_df, nazario_df, nigerian_df,
                     phishingtype_df, legit_df, phishing_df,
                     phishingcategory_df], 
                     ignore_index=True, sort=False)

In [91]:
final_df.info()
final_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210982 entries, 0 to 210981
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   date        171635 non-null  object
 1   subject     205532 non-null  object
 2   body        210979 non-null  object
 3   isPhishing  210982 non-null  int64 
 4   urls        210982 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 8.0+ MB


Unnamed: 0,date,subject,body,isPhishing,urls
0,2002-08-22 00:00:00,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1
1,2002-08-22 00:00:00,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1
2,2002-08-22 00:00:00,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1
3,2002-08-22 00:00:00,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1
4,2002-08-22 00:00:00,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1


# Data Preparation

In [92]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210982 entries, 0 to 210981
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   date        171635 non-null  object
 1   subject     205532 non-null  object
 2   body        210979 non-null  object
 3   isPhishing  210982 non-null  int64 
 4   urls        210982 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 8.0+ MB


In [93]:
final_df = final_df.drop_duplicates()
final_df = final_df.reset_index(drop=True)
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210059 entries, 0 to 210058
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   date        171635 non-null  object
 1   subject     205532 non-null  object
 2   body        210056 non-null  object
 3   isPhishing  210059 non-null  int64 
 4   urls        210059 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 8.0+ MB


As it turns out, there were some duplicate rows and we have now removed them.

In [94]:
final_df.to_csv('../datasets/final_dataset.csv', index=False, encoding='utf-8', sep=',')

In [95]:
mismatch_df = final_df[
    ((final_df['isPhishing'] == 1) & (final_df['urls'] == 0)) |
    ((final_df['isPhishing'] == 0) & (final_df['urls'] == 1))
]
mismatch_df

Unnamed: 0,date,subject,body,isPhishing,urls
0,2002-08-22 00:00:00,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1
1,2002-08-22 00:00:00,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1
2,2002-08-22 00:00:00,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1
3,2002-08-22 00:00:00,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1
4,2002-08-22 00:00:00,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1
...,...,...,...,...,...
210054,,,You’ve been selected for a premium Spotify tri...,1,0
210055,,,Official communication: Your banking will be s...,1,0
210056,,,Claim your limited-time special cashback deal ...,1,0
210057,,,"Dear user, we noticed irregular transactions i...",1,0
