**Overview and Introduction**

This project intends to highlight how the latest machine learning techniques can be used to train a model that will detect phishing emails.

Phishing has been described as "the most prevalent method of cybercrime that convinces people to provide sensitive information" <cite id="ry8t5"><a href="#zotero%7C12693434%2FNB8NU34F">(Salloum et al., 2021)</a></cite>. Applications of social engineering, targeted research, and technological improvements have allowed for sophisticated phishing techniques to develop alongside improving attempts at their detection. There are many ways in which phishing attempts can be made, whether through SMS, phone calls, websites or emails. These attempts all tend to share the fact that they imitate a third party in an attempt to confuse the victim and convince them to share sensitive information.

The use of phishing emails has been on the rise, especially during/after the COVID-19 pandemic, with the Anti-Phishing Working Group reporting the detection of 128,926 phishing emails during Q3 of 2020 in comparison with 44,497 and 44,008 detected in Q2 and Q1 of 2020, respectively.

There has been extensive research into the application of artifical intelligence / machine learning techniques towards the improvement of phishing email detection: <cite id="8ekri"><a href="#zotero%7C12693434%2FFTMZJ2VS">(Vazhayil and Nb, 2018)</a></cite>, <cite id="zyyyr"><a href="#zotero%7C12693434%2FTK6AGCGG">(Egozi and Verma, 2018)</a></cite>, <cite id="nrz4h"><a href="#zotero%7C12693434%2FN3Q6U5PG">(Alhogail and Alsabih, 2021)</a></cite>, <cite id="n583h"><a href="#zotero%7C12693434%2FUH76NDRX">(Divakaran and Oest, 2022)</a></cite>

Machine learning is described as "a scientific field that focuses on the design of computer models and algorithms that can perform specific tasks, often involving pattern recognition, without the need to be explicitly programmed"<cite id="9thzt"><a href="#zotero%7C12693434%2FF57WLZ2M">(Raschka, Patterson and Nolet, 2020)</a></cite>. Within the last decade, the general-purpose Python language has seen a tremendous growth of popularity with the scientific computing community. The most recent machine learning and deep learning libraries are now Python-based.

Open-source, community-contributed libraries such as [Pandas](https://svn.scipy.org/about.html), [NumPy](https://numpy.org/), [Scikit-learn](https://scikit-learn.org/stable/), and [PyTorch](https://pytorch.org/features/) provide a relatively easy to learn ecosystem for the machine learning community to develop within.

The task of phishing email detection can be described as a Natural Language Processing problem of text classification. By learning from pre-defined classes of phishing or legitimate emails, models can be trained on previously collected emails and through reinforcement learning or deep learning be able to identify new previously unseen email samples.


This project will walkthrough the required steps and different methods of creating a phishing email detector over the course of 6 different Colab Notebooks:

1. Data Extraction
2. [Classical Machine Learning Classification Methods  (Sci-kit Learn)](https://colab.research.google.com/drive/1uAGq5z3_AOcMHMNHxmCWT9Hj8RNZQZRR?usp=sharing)
3. [Deep Learning with PyTorch](https://colab.research.google.com/drive/10Q7_gMq9MxxY9vpBwHDO_uokGEVNCoKw?usp=sharing)
4. [Deep Learning with HuggingFace Transformers](https://colab.research.google.com/drive/1zmokI_UWsUMJ-cumXlaxaEXHWX6_Vlrs?usp=sharing)
5. [A look into Federated Learning with Flower](https://colab.research.google.com/drive/1-CcQWBGSY6U1FHZ7j_m4_z9D8DwW43eL?usp=sharing)
6. [Finished phishing detector hosted via Gradio](https://colab.research.google.com/drive/13arA1h6_yBiPJkxHCIrmiRo5IT8YjaBO?usp=sharing)


**Email Datasets**

The email datasets used in this phishing-email detection project come from two sources: The first is <cite id="j7jah"><a href="#zotero%7C12693434%2F4V7AWI4D">(Aassal et al., 2018)</a></cite>'s email corpus which was collated as part of the International Workshop on Security and Privacy Analytics (IWSPA) Anti-Phishing Shared Task, collecting emails from sources such as the "Wikileaks archives, Democratic National Committee, Hacking Team, Sony emails, the Enron Dataset and SpamAssassin".The IWSPA email corpus provides samples of both phishing and legitimate emails, some of which have full headers and some of which are email-body-only.

Aassal et al. advise that legitimate emails were relatively easy to find in comparison to correctly identified and labelled phishing emails. In addition to the IWSPA corpus, this project uses the publicly available phishing-email corpus made available by Jose Nazario, these phishing emails provided with full email header are a popular dataset found throughout the most recent phishing email detection literature:
<cite id="0n95c"><a href="#zotero%7C12693434%2FVG4LN3SZ">(Toolan and Carthy, 2010)</a></cite>, <cite id="nfgjs"><a href="#zotero%7C12693434%2F77DQ22QF">(Bountakas, Koutroumpouchos and Xenakis, 2021)</a></cite>, <cite id="mjbzp"><a href="#zotero%7C12693434%2F44AQ27V5">(Thapa et al., 2021)</a></cite>, <cite id="qcl3j"><a href="#zotero%7C12693434%2FGQKQQHUE">(Dewis and Viana, 2022)</a></cite>

The IWSPA emails were provided in .txt file format for each individual email, while the Nazario corpus was in the form of .mbox. The original datasets were extracted to .csv files and stored in an AWS S3 bucket for ease of access.

In [None]:
# set dataset Paths
LEGIT_HEADER_IWSPA_PATH= "https://anti-phish.s3.eu-west-1.amazonaws.com/iwspa_legit_header.csv"
PHISHING_HEADER_IWSPA_PATH = "https://anti-phish.s3.eu-west-1.amazonaws.com/iwspa_phish_header.csv"

PHISHING_HEADER_NAZARIO_PATH = "https://anti-phish.s3.eu-west-1.amazonaws.com/nazario_phish_header.csv"

LEGIT_NO_HEADER_IWSPA_PATH = "https://anti-phish.s3.eu-west-1.amazonaws.com/iwspa_legit_no_header.csv"
PHISHING_NO_HEADER_IWSPA_PATH = "https://anti-phish.s3.eu-west-1.amazonaws.com/iwspa_phish_no_header.csv"

The subset of labeled emails without a header will be used as a way to compare the different Machine Learning techniques by evaluating each model's accuracy in identifying phishing emails that have no header, after being trained on emails with full headers.

Below we have a glimpse into each dataset, showing 4080 legitimate samples and 10,305 phishing samples:

In [None]:
iwspa_phishing = pd.read_csv(PHISHING_HEADER_IWSPA_PATH, index_col=0)
iwspa_legit = pd.read_csv(LEGIT_HEADER_IWSPA_PATH, index_col=0)
nazario_phishing = pd.read_csv(PHISHING_HEADER_NAZARIO_PATH, index_col=0)

In [None]:
iwspa_legit

Unnamed: 0,0
0,"Status: RO\nFrom: ""Lynton, Michael"" <MAILER-DA..."
1,"Status: RO\nFrom: ""Mosko, Steve"" <MAILER-DAEMO..."
2,Received: from domain.com (146.215.230.105) by...
3,Received: from domain (192.168.10.251) by doma...
4,Received: from domain ([fe80::f85f:3b98:e405:6...
...,...
4076,"From: ""User"" <user@domain>\nTo: ""User""\n\t<use..."
4077,"From: ""User"" <user@domain>\nTo: Daniel Strauss..."
4078,Received: from domain (192.168.10.251) by doma...
4079,Received: from USSDIXMSG20.spe.organization.co...


In [None]:
iwspa_phishing

Unnamed: 0,0
0,Return-Path: <user@domain>\nX-Original-To: use...
1,Return-Path: <user@domain>\nX-Original-To: use...
2,Return-Path: <user@domain>\nX-Original-To: use...
3,Return-Path: <user@domain>\nX-Original-To: use...
4,Return-Path: <user@domain>\nX-Original-To: use...
...,...
498,Return-Path: <user@domain>\nX-Original-To: use...
499,Return-Path: <user@domain>\nX-Original-To: use...
500,Return-Path: <user@domain>\nX-Original-To: use...
501,Return-Path: <user@domain>\nX-Original-To: use...


In [None]:
nazario_phishing

Unnamed: 0,0
0,Return-Path: acmim@up.edu\r\nDelivered-To: jos...
1,Return-Path: no-ramericanexpress@usxchange.net...
2,Return-Path: david231@wsu.edu\r\nDelivered-To:...
3,Return-Path: exports@delightdesigns.co.in\r\nD...
4,Return-Path: michael911@wsu.edu\r\nDelivered-T...
...,...
9799,Return-Path: admin@ordnungswelt.net\r\nDeliver...
9800,Return-Path: MAILER-DAEMON\r\nDelivered-To: jo...
9801,Return-Path: reply@c.constantcontact.com\r\nDe...
9802,Return-Path: amit@aonenonwoven.com\r\nDelivere...


<cite id="a4zqo"><a href="#zotero%7C12693434%2FVG4LN3SZ">(Toolan and Carthy, 2010)</a></cite> have identified over 40 important features of email messages that can be used to identify phishing emails.

e.g.
<i>body_richness</i>: The richness is defined as the ratio of the number of words to the number of characters in the document. This is expressed mathematically in equation:
<i>body_richness</i> = $$\frac{body\_noWords}{ body\_noCharacters} $$


Functions related to header-based, subject-based, and body-based featured have been created below and will be used to extract features from the dataset of phishing and legit emails.

In [None]:
import sys
import os
import re

from bs4 import BeautifulSoup # used to parse HTML content

# declare functions used in preprocessing data and feature extraction

# get subject string from an email.message object
def get_subject(message):
    if message['Subject'] == None:
        return ' '
    else:
        return message['Subject']

# get body string from an email.message object
# if email has multiple parts, concatenate text from all parts
def get_body(message):
    if message.is_multipart():
        contents = []
        for part in message.walk():
            if part.is_multipart() or part.get_content_disposition()=='attachment':
                continue
            contents.append(str(get_body(part)))
        content = '\n\n'.join(contents)
    else:
        content = message.get_payload()
    return content

# use regex to find all urls in an email.message object's body
def get_urls(message):
    return re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', get_body(message))

# use regex to find all email addresses in a string,
# return none if no email addresses, the email message string if 1 email, list of emails otherwise
def get_email_from_string(string):
    email_address = re.findall(r'[a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+', string)
    if len(email_address) == 0:
        email_address = None
    elif len(email_address) == 1:
        email_address = email_address[0]
    return email_address

# count characters in a string by using regex to list all unicode word characters
def count_chars(string):
    return len(re.findall(r'\w', string))

# use regex to list all grouped sequences of unicode word characters
def get_words(string):
    return re.findall(r'\w+', string)

# count words in a string by getting the length of an array of the words
def count_words(string):
    return len(get_words(string))

# count distinct words in a string by getting the length of a set of the words
def count_distinct_words(string):
    return len(set(get_words(string)))

# use regex to get the count of each functional word found in a string
def count_functional_words(string):
    functional_word_counts = {}

    for word in functional_words:
        word_count = len(re.findall(word, string))
        functional_word_counts[word] = word_count
    return functional_word_counts

# extract features from every .txt file in a directory
def process_txt_files(path, phishing):
    data = []

    for file in sorted([file for file in os.listdir(path) if file.endswith('txt')], key = lambda x: int(x.split(".")[0])):
        message = email.message_from_file(open(os.path.join(path, file)), policy=email.policy.SMTPUTF8)
        features = process_message(message, phishing)
        data.append(features)

    return data

# extract features from every message found in an .mbox file
def process_mbox(path, phishing):
    data = []

    mbox = mailbox.mbox(path, factory=BytesParser(policy=email.policy.SMTPUTF8).parse)

    for message in mbox:
        features = process_message(message, phishing)
        data.append(features)

    return data

# count the total number of functional words found in a string
def total_functional_words(string):
    return sum(count_functional_words(string).values())

# make a list of words commonly found in phishing emails
functional_words = ["access", "account","agree", "alert", "bank", "credit",
                "click", "confirm", "identity", "inconvenience", "information",
                "limited", "log", "password", "recently", "security"]

In [None]:
# declare functions used to extract header and subject features

# concatenate the values of all header field: data pairs into a single string
def header_to_string(message):
    header_string = ""
    for key in message.keys():
        try:
            header_string = header_string + str(key) + ": " + message[key] + "\n"
        except:
            continue
    return header_string

# calculate the size of the header string in bytes
def get_header_size(message):
    return sys.getsizeof(header_to_string(message))

# extract the email address that sent an email
def get_sender_email(message):
    if message['Sender']:
        sender_email = get_email_from_string(message['Sender'])
    else:
        sender_email = get_email_from_string(message['From'])
    return sender_email

# count the number of email addresses the email was sent to
def count_to(message):
    try:
        to_count = len(message.get_all('To'))
    except:
        to_count = 0
    return to_count

# count the number of checkpoints the email passed through to reach recipient
def count_received(message):
    try:
        received_count = len(message.get_all('Received'))
    except:
        received_count = 0
    return received_count

# count the number of email addresses carbon copied
def count_cc(message):
    if message.get_all('Cc') == None:
        return 0
    else:
        return len(message.get_all('Cc'))

# count the number of email addresses blind carbon copied
def count_bcc(message):
    if message.get_all('Bcc') == None:
        return 0
    else:
        return len(message.get_all('Bcc'))

# compare the email address domain in the Message-ID and Sender fields
def same_messageID_senderID(message):
    try:
        messageID_domain = get_email_from_string(message['Message-ID']).split('@')[1]
        senderID_domain = get_sender_email(message).split('@')[1]
        return messageID_domain == senderID_domain
    except:
        return True

# compare the email address Return-To Sender fields
def same_return_sender(message):
    try:
        return_email = get_email_from_string(message['Return-Path'])
        sender_email = get_sender_email(message)
        return return_email == sender_email
    except:
        return True

# count words found in the Subject field
def count_words_subject(message):
     try:
        return count_words(message['Subject'].encode('ascii', 'ignore').decode())
     except:
        return 0

# count distinct words found in the Subject field
def count_distinct_words_subject(message):
    try:
        return count_distinct_words(message['Subject'].encode('ascii', 'ignore').decode())
    except:
        return 0

# count characters found in the Subject field
def count_chars_subject(message):
    try:
        return count_chars(message['Subject'].encode('ascii', 'ignore').decode())
    except:
        return 0

# calculate the richness of the Subject field based on the counted characters and words
def get_subject_richness(message):
    try:
        return count_words_subject(message) / count_chars_subject(message)
    except:
        return 0

# count the number of functional words found in the Subject field
def count_functional_words_subject(message):
    try:
        return total_functional_words(message['Subject'])
    except:
        return 0

# check whether email is a reply
def get_is_reply(message):
    try:
        return re.match(r"^re:", message['Subject'].lower()) is not None
    except:
        return False

# check if email was forwarded
def get_is_forward(message):
    try:
        return re.match(r"^fwd:", message['Subject'].lower()) is not None
    except:
        return False

In [None]:
# declare functions used to extract body features

# list the different content types of the email's subparts
# e.g. 'text/html', 'text/html', 'image/gif'
def get_content_type_list(message):
    content_types = []
    for part in message.walk():
        if part.is_multipart():
            continue
        content_types.append(part.get_content_type())
    return content_types

# list the different content dispositions of the email\s subparts
# e.g. None, 'inline', 'attachment'
def get_content_disposition_list(message):
    content_dispositions = []
    for part in message.walk():
        if part.is_multipart():
            continue
        content_dispositions.append(part.get_content_disposition())
    return content_dispositions

# count number of attachments found in content disposition list
def count_attachments(message):
    attachment_count = 0
    for disposition in get_content_disposition_list(message):
        if disposition == 'attachment':
            attachment_count+=1
    return attachment_count

# check whether body text contains any html
def body_has_html(message):
    return bool(BeautifulSoup(get_body(message), "html.parser").find())

# check whether body text contains html forms
def body_has_forms(message):
    return bool(BeautifulSoup(get_body(message), "html.parser").find("form"))

# count words found in the email body
def count_words_body(message):
    return count_words(get_body(message))

# count distinct words found in the email body
def count_distinct_words_body(message):
    return count_distinct_words(get_body(message))

# count characters found in the email body
def count_chars_body(message):
    return count_chars(get_body(message))

# calculate the richness of the email body based on the counted characters and words
def get_body_richness(message):
    try:
        return count_words_body(message) / count_chars_body(message)
    except:
        return 0

# count the number of functional words found in the email body
def count_functional_words_body(message):
    try:
        return total_functional_words(get_body(message))
    except:
        return 0

**Extraction of Features**

In [None]:
# declare functions used to process email dataset files
import email

# extract header and body features from an email message object and store in a dict
def process_message(string, phishing):
    message = email.message_from_string(string)

    email_features = {}

    email_features['phishing'] = phishing
    email_features['header'] = header_to_string(message)
    if body_has_html(message):
        email_features['text'] = get_subject(message) + " " + BeautifulSoup(get_body(message)).get_text()
    else:
        email_features['text'] = get_subject(message) + " " + get_body(message)

    email_features['header-size'] = get_header_size(message)
    email_features['count-to'] = count_to(message)
    email_features['count-received'] = count_received(message)
    email_features['count-cc'] = count_cc(message)
    email_features['count-bcc'] = count_bcc(message)
    email_features['same-id-sender'] = same_messageID_senderID(message)
    email_features['same-return-sender'] = same_return_sender(message)
    email_features['subject-word-count'] = count_words_subject(message)
    email_features['subject-distinct-word-count'] = count_distinct_words_subject(message)
    email_features['subject-richness'] = get_subject_richness(message)
    email_features['subject-function-word-count'] = count_functional_words_subject(message)
    email_features['is-reply'] = get_is_reply(message)
    email_features['is-forward'] = get_is_forward(message)

    email_features['count-content-types'] = len(get_content_type_list(message))
    email_features['count-attachments'] = count_attachments(message)
    email_features['body-word-count'] = count_words_body(message)
    email_features['body-distint-word-count'] = count_distinct_words_body(message)
    email_features['body-richness'] = get_body_richness(message)
    email_features['body-function-word-count'] = count_functional_words_body(message)
    email_features['has-html'] = body_has_html(message)
    email_features['has-form'] = body_has_forms(message)

    return email_features


def process_dataframe(dataframe, phishing):
    data = []

    for row in dataframe.iterrows():
        features = process_message(row[1][0], phishing = phishing)
        data.append(features)
    return data

In [None]:
iwspa_legit_processed = process_dataframe(iwspa_legit, phishing = 0)
iwspa_legit_processed = pd.DataFrame(processed_iwspa_legit)
iwspa_legit_processed

  return bool(BeautifulSoup(get_body(message), "html.parser").find())
  return bool(BeautifulSoup(get_body(message), "html.parser").find("form"))


Unnamed: 0,phishing,header,text,header-size,count-to,count-received,count-cc,count-bcc,same-id-sender,same-return-sender,...,is-reply,is-forward,count-content-types,count-attachments,body-word-count,body-distint-word-count,body-richness,body-function-word-count,has-html,has-form
0,0,"Status: RO\nFrom: ""Lynton, Michael"" <MAILER-DA...",Re: i may have a meeting around 3pm i have to ...,493,1,0,0,0,True,True,...,True,False,1,0,12,12,0.352941,0,False,False
1,0,"Status: RO\nFrom: ""Mosko, Steve"" <MAILER-DAEMO...",RE: Mosko's Calls a/o 5:55pm Tues 3/4 - 2 new ...,507,1,0,0,0,True,True,...,True,False,1,0,257,144,0.268828,0,True,False
2,0,Received: from domain.com (146.215.230.105) by...,[domain.com] 'Phillips' IS The Captain Now As ...,3829,1,8,0,0,True,True,...,False,False,1,0,70,58,0.230263,0,True,False
3,0,Received: from domain (192.168.10.251) by doma...,Trump: Leave It To Me To be automatically unsu...,1392,1,4,0,0,True,True,...,False,False,1,0,11,11,0.177419,0,False,False
4,0,Received: from domain ([fe80::f85f:3b98:e405:6...,EVENT: Trump and Clinton video remarks to Lati...,1019,1,1,0,0,True,True,...,False,False,1,0,27,26,0.204545,0,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4076,0,"From: ""User"" <user@domain>\nTo: ""User""\n\t<use...","RE: Draft Video Topper [SigDems]<<>>User, Comm...",598,1,0,0,0,True,True,...,True,False,1,0,170,124,0.205066,0,True,False
4077,0,"From: ""User"" <user@domain>\nTo: Daniel Strauss...",RE: FOR DEEP BACKGROUND Nina Turner\nJohnny Pa...,639,1,0,0,0,True,True,...,True,False,1,0,633,322,0.190605,0,True,False
4078,0,Received: from domain (192.168.10.251) by doma...,Points from Talkers Call To be automatically u...,1625,1,4,1,0,True,True,...,False,False,1,0,11,11,0.177419,0,False,False
4079,0,Received: from USSDIXMSG20.spe.organization.co...,A few IMAX people are running 5/10 minutes lat...,1084,1,1,1,0,True,True,...,False,False,1,0,99,80,0.200000,2,False,False


In [None]:
iwspa_phishing_processed = process_dataframe(iwspa_phishing, phishing = 1)
iwspa_phishing_processed = pd.DataFrame(iwspa_phishing_processed)
iwspa_phishing_processed

Unnamed: 0,phishing,header,text,header-size,count-to,count-received,count-cc,count-bcc,same-id-sender,same-return-sender,...,is-reply,is-forward,count-content-types,count-attachments,body-word-count,body-distint-word-count,body-richness,body-function-word-count,has-html,has-form
0,1,Return-Path: <user@domain>\nX-Original-To: use...,"PayPal Flagged Account Dear PayPal Member,\n\n...",1106,1,4,0,0,True,True,...,False,False,1,0,114,88,0.227545,7,False,False
1,1,Return-Path: <user@domain>\nX-Original-To: use...,Protect your personal information ! CONSUMER A...,1396,1,4,0,0,True,True,...,False,False,1,0,92,70,0.186235,8,True,False
2,1,Return-Path: <user@domain>\nX-Original-To: use...,Mesage from ebay member Question about Item --...,873,1,2,0,0,True,True,...,False,False,1,0,946,442,0.200721,7,True,False
3,1,Return-Path: <user@domain>\nX-Original-To: use...,"Re: tyvowVjiagra Hi,\n=20\nAmbjien\nCjialis fr...",930,1,2,0,0,True,True,...,True,False,1,0,40,34,0.279720,0,True,False
4,1,Return-Path: <user@domain>\nX-Original-To: use...,Dear FCU domain.com card holder... NCUA \n ...,849,1,3,0,0,True,True,...,False,False,1,0,137,98,0.195157,13,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,1,Return-Path: <user@domain>\nX-Original-To: use...,Congratulations!You're an eBay Silver PowerSel...,1643,1,6,0,0,True,True,...,False,False,1,0,235,145,0.199660,4,False,False
499,1,Return-Path: <user@domain>\nX-Original-To: use...,Regular Account Maintenance PayPal\n\n#yiv1429...,1205,1,4,0,0,True,True,...,False,False,1,0,780,278,0.175518,9,False,False
500,1,Return-Path: <user@domain>\nX-Original-To: use...,"RE:Update our records!! BODY, TD\n{font-family...",850,1,2,0,0,True,True,...,True,False,1,0,657,231,0.184966,29,True,False
501,1,Return-Path: <user@domain>\nX-Original-To: use...,Security Measures PayPal\n#obmessage .dummy {}...,1566,1,3,0,0,True,True,...,False,False,1,0,260,155,0.200927,25,True,False


In [None]:
nazario_phishing_processed = process_dataframe(nazario_phishing, phishing = 1)
nazario_phishing_processed = pd.DataFrame(nazario_phishing_processed)
nazario_phishing_processed

  return bool(BeautifulSoup(get_body(message), "html.parser").find())
  return bool(BeautifulSoup(get_body(message), "html.parser").find("form"))


Unnamed: 0,phishing,header,text,header-size,count-to,count-received,count-cc,count-bcc,same-id-sender,same-return-sender,...,is-reply,is-forward,count-content-types,count-attachments,body-word-count,body-distint-word-count,body-richness,body-function-word-count,has-html,has-form
0,1,Return-Path: acmim@up.edu\nDelivered-To: jose@...,Important Security Message .\n\n.,2297,1,3,0,0,False,True,...,False,False,3,1,18,14,0.236842,0,True,False
1,1,Return-Path: no-ramericanexpress@usxchange.net...,NEW PDF MESSAGE FROM AMERICAN EXPRESS ONLINE F...,2086,1,1,0,0,True,False,...,False,False,2,1,18,14,0.200000,0,True,False
2,1,Return-Path: david231@wsu.edu\nDelivered-To: j...,Confirm Your recent Transactions =\r\n\r\n ...,2871,1,3,0,0,False,True,...,False,False,2,0,1662,365,0.212044,13,True,False
3,1,Return-Path: exports@delightdesigns.co.in\nDel...,Your Email ✉ jose@monkey.org Is Full Upgrade E...,6744,1,2,0,0,True,False,...,False,False,1,0,633,235,0.236724,0,True,False
4,1,Return-Path: michael911@wsu.edu\nDelivered-To:...,confirm This Transaction =\r\n\r\n \t =\r\n\...,2861,1,3,0,0,False,True,...,False,False,2,0,3252,499,0.209038,10,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9799,1,Return-Path: admin@ordnungswelt.net\nDelivered...,Account suspension notice 12/09/2020 03:36:33 ...,2686,1,2,0,0,True,True,...,False,False,2,0,3708,602,0.184579,3,True,False
9800,1,Return-Path: MAILER-DAEMON\nDelivered-To: jose...,Undeliverable: Delivery Status Notification (F...,2823,1,2,0,0,True,True,...,False,False,2,0,739,195,0.223668,8,True,False
9801,1,Return-Path: reply@c.constantcontact.com\nDeli...,Reminder: Notice for monkey.org \n\n\n\r\nDear...,2380,1,1,0,0,True,True,...,False,False,1,0,1445,202,0.208574,0,True,False
9802,1,Return-Path: amit@aonenonwoven.com\nDelivered-...,Notification jose@monkey.org \n\n\n \nDear jos...,2602,1,2,0,0,True,True,...,False,False,1,0,334,155,0.232106,1,True,False


In [None]:
extracted_features = pd.concat([iwspa_legit_processed.drop(['header','text'],axis=1), iwspa_phishing_processed.drop(['header','text'],axis=1), nazario_phishing_processed.drop(['header','text'] ,axis=1)] , ignore_index=True)

extracted_features

Unnamed: 0,phishing,header-size,count-to,count-received,count-cc,count-bcc,same-id-sender,same-return-sender,subject-word-count,subject-distinct-word-count,...,is-reply,is-forward,count-content-types,count-attachments,body-word-count,body-distint-word-count,body-richness,body-function-word-count,has-html,has-form
0,0,493,1,0,0,0,True,True,13,10,...,True,False,1,0,12,12,0.352941,0,False,False
1,0,507,1,0,0,0,True,True,17,17,...,True,False,1,0,257,144,0.268828,0,True,False
2,0,3829,1,8,0,0,True,True,24,23,...,False,False,1,0,70,58,0.230263,0,True,False
3,0,1392,1,4,0,0,True,True,5,5,...,False,False,1,0,11,11,0.177419,0,False,False
4,0,1019,1,1,0,0,True,True,14,14,...,False,False,1,0,27,26,0.204545,0,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14383,1,2686,1,2,0,0,True,True,10,10,...,False,False,2,0,3708,602,0.184579,3,True,False
14384,1,2823,1,2,0,0,True,True,5,5,...,False,False,2,0,739,195,0.223668,8,True,False
14385,1,2380,1,1,0,0,True,True,5,5,...,False,False,1,0,1445,202,0.208574,0,True,False
14386,1,2602,1,2,0,0,True,True,4,4,...,False,False,1,0,334,155,0.232106,1,True,False


In [None]:
# combine the text content extracted from the phishing and legit datasets
# for use with Deep Learning models
extracted_text = pd.concat([iwspa_legit_processed[['phishing','header', 'text']], iwspa_phishing_processed[['phishing', 'header', 'text']], nazario_phishing_processed[['phishing', 'header', 'text']] ], ignore_index=True)

extracted_text

Unnamed: 0,phishing,header,text
0,0,"Status: RO\nFrom: ""Lynton, Michael"" <MAILER-DA...",Re: i may have a meeting around 3pm i have to ...
1,0,"Status: RO\nFrom: ""Mosko, Steve"" <MAILER-DAEMO...",RE: Mosko's Calls a/o 5:55pm Tues 3/4 - 2 new ...
2,0,Received: from domain.com (146.215.230.105) by...,[domain.com] 'Phillips' IS The Captain Now As ...
3,0,Received: from domain (192.168.10.251) by doma...,Trump: Leave It To Me To be automatically unsu...
4,0,Received: from domain ([fe80::f85f:3b98:e405:6...,EVENT: Trump and Clinton video remarks to Lati...
...,...,...,...
14383,1,Return-Path: admin@ordnungswelt.net\nDelivered...,Account suspension notice 12/09/2020 03:36:33 ...
14384,1,Return-Path: MAILER-DAEMON\nDelivered-To: jose...,Undeliverable: Delivery Status Notification (F...
14385,1,Return-Path: reply@c.constantcontact.com\nDeli...,Reminder: Notice for monkey.org \n\n\n\r\nDear...
14386,1,Return-Path: amit@aonenonwoven.com\nDelivered-...,Notification jose@monkey.org \n\n\n \nDear jos...


Further Natural Language Processing features have been identified by <cite id="kbz3c"><a href="#zotero%7C12693434%2FTK6AGCGG">(Egozi and Verma, 2018)</a></cite> as being useful for identifying phishing emails.

For example:

1.   Difference Measure: The difference measure is used to determine how likely
an email is a phishing email by comparing the number of times a stem appears in a phishing email to the number of times the stem appears in a ham email.
2.   Phishing Ratio: The phishing ratio uses the ratio between phishing and total appearances of a stem to help determine the likelihood of an email being phishing.
3.   Unique Difference Measure: The unique difference measure is similar to the difference measure, but it does not consider multiple appearances of a stem in an email.
4.   Unique Phishing Ratio: While the unique phishing ratio is similar to the phishing ratio, it does not consider multiple appearances of the same stem in an email.

Functions below are based off of their research, and used to extract NLP features from the email bodies.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download("punkt") # download the vocab used for tokenisation

# identify all stems found in an email
# count the overall frequency of each stem,
# and their frequency in the phishing and legit subsets
# calculate the Phishing Ratio and Difference Measure of each email

# a list of stems and unique stems found in each email
email_stems = []

# number of times a stem has occured in all phishing emails, stem: count
phishing_stem_counts = {}

# number of times a stem has uniquely occurred in all phishing emails, stem: count
phishing_stem_counts_unique = {}

# number of times a stem has occured in all legit emails, stem: count
legit_stem_counts = {}

# number of times a stem has uniquely occurred in all legit emails, stem: count
legit_stem_counts_unique = {}

def stem_dataset(text_dataframe):

    # clear previous attempts
    email_stems.clear()
    phishing_stem_counts.clear()
    legit_stem_counts.clear()
    legit_stem_counts_unique.clear()

    # instantiate a stemmer used to convert text into a list of stems
    ps = PorterStemmer()

    # each row represents an email
    for i, row in text_dataframe.iterrows():

        # a dictionary to hold lists of an email's stems and unique stems
        extracted_stems = {}

        # create String of email text
        email_text = row['text']

        # convert the String into a list of lowercase alphanumeric tokens
        tokens = [token for token in word_tokenize(email_text.lower()) if token.isalnum()]

        # list all stems in an email, including duplicates
        stems = [ps.stem(token) for token in tokens]

        # store this email's stems in the dictionary
        extracted_stems['stems'] = stems

        # store this email's unique stems in the dictionary
        extracted_stems['unique-stems'] = set(stems)

        # append this email's stem dictioanry extracted_stems to the list of overall stems
        email_stems.append(extracted_stems)

        # add each stem's frequency to the overall frequency counts

        for stem in set(stems):

            # if phishing add stem frequencies to the overall phishing counts
            if row['phishing'] == 1:

                # if stem already exists in the overall unique stem: count dict,
                # add 1 to its count
                if stem in phishing_stem_counts_unique:
                    phishing_stem_counts_unique[stem] += 1
                # if not already in overall dictionary add the stem: count key:value pair
                else:
                    phishing_stem_counts_unique[stem] = 1

                # if stem already exists in the overall stem: count dict,
                # add all occurences of the stem in this email to the overall count
                if stem in phishing_stem_counts:
                    phishing_stem_counts[stem] += stems.count(stem)
                else:
                    phishing_stem_counts[stem] = stems.count(stem)


            # if target variable is legit add stem frequencies to the overall legit counts
            else:
                if stem in legit_stem_counts_unique:
                    legit_stem_counts_unique[stem] += 1
                else:
                    legit_stem_counts_unique[stem] = 1

                if stem in legit_stem_counts:
                    legit_stem_counts[stem] += stems.count(stem)
                else:
                    legit_stem_counts[stem] = stems.count(stem)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# declare functions used to extract Natural Language Processing features

# count the number of times a stem appears overall in all emails,
# including duplicates in the same email
def tot_count(stem):
    total_count = 0

    if stem in phishing_stem_counts:
        total_count += phishing_stem_counts[stem]

    if stem in legit_stem_counts:
        total_count += legit_stem_counts[stem]

    return total_count

# count the number of times a stem appears overall in all emails,
# not including duplicates in the same email
def tot_count_unique(stem):
    total_count_unique = 0

    if stem in phishing_stem_counts_unique:
        total_count_unique += phishing_stem_counts_unique[stem]

    if stem in legit_stem_counts_unique:
        total_count_unique += legit_stem_counts_unique[stem]

    return total_count_unique

# compare number of times a stem appears in all phishing / total emails,
# including duplicates in the same email
def phishing_ratio(stem):
    tot = tot_count(stem)

    if tot == 0 or stem not in phishing_stem_counts:
        return 0
    else:
        return phishing_stem_counts[stem] / tot

# compare number of times a stem appears in phishing / total emails,
# not including duplicates in the same email
def phishing_ratio_unique(stem):
    tot_unique = tot_count_unique(stem)

    if tot_unique == 0 or stem not in phishing_stem_counts_unique:
        return 0
    else:
        return phishing_stem_counts_unique[stem] / tot_unique

# count how many more times a stem appears in all phishing than in legit emails
# including duplicates in the same email
def difference_measure(stem):
    try:
        phishing_count = phishing_stem_counts[stem]
    except:
        phishing_count = 0

    try:
        legit_count = legit_stem_counts[stem]
    except:
        legit_count = 0

    return max(phishing_count - legit_count, 0)

# count how many more times a stem appears in all phishing than in legit emails
# not including duplicates in the same email
def difference_measure_unique(stem):
    try:
        phishing_count_unique = phishing_stem_counts_unique[stem]
    except:
        phishing_count_unique = 0

    try:
        legit_count_unique = legit_stem_counts_unique[stem]
    except:
        legit_count_unique = 0
    return phishing_count_unique + legit_count_unique

In [None]:
def extract_NLP_features(extracted_features_dataframe):
    # calculate summed difference measure for every stem found in an email, including duplicates
    extracted_features_dataframe['difference-measure'] = [sum([difference_measure(stem) for stem in stem_dict['stems']]) for stem_dict in email_stems]

    # the summed difference measure for every stem found in an email, not including duplicates
    extracted_features_dataframe['difference-measure-unique'] = [sum([difference_measure_unique(stem) for stem in stem_dict['unique-stems']]) for stem_dict in email_stems]

    # the summed phishing ratio for every stem found in an email, including duplicates
    extracted_features_dataframe['phishing-ratio'] = [sum([phishing_ratio(stem) for stem in stem_dict['stems']]) for stem_dict in email_stems]

    # the summed phishing ratio for every stem found in an email, not including duplicates
    extracted_features_dataframe['phishing-ratio-unique'] = [sum([phishing_ratio_unique(stem) for stem in stem_dict['unique-stems']]) for stem_dict in email_stems]

**Extraction of Natural Language Processing Features**

In [None]:
# isolate all stems and calculate their frequencies
stem_dataset(extracted_text)

In [None]:
# print an example of the stems and unique stems extracted from each email
print("No. of stems in Email 1000:",len(email_stems[1000]['stems']))
print("No. of unique stems in Email 1000:", len(email_stems[1000]['unique-stems']))

No. of stems in Email 1000: 2364
No. of unique stems in Email 1000: 693


In [None]:
# calculate the Natural Language Processing features from the frequencies
# add them to the extracted_features dataframe
extract_NLP_features(extracted_features)

In [None]:
# show dataframe of extracted features with 4 new columns for NLP features
extracted_features

Unnamed: 0,phishing,header-size,count-to,count-received,count-cc,count-bcc,same-id-sender,same-return-sender,subject-word-count,subject-distinct-word-count,...,body-word-count,body-distint-word-count,body-richness,body-function-word-count,has-html,has-form,difference-measure,difference-measure-unique,phishing-ratio,phishing-ratio-unique
0,0,493,1,0,0,0,True,True,13,10,...,12,12,0.352941,0,False,False,49635,47220,7.162959,7.070917
1,0,507,1,0,0,0,True,True,17,17,...,257,144,0.268828,0,True,False,180849,112657,39.625021,19.850423
2,0,3829,1,8,0,0,True,True,24,23,...,70,58,0.230263,0,True,False,341690,187225,35.636941,25.572848
3,0,1392,1,4,0,0,True,True,5,5,...,11,11,0.177419,0,False,False,77171,61707,8.020079,8.252500
4,0,1019,1,1,0,0,True,True,14,14,...,27,26,0.204545,0,True,False,56244,60999,8.682031,8.355112
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14383,1,2686,1,2,0,0,True,True,10,10,...,3708,602,0.184579,3,True,False,780510,258861,169.073067,143.043212
14384,1,2823,1,2,0,0,True,True,5,5,...,739,195,0.223668,8,True,False,977919,250883,194.024057,76.146350
14385,1,2380,1,1,0,0,True,True,5,5,...,1445,202,0.208574,0,True,False,267365,144481,45.174507,33.271741
14386,1,2602,1,2,0,0,True,True,4,4,...,334,155,0.232106,1,True,False,129544,35849,13.326531,10.917989


In [None]:
# write the features and text dataframes to CSV files for use elsewhere
extracted_text.to_csv('Datasets/Extracted Data/extracted_text.csv')
extracted_features.to_csv('Datasets/Extracted Data/extracted_features.csv')

**Extraction of IWSPA No Header Datasets for comparison of generalisability of models to data from a different source**

In [None]:
iwspa_legit_no_header = pd.read_csv(LEGIT_NO_HEADER_IWSPA_PATH, index_col=0)
iwspa_legit_no_header

Unnamed: 0,0
0,"\nThank you for your reply, the local NSM resp..."
1,\nDaniele Milan updated a event in the TENTATI...
2,\n<x-flowed>\nEd:\nI just waded through all th...
3,\nThank you!\nJordan C. Vaughn\nNational AALC ...
4,"\nDear Keith,\nThank you for the message of 5 ..."
...,...
5087,\nYou missed a call from OMALLEY JASON at (202...
5088,\nThank you Katie.\nI will be with David as we...
5089,"\n<x-flowed>\nDear Stefan,\nThe distinction he..."
5090,"\nDelete drew From: Kanner, Fayanne \nOUT CALL..."


In [None]:
iwspa_legit_no_header_processed = process_dataframe(iwspa_legit_no_header, phishing = 0)
iwspa_legit_no_header_processed = pd.DataFrame(iwspa_legit_no_header_processed)
iwspa_legit_no_header_processed

  return bool(BeautifulSoup(get_body(message), "html.parser").find())
  return bool(BeautifulSoup(get_body(message), "html.parser").find("form"))


Unnamed: 0,phishing,header,text,header-size,count-to,count-received,count-cc,count-bcc,same-id-sender,same-return-sender,...,is-reply,is-forward,count-content-types,count-attachments,body-word-count,body-distint-word-count,body-richness,body-function-word-count,has-html,has-form
0,0,,"Thank you for your reply, the local NSM resp...",49,0,0,0,0,True,True,...,False,False,1,0,61,47,0.245968,0,False,False
1,0,,Daniele Milan updated a event in the TENTATI...,49,0,0,0,0,True,True,...,False,False,1,0,54,40,0.181818,0,False,False
2,0,,\nEd:\nI just waded through all the correspo...,49,0,0,0,0,True,True,...,False,False,1,0,185,124,0.227552,2,True,False
3,0,,Thank you!\nJordan C. Vaughn\nNational AALC ...,49,0,0,0,0,True,True,...,False,False,1,0,33,25,0.215686,0,True,False
4,0,,"Dear Keith,\nThank you for the message of 5 ...",49,0,0,0,0,True,True,...,False,False,1,0,126,86,0.217993,1,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5087,0,,You missed a call from OMALLEY JASON at (202...,49,0,0,0,0,True,True,...,False,False,1,0,18,15,0.243243,0,True,False
5088,0,,Thank you Katie.\nI will be with David as we...,49,0,0,0,0,True,True,...,False,False,1,0,10,10,0.285714,0,False,False
5089,0,,"\nDear Stefan,\nThe distinction here is that...",49,0,0,0,0,True,True,...,False,False,1,0,425,233,0.211759,0,True,False
5090,0,,"Delete drew From: Kanner, Fayanne \nOUT CALL...",49,0,0,0,0,True,True,...,False,False,1,0,19,19,0.250000,0,False,False


In [None]:
iwspa_phishing_no_header_processed = process_dataframe(iwspa_phishing_no_header, phishing = 1)
iwspa_phishing_no_header_processed = pd.DataFrame(iwspa_phishing_no_header_processed)
iwspa_phishing_no_header_processed

  return bool(BeautifulSoup(get_body(message), "html.parser").find())
  return bool(BeautifulSoup(get_body(message), "html.parser").find("form"))


Unnamed: 0,phishing,header,text,header-size,count-to,count-received,count-cc,count-bcc,same-id-sender,same-return-sender,...,is-reply,is-forward,count-content-types,count-attachments,body-word-count,body-distint-word-count,body-richness,body-function-word-count,has-html,has-form
0,1,,"Dear Member,\nYou have 1 new Important secur...",49,0,0,0,0,True,True,...,False,False,1,0,21,20,0.161538,1,False,False
1,1,,APPLE SERVICE ANNOUNCEMENT\nDear user@domain...,49,0,0,0,0,True,True,...,False,False,1,0,88,66,0.212048,5,False,False
2,1,,We welcome you as you resume your 2015/2016 ...,49,0,0,0,0,True,True,...,False,False,1,0,127,90,0.205502,7,True,False
3,1,,"Hello,\nYou are qualified for a pay raise on...",49,0,0,0,0,True,True,...,False,False,1,0,88,61,0.192982,3,True,False
4,1,,"Dear [netID email address],\nYou have new me...",49,0,0,0,0,True,True,...,False,False,1,0,46,39,0.203540,1,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
624,1,"From: HelpDesk <user@domain>\nDate: Wed, Mar 1...",YOUR HTML DISABLED Thanks,159,1,0,0,0,True,True,...,False,False,1,0,1,1,0.166667,0,False,False
625,1,"Subject: I N F O\nDate: December 10, 2016\n",I N F O Just a reminder that there is a messag...,90,0,0,0,0,True,True,...,False,False,1,0,35,31,0.209581,1,False,False
626,1,,"Dear Student, \nYour student portal! was rec...",49,0,0,0,0,True,True,...,False,False,1,0,14,14,0.181818,1,False,False
627,1,,Dear Mailbox User\nDue to the strengthening ...,49,0,0,0,0,True,True,...,False,False,1,0,81,59,0.184932,3,True,False


In [None]:
# separate the features and texts for the IWSPA No Header Dataset
extracted_features_no_header = pd.concat([iwspa_legit_no_header_processed.drop(['header','text'],axis=1), iwspa_phishing_no_header_processed.drop(['header','text'],axis=1)] , ignore_index=True)
extracted_text_no_header = pd.concat([iwspa_legit_no_header_processed[['phishing','header', 'text']], iwspa_phishing_no_header_processed[['phishing', 'header', 'text']]], ignore_index=True)

In [None]:
# isolate all stems and calculate their frequencies
stem_dataset(extracted_text_no_header)

# calculate the Natural Language Processing features from the frequencies
# add them to the extracted_features dataframe
extract_NLP_features(extracted_features_no_header)

In [None]:
# show dataframe of extracted features with 4 new columns for NLP features, for the IWSPA No Header Dataset
extracted_features_no_header

Unnamed: 0,phishing,header-size,count-to,count-received,count-cc,count-bcc,same-id-sender,same-return-sender,subject-word-count,subject-distinct-word-count,...,body-word-count,body-distint-word-count,body-richness,body-function-word-count,has-html,has-form,difference-measure,difference-measure-unique,phishing-ratio,phishing-ratio-unique
0,0,49,0,0,0,0,True,True,0,0,...,61,47,0.245968,0,False,False,0,147656,2.576195,23.313646
1,0,49,0,0,0,0,True,True,0,0,...,54,40,0.181818,0,False,False,0,88758,1.815919,14.430322
2,0,49,0,0,0,0,True,True,0,0,...,185,124,0.227552,2,True,False,0,186438,5.259212,41.775256
3,0,49,0,0,0,0,True,True,0,0,...,33,25,0.215686,0,True,False,0,29726,0.427229,3.318847
4,0,49,0,0,0,0,True,True,0,0,...,126,86,0.217993,1,False,False,6,180009,6.045928,35.185558
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5716,1,159,1,0,0,0,True,True,3,3,...,1,1,0.166667,0,False,False,0,15657,1.121533,3.600680
5717,1,90,0,0,0,0,True,True,4,4,...,35,31,0.209581,1,False,False,115,135071,5.667323,23.479154
5718,1,49,0,0,0,0,True,True,0,0,...,14,14,0.181818,1,False,False,55,32658,3.616218,9.817490
5719,1,49,0,0,0,0,True,True,0,0,...,81,59,0.184932,3,True,False,1840,155928,22.926205,41.106934


In [None]:
# write the features and text dataframes to CSV files for use elsewhere
# extracted_text_no_header.to_csv('Datasets/Extracted Data/extracted_text_no_header.csv')
# extracted_features_no_header.to_csv('Datasets/Extracted Data/extracted_features_no_header.csv')

***Bibliography***

<!-- BIBLIOGRAPHY START -->
<div class="csl-bib-body">
  <div class="csl-entry"><i id="zotero|12693434/4V7AWI4D"></i>Aassal, A. E., Moraes, L., Baki, S. and Das, A. (2018) Anti-Phishing Pilot at ACM IWSPA 2018, p. 10.</div>
  <div class="csl-entry"><i id="zotero|12693434/N3Q6U5PG"></i>Alhogail, A. and Alsabih, A. (2021) Applying machine learning and natural language processing to detect phishing email, <i>Computers &#38; Security</i>, 110, p. 102414, [online] Available at: <a href="https://linkinghub.elsevier.com/retrieve/pii/S0167404821002388">https://linkinghub.elsevier.com/retrieve/pii/S0167404821002388</a> (Accessed 19 June 2022).</div>
  <div class="csl-entry"><i id="zotero|12693434/77DQ22QF"></i>Bountakas, P., Koutroumpouchos, K. and Xenakis, C. (2021) A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection, In <i>The 16th International Conference on Availability, Reliability and Security</i>, Vienna Austria, ACM, pp. 1–12, [online] Available at: <a href="https://dl.acm.org/doi/10.1145/3465481.3469205">https://dl.acm.org/doi/10.1145/3465481.3469205</a> (Accessed 19 June 2022).</div>
  <div class="csl-entry"><i id="zotero|12693434/GQKQQHUE"></i>Dewis, M. and Viana, T. (2022) Phish Responder: A Hybrid Machine Learning Approach to Detect Phishing and Spam Emails, <i>Applied System Innovation</i>, 5(4), p. 73, [online] Available at: <a href="https://www.mdpi.com/2571-5577/5/4/73">https://www.mdpi.com/2571-5577/5/4/73</a> (Accessed 1 October 2022).</div>
  <div class="csl-entry"><i id="zotero|12693434/UH76NDRX"></i>Divakaran, D. M. and Oest, A. (2022) Phishing Detection Leveraging Machine Learning and Deep Learning: A Review, arXiv, [online] Available at: <a href="http://arxiv.org/abs/2205.07411">http://arxiv.org/abs/2205.07411</a> (Accessed 1 October 2022).</div>
  <div class="csl-entry"><i id="zotero|12693434/TK6AGCGG"></i>Egozi, G. and Verma, R. (2018) Phishing Email Detection Using Robust NLP Techniques, In <i>2018 IEEE International Conference on Data Mining Workshops (ICDMW)</i>, Singapore, Singapore, IEEE, pp. 7–12, [online] Available at: <a href="https://ieeexplore.ieee.org/document/8637476/">https://ieeexplore.ieee.org/document/8637476/</a> (Accessed 21 June 2022).</div>
  <div class="csl-entry"><i id="zotero|12693434/F57WLZ2M"></i>Raschka, S., Patterson, J. and Nolet, C. (2020) Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence, <i>Information</i>, 11(4), p. 193, [online] Available at: <a href="https://www.mdpi.com/2078-2489/11/4/193">https://www.mdpi.com/2078-2489/11/4/193</a> (Accessed 19 June 2022).</div>
  <div class="csl-entry"><i id="zotero|12693434/NB8NU34F"></i>Salloum, S., Gaber, T., Vadera, S. and Shaalan, K. (2021) Phishing Email Detection Using Natural Language Processing Techniques: A Literature Survey, <i>Procedia Computer Science</i>, 189, pp. 19–28, [online] Available at: <a href="https://linkinghub.elsevier.com/retrieve/pii/S1877050921011741">https://linkinghub.elsevier.com/retrieve/pii/S1877050921011741</a> (Accessed 19 June 2022).</div>
  <div class="csl-entry"><i id="zotero|12693434/44AQ27V5"></i>Thapa, C., Tang, J. W., Abuadbba, A., Gao, Y., Camtepe, S., Nepal, S., Almashor, M. and Zheng, Y. (2021) Evaluation of Federated Learning in Phishing Email Detection, arXiv, [online] Available at: <a href="http://arxiv.org/abs/2007.13300">http://arxiv.org/abs/2007.13300</a> (Accessed 19 June 2022).</div>
  <div class="csl-entry"><i id="zotero|12693434/VG4LN3SZ"></i>Toolan, F. and Carthy, J. (2010) Feature selection for Spam and Phishing detection, In <i>2010 eCrime Researchers Summit</i>, Dallas, USA, IEEE, pp. 1–12, [online] Available at: <a href="http://ieeexplore.ieee.org/document/5706696/">http://ieeexplore.ieee.org/document/5706696/</a> (Accessed 11 May 2023).</div>
  <div class="csl-entry"><i id="zotero|12693434/FTMZJ2VS"></i>Vazhayil, A. and Nb, H. (2018) PED-ML: Phishing Email Detection Using Classical Machine Learning Techniques, p. 9.</div>
</div>
<!-- BIBLIOGRAPHY END -->