# Phishing Detection Using BERT

## Authors
- Eli Fajardo (NVIDIA)
- Gorkem Batmaz (NVIDIA)
- Bartley Richardson, PhD (NVIDIA)

## Table of Contents 
* Introduction
* List of datasets used
* Reading in the datasets
* Initialize CLX Phishing Detection and BERT model
* Training - CLAIR FRAUDULENT EMAILS dataset
* Evaluation of CLAIR Test Set
* Training with the the SPAM_ASSASSIN dataset
* Evaluation of the SPAM_ASSASSIN Test Set
* Training with all three datasets CLAIR+SPAM_ASSASSIN+ENRON
* Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN+ENRON Datasets
* References

## Introduction
Phishing is a method used by fraudsters/hackers to obtain sensitive information from email users by pretending to be from legitimate institutions/people.
Various machine learning methods are in use to detect and filter phishing/spam emails. 
In this notebook, we show how to train a *BERT language model and analyse the performance on various datasets. We have fine-tuned a pre-trained BERT model with a classification layer using HuggingFace library. 
*BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found [here.](https://arxiv.org/pdf/1810.04805.pdf)
This notebook will be updated with a much faster GPU tokenizer

## Datasets used
* [CLAIR-Fraudulent E-mail Corpus](https://www.kaggle.com/rtatman/fraudulent-email-corpus)
* [SPAM_ASSASSIN Dataset](https://spamassassin.apache.org/old/publiccorpus/)
* [Enron Emails](https://www.cs.cmu.edu/~enron/)

### Required Libraries

In [1]:
import cudf
from cuml.preprocessing.model_selection import train_test_split
from clx.analytics.phishing_detector import PhishingDetector
import s3fs
from os import path



## Reading the files

In [2]:
CLAIR_TSV = "Phishing_Dataset_Clair_Collection.tsv"
SPAM_TSV = "spam_assassin_spam_200_20021010.tsv"
EASY_HAM_TSV = "spam_assassin_easyham_200_20021010.tsv"
HARD_HAM_TSV = "spam_assassin_hardham_200_20021010.tsv"
ENRON_TSV = "enron_10000.tsv"

S3_BASE_PATH = "rapidsai-data/cyber/clx"

In [3]:
# Clair dataset
if not path.exists(CLAIR_TSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + CLAIR_TSV, CLAIR_TSV)
    
dfclair = cudf.read_csv(CLAIR_TSV, delimiter='\t', header=None, names=['label', 'email']).dropna()

In [4]:
# Phishing emails of the SPAM ASSASSIN dataset
if not path.exists(SPAM_TSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + SPAM_TSV, SPAM_TSV)
 
dfspam = cudf.read_csv(SPAM_TSV, delimiter='\t', header=None, names=['label', 'email'])

In [5]:
# Benign emails of the SPAM ASSASSIN dataset
if not path.exists(EASY_HAM_TSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + EASY_HAM_TSV, EASY_HAM_TSV)
    
dfeasyham = cudf.read_csv(EASY_HAM_TSV, delimiter='\t', header=None, names=['label', 'email'])

In [6]:
# Benign emails of the SPAM ASSASSIN dataset that are easy to be confused with phishing emails
if not path.exists(HARD_HAM_TSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + HARD_HAM_TSV, HARD_HAM_TSV)

dfhardham = cudf.read_csv(HARD_HAM_TSV, delimiter='\t', header=None, names=['label', 'email'])

In [7]:
# Benign Enron emails
if not path.exists(ENRON_TSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + ENRON_TSV, ENRON_TSV)

dfenron = cudf.read_csv(ENRON_TSV, delimiter='\t', header=None, names=['label', 'email'])

In [8]:
# The files contain the first 200 words of each email. The model uses only the first 128 words.

## Initialize/Load BERT model

In [9]:
phish_detect = PhishingDetector()
phish_detect.init_model()

# init_model can also load pre-trained model by passing it model directory path

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Training - CLAIR FRAUDULENT EMAILS DATASET

Split the dataset into training and test sets

In [10]:
X_train, X_test, y_train, y_test = train_test_split(dfclair, 'label', train_size=0.8)

In [11]:
phish_detect.train_model(X_train, y_train, epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Train loss: 0.08132609387540168


Epoch: 100%|██████████| 1/1 [00:57<00:00, 57.57s/it]

Validation Accuracy: 0.9942708333333333





## Evaluation of CLAIR Test Set

In [12]:
phish_detect.evaluate_model(X_test, y_test)

0.9941324392288349

## Training with SPAM_ASSASSIN dataset

Merging the spam assasin dataset

In [13]:
df_assassin = cudf.concat([dfhardham,dfeasyham,dfspam], ignore_index=True)

Split the dataset into train and test

In [14]:
X_train, X_test, y_train, y_test = train_test_split(df_assassin, 'label', train_size=0.8)

In [15]:
phish_detect.train_model(X_train, y_train, epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Train loss: 0.48003239410393167


Epoch: 100%|██████████| 1/1 [00:16<00:00, 16.03s/it]

Validation Accuracy: 0.8211505190311419





## Evaluation of the SPAM_ASSASSIN Test Set

In [16]:
phish_detect.evaluate_model(X_test, y_test)

0.8577912254160364

## Training with CLAIR+SPAM_ASSASSIN datasets

Merge the two datasets and split as train and test sets

In [17]:
df_total = cudf.concat([dfhardham,dfeasyham,dfspam,dfclair],ignore_index=True)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(df_total, 'label', train_size=0.8)

In [19]:
phish_detect.train_model(X_train, y_train, epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Train loss: 0.04817966085660164


Epoch: 100%|██████████| 1/1 [01:13<00:00, 73.52s/it]

Validation Accuracy: 0.9967532467532467





## Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN Datasets

In [20]:
phish_detect.evaluate_model(X_test, y_test)

0.994418910045962

## Training with all three datasets (CLAIR+SPAM_ASSASSIN+ENRON)

Merge all the datasets, split into training and test set and then tokenize the emails

In [21]:
df_total = cudf.concat([dfhardham,dfeasyham,dfspam,dfclair,dfenron],ignore_index=True)

In [22]:
X_train, X_test, y_train, y_test = train_test_split(df_total, 'label', train_size=0.8)

In [23]:
phish_detect.train_model(X_train, y_train, epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Train loss: 0.01257487908028767


Epoch: 100%|██████████| 1/1 [02:02<00:00, 122.02s/it]

Validation Accuracy: 0.9968011811023622





## Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN+ENRON Datasets

In [24]:
phish_detect.evaluate_model(X_test, y_test)

0.998414585810543

# References
* https://github.com/huggingface/transformers/tree/main/examples#
* https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/
* https://github.com/ThilinaRajapakse/pytorch-transformers-classification
* https://mccormickml.com/2019/07/22/BERT-fine-tuning/