# Phishing Detection Using BERT- Training and Evaluation

## Table of Contents 
* Introduction
* Required Libraries
* Dataset
* Downloading pretrained model
* Training
* Evaluation
* Conclusion
* References

## Introduction
Phishing is used by malicous actors to obtain sensitive information from email users by pretending to be from legitimate institutions/people. Traditional methods are rigid and reactive. They rely on keyword matching and previously seen malicous URLs to detect phishing emails. By using a language model to infernece on an whole email message, we build a more robust model that utlizes the entire context of an email and generalizes to previously unseen messages.
In this notebook, we show how to train a [BERT](https://arxiv.org/pdf/1810.04805.pdf) transformer language model and analyse the performance on an example dataset.

## Required Libraries

In [1]:
import cudf
from cudf.core.subword_tokenizer import SubwordTokenizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import requests
import os.path
import torch
import numpy as np
import sys

from morpheus.utils.seed import manual_seed

from common.sequence_classifier import SequenceClassifier

## Dataset

Due to the limited public availability of labeled email datasets, for this example we are using the labeled SMS Spam Collection Data Set from the UCI Machine Learning Repository. 
SMSSPAM contain deceptive information, some of the messages have the intent of convincing the recipient to give the sender money or to share information.

* [SMSSPAM](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)

## Downloading files

In [2]:
if not os.path.isfile("smsspamcollection.zip"):    
    URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
    response = requests.get(URL)
    open("smsspamcollection.zip", "wb").write(response.content)

In [3]:
!md5sum smsspamcollection.zip

ab53f9571d479ee677e7b283a06a661a  smsspamcollection.zip


You can check if you have the same version with the checksum we got when we ran the notebook: ab53f9571d479ee677e7b283a06a661a

In [4]:
if not os.path.isfile("SMSSpamCollection"):
    !unzip smsspamcollection.zip

In [5]:
df = cudf.read_csv("SMSSpamCollection", delimiter='\t', header=None, names=['spam/ham', 'message'])

In [6]:
# convert label to binary 0 = ham, 1 = spam
df["label"] = df["spam/ham"].str.match('spam').astype(int)

## Creating Train and Test sets

Split the dataset into training (80%) and test (20%) sets

In [7]:
random_seed=42

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df["message"], df["label"], train_size=0.8,random_state=random_seed)

## Initialize/Load BERT model
Load the pre-trained bert-base-uncased model from [Hugging Face](https://huggingface.co/bert-base-uncased)

In [9]:
seq_classifier = SequenceClassifier("bert-base-uncased", hash_file="../../../morpheus/data/bert-base-uncased-hash.txt", num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Training

In [10]:
# set seeds for model reproducability
manual_seed(random_seed)

In [11]:
seq_classifier.train_model(X_train, y_train, batch_size=32, epochs=2)

Epoch:  50%|█████     | 1/2 [00:33<00:33, 33.81s/it]

Train loss: 0.08998454025214804


Epoch: 100%|██████████| 2/2 [01:08<00:00, 34.42s/it]

Train loss: 0.015057251792833475





In [12]:
# save as pytorch model
torch.save(seq_classifier._model, "phishing-bert.pt")

## Evaluation of Test Set

Accuracy:

In [13]:
seq_classifier.evaluate_model(X_test, y_test)

0.99375

In [14]:
test_preds = seq_classifier.predict(X_test, batch_size=128)



F1 Score

In [15]:
tests = test_preds.to_numpy()
true_labels = y_test.to_numpy()
f1_score(true_labels, tests)

0.9767441860465117

## Export Model to ONNX

In [16]:
tokenizer = SubwordTokenizer("../../../morpheus/data/bert-base-uncased-hash.txt", do_lower_case=True)

In [17]:
tokenizer_output = tokenizer(df["message"][0:3],
                                max_length=128,
                                max_num_rows=3,
                                truncation=True,
                                add_special_tokens=True,
                                return_tensors="pt")

sample_model_input = (tokenizer_output["input_ids"].type(torch.long), tokenizer_output["attention_mask"].type(torch.long))

In [18]:
torch.onnx.export(seq_classifier._model,              
                  sample_model_input,               
                  "model.onnx",                                      # where to save the model
                  export_params=True,                                # store the trained parameter weights inside the model file
                  opset_version=10,                                  # the ONNX version to export the model to
                  do_constant_folding=True,                          # whether to execute constant folding for optimization
                  input_names = ['input_ids','attention_mask'],      # the model's input names
                  output_names = ['output'],                         # the model's output names
                  dynamic_axes={'input_ids' : {0 : 'batch_size'},    # variable length axes
                                'attention_mask': {0: 'batch_size'}, 
                                'output' : {0 : 'batch_size'}})

## Conclusion

Here we show that using a BERT-based spam/phishing detector performs well in identifying the spam messages across this dataset with an F1 score above 0.95. This notebook is prepared as an example. We have seen an equally strong performance using private datasets with bengin and phishing emails; and we suggest users experiment with their own datasets as well.

# References
* SMS Dataset https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

* SMS Dataset http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

* BERT model hosted on HuggingFace https://huggingface.co/bert-base-uncased

* Spam Detection Using BERT - Thaer Sahmoud, Dr. Mohammad Mikki (2022) https://arxiv.org/abs/2206.02443
