# Phishing Detection Using BERT

## Authors
- Gorkem Batmaz (NVIDIA)
- Bartley Richardson, PhD (NVIDIA)

## Table of Contents 
* Introduction
* List of datasets used
* Reading in the datasets
* Setting parameters that are common throughout the notebook
* Function definitions that are common throughout the notebook
* Load the tokenizer and the Model from Huggingface
* Training - CLAIR FRAUDULENT EMAILS dataset
* Evaluation of CLAIR Test Set
* Training with the the SPAM_ASSASSIN dataset
* Evaluation of the SPAM_ASSASSIN Test Set
* Training with all three datasets CLAIR+SPAM_ASSASSIN+ENRON
* Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN+ENRON Datasets
* References

## Introduction
Phishing is a method used by fraudsters/hackers to obtain sensitive information from email users by pretending to be from legitimate institutions/people.
Various machine learning methods are in use to detect and filter phishing/spam emails. 
In this notebook, we show how to train a *BERT language model and analyse the performance on various datasets. We have fine-tuned a pre-trained BERT model with a classification layer using HuggingFace library. 
*BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found [here.](https://arxiv.org/pdf/1810.04805.pdf)
This notebook will be updated with a much faster GPU tokenizer

## Datasets used
* [CLAIR-Fraudulent E-mail Corpus](https://www.kaggle.com/rtatman/fraudulent-email-corpus)
* [SPAM_ASSASSIN Dataset](https://spamassassin.apache.org/old/publiccorpus/)
* [Enron Emails](https://www.cs.cmu.edu/~enron/)

### Required Libraries

In [1]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer, AdamW , BertForSequenceClassification
from tqdm import tqdm, trange
import pandas as pd
import numpy as np

Using TensorFlow backend.


## Reading the files

In [2]:
dfclair = pd.read_csv("Phishing_Dataset_Clair-Collection.tsv", delimiter='\t', header=None, names=['label', 'email'])# Clair dataset

In [3]:
dfspam = pd.read_csv("200_20021010_spam_.tsv", delimiter='\t', header=None, names=['label', 'email'])#Phishing emails of the SPAM ASSASIN dataset

In [4]:
dfeasyham = pd.read_csv("200_1010_easy_ham_.tsv", delimiter='\t', header=None, names=['label', 'email'])#Benign emails of the SPAM ASSASIN dataset

In [5]:
dfhardham = pd.read_csv("200_1010_hard_ham_.tsv", delimiter='\t', header=None, names=['label', 'email'])#Benign emails of the SPAM ASSASIN dataset that are easy to be confused with phishing emails

In [6]:
dfenron=pd.read_csv("enron10000.tsv", delimiter='\t', header=None, names=['label', 'email'])#Benign Enron emails

In [7]:
# The files contain the first 200 words of each email. The model uses only the first 128 words.

## Some of the hyperparameters 

In [8]:
learning_rate=3e-5 # 5e-5, 4e-5, 3e-5, and 2e-5 were tested in the original paper

In [9]:
max_len = 128 #first X # of words in each email will be used

In [10]:
batch_size = 32 #  32 is one of the recommended batch sizes by the paper

In [11]:
epoch=5 # 5 epochs give good enough results. Can be adjusted for different datasets

## Function definitions that are common in the notebook

`split` function to split the datasets into training and test sets

In [12]:
def split(df_data,df_label):
    X_train, X_test, y_train, y_test=train_test_split(df_data, df_label, test_size=0.20, random_state=2)
    return X_train, X_test, y_train, y_test

`create_input_id` function returns ids for the padded truncated and tokenized emails.

In [13]:
def create_input_id(tokenized_emails,max_len):
    input_ids = [tokenizer.convert_tokens_to_ids(email) for email in tokenized_emails]
    input_ids = pad_sequences(input_ids, maxlen=max_len, dtype="long", truncating="post", padding="post")
    return input_ids
    

`flatten_accuracy` extracts the prediction and returns the accuracy

In [14]:
def flatten_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

`train` function runs the training using `train_dataloader`,`validation_dataloader`

In [15]:
def train(train_dataloader,validation_dataloader,model,epochs=10):
    train_loss_set = []# Store loss and accuracy
   
    for _ in trange(epochs, desc="Epoch"):
        model.train()#enable training mode
        tr_loss = 0 # Tracking variables
        nb_tr_examples, nb_tr_steps = 0, 0
        for step, batch in enumerate(train_dataloader):
            batch = tuple(t.to(device) for t in batch)# Add batch to GPU
            b_input_ids, b_input_mask, b_labels = batch# Unpack the inputs from dataloader
            optimizer.zero_grad()# Clear out the gradients
            loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)[0]#forwardpass
            # print(type(train_loss_set))

            train_loss_set.append(loss.item())
            # print("loss.item",loss.item())

            loss.backward()
            optimizer.step()#update parameters
            tr_loss += loss.item()#get a numeric value
            nb_tr_examples += b_input_ids.size(0)
            nb_tr_steps += 1

        print("Train loss: {}".format(tr_loss / nb_tr_steps))

        model.eval()# Put model in evaluation mode to evaluate loss on the validation set

        eval_loss, eval_accuracy = 0, 0
        nb_eval_steps, nb_eval_examples = 0, 0

        for batch in validation_dataloader:
            batch = tuple(t.to(device) for t in batch)

            b_input_ids, b_input_mask, b_labels = batch

            with torch.no_grad():# Telling the model not to compute or store gradients, saving memory and speeding up validation
                logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)[0]# Forward pass, calculate logit predictions
            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()
            temp_eval_accuracy = flatten_accuracy(logits, label_ids)

            eval_accuracy += temp_eval_accuracy
            nb_eval_steps += 1

        print("Validation Accuracy: {}".format(eval_accuracy / nb_eval_steps))
    return model

`evaluate_testset` function returns the predictions and the true labels

In [16]:
def evaluate_testset(test_dataloader):
    model.eval()

    tests , true_labels = [], []

    for batch in test_dataloader:

        batch = tuple(t.to(device) for t in batch)

        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad():

            logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)[0]

        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        tests.append(logits)
        true_labels.append(label_ids)
    return tests,true_labels

`tokenize` function tokenizes the emails.

In [17]:
def tokenize(X_,y_):
    emails = X_.to_numpy()
    emails = ["[CLS] " + str(email) + " [SEP]" for email in emails]#add cls and sep so they are recognized by the tokenizer
    labels = y_.to_numpy()
    tokenized_emails = [tokenizer.tokenize(email) for email in emails]
    return tokenized_emails, labels

`create_att_mask` Creates a mask of 1s for each token followed by 0s for padding as a list

In [18]:
def create_att_mask(input_ids):
    attention_masks = []
    for a in input_ids:
        a_mask = [float(i>0) for i in a]
        attention_masks.append(a_mask)
    return attention_masks  

`testset_loader` returns a data loader using `input_ids`, `attention_masks` and `labels` for the test set

In [19]:
def testset_loader(input_ids,attention_masks,labels):
    test_inputs = torch.tensor(input_ids)
    test_masks = torch.tensor(attention_masks)
    test_labels = torch.tensor(labels)
    test_data = TensorDataset(test_inputs, test_masks, test_labels)
    test_sampler = SequentialSampler(test_data)
    test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
    return test_dataloader

`create_data_loader` returns a data loader using `input_ids`, `attention_masks` and `labels` for train and validation sets

In [20]:
def create_data_loaders(train_inputs,validation_inputs,train_labels,validation_labels,train_masks,validation_masks):
    ## Convert all of our data into torch tensors, 
    train_inputs = torch.tensor(train_inputs)
    validation_inputs = torch.tensor(validation_inputs)
    train_labels = torch.tensor(train_labels)
    validation_labels = torch.tensor(validation_labels)
    train_masks = torch.tensor(train_masks)
    validation_masks = torch.tensor(validation_masks)
    ## Create an iterator of our data with torch DataLoader. more memory effective
    train_data = TensorDataset(train_inputs, train_masks, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
    validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
    validation_sampler = SequentialSampler(validation_data)
    validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
    return train_dataloader,validation_dataloader


## Load the tokenizer and the model from Huggingface

In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## Training - CLAIR FRAUDULENT EMAILS DATASET

Split the dataset into training and test sets

In [22]:
X_train, X_test, y_train, y_test=split(dfclair.email, dfclair.label)#return training and test set as Series #training part will be split again later for val set, test part will be used in verif stage

Tokenize the emails

In [23]:
tokenized_emails,labels = tokenize(X_train,y_train)#pass the series into the 'tokenize' function and get tokenized emails as lists and labels as an array

In [24]:
%%capture
input_ids=create_input_id(tokenized_emails,max_len)## get an array by using the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary, and then pad or crop them to the max length

In [25]:
attention_masks=create_att_mask(input_ids)## Create a mask of 1s for each token followed by 0s for padding as a list

In [26]:
train_inputs, validation_inputs, train_labels, validation_labels = split(input_ids, labels)#split the ids and the masks into train and val and get an array
train_masks, validation_masks, _, _ = split(attention_masks, input_ids)#split the ids and the masks into train and val and get a list

Creating data loaders

In [27]:
train_dataloader,validation_dataloader=create_data_loaders(train_inputs,validation_inputs,train_labels,validation_labels,train_masks,validation_masks)#create data loaders

Optimizer parameters are set

In [28]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [29]:
optimizer = AdamW(optimizer_grouped_parameters, learning_rate)

In [30]:
model=train(train_dataloader,validation_dataloader,model,epoch)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Train loss: 0.07040747153284378


Epoch:  20%|██        | 1/5 [01:03<04:12, 63.13s/it]

Validation Accuracy: 0.9963541666666667
Train loss: 0.020385592180269158


Epoch:  40%|████      | 2/5 [02:05<03:09, 63.02s/it]

Validation Accuracy: 0.9984375
Train loss: 0.003376401658608309


Epoch:  60%|██████    | 3/5 [03:08<02:05, 62.98s/it]

Validation Accuracy: 0.9979166666666667
Train loss: 0.0006898887001005271


Epoch:  80%|████████  | 4/5 [04:13<01:03, 63.56s/it]

Validation Accuracy: 0.9979166666666667
Train loss: 0.0005424856491387864


Epoch: 100%|██████████| 5/5 [05:16<00:00, 63.36s/it]

Validation Accuracy: 0.9979166666666667





In [31]:
model.save_pretrained('.')#model is saved in the current working directory

## Evaluation of CLAIR Test Set

Firstly tokenize the emails

In [32]:
tokenized_emails,labels = tokenize(X_test,y_test)

Create input ids

In [33]:
%%capture
input_ids=create_input_id(tokenized_emails,max_len)

Create attention masks

In [34]:
attention_masks=create_att_mask(input_ids) ## Create a mask of 1s for each token followed by 0s for padding as a list

Create a data loader for the test dataset

In [35]:
test_dataloader=testset_loader(input_ids,attention_masks,labels)

Get predictions using the model

In [36]:
tests, true_labels=evaluate_testset(test_dataloader)

Flatten the labels array

In [37]:
flat_tests = [item for sublist in tests for item in sublist]
flat_tests = np.argmax(flat_tests, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]

In [38]:
f1_score(flat_true_labels, flat_tests)

0.998093422306959

In [39]:
accuracy_score(flat_true_labels, flat_tests)

0.9983235540653814

## Training with SPAM_ASSASSIN dataset

Merging the spam assasin dataset

In [40]:
df_assassin = pd.concat([dfhardham,dfeasyham,dfspam] ,ignore_index=True)

Split the dataset into train and test

In [41]:
X_train, X_test, y_train, y_test=split(df_assassin.email, df_assassin.label)

Tokenize the emails and create input_ids and attention masks

In [42]:
tokenized_emails,labels = tokenize(X_train,y_train)

In [43]:
%%capture
input_ids=create_input_id(tokenized_emails,max_len)

In [44]:
attention_masks=create_att_mask(input_ids)

Split the training set into train and validations sets

In [45]:
train_inputs, validation_inputs, train_labels, validation_labels = split(input_ids, labels)
train_masks, validation_masks, _, _ = split(attention_masks, input_ids)

In [46]:
train_dataloader,validation_dataloader=create_data_loaders(train_inputs,validation_inputs,train_labels,validation_labels,train_masks,validation_masks)

Set the optimizer

In [47]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [48]:
optimizer = AdamW(optimizer_grouped_parameters, learning_rate)

In [49]:
model=train(train_dataloader,validation_dataloader,model,epoch)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Train loss: 0.36991006248828134


Epoch:  20%|██        | 1/5 [00:17<01:09, 17.46s/it]

Validation Accuracy: 0.9613970588235294
Train loss: 0.11205905362625014


Epoch:  40%|████      | 2/5 [00:34<00:52, 17.47s/it]

Validation Accuracy: 0.9908088235294118
Train loss: 0.01727770713411949


Epoch:  60%|██████    | 3/5 [00:52<00:34, 17.49s/it]

Validation Accuracy: 0.9944852941176471
Train loss: 0.008141785591953632


Epoch:  80%|████████  | 4/5 [01:10<00:17, 17.51s/it]

Validation Accuracy: 0.9963235294117647
Train loss: 0.0036517112290091586


Epoch: 100%|██████████| 5/5 [01:27<00:00, 17.52s/it]

Validation Accuracy: 0.9963235294117647





## Evaluation of the SPAM_ASSASSIN Test Set

Tokenize the test set and create input ids, attention masks and the data loader

In [50]:
tokenized_emails,labels = tokenize(X_test,y_test)

In [51]:
%%capture
input_ids=create_input_id(tokenized_emails,max_len)

In [52]:
attention_masks=create_att_mask(input_ids)

In [53]:
test_dataloader=testset_loader(input_ids,attention_masks,labels)

Get predictions for the test set and flatten the results

In [54]:
tests, true_labels=evaluate_testset(test_dataloader)

In [55]:
flat_tests = [item for sublist in tests for item in sublist]
flat_tests = np.argmax(flat_tests, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]

In [56]:
f1_score(flat_true_labels, flat_tests)

0.9674418604651164

In [57]:
accuracy_score(flat_true_labels, flat_tests)

0.989409984871407

## Training with CLAIR+SPAM_ASSASSIN datasets

Merge the two datasets and split as train and test sets

In [58]:
df_total = pd.concat([dfhardham,dfeasyham,dfspam,dfclair],ignore_index=True)

In [59]:
X_train, X_test, y_train, y_test=split(df_total.email, df_total.label)

Tokenize the emails and create input_ids and attention masks

In [60]:
tokenized_emails,labels = tokenize(X_train,y_train)

In [61]:
%%capture
input_ids=create_input_id(tokenized_emails,max_len)

In [62]:
attention_masks=create_att_mask(input_ids)

Split the training set into train and validations sets. Create the data loaders

In [63]:
train_inputs, validation_inputs, train_labels, validation_labels = split(input_ids, labels)
train_masks, validation_masks, _, _ = split(attention_masks, input_ids)

In [64]:
train_dataloader,validation_dataloader=create_data_loaders(train_inputs,validation_inputs,train_labels,validation_labels,train_masks,validation_masks)

Set the optimizer

In [65]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [66]:
optimizer = AdamW(optimizer_grouped_parameters, learning_rate)

Start the training

In [67]:
model=train(train_dataloader,validation_dataloader,model,5)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Train loss: 0.011452606556790148


Epoch:  20%|██        | 1/5 [01:20<05:21, 80.34s/it]

Validation Accuracy: 0.9995941558441559
Train loss: 0.00519723205189755


Epoch:  40%|████      | 2/5 [02:40<04:01, 80.41s/it]

Validation Accuracy: 0.9995941558441559
Train loss: 0.0017975553126144987


Epoch:  60%|██████    | 3/5 [04:02<02:41, 80.62s/it]

Validation Accuracy: 0.9995941558441559
Train loss: 6.213206363666314e-05


Epoch:  80%|████████  | 4/5 [05:22<01:20, 80.61s/it]

Validation Accuracy: 0.9995941558441559
Train loss: 3.438446031202043e-05


Epoch: 100%|██████████| 5/5 [06:44<00:00, 80.95s/it]

Validation Accuracy: 0.9995941558441559





## Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN Datasets

Tokenize the test set, tokenize, create the data loader

In [68]:
tokenized_emails,labels = tokenize(X_test,y_test)

In [69]:
%%capture
input_ids=create_input_id(tokenized_emails,max_len)

In [70]:
attention_masks=create_att_mask(input_ids)

In [71]:
test_dataloader=testset_loader(input_ids,attention_masks,labels)

Get the predictions using `evaluate_testset`

In [72]:
tests, true_labels=evaluate_testset(test_dataloader)

Flatten the results

In [73]:
flat_tests = [item for sublist in tests for item in sublist]
flat_tests = np.argmax(flat_tests, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]

In [74]:
f1_score(flat_true_labels, flat_tests)

0.9991386735572783

In [75]:
accuracy_score(flat_true_labels, flat_tests)

0.9993434011818779

## Training with all three datasets (CLAIR+SPAM_ASSASSIN+ENRON)

Merge all the datasets, split into training and test set and then tokenize the emails

In [76]:
df_total = pd.concat([dfhardham,dfeasyham,dfspam,dfclair,dfenron],ignore_index=True)

In [77]:
X_train, X_test, y_train, y_test=split(df_total.email, df_total.label)

In [78]:
tokenized_emails,labels = tokenize(X_train,y_train)

Create input ids and attention masks

In [79]:
%%capture
input_ids=create_input_id(tokenized_emails,max_len)

In [80]:
attention_masks=create_att_mask(input_ids)

In [81]:
train_inputs, validation_inputs, train_labels, validation_labels = split(input_ids, labels)
train_masks, validation_masks, _, _ = split(attention_masks, input_ids)

Create data loaders

In [82]:
train_dataloader,validation_dataloader=create_data_loaders(train_inputs,validation_inputs,train_labels,validation_labels,train_masks,validation_masks)

Set the optimizer parameters

In [83]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [84]:
optimizer = AdamW(optimizer_grouped_parameters, learning_rate)

Run the training

In [85]:
model=train(train_dataloader,validation_dataloader,model,5)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Train loss: 0.00770341860168466


Epoch:  20%|██        | 1/5 [02:13<08:54, 133.58s/it]

Validation Accuracy: 0.999507874015748
Train loss: 0.005486953640552607


Epoch:  40%|████      | 2/5 [04:27<06:40, 133.63s/it]

Validation Accuracy: 0.9990157480314961
Train loss: 0.005276057476380674


Epoch:  60%|██████    | 3/5 [06:41<04:27, 133.84s/it]

Validation Accuracy: 0.9931102362204725
Train loss: 0.0009331031898981753


Epoch:  80%|████████  | 4/5 [08:55<02:13, 133.82s/it]

Validation Accuracy: 0.999507874015748
Train loss: 4.1015769602381626e-05


Epoch: 100%|██████████| 5/5 [11:09<00:00, 133.85s/it]

Validation Accuracy: 0.999507874015748





## Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN+ENRON Datasets

Tokenize the test set, create input ids and attention masks

In [86]:
tokenized_emails,labels = tokenize(X_test,y_test)

In [87]:
%%capture
input_ids=create_input_id(tokenized_emails,max_len)

In [88]:
attention_masks=create_att_mask(input_ids)

Create the test data loader and evaluate the test set

In [89]:
test_dataloader=testset_loader(input_ids,attention_masks,labels)

In [90]:
tests, true_labels=evaluate_testset(test_dataloader)

Flatten the results

In [91]:
flat_tests = [item for sublist in tests for item in sublist]
flat_tests = np.argmax(flat_tests, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]

In [92]:
f1_score(flat_true_labels, flat_tests)

0.9991071428571429

In [93]:
accuracy_score(flat_true_labels, flat_tests)

0.9996036464526358

# References
* https://github.com/huggingface/transformers/tree/master/examples#
* https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/
* https://github.com/ThilinaRajapakse/pytorch-transformers-classification
* https://mccormickml.com/2019/07/22/BERT-fine-tuning/