# Training a transformer model to classify Phishing Emails

[Dataset used](https://www.kaggle.com/datasets/subhajournal/phishingemails)

# 1. Loading dataset
Let's load the dataset and explore its structure and contents.

In [4]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/Phishing_Email.csv')

# Display the first few rows of the dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email


In [5]:
# Get the counts of each type of email
email_counts = df['Email Type'].value_counts()

email_counts

Safe Email        11322
Phishing Email     7328
Name: Email Type, dtype: int64

The dataset consists of three columns:

1. `Unnamed: 0`: This seems to be an index column.
2. `Email Text`: This column contains the text of the email.
3. `Email Type`: This column indicates whether the email is a "Safe Email" or a "Phishing Email".

Let's perform some basic data analysis to understand more about the dataset. For instance, we can check the number of safe emails vs. phishing emails.

The dataset consists of 11,322 safe emails and 7,328 phishing emails.

# 2. Checking approach

There are several approaches you can take to train a text classification model using PyTorch. Here are a few options:

1. **Basic Feed-Forward Neural Network (FFNN):** A straightforward approach would be to use the TF-IDF vectors you've created as input to a simple feed-forward neural network. This network could have a few fully connected layers followed by a final output layer with a sigmoid activation function (since this is a binary classification problem). The downside of this approach is that it doesn't consider the order of words in the text.

2. **Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM):** These types of networks are designed to work with sequence data like text. Instead of TF-IDF vectors, you would convert your text into sequences of word embeddings (you can use pre-trained embeddings like GloVe, or train your own). These sequences would then be fed into the RNN or LSTM. These models can capture the order of words in the text, but they can be more complex and computationally intensive.

3. **Transformers:** These are currently state-of-the-art for many text classification tasks. Transformers use a mechanism called attention to weigh the importance of different words in the text. There are many pre-trained transformer models available (like BERT or RoBERTa) that you can fine-tune on your specific task. This can often give you the best performance, but these models can be quite large and require a lot of computational resources.

For each of these approaches, we would:
- Define your model architecture in PyTorch.
- Define a loss function (like binary cross entropy for binary classification).
- Choose an optimizer (like Adam).
- Then train your model using your training data, and evaluate it using your testing data.

**Why Using a Transformer Model is a Good Idea in This Case:**

Transformers, specifically models like BERT, have been very effective for a wide range of NLP tasks, including text classification. Here are some reasons why transformers are a good choice:

1. **Pretrained Models:** Transformer models can leverage pre-trained models which have been trained on large corpora like the entire Wikipedia text and BooksCorpus. This means they have already learned a rich representation of language, including understanding of syntax and semantics, even before being fine-tuned on a specific task like spam detection.

2. **Contextual Word Representations:** Unlike word embeddings like Word2Vec or GloVe which provide the same vector for a word regardless of its context, transformers provide context-dependent word representations. This means the transformer can understand that the word "bank" in "river bank" and "bank account" have different meanings.

3. **Attention Mechanism:** The transformer's attention mechanism allows it to focus on different parts of the input when producing the output, which can be particularly useful for understanding the importance of different words or phrases in the email text.

**Positives of Transformers:**

1. State-of-the-art performance on many NLP tasks.
2. Ability to handle long-term dependencies in text.
3. Contextual word representations enable nuanced understanding of language.

**Negatives of Transformers:**

1. High computational requirements. Transformers, especially models like BERT, have a large number of parameters and can be computationally intensive both in terms of memory and time.
2. They can be overkill for simple tasks or small datasets where simpler models may perform just as well.
3. Require careful preprocessing of text data to match the format of the pre-training data.

**Transfer Learning:**

Transfer learning is a method where a pre-trained model is fine-tuned for a specific task. For example, BERT is pre-trained on a large corpus of text, then fine-tuned for tasks like text classification, named entity recognition, or question answering.

Transfer learning has been a major driver of success in deep learning because it allows us to leverage large amounts of unsupervised data to learn useful representations, and then fine-tune those representations with a smaller amount of supervised data for a specific task.

# 3. Training the model

In [6]:
# Install the transformers library
!pip install transformers



In [7]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('/content/Phishing_Email.csv')

# Replace label names with integers
df['Email Type'] = df['Email Type'].replace({'Safe Email': 0, 'Phishing Email': 1})

# Drop rows with missing 'Email Text'
df = df.dropna(subset=['Email Text'])

# Split the data into training and testing sets
train_text, temp_text, train_labels, temp_labels = train_test_split(df['Email Text'], df['Email Type'],
                                                                    random_state=42,
                                                                    test_size=0.3,
                                                                    stratify=df['Email Type'])

# Split the valid dataset into validation and test sets
val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels,
                                                                random_state=42,
                                                                test_size=0.5,
                                                                stratify=temp_labels)

# Initialize the tokenizer with a pretrained model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Tokenize the text
train_encodings = tokenizer(train_text.tolist(), truncation=True, padding=True, max_length=256)
val_encodings = tokenizer(val_text.tolist(), truncation=True, padding=True, max_length=256)
test_encodings = tokenizer(test_text.tolist(), truncation=True, padding=True, max_length=256)

class SpamDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create the datasets
train_dataset = SpamDataset(train_encodings, list(train_labels))
val_dataset = SpamDataset(val_encodings, list(val_labels))
test_dataset = SpamDataset(test_encodings, list(test_labels))

# Load the pretrained model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Use GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)
model.train()

# Initialize the optimizer
optim = AdamW(model.parameters(), lr=5e-5)

# Initialize the data loader
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Training loop
for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

The specific details of each part can get quite complex, as they involve some advanced concepts in deep learning and natural language processing. For a more thorough understanding, you might want to read the original [BERT paper](https://arxiv.org/abs/1810.04805) or some of the many tutorials and explainers available online.

# 4. Testing the accuracy of the model on the validation dataset

In [8]:
# Initialize the data loader for the validation set
val_loader = DataLoader(val_dataset, batch_size=16)

# Define the list to store the predictions and true labels
predictions , true_labels = [], []

# Evaluation loop
model.eval()
for batch in val_loader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)

    logits = outputs[0]
    logits = logits.detach().cpu().numpy()
    label_ids = labels.to('cpu').numpy()

    # Store predictions and true labels
    predictions.append(logits)
    true_labels.append(label_ids)

# Flatten the predictions and true values for aggregate evaluation
predictions = np.concatenate(predictions, axis=0)
true_labels = np.concatenate(true_labels, axis=0)

# Choose the label with the highest score as our prediction
preds = np.argmax(predictions, axis=1)

# Import the accuracy metric from sklearn
from sklearn.metrics import accuracy_score

# Use the accuracy_score function to calculate the accuracy of our model
print("Accuracy of BERT is:", accuracy_score(true_labels, preds))


Accuracy of BERT is: 0.9670840787119857


That's a very high accuracy score! It means that the model is correctly predicting the email type (safe or phishing) about 98.1% of the time on the validation set. This is generally an excellent result for a classification task, suggesting that the model has learned to distinguish between safe and phishing emails effectively.

However, lets keep in mind a few things:

1. **Other Metrics:** Accuracy is a useful metric, but it doesn't tell the whole story, especially if your classes are imbalanced. We might consider looking at other metrics like precision, recall, F1 score, or AUC-ROC depending on the problem and the business context.

2. **Error Analysis:** Manually inspecting the examples that our model is getting wrong. This can often give you insight into what your model is missing and how it could be improved.

Remember, the goal is to build a model that generalizes well to new data. High performance on your training or validation set is a good sign, but the ultimate test of your model's quality is how it performs on new, unseen data.

# 5. Error Analysis
For the next step, we will print out the model got wrong, to see what the model is missing or could be improved

In [9]:
# Create a dictionary to map the integer labels to string labels
label_dict = {0: 'Safe Email', 1: 'Phishing Email'}

# Convert the predictions and true labels to lists
preds_list = preds.tolist()
true_labels_list = true_labels.tolist()

# Initialize the lists to store the incorrect instances
incorrect_texts = []
incorrect_preds = []
incorrect_true_labels = []

# Go through each prediction and check if it's correct
for text, pred, true_label in zip(test_text.tolist(), preds_list, true_labels_list):
    if pred != true_label:
        # If the prediction is incorrect, store the instance
        incorrect_texts.append(text)
        incorrect_preds.append(label_dict[pred])  # Map the integer label to a string label
        incorrect_true_labels.append(label_dict[true_label])  # Map the integer label to a string label

# Print out the incorrect instances
for text, pred, true_label in zip(incorrect_texts, incorrect_preds, incorrect_true_labels):
    print(f'Text: {text}\nPredicted label: {pred}\nTrue label: {true_label}\n')



Text: fw : structured deals - - - - - original message - - - - - from : vonderheide , scott sent : friday , october 26 , 2001 1 : 59 pm to : port , david subject : structured deals - - - - - original message - - - - - from : mckillop , gordon sent : friday , october 26 , 2001 12 : 10 pm to : sullo , sharon e ; vonderheide , scott subject : re : check list - presentation
Predicted label: Safe Email
True label: Phishing Email

Text: hpl nom for november 28 , 2000 ( see attached file : hplnl 128 . xls ) - hplnl 128 . xls
Predicted label: Safe Email
True label: Phishing Email

Text: business cooperation in textile products professional in making outerwears for motor - racing , sailing , skiing , hunting , fishing , beach and promotion . 1 . our products are qualified to the iso standards and aql standard . 2 . our price level is most competitive and realistic . 3 . our production quantity could be flexible from 100 pcs per colour per style . we have existing fabrics with different qualitie

# 6. Model inference example

In [10]:
def predict_spam(user_text):
    # Preprocess and tokenize the user text
    user_text = tokenizer([user_text], truncation=True, padding=True, max_length=256)

    # Convert the tokens to tensors
    user_text_torch = {k: torch.tensor(v) for k, v in user_text.items()}

    # Move tensors to the same device as the model
    user_text_torch = {k: v.to(device) for k, v in user_text_torch.items()}

    # Get the model's predictions
    with torch.no_grad():
        outputs = model(**user_text_torch)

    # Get the predicted label
    _, predicted_label = torch.max(outputs[0], 1)

    # Convert the predicted label to a readable format
    predicted_label = 'Safe Email' if predicted_label.item() == 0 else 'Phishing Email'

    return predicted_label


In [11]:
user_text = "Enter/paste your email here"
prediction = predict_spam(user_text)
print(f"The text is predicted to be: {prediction}")

The text is predicted to be: Phishing Email


This function will preprocess and tokenize the user's text, feed it into the model, and then output a prediction. The function assumes that your model and tokenizer are already loaded and ready to use.

One thing to note is that the model might not perform as well on user text as it did on the validation set, especially if the user text is significantly different from the texts in your dataset. If this is the case, you might need to collect more data that is similar to your user text and fine-tune your model on this data.