# Spam Classification using Encoder LLMs with Linear Probing [5 points]
In this part, we will use encoder Large Language Models (LLMs) for spam classification. We will leverage the rich features of pre-trained LLMs without fine-tuning them. Instead, we will freeze the LLM weights and train a lightweight classifier head (MLP) on top for spam classification.

**Dataset:** Enron Spam Dataset

**Expected Performance (Best Model):** {Accuracy: >85%, F1: >85%, Precision: >85%, Recall: >82%}

1. Load the Enron Spam dataset. Use the train/val/test splits and tokenize the text using your pre-trained LLM’s tokenizer. Use your best judgement for the relevant input fields.

In [1]:
### ADD YOUR CODE HERE ###
# Load Enron Spam dataset (consider using Hugging Face Datasets or manual loading if necessary)
# Implement train/val/test splits
# Tokenize text data using the chosen LLM's tokenizer
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from datasets import Dataset

# Loading the Enron Spam dataset from Hugging Face
dataset = load_dataset("SetFit/enron_spam")

# Splitting the dataset
train_df = dataset['train'].to_pandas()
test_df = dataset['test'].to_pandas()


train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42)


train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)




README.md:   0%|          | 0.00/176 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl:   0%|          | 0.00/101M [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/6.27M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/31716 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [15]:
from transformers import AutoTokenizer

# Tokenizing
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')


def tokenize_function(examples):
   
    return tokenizer(examples['text'], padding='max_length', truncation=True)


tokenized_train = tokenized_train.map(tokenize_function, batched=True)
tokenized_val = tokenized_val.map(tokenize_function, batched=True)
tokenized_test = tokenized_test.map(tokenize_function, batched=True)


tokenized_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_val.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


print(tokenized_train[0])


Map:   0%|          | 0/28544 [00:00<?, ? examples/s]

Map:   0%|          | 0/3172 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'label': tensor(0), 'input_ids': tensor([  101,  3531,  3602,  2008,  2026, 10373,  4769,  2038,  2042,  2904,
         2000,  1024,  1046,  4819,  3051,  1030,  3891,  5880,  2015,  1012,
         4012,  3531, 10651,  2115,  4769,  2808,  1012,  4283,  1012,  8963,
         1012,  3622,  1024,  1009,  4008,  1006,  1014,  1007,  2322,  6356,
         2620,  2549,  5818, 27531,  7479,  1012,  3891,  5880,  2015,  1012,
         4012,  1011,  2012, 19646,  1012,  1044, 21246,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,  

2. Model Setup – Probing:

   a. Load a pre-trained LLM (e.g., DistilBERT, BART-encoder) for sequence classification. Choose a lightweight encoder model that is amenable to your GPU size. Consider using DistilBERT, TinyBERT, MobileBERT, AlBERT, or others. **Specify the chosen LLM below.**

   **Chosen Encoder LLM:** <span style='color:green'>### YOUR ANSWER ###</span>

In [16]:
from transformers import AutoModelForSequenceClassification

# Loading the pre-trained LLM - DistilBERT 
model_name = 'distilbert-base-uncased'  
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  

# Freezing all base model weights
for param in model.base_model.parameters():
    param.requires_grad = False  
print(model)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


   b. Freeze all base model weights and attach a lightweight MLP (the classification head) that maps the model’s representations to binary labels. You may want to create a separate model class that defines these components and a forward function or use out of the box 🤗 classification wrappers.

In [17]:
### ADD YOUR CODE HERE ###
# Freeze all weights of the loaded LLM
# Define and attach a lightweight MLP classifier head

import torch
from torch import nn
from transformers import AutoModel

# Defining the custom model class with the classification head (MLP)
class SpamClassifier(nn.Module):
    def __init__(self, base_model_name='distilbert-base-uncased'):
        super(SpamClassifier, self).__init__()
        
        self.base_model = AutoModel.from_pretrained(base_model_name)
        
        for param in self.base_model.parameters():
            param.requires_grad = False  # Freezing

  
        self.classifier = nn.Sequential(
            nn.Linear(self.base_model.config.hidden_size, 64),  
            nn.ReLU(),
            nn.Linear(64, 2)  
        )
    
    def forward(self, input_ids, attention_mask):
  
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        
        cls_output = outputs.last_hidden_state[:, 0, :]  
        

        logits = self.classifier(cls_output)
        
        return logits


spam_classifier = SpamClassifier(base_model_name='distilbert-base-uncased')

print(spam_classifier)


SpamClassifier(
  (base_model): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): L

   c. Use the [CLS] token if available or mean-pooled final hidden states from the LLM as input to your classifier head.

In [18]:
### ADD YOUR CODE HERE ###
# Implement logic to extract [CLS] token or mean-pooled hidden states from the LLM's output

import torch
from torch import nn
from transformers import AutoModel

# Modifying the function to include an option for Mean pooled hidden states and  [CLS] token

class SpamClassifier(nn.Module):
    def __init__(self, base_model_name='distilbert-base-uncased', use_mean_pooling=False):
        super(SpamClassifier, self).__init__()
        
    
        self.base_model = AutoModel.from_pretrained(base_model_name)
        
      
        for param in self.base_model.parameters():
            param.requires_grad = False  

    
        self.classifier = nn.Sequential(
            nn.Linear(self.base_model.config.hidden_size, 64),  
            nn.ReLU(),
            nn.Linear(64, 2)  
        )
        
      
        self.use_mean_pooling = use_mean_pooling
    
    def forward(self, input_ids, attention_mask):
    
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        
        
        if not self.use_mean_pooling:
            cls_output = outputs.last_hidden_state[:, 0, :] 
        else:
           
            cls_output = outputs.last_hidden_state.mean(dim=1)
        
      
        logits = self.classifier(cls_output)
        
        return logits


spam_classifier = SpamClassifier(base_model_name='distilbert-base-uncased', use_mean_pooling=False)


print(spam_classifier)


SpamClassifier(
  (base_model): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): L

3. Configure your training parameters (learning rate, batch size, epochs) and train the model using only the classifier head while the LLM remains frozen.

In [19]:
### ADD YOUR CODE HERE ###
# Define training parameters (learning rate, batch size, epochs)
# Define loss function and optimizer (optimize only classifier head parameters)
# Implement training loop: forward pass, loss calculation, backward pass, optimizer step

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from tqdm import tqdm 


learning_rate = 2e-5  
batch_size = 16       
epochs = 3            


criterion = torch.nn.CrossEntropyLoss()  
optimizer = Adam(spam_classifier.classifier.parameters(), lr=learning_rate)  


train_dataloader = DataLoader(tokenized_train, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(tokenized_val, batch_size=batch_size)
test_dataloader = DataLoader(tokenized_test, batch_size=batch_size)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
spam_classifier.to(device)

for epoch in range(epochs):
    spam_classifier.train() 
    total_loss = 0
    correct_preds = 0
    total_preds = 0
    
    for batch in tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{epochs}"):
       
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

       
        optimizer.zero_grad()

        
        outputs = spam_classifier(input_ids, attention_mask)
        loss = criterion(outputs, labels)

     
        loss.backward()

      
        optimizer.step()

      
        total_loss += loss.item()
        predictions = torch.argmax(outputs, dim=1)
        correct_preds += (predictions == labels).sum().item()
        total_preds += labels.size(0)
    
  
    avg_loss = total_loss / len(train_dataloader)
    accuracy = correct_preds / total_preds * 100

    print(f"Epoch {epoch + 1}/{epochs} - Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%")
    
    
    spam_classifier.eval() 
    val_loss = 0
    correct_preds = 0
    total_preds = 0
    
    with torch.no_grad():  
        for batch in val_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            
            outputs = spam_classifier(input_ids, attention_mask)
            loss = criterion(outputs, labels)

         
            val_loss += loss.item()
            predictions = torch.argmax(outputs, dim=1)
            correct_preds += (predictions == labels).sum().item()
            total_preds += labels.size(0)
    
    avg_val_loss = val_loss / len(val_dataloader)
    val_accuracy = correct_preds / total_preds * 100

    print(f"Validation Loss: {avg_val_loss:.4f}, Validation Accuracy: {val_accuracy:.2f}%")


torch.save(spam_classifier.state_dict(), 'spam_classifier_model.pth')


Epoch 1/3: 100%|██████████| 1784/1784 [08:05<00:00,  3.68it/s]


Epoch 1/3 - Loss: 0.4293, Accuracy: 89.88%
Validation Loss: 0.2492, Validation Accuracy: 92.88%


Epoch 2/3: 100%|██████████| 1784/1784 [08:11<00:00,  3.63it/s]


Epoch 2/3 - Loss: 0.1963, Accuracy: 94.45%
Validation Loss: 0.1551, Validation Accuracy: 95.24%


Epoch 3/3: 100%|██████████| 1784/1784 [08:13<00:00,  3.61it/s]


Epoch 3/3 - Loss: 0.1450, Accuracy: 95.35%
Validation Loss: 0.1295, Validation Accuracy: 95.59%


4. Evaluation and Analysis:

   a. Evaluate the model on the test set using accuracy, precision, recall, and F1-score.

In [22]:
### ADD YOUR CODE HERE ###
# Evaluate the trained model on the test set
# Calculate and report accuracy, precision, recall, and F1-score

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from torch.utils.data import DataLoader
import torch
from tqdm import tqdm


test_dataloader = DataLoader(tokenized_test, batch_size=16)


spam_classifier.eval()


all_preds = []
all_labels = []


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
spam_classifier.to(device)


with torch.no_grad():  
    for batch in tqdm(test_dataloader, desc="Evaluating"):
       
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        
        outputs = spam_classifier(input_ids=input_ids, attention_mask=attention_mask)
        
       
        logits = outputs  
        
        
        preds = torch.argmax(logits, dim=1)

      
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())


accuracy = accuracy_score(all_labels, all_preds)
precision = precision_score(all_labels, all_preds)
recall = recall_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds)


print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Evaluating: 100%|██████████| 125/125 [00:35<00:00,  3.57it/s]

Accuracy: 0.9580
Precision: 0.9676
Recall: 0.9484
F1 Score: 0.9579





   b. Select **two** encoder LLMs, repeat steps 2-4 for the second LLM, and compare and discuss any performance trends between the two models. **Specify the second chosen LLM below and report performance comparison.**

   **Second Chosen Encoder LLM:**  TinyBERT Model

   Reason for choosing the TinyBERT Model 

   **TinyBERT** is a great choice because it's a compact, efficient model optimized for quicker inference and lower resource usage, yet still performs well on many tasks. Comparing it with a larger model like DistilBERT helps highlight the balance between speed and accuracy.


Defining the Second LLM Model and repeating the above steps mentioned and performed on distilbert

In [25]:
from transformers import AdamW


In [24]:

model_name_tinybert = 'huawei-noah/TinyBERT_General_4L_312D'  
model_tinybert = AutoModelForSequenceClassification.from_pretrained(model_name_tinybert, num_labels=2) 



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
for param in model_tinybert.base_model.parameters():
    param.requires_grad = False  


class SpamClassifierTinyBERT(nn.Module):
    def __init__(self, base_model_name='huawei-noah/TinyBERT_General_4L_312D', use_mean_pooling=False):
        super(SpamClassifierTinyBERT, self).__init__()
        
        
        self.base_model = AutoModel.from_pretrained(base_model_name)
        
        
        for param in self.base_model.parameters():
            param.requires_grad = False 

        self.classifier = nn.Sequential(
            nn.Linear(self.base_model.config.hidden_size, 64), 
            nn.ReLU(),
            nn.Linear(64, 2)  
        )
        
        self.use_mean_pooling = use_mean_pooling
    
    def forward(self, input_ids, attention_mask):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
    
        if not self.use_mean_pooling:
            cls_output = outputs.last_hidden_state[:, 0, :] 
        else:
            cls_output = outputs.last_hidden_state.mean(dim=1)
        

        logits = self.classifier(cls_output)
        
        return logits

TinyBERT's Model Architecture 

In [27]:

spam_classifier_tinybert = SpamClassifierTinyBERT(base_model_name='huawei-noah/TinyBERT_General_4L_312D', use_mean_pooling=False)
print(spam_classifier_tinybert)


SpamClassifierTinyBERT(
  (base_model): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 312, padding_idx=0)
      (position_embeddings): Embedding(512, 312)
      (token_type_embeddings): Embedding(2, 312)
      (LayerNorm): LayerNorm((312,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-3): 4 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=312, out_features=312, bias=True)
              (key): Linear(in_features=312, out_features=312, bias=True)
              (value): Linear(in_features=312, out_features=312, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=312, out_features=312, bias=True)
              (LayerNorm): LayerNorm((312,), eps=1e-12

In [28]:
### ADD YOUR CODE HERE ###
# Repeat steps 2-4 for the second chosen LLM
# Implement code for performance comparison and trend analysis



learning_rate = 1e-5
batch_size = 16
epochs = 3
optimizer_tinybert = AdamW(spam_classifier_tinybert.parameters(), lr=learning_rate)
loss_fn_tinybert = nn.CrossEntropyLoss()


spam_classifier_tinybert.to(device)

# Training the TinyBERT model 
for epoch in range(epochs):
    spam_classifier_tinybert.train()
    total_loss = 0
    total_preds = 0

    for batch in tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{epochs}"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

      
        optimizer_tinybert.zero_grad()
        outputs = spam_classifier_tinybert(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn_tinybert(outputs, labels)

       
        loss.backward()
        optimizer_tinybert.step()

        total_loss += loss.item()
        preds = torch.argmax(outputs, dim=1)
        total_preds += torch.sum(preds == labels).item()

    avg_loss = total_loss / len(train_dataloader)
    accuracy = total_preds / len(train_dataloader.dataset)

    print(f"Epoch {epoch + 1}/{epochs} - Loss: {avg_loss:.4f} - Accuracy: {accuracy:.4f}")





Epoch 1/3: 100%|██████████| 1784/1784 [01:29<00:00, 19.94it/s]


Epoch 1/3 - Loss: 0.6746 - Accuracy: 0.6579


Epoch 2/3: 100%|██████████| 1784/1784 [01:29<00:00, 19.94it/s]


Epoch 2/3 - Loss: 0.6312 - Accuracy: 0.7435


Epoch 3/3: 100%|██████████| 1784/1784 [01:30<00:00, 19.82it/s]

Epoch 3/3 - Loss: 0.5844 - Accuracy: 0.7621





In [29]:
# Evaluate TinyBERT model on the test set
test_dataloader = DataLoader(tokenized_test, batch_size=batch_size)

spam_classifier_tinybert.eval()
all_preds_tinybert = []
all_labels_tinybert = []

with torch.no_grad():
    for batch in tqdm(test_dataloader, desc="Evaluating TinyBERT"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = spam_classifier_tinybert(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs
        preds = torch.argmax(logits, dim=1)

        all_preds_tinybert.extend(preds.cpu().numpy())
        all_labels_tinybert.extend(labels.cpu().numpy())

accuracy_tinybert = accuracy_score(all_labels_tinybert, all_preds_tinybert)
precision_tinybert = precision_score(all_labels_tinybert, all_preds_tinybert)
recall_tinybert = recall_score(all_labels_tinybert, all_preds_tinybert)
f1_tinybert = f1_score(all_labels_tinybert, all_preds_tinybert)

print(f"TinyBERT Evaluation Results:")
print(f"Accuracy: {accuracy_tinybert:.4f}")
print(f"Precision: {precision_tinybert:.4f}")
print(f"Recall: {recall_tinybert:.4f}")
print(f"F1 Score: {f1_tinybert:.4f}")

Evaluating TinyBERT: 100%|██████████| 125/125 [00:05<00:00, 21.65it/s]

TinyBERT Evaluation Results:
Accuracy: 0.7790
Precision: 0.8083
Recall: 0.7361
F1 Score: 0.7705





   **Performance Comparison and Trend Discussion:**

<span style='color:green'>### YOUR ANSWER ###</span>


Below is a comparison of the performance metrics of the two models evaluated: 


**TinyBERT Evaluation Results:**

Accuracy: 0.7790


Precision: 0.8083


Recall: 0.7361


F1 Score: 0.7705


**DistilBERT Evaluation Results:**

Accuracy: 0.9580


Precision: 0.9676


Recall: 0.9484


F1 Score: 0.9579

**Trend Analysis:**
TinyBERT is more efficient and lighter, and although it gets good performance for a smaller model, it is not outperformed by DistilBERT on this task. The smaller model architecture of TinyBERT is achieved at the expense of efficiency against model performance and is evident through lower accuracy, precision, recall, and F1 score.


DistilBERT is a more powerful model, and on this task, its performance is that it can capture more complicated features of the text, leading to significantly better results. The model achieves accuracy of 95.8%, well ahead of TinyBERT results and comfortably surpassing target levels.

   c. The best model is expected to attain {Accuracy: >85%, F1: >85%, Precision: >85%, Recall: >82%}. Report whether your best model achieves these metrics and discuss.


The DistilBERT model is the best model which exceeds all the expected metrics:

Accuracy: >85% (achieved 95.8%)

Precision: >85% (achieved 96.76%)

Recall: >82% (achieved 94.84%)

F1 Score: >85% (achieved 95.79%)

Therefore, the best model (DistilBERT) surpasses and fulfills the target performance levels, i.e., it is highly efficient for this task.


- DistilBERT is highly efficient due to its larger structure compared to TinyBERT, and hence it can learn more complex patterns and representations from the text.

- DistilBERT is a BERT distill but retains nearly all of BERT's language capabilities but with reduced resource use. It has the ability to examine context information to a higher degree of profundity in order to make more accurate predictions.

- Furthermore, the pre-training of the model on a large corpus of data enables it to take advantage of learned patterns in language, resulting in better generalization on spam classification tasks, for example. This synergy of depth, efficiency, and pre-training is what makes it highly performant across all the important metrics.

   **Performance vs. Expected Metrics Discussion:**

<span style='color:green'>### YOUR ANSWER ###</span>

**The expected performance metrics are:**

Accuracy: >85%
Precision: >85%
Recall: >82%
F1 Score: >85%

**TinyBERT Model:**

Accuracy: 77.9

TinyBERT is below the target accuracy which is >85%. It is poor in its performance since it is a small model in size along with fewer parameters than the larger models such as DistilBERT. In terms of efficiency, it fell short of encompassing as much complexity and therefore is less accurate.

Accuracy: 80.83

Accuracy of TinyBERT is average but not yet reaching the best >85%. I.e., the model is very good at identifying the positive class but possibly still producing some false positives.

Recall: 73.61

The recall is subpar. With >82% being the goal, TinyBERT misses out on significant amounts of true positives, which in a spam detection task where we need to catch as many spam messages as possible, could be significant.

F1 Score: 77.05%

The F1 score is still under par. While precision vs. recall trade-off is good, TinyBERT's overall performance is still lower compared to more computationally heavy models.

**DistilBERT Model:**

Accuracy: 95.8%

The DistilBERT model comfortably breaches the >85% accuracy mark, showcasing its capability to classify text with high accuracy reliably.

Precision: 96.76%

DistilBERT's precision is superb, i.e., it is very good at identifying the positive class with virtually no false positives, far better than the >85% level.

Recall: 94.84%

Recall is also excellent, well over baseline >82%. This indicates that the model is highly effective at capturing the true positive instances, which is highly relevant for uses like spam filtering.

F1 Score: 95.79%

DistilBERT's F1 score is also excellent, further supporting its overall performance. With an F1 score considerably higher than 85%, the model has a very good precision-recall balance.

**Conclusion:**

Based on the evaluation metrics, DistilBERT performs the best. It exceeds all the target values of accuracy, precision, recall, and F1 score. It is clearly more appropriate for this task since it's larger and able to capture more subtle patterns in the text.


TinyBERT does worse on this task, but as it is a lightweight model, it is an apt choice when high performance is not as desirable as efficiency. It has a smaller size, which results in faster inference, hence being better suited for environments with constrained resources. In performance, however, it cannot catch up to the larger DistilBERT model.

5. References. Include details on all the resources used to complete this part.

<span style='color:green'>### YOUR ANSWER ###</span>

https://medium.com/@azimkhan8018/email-spam-detection-with-machine-learning-a-comprehensive-guide-b65c6936678b

https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

https://www.kaggle.com/datasets/tapakah68/email-spam-classification/code

https://link.springer.com/article/10.1007/s10207-023-00756-1

https://hugobowne.github.io/hugo-blog/posts/fine-tuning-llms-gpt-2/

https://huggingface.co/datasets/SetFit/enron_spam

https://heartbeat.comet.ml/using-transfer-learning-and-pre-trained-language-models-to-classify-spam-549fc0f56c20

https://blog.madhukaraphatak.com/bert-email-spam-1

https://www.analyticsvidhya.com/blog/2020/07/transfer-learning-for-nlp-fine-tuning-bert-for-text-classification/

https://medium.com/@varun.tyagi83/introducing-the-spam-detection-model-with-pre-trained-llm-3eb1f8186ba1

