# BERT Fine-tune for Phishing URL Identification

[Reference Article: Fine-Tuning BERT for Text Classification by Shaw Talebi](https://towardsdatascience.com/fine-tuning-bert-for-text-classification-a01f89b179fc)

# Outline

1. Imports
2. Load training, test, and validation data
3. Load the pre-trained model and tokenizer
4. Load a binary classification head
5. Freeze all base model parameters
6. Unfreeze base model pooling layers
7. Tokenize data
8. Create data collator
9. Load evaluation metrics
10. Define hyperparameters
11. Create a trainer
12. Train the model
13. Validate the model

# 1. Imports

In [1]:
from datasets import DatasetDict, Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
import numpy as np
from transformers import DataCollatorWithPadding

  from .autonotebook import tqdm as notebook_tqdm


# 2. Load training, test, and validation data

In [2]:
dataset_dict = load_dataset("shawhin/phishing-site-classification")

# 3. Load the pre-trained model and tokenizer

In [3]:
model_path = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_path)
id2label = {0: "Safe", 1: "Not Safe"}
label2id = {"Safe": 0, "Not Safe": 1}

# 4. Load a binary classification head

In [4]:
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2, id2label=id2label, label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 5. Freeze all base model parameters

In [5]:
for name, param in model.base_model.named_parameters():
    param.requires_grad = False

# 6. Unfreeze base model pooling layers

In [6]:
for name, param in model.base_model.named_parameters():
    if "pooler" in name:
        param.requires_grad = True

# 7. Tokenize data

In [7]:
def preprocess_data(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_data = dataset_dict.map(preprocess_data, batched=True)

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

Map: 100%|██████████| 450/450 [00:00<00:00, 15753.06 examples/s]


# 8. Create data collator

In [8]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 9. Load evaluation metrics

In [9]:
accuracy = evaluate.load("accuracy")
auc_score = evaluate.load("roc_auc")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    
    # softmax to get probablities
    probablities = np.exp(predictions) / np.exp(predictions).sum(-1, keepdims = True)
    
    positive_class_probs = probablities[:, 1]
    
    auc = np.round(auc_score.compute(prediction_scores=positive_class_probs, references=labels)['roc_auc'], 3)

    predicted_classes = np.argmax(predictions, axis=1)
    acc = np.round(accuracy.compute(predictions=predicted_classes, references=labels)['accuracy'], 3)
    
    return {"Accuracy": acc, "AUC": auc}
    

# 10. Define hyperparameters

In [10]:
lr = 2e-4
batch_size = 8
num_epochs = 10

training_args = TrainingArguments(
    output_dir="bert-phishing-classifier_teacher",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# 11. Create a trainer

In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# 12. Train the model

In [12]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Auc
1,0.4956,0.388122,0.816,0.909
2,0.4017,0.341812,0.84,0.929
3,0.3732,0.32256,0.86,0.939
4,0.3488,0.371792,0.849,0.942
5,0.3458,0.299914,0.873,0.945
6,0.3351,0.307074,0.86,0.946
7,0.324,0.29363,0.878,0.948
8,0.3199,0.300993,0.856,0.95
9,0.3319,0.289078,0.869,0.951
10,0.3221,0.294266,0.864,0.951


TrainOutput(global_step=2630, training_loss=0.3598194542946471, metrics={'train_runtime': 109.3245, 'train_samples_per_second': 192.089, 'train_steps_per_second': 24.057, 'total_flos': 706603239165360.0, 'train_loss': 0.3598194542946471, 'epoch': 10.0})

# 13. Validate the model

In [13]:
predictions = trainer.predict(tokenized_data["validation"])

logits = predictions.predictions
labels = predictions.label_ids

metrics = compute_metrics((logits, labels))
print(metrics)

{'Accuracy': np.float64(0.884), 'AUC': np.float64(0.946)}


# Push the model to Hugging Face

1. Login to Hugging Face

    ```     
    huggingface-cli login
    ```
    - enter huggingface token
2. Push the model to huggingface

In [5]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_path = "bert-phishing-classifier_teacher/checkpoint-2630"

# Load the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Define the repository name
repo_name = "andrewcyeow/phishing_url_model"

# Push the model and tokenizer to Hugging Face
model.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

model.safetensors: 100%|██████████| 438M/438M [00:19<00:00, 22.8MB/s]   


CommitInfo(commit_url='https://huggingface.co/andrewcyeow/phishing_url_model/commit/33316f29b6a7e6437e36ea0da8b37a5b6326bfe4', commit_message='Upload tokenizer', commit_description='', oid='33316f29b6a7e6437e36ea0da8b37a5b6326bfe4', pr_url=None, repo_url=RepoUrl('https://huggingface.co/andrewcyeow/phishing_url_model', endpoint='https://huggingface.co', repo_type='model', repo_id='andrewcyeow/phishing_url_model'), pr_revision=None, pr_num=None)

# Further Resources

[Compressing Large Language Models (LLMs)](https://towardsdatascience.com/compressing-large-language-models-llms-9f406eea5b5e)

[KL Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)

[QLoRA — How to Fine-Tune an LLM on a Single GPU](https://towardsdatascience.com/qlora-how-to-fine-tune-an-llm-on-a-single-gpu-4e44d6b5be32)
