# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA
* Model: distilbert-base-uncased
* Evaluation approach: huggingface evaluate method
* Fine-tuning dataset: ealvaradob/phishing-dataset

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install -U scikit-learn -q
!pip install peft -q

In [2]:
import numpy as np
import torch
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split
import gc
from transformers import AutoModelForSequenceClassification, AutoTokenizer, \
    DataCollatorWithPadding, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType

In [3]:
# verify the compute resource
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"training on: {device}")

training on: cuda


In [4]:
# REQUIRED FUNCTIONS

def get_train_test_sets(num_examples, seed=42, test_size=0.2):
    """
    Loads phishing-dataset from Huggingface and splits into train and test sets.
    
    Args:
        seed: (int) ensures reproducable random sample
        num_examples: (int) number of samples to split into train and test
        test_size: (float) fraction of dataset to hold back for evaluation
        
    Returns:
        tuple of train and test datasets as Dataset objects
        
    """
    
    dataset = load_dataset("ealvaradob/phishing-dataset", "combined_reduced", trust_remote_code=True).shuffle(seed=seed)
    display(dataset)
    
    # turn Dataset object into a pandas df in order to get a random sample and use sklearn train_test_split
    df = dataset['train'].to_pandas().sample(n=num_examples, random_state=seed)

    # Delete the original dataset to free up memory
    del dataset

    # Run the garbage collector
    gc.collect()

    # preview df
    display(df.head())
    print(f"num samples: {len(df)}")
    
    # split into train and test sets
    train, test = train_test_split(df, test_size=test_size, shuffle=True, random_state=seed)

    # convert back into Dataset objects
    train, test = Dataset.from_pandas(train, preserve_index=False), Dataset.from_pandas(test, preserve_index=False)
    
    return train, test


def tokenize_dataset(train, test, model_name="distilbert-base-uncased"):
    """return tokenized examples from the dataset"""
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def preprocess_function(examples):
        """function to map over dataset to tokenize examples"""
        return tokenizer(examples["text"], padding="max_length", truncation=True, return_tensors="pt")

    train = train.map(preprocess_function, batched=True)
    test = test.map(preprocess_function, batched=True)
    
    return train, test, tokenizer


def build_model(model_name="distilbert-base-uncased", requires_grad=False):
    
    model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                 num_labels=2,
                                                 id2label={0: "benign", 1: "phishing"},
                                                 label2id={"benign": 0, "phishing": 1},
                                                 )
        
    # Freeze all the parameters of the base model
    for param in model.base_model.parameters():
        param.requires_grad = requires_grad

    return model


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


def build_trainer(model, train, test, tokenizer, dir_name, lr, batch_size, epochs):
    trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir="./data/" + dir_name,
            # Set the learning rate
            learning_rate = lr,
            # Set the per device train batch size and eval batch size
            per_device_train_batch_size = batch_size,
            per_device_eval_batch_size = batch_size,
            # Evaluate and save the model after each epoch
            evaluation_strategy = 'epoch',
            save_strategy = 'epoch',
            num_train_epochs=epochs,
            weight_decay=0.01,
            load_best_model_at_end=True,
        ),
        train_dataset=train,
        eval_dataset=test,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
        compute_metrics=compute_metrics,
    )
    return trainer


def run_inference(index, dataset=lora_test, model=lora_model, tokenizer=lora_tokenizer):
    label_map = {0: 'benign',
                1: 'phishing'}
    
    sample = lora_test[index]['text']
    label = lora_test[index]['label']
    inputs = lora_tokenizer(sample, return_tensors='pt')
    outputs = lora_model(**inputs)
    predicted_label = outputs.logits.argmax(dim=1).item()
    print(f"sample_text: {sample}\npredicted: {label_map[predicted_label]}\nactual:{label_map[label]}")

In [5]:
# load dataset and get train/test splits
train, test = get_train_test_sets(num_examples=15000)

Downloading builder script:   0%|          | 0.00/3.27k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.75k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/521M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 77677
    })
})

Unnamed: 0,text,label
77156,https://warriorplus.com/support/admin/password...,0
28465,mlssoccer.com/videos?id=21318,0
12465,Robert Harley writes:\n> Chuck Murcko wrote:> ...,0
39444,On the topic ofIt is time to refinance!Your cr...,1
14496,<!doctypehtml><html ng-app=app ng-strict-di><t...,1


num samples: 15000


In [6]:
# view an example
sample = train[2]['text']
print(f"sample text: {sample}")

# check length of longest text string although sequence length will be capped by model limit.
max_length = 0

for i in train:
    if len(i['text']) > max_length:
        max_length = len(i['text'])
    
print(f"length of longest text string: {max_length}")

sample text: http://line2329.top/
length of longest text string: 120761


### Tokenize the Data

In [7]:
# tokenize the train and test splits
train, test, tokenizer = tokenize_dataset(train, test)

# Show first example of tokenized training set
print(train[0]['input_ids'])

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

[101, 1026, 999, 9986, 13874, 11039, 19968, 1028, 1026, 16129, 17576, 1027, 1000, 13958, 1024, 16770, 1024, 1013, 1013, 13958, 2361, 1012, 2033, 1013, 24978, 1001, 1000, 16101, 1027, 8318, 2099, 11374, 1027, 4372, 1028, 1026, 4957, 17850, 12879, 1027, 1013, 6991, 1013, 7661, 1013, 19804, 11261, 2102, 1035, 4323, 1013, 7045, 1013, 4871, 1013, 6207, 1011, 3543, 1011, 12696, 1012, 1052, 3070, 2128, 2140, 1027, 6207, 1011, 3543, 1011, 12696, 1028, 1026, 4957, 2004, 1027, 5896, 2892, 10050, 11528, 17850, 12879, 1027, 1013, 1013, 7045, 1012, 18106, 11927, 2213, 1012, 4012, 1013, 1021, 2050, 2620, 2050, 2683, 2620, 3207, 2692, 21619, 2509, 26337, 2098, 2692, 2629, 2497, 2683, 2620, 2850, 27531, 2487, 2094, 2575, 2497, 21926, 20842, 2575, 20961, 2581, 9468, 1013, 5871, 29521, 1011, 6021, 2683, 2581, 6679, 26224, 16048, 14141, 27531, 25746, 2487, 3401, 2063, 16086, 12521, 2629, 2094, 21057, 21057, 17465, 2050, 2475, 2546, 2475, 2094, 22394, 2629, 1012, 1046, 2015, 2128, 2140, 1027, 3653, 11066,

### Load and setup the model

In [8]:
model = build_model()
print(model)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [9]:
# use the huggingface trainer class to train the model
trainer = build_trainer(model, train, test, tokenizer, dir_name='phishing_or_benign', 
                        lr=2e-5, batch_size=16, epochs=1)

In [10]:
trainer.evaluate()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.6962440013885498,
 'eval_accuracy': 0.45,
 'eval_runtime': 46.8856,
 'eval_samples_per_second': 63.986,
 'eval_steps_per_second': 4.01}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [12]:
lora_train, lora_test = get_train_test_sets(num_examples=15000)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 77677
    })
})

Unnamed: 0,text,label
77156,https://warriorplus.com/support/admin/password...,0
28465,mlssoccer.com/videos?id=21318,0
12465,Robert Harley writes:\n> Chuck Murcko wrote:> ...,0
39444,On the topic ofIt is time to refinance!Your cr...,1
14496,<!doctypehtml><html ng-app=app ng-strict-di><t...,1


num samples: 15000


In [13]:
lora_train, lora_test, lora_tokenizer = tokenize_dataset(lora_train, lora_test)

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [14]:
lora_train

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 12000
})

In [15]:
lora_model = build_model()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:

# create a PEFT config
config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=32,
    lora_dropout=0.10,
    bias="none",
    target_modules=["q_lin", "k_lin", "v_lin", "out_lin", "ffn.lin1", "ffn.lin2"] #can only apply LoRA to linear layers
)

# create a PEFT model
lora_model = get_peft_model(lora_model, config)

In [23]:
lora_model.print_trainable_parameters()

trainable params: 1,847,812 || all params: 68,210,692 || trainable%: 2.708977061836581


In [25]:
lora_trainer = build_trainer(lora_model, lora_train, lora_test, 
                             lora_tokenizer, dir_name='lora_phishing_or_benign', lr=2e-5, batch_size=16, epochs=2)

In [26]:
# Start training
lora_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2106,0.217393,0.924
2,0.1936,0.193034,0.93


Checkpoint destination directory ./data/lora_phishing_or_benign/checkpoint-750 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1500, training_loss=0.19672984313964845, metrics={'train_runtime': 1175.6971, 'train_samples_per_second': 20.413, 'train_steps_per_second': 1.276, 'total_flos': 3271796490240000.0, 'train_loss': 0.19672984313964845, 'epoch': 2.0})

In [27]:
lora_trainer.evaluate()

{'eval_loss': 0.19303371012210846,
 'eval_accuracy': 0.93,
 'eval_runtime': 57.3638,
 'eval_samples_per_second': 52.298,
 'eval_steps_per_second': 3.277,
 'epoch': 2.0}

In [28]:
# save trained adapter weights
lora_model.save_pretrained('distilbert-lora')

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [30]:
from peft import AutoPeftModelForSequenceClassification

In [31]:
# load saved lora model
lora_model = AutoPeftModelForSequenceClassification.from_pretrained('distilbert-lora')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [50]:
# pick a sample from test set and run inference
run_inference(0)

sample_text: fldopaype.com
predicted: phishing
actual:phishing
