# Llama 3 fine-tuning for smishing detection

based on the tutorial ["Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model"](https://www.datacamp.com/tutorial/fine-tuning-llama-2) & ["Fine-Tuning Llama 3 and Using It Locally: A Step-by-Step Guide"](https://www.datacamp.com/tutorial/llama3-fine-tuning-locally)

In [1]:
%%capture
%pip install accelerate peft bitsandbytes transformers trl wandb
import os, torch, wandb
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
from trl import SFTTrainer

In [2]:
# Login to Hugging Face and Weights and Biases

from huggingface_hub import login
from google.colab import userdata

hf_token = userdata.get('HuggingFace')

login(token = hf_token)

wb_token = userdata.get('wandb')

wandb.login(key=wb_token)
run = wandb.init(
    project='Smishing detection with fine-tuned Llama 3 8B',
    job_type="training",
    anonymous="allow"
)

from google.colab import drive
drive.mount('/content/drive')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


[34m[1mwandb[0m: Currently logged in as: [33mdanielhenel[0m ([33mdanielhenel-research[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Model from Hugging Face hub
base_model = "NousResearch/Meta-Llama-3-8B-Instruct"
# Fine-tuned model
new_model = "/content/drive/MyDrive/1/models/smishing-detection-llama-3-8B-instruct"
# Dataset
dataset = load_dataset("text", data_files="/content/drive/MyDrive/1/data/llama3_train_data.txt", split="train")

Making the fine-tunning more efficient by using 4-bit quantization

In [4]:
compute_dtype = getattr(torch, "float16")
attn_implementation = "eager"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

Loading the Llama 2 model and tokenizer

In [5]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0},
    attn_implementation=attn_implementation
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Parameter-Efficient Fine-Tuning

In [6]:
peft_params = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_params)

Training parameters

In [7]:
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="wandb"
)

Supervised fine-tuning

In [8]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

trainer.train()


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Step,Training Loss
25,2.184
50,0.9078
75,1.5215
100,0.7348
125,1.6336
150,0.7379
175,1.4862
200,0.7769
225,1.746
250,0.8304




TrainOutput(global_step=4458, training_loss=0.9760110017803015, metrics={'train_runtime': 4088.2187, 'train_samples_per_second': 2.181, 'train_steps_per_second': 1.09, 'total_flos': 3.0086508851748864e+16, 'train_loss': 0.9760110017803015, 'epoch': 1.9995514689392242})

In [9]:
wandb.finish()
model.config.use_cache = True

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñá‚ñá‚ñá‚ñà‚ñà‚ñà
train/global_step,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñá‚ñá‚ñá‚ñà‚ñà‚ñà
train/grad_norm,‚ñÑ‚ñÇ‚ñÇ‚ñÑ‚ñÅ‚ñÇ‚ñÇ‚ñÅ‚ñÅ‚ñÇ‚ñÅ‚ñÉ‚ñÇ‚ñÉ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÅ‚ñÅ‚ñÜ‚ñÑ‚ñÖ‚ñÜ‚ñÜ‚ñÑ‚ñÖ‚ñÜ‚ñÖ‚ñÖ‚ñÜ‚ñÖ‚ñá‚ñá‚ñÖ‚ñá‚ñà‚ñà‚ñà‚ñÜ
train/learning_rate,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ
train/loss,‚ñÑ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÅ‚ñÇ‚ñÇ‚ñÅ‚ñÉ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÖ‚ñÜ‚ñÜ‚ñÖ‚ñà‚ñÖ‚ñÜ‚ñÖ‚ñÑ‚ñÜ‚ñÖ‚ñà‚ñÑ‚ñÜ‚ñá‚ñÖ

0,1
total_flos,3.0086508851748864e+16
train/epoch,1.99955
train/global_step,4458.0
train/grad_norm,2.69953
train/learning_rate,0.0002
train/loss,0.9789
train_loss,0.97601
train_runtime,4088.2187
train_samples_per_second,2.181
train_steps_per_second,1.09


Save the model and the tokenizer

In [10]:
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)



('/content/drive/MyDrive/1/models/smishing-detection-llama-3-8B-instruct/tokenizer_config.json',
 '/content/drive/MyDrive/1/models/smishing-detection-llama-3-8B-instruct/special_tokens_map.json',
 '/content/drive/MyDrive/1/models/smishing-detection-llama-3-8B-instruct/tokenizer.json')

In [11]:
trainer.model.push_to_hub("danielhenel/smishing-detection-llama-3-8B-instruct")

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/danielhenel/smishing-detection-llama-3-8B-instruct/commit/04b08d8d4009cd2e596805dacc2ed49229964523', commit_message='Upload model', commit_description='', oid='04b08d8d4009cd2e596805dacc2ed49229964523', pr_url=None, pr_revision=None, pr_num=None)

In [12]:
tokenizer.push_to_hub("danielhenel/smishing-detection-llama-3-8B-instruct")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/danielhenel/smishing-detection-llama-3-8B-instruct/commit/eb1018658aaa838fc0272d0505f8ffbf2c8f3f12', commit_message='Upload tokenizer', commit_description='', oid='eb1018658aaa838fc0272d0505f8ffbf2c8f3f12', pr_url=None, pr_revision=None, pr_num=None)