# Phase 2: Model Training - Fine-Tuning with LoRA

In this phase of the project, we focus on fine-tuning the pretrained model ("meta-llama/Llama-3.2-3B-Instruct") to perform Cyber Threat Intelligence (CTI) analysis using the dataset prepared in Phase 1. We employ the **LoRA (Low-Rank Adaptation)** fine-tuning technique to adapt the base model for the task efficiently. 

## Fine-Tuning with LoRA

1. **Model Loading**
   - The base model, "meta-llama/Llama-3.2-3B-Instruct", is loaded. This model is a large pretrained decoder-only model, which is well-suited for instruction-based tasks.

2. **LoRA Fine-Tuning**
   - LoRA is applied as a fine-tuning technique, which allows us to efficiently adapt large models to new tasks with a lower computational cost.
   - Instead of updating all the parameters of the base model, LoRA introduces low-rank matrices that are added to the pre-existing model weights, which are then fine-tuned during training.
   - This approach enables the model to learn task-specific patterns while keeping the original pretrained parameters intact, making the fine-tuning process more resource-efficient.

3. **Data Collector**
   - A **data collector** is used during training to handle variable-length inputs and properly handle the masking of the tokenized data.
   - Since the input sequences (threat reports and the corresponding responses) vary in length, the data collector ensures efficient batch processing by dynamically padding the sequences to a consistent length while avoiding excessive padding.
   - Masking is applied to the input prompt tokens (only in the labels) using the special `-100` token, which prevents these tokens from contributing to the loss function and helps the model focus on predicting the expected output.

4. **Training Process**
   - The model is trained on the preprocessed dataset, using the formatted input-output sequences, where the input prompt is integrated with the expected output (entities, relations, and diagnosis).
   - The LoRA technique allows the model to learn to generate the correct output while efficiently handling the vast number of parameters in the pretrained model.

---

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForSeq2Seq
from peft import LoraConfig, get_peft_model

import torch

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

### Load the model

In [None]:
model_name = "meta-llama/Llama-3.2-3B-Instruct"
access_token = "YOUR HUGGING FACE ACCESS TOKEN"

base_model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=access_token)
base_model.to(device)

### Define the tokenizer, to decode the output of the model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=access_token)
tokenizer.pad_token = tokenizer.eos_token # Add this line to define the padding token

### Load the dataset

In [None]:
import pickle
with open('/content/drive/My Drive/Git_Portfolio/CTI/data/dataset_CTI_llama3_2-3B.pkl', 'rb') as file:
    dataset = pickle.load(file)

In [None]:
dataset

### Setup the PEFT configuration

In [None]:
# Set up LoRA configuration
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Wrap model with LoRA
peft_model = get_peft_model(base_model, lora_config)

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"


In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

### Create the data collator, set the training parameters and create the trainer

In [None]:
# Define the data collator to handle padding dynamically. For the moment, the dataset is composed of lists of variable lenght.
data_collator = DataCollatorForSeq2Seq(tokenizer, model=peft_model,  label_pad_token_id=-100)

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/Git_Portfolio/CTI/model_training",
    report_to="none",  # Disable logging to W&B
    eval_strategy="steps",
    eval_steps=100,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=1,
    logging_steps=100,
)

In [None]:
# Get LoRA trainable parameters
lora_parameters = [p for p in peft_model.parameters() if p.requires_grad]

# Define optimizer
optimizer = torch.optim.AdamW(lora_parameters, lr=1e-3)

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    optimizers=(optimizer, None)
)

In [None]:
# Start fine-tuning
trainer.train()

In [None]:
model_path="/content/drive/My Drive/Git_Portfolio/CTI/peft_model_CTI"

trainer.model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)
