# Slot 4: Fine-Tuning

This is the notebook for the last slot of the tutorial KE-RAG. Here, we will fine-tune a pretrained LLM using RLHF, reinforcement learning from human feedback.

### Preparations
Start with installing the packages necessary for the task

In [None]:
%pip install -r requirements.txt

Import packages and load environment

In [None]:
import torch
from datasets import load_dataset
from transformers import ( 
    AutoTokenizer, 
    AutoModelForCausalLM,
    pipeline,
    logging,
    )
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

#### Load Model

We will be using [Llama 3.2 with 1 billion parameters](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct). **Beware that you need to request access beforehand.**

In [None]:
# Model to be used
model_name = "meta-llama/Llama-3.2-1B-Instruct" # https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

#Load LLama tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code = True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# save the terminator tokens for use in the pipe
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

#load the entire model on the GPU
device_map = {"":0}
#load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # quantization_config = bnb_config, # Quantization option. If needed, this needs to be invoked *after* the bnb configuration
    device_map = device_map,
    torch_dtype = torch.bfloat16
)

model.config.use_cache = False
model.config.pretraining_tp = 1

### Showcase: Base model

Here we check the result of the base model on our query.
We use the text generation pipeline to create a SPARQL query on the DBLP knowledge graph.

In [None]:
# Run text generation pipeline with our next model
pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
)

messages = [
    {"role": "system", "content": "You are a helpful chatbot, that answers only with SPARQL queries."},
    {"role": "user", "content": "Show the Wikidata ID of the person Robert Schober. His entity ID is <https://dblp.org/pid/95/2265>."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    eos_token_id = terminators,
)
print(outputs[0]['generated_text'][-1]["content"])


### Showcase: Fine-tuning

First, we load the dataset to be used. For our example task we use [DBLP-QuAD](https://huggingface.co/datasets/awalesushil/DBLP-QuAD), a scholarly knowledge graph question answering dataset with 10,000 question - SPARQL query pairs. As the name indicates it targets the DBLP knowledge graph.

In [None]:
dataset_name = "awalesushil/DBLP-QuAD" # https://huggingface.co/datasets/awalesushil/DBLP-QuAD

#load dataset
dataset = load_dataset(dataset_name,split = "train")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['question'])):
        text = f"### Question: {example['question'][i]}\n ### Answer: {example['query'][i]}"
        output_texts.append(text)
    return output_texts

response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

### Training

As we use LORA, we need to set up some additional parameters for it. Commented out you can see the parameters for QLORA, a quantization method. We do not use it for now.

In [None]:
# BitsAndBytes for QLORA
# use_4bit = True
# bnb_4bit_compute_dtype = "float16"
# bnb_4bit_quant_type = "nf4"
# use_nested_quant = False

#load tokenizer and model with QLoRA config
# compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

# bnb_config = BitsAndBytesConfig(
#     load_in_4bit = use_4bit,
#     bnb_4bit_quant_type = bnb_4bit_quant_type,
#     bnb_4bit_compute_dtype = compute_dtype,
#     bnb_4bit_use_double_quant = use_nested_quant,)

#checking GPU compatibility with bfloat16
# if compute_dtype == torch.float16 and use_4bit:
#     major, _ = torch.cuda.get_device_capability()
#     if major >= 8:
#         print("="*80)
#         print("Your GPU supports bfloat16, you are getting accelerate training with bf16= True")
#         print("="*80)


#Load LoRA config
peft_config = LoraConfig(
    r  = 16, #attention dimension/ rank
    lora_alpha = 32, #scaling parameter
    lora_dropout = 0.05, #dropout probability
    bias = "none",
    task_type = "CAUSAL_LM",
)

Then we set all parameters necessary for fine-tuning.
For futher information on supervised fune-tuning, please refer to [the documentation at huggingface.co](https://huggingface.co/docs/trl/en/sft_trainer).

In [None]:
#Set Training parameters
training_arguments = SFTConfig(
    output_dir = "./results",
    num_train_epochs = 5,
    per_device_train_batch_size = 4,
    optim = "adamw_torch", # for QLORA use"paged_adamw_32bit"
    save_steps = 0,
    logging_steps = 50,
    learning_rate = 2e-4,
    max_grad_norm = 0.3,
    weight_decay = 0.001,
    lr_scheduler_type = "cosine",
    warmup_ratio = 0.03,
    group_by_length = True,
    # report_to = "tensorboard", # if you want reporting, you need to install it first
)


#SFT Trainer
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    peft_config = peft_config,
    # dataset_text_field = "text",
    max_seq_length = 1024,
    args = training_arguments,
    tokenizer = tokenizer,
    packing = False,
    formatting_func=formatting_prompts_func,
    # data_collator=collator,
)

#Start training
trainer.train()

#### Saving the model and Testing
Use the text generation pipeline once again but this time with the fine-tuned model. Note that the prompt stays the same.

In [None]:
new_model = f"{model_name}-finetuned-DBLP-QUAD2" # name of the fine-tuned model

#save trained model
trainer.model.save_pretrained(f"results/finetuned models/{new_model}")

In [None]:
# model = AutoModelForCausalLM.from_pretrained(f'results/finetuned models/{new_model}', device_map=device_map)
# Run text generation pipeline with our next model
pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
)

messages = [
    {"role": "system", "content": "You are a helpful chatbot, that answers only with SPARQL queries."},
    {"role": "user", "content": "Show the Wikidata ID of the person Robert Schober. His entity ID is <https://dblp.org/pid/95/2265>."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    eos_token_id = terminators,
)
print(outputs[0]['generated_text'][-1]["content"])
