# LoRA and Alignment
The goal of this notebook is to finetune a given model for Instruction Finetuning using Lora and Alignment over the dataset [ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized).

### Reading and Links
1. https://huggingface.co/docs/trl/main/en/index


In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from trl import SFTTrainer, SFTConfig, DPOTrainer, DPOConfig
import os

In [6]:
HF_TOKEN = os.getenv("HF_TOKEN")

In [59]:
model_name = "Qwen/Qwen2.5-3B-Instruct"   # swap to a 7B instruct model if you prefer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [36]:
ds = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_sft").select(range(1000)) # prompt/chosen/rejected
def tok_len(s): 
    return len(tokenizer(s, add_special_tokens=False)["input_ids"])

In [37]:
def get_last_assistant_content(msgs):
    # msgs: list of {"role": "...", "content": "..."}
    for m in reversed(msgs):
        if isinstance(m, dict) and m.get("role") == "assistant":
            return m.get("content", "")
    return ""

In [38]:
ds[0]

{'prompt': 'how can i develop a habit of drawing daily',
 'prompt_id': '086b3e24f29b8956a01059f79c56db35d118a06fb6b844b095737d042795cd43',
 'chosen': [{'content': 'how can i develop a habit of drawing daily',
   'role': 'user'},
  {'content': "Developing a daily habit of drawing can be challenging but with consistent practice and a few tips, it can become an enjoyable and rewarding part of your daily routine. Here are some strategies to help you develop the habit of drawing daily:\n\n1. Set a specific time: Allocate a specific time of the day to draw. It could be in the morning, afternoon, or evening. Make drawing a part of your daily routine.\n2. Set a specific duration: Determine the amount of time you want to spend on drawing each day. It can be as little as 10 minutes or as long as an hour. Be consistent with the duration to help build the habit.\n3. Start small and simple: Don't try to create a masterpiece every day, start with simple and easy-to-do sketches. Focus on improving yo

In [39]:
ds_f = ds.filter(
    lambda x: tok_len(x["prompt"]) <= 256
    and tok_len(get_last_assistant_content(x["chosen"])) <= 256
    and tok_len(get_last_assistant_content(x["rejected"])) <= 256
)

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [40]:
split = ds_f.train_test_split(test_size=100, seed=42)
train_raw, eval_raw = split["train"], split["test"]

In [41]:
train_raw

Dataset({
    features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
    num_rows: 254
})

In [42]:
SYSTEM = "You are a helpful university helpdesk assistant. Be concise and ask clarifying questions when needed."

def build_chat_prompt(user: str) -> str:
    msgs = [{"role": "system", "content": SYSTEM},
            {"role": "user", "content": user}]
    if hasattr(tokenizer, "apply_chat_template"):
        return tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    return f"System: {SYSTEM}\nUser: {user}\nAssistant:"

def to_dpo_row(ex):
    return {
        "prompt": build_chat_prompt(ex["prompt"]),                # string
        "chosen": get_last_assistant_content(ex["chosen"]),       # string
        "rejected": get_last_assistant_content(ex["rejected"]),   # string
    }

train_dpo = train_raw.map(to_dpo_row, remove_columns=train_raw.column_names)
eval_dpo  = eval_raw.map(to_dpo_row,  remove_columns=eval_raw.column_names)

train_sft = train_dpo.map(lambda x: {"prompt": x["prompt"], "completion": x["chosen"]},
                          remove_columns=["chosen","rejected"])
eval_sft  = eval_dpo.map(lambda x: {"prompt": x["prompt"], "completion": x["chosen"]},
                         remove_columns=["chosen","rejected"])

Map:   0%|          | 0/254 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/254 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [43]:
train_dpo[0]

{'prompt': '<|im_start|>system\nYou are a helpful university helpdesk assistant. Be concise and ask clarifying questions when needed.<|im_end|>\n<|im_start|>user\nWrite a 10-line free verse poem about the contagious joys of laughter with a focus on the physical and emotional benefits it provides. Use descriptive language to evoke a vivid image of a joyful moment shared among friends, and include at least one metaphor or simile to express the power of laughter to uplift spirits and connect people.<|im_end|>\n<|im_start|>assistant\n',
 'chosen': "Laughter, a medicine for the soul,\nSpreading joy, like a wildfire's embrace,\nContagious and pure, it warms the heart,\nBringing people together, like a work of art.\n\nA shared laugh, a bonding moment,\nCreates a connection, like a river's flow,\nThrough the laughter, our spirits are lifted,\nLike a bird taking flight, our hearts are alifted.\n\nThe power of laughter, it heals and mends,\nLike a gentle breeze, it soothes and sends,\nA message 

In [44]:
peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    )

In [45]:
sft_out = "sft_lora"

sft_cfg = SFTConfig(
    output_dir=sft_out,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps=10,
    save_steps=200,
    eval_strategy="steps",
    eval_steps=200,
    save_total_limit=2,
    fp16=False,
    bf16=True,
    gradient_checkpointing=True,
    report_to=None,
)

sft_trainer = SFTTrainer(
    model=model,
    args=sft_cfg,
    train_dataset=train_sft,
    eval_dataset=eval_sft,
    processing_class=tokenizer,
    peft_config=peft_config,
)






Map:   0%|          | 0/254 [00:00<?, ? examples/s]

Converting train dataset to ChatML:   0%|          | 0/254 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/254 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/254 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/254 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/100 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

In [46]:
sft_trainer.train()
sft_trainer.save_model(sft_out) 

Step,Training Loss,Validation Loss


In [52]:
def prepare_for_inference(m):
    m.eval()
    # Disable gradient checkpointing for generation
    if hasattr(m, "gradient_checkpointing_disable"):
        m.gradient_checkpointing_disable()
    # Re-enable cache for faster generation
    if hasattr(m, "config"):
        m.config.use_cache = True
    return m

def generate(m, user_text, max_new_tokens=160, temperature=0.7, BF16=torch.float16):
    m = prepare_for_inference(m)

    prompt = build_chat_prompt(user_text)
    inputs = tokenizer(prompt, return_tensors="pt").to(m.device)

    # Choose autocast dtype
    amp_dtype = torch.bfloat16 if (m.device.type == "cuda" and BF16) else torch.float16

    with torch.no_grad():
        if m.device.type == "cuda":
            with torch.autocast(device_type="cuda", dtype=amp_dtype):
                out = m.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    do_sample=True,
                    temperature=temperature,
                )
        else:
            out = m.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=temperature,
            )

    return tokenizer.decode(out[0], skip_special_tokens=True)


In [53]:
print("=== BASE ===")
print(generate(model, "I forgot my Uni email password. What should I do?"))

=== BASE ===
system
You are a helpful university helpdesk assistant. Be concise and ask clarifying questions when needed.
user
I forgot my Uni email password. What should I do?
assistant
If you have forgotten your Uni email password, there are a few steps you can take to reset it:

1. Go to the University's online helpdesk or support page.
2. Look for a "Forgot Password" or "Reset Password" option on the login page or in the account settings section.
3. Follow the instructions provided by the website to reset your password. This may involve answering security questions, receiving a link via email, or using a mobile app.

If these options are not available, you can also contact the IT Helpdesk directly for assistance.


In [54]:
# After SFT
sft_eval = PeftModel.from_pretrained(model, sft_out)
sft_eval.eval()
print("\n=== AFTER SFT ===")
print(generate(sft_eval, "I forgot my Uni email password. What should I do?"))




=== AFTER SFT ===
system
You are a helpful university helpdesk assistant. Be concise and ask clarifying questions when needed.
user
I forgot my Uni email password. What should I do?
assistant
You can reset your Uni email password by following these steps:

1. Go to the uni's website and click on "Forgot Password?" or "Reset Password?"
2. Enter your University email address.
3. Follow the instructions to reset your password.

If you don't know your email address, contact the Student Services office for assistance.


# Reload the base model freshly

In [None]:

model2 = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

In [60]:
dpo_out = "dpo_lora"
policy_model = PeftModel.from_pretrained(model2, sft_out, is_trainable=True)

dpo_cfg = DPOConfig(
    output_dir=dpo_out,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps=10,
    save_steps=200,
    eval_strategy="steps",
    eval_steps=200,
    save_total_limit=2,
    fp16=False,
    bf16=True,
    gradient_checkpointing=True,
    report_to=None,
    beta=0.1,
    max_prompt_length=256,
    max_completion_length=256,# Helps single-GPU memory: precompute ref logprobs so ref model needn't stay resident :contentReference[oaicite:9]{index=9}
    precompute_ref_log_probs=True,
    precompute_ref_batch_size=1,
)

dpo_trainer = DPOTrainer(
    model=policy_model,
    args=dpo_cfg,
    train_dataset=train_dpo,
    eval_dataset=eval_dpo,
    processing_class=tokenizer,
    # (prompt, chosen, rejected) format; explicit prompts recommended :contentReference[oaicite:10]{index=10}
)
dpo_trainer.train()
dpo_trainer.save_model(dpo_out)

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Train dataset reference log probs:   0%|          | 0/254 [00:00<?, ?it/s]

Step,Training Loss,Validation Loss


In [61]:
# After DPO
dpo_eval = PeftModel.from_pretrained(model, sft_out)
dpo_eval.eval()
print("\n=== AFTER DPO ===")
print(generate(dpo_eval, "I forgot my Uni email password. What should I do?"))




=== AFTER DPO ===
system
You are a helpful university helpdesk assistant. Be concise and ask clarifying questions when needed.
user
I forgot my Uni email password. What should I do?
assistant
If you've forgotten your University of X email password, you can reset it by following these steps:

1. Go to the [University's Password Reset Page](https://www.universityofx.com/password-reset)
2. Enter your University of X email address in the provided field.
3. Click on "Reset Password".
4. A new password will be sent to the email address associated with your account.

If you don't have access to your email or you're not able to reset your password, you may need to contact the IT Helpdesk for assistance.
