<a href="https://colab.research.google.com/github/arielzamir/qwen2.5-finetuned-legal-assistant/blob/main/legal_assistant_qwen_finetuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Legal Assistant - Qwen2.5 Fine-Tuned with LoRA

This notebook demonstrates how to fine-tune the **Qwen2.5-1.5B-Instruct** model using **LoRA (Low-Rank Adaptation)** on a legal dataset.  
We use the Hugging Face ecosystem with `transformers`, `trl`, `peft`, and `datasets`, along with **Weights & Biases (wandb)** for experiment tracking.  


##Install Dependencies

We install all the necessary libraries:
- **bitsandbytes** → 8-bit optimizers for efficient training  
- **transformers** → Hugging Face model APIs  
- **accelerate** → handles multi-GPU / mixed precision training  
- **peft** → lightweight fine-tuning with LoRA  
- **trl** → supervised fine-tuning (SFT) utilities  
- **datasets** → loading and processing datasets

In [1]:
!pip -q install -U bitsandbytes transformers accelerate peft trl datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.9/511.9 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25h

##Import Libraries

We import the core libraries:  
- `datasets` → load datasets easily from Hugging Face Hub  
- `transformers` → tokenizer + base model  
- `trl` → SFTTrainer for fine-tuning  
- `peft` → LoRA configs and model wrapping  
- `huggingface_hub` → authentication for pushing models  
- `wandb` → experiment tracking  
- `torch` → PyTorch backend  

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer, setup_chat_format
from peft import LoraConfig, get_peft_model
from huggingface_hub import login
import wandb
import torch

##Authentication

Here we log into:
- **Hugging Face Hub** → for downloading models and pushing trained adapters  
- **Weights & Biases** → to track metrics, losses, and experiment runs  

In [3]:
login()
wandb.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33marielzamir100[0m ([33marielzamir100-independent[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

##Dataset Preparation  

We use the **CUAD (Contract Understanding Atticus Dataset)** legal QA dataset.  
Each question-answer pair is converted into a **chat format** with roles:  
- `system` → defines assistant behavior  
- `user` → the question (legal contract query)  
- `assistant` → the answer  

This ensures the dataset matches the **instruction-tuned format** required by Qwen2.5.  

In [4]:
def convert_to_chat(example):
    ans = example.get("answers", {}).get("text", [])
    answer = ans[0].strip() if len(ans) > 0 else ""
    return {
        "messages": [
            {"role":"system","content":"You are a helpful assistant."},
            {"role":"user","content": example["question"]},
            {"role":"assistant","content": answer},
        ]
    }

##Load the Dataset

In [5]:
dataset = load_dataset("chenghao/cuad_qa")
dataset = dataset.map(convert_to_chat, remove_columns=dataset["train"].column_names)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

dataset_infos.json: 0.00B [00:00, ?B/s]

data/train-00000-of-00002-18a81e2099017f(…):   0%|          | 0.00/164M [00:00<?, ?B/s]

data/train-00001-of-00002-f2853274fe89f4(…):   0%|          | 0.00/127M [00:00<?, ?B/s]

data/test-00000-of-00001-95f0b9188fb671f(…):   0%|          | 0.00/23.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1244 [00:00<?, ? examples/s]

Map:   0%|          | 0/11178 [00:00<?, ? examples/s]

Map:   0%|          | 0/1244 [00:00<?, ? examples/s]

##Load Base Model & Tokenizer  

We load the **Qwen2.5-1.5B-Instruct** model and tokenizer.  
- If the tokenizer has no `pad_token`, we assign it to the EOS token.  
- The model is loaded in **bfloat16/float16** automatically if GPU supports it.  
- Device mapping is set to `"auto"` so `accelerate` decides GPU/CPU placement.  

In [6]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype = "auto",
    attn_implementation="sdpa",
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

##Configure LoRA  

We apply **LoRA (Low-Rank Adaptation)** for efficient fine-tuning.  
Key parameters:  
- `r=16` → rank (controls size of LoRA updates)  
- `lora_alpha=32` → scaling factor for updates  
- `target_modules=["q_proj","v_proj"]` → which layers LoRA adapts  
- `lora_dropout=0.05` → regularization  
- `bias="none"` → no bias terms are trained  
- `task_type="CAUSAL_LM"` → language modeling  

This keeps most of the base model **frozen** and trains only small adapter layers.  

In [7]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

##Training Configuration


Here we define the **training arguments** for supervised fine-tuning:  

- `output_dir="./legal-assistant"` → where to save checkpoints  
- `per_device_train_batch_size=1` → batch size per GPU  
- `gradient_accumulation_steps=8` → simulates a larger batch size  
- `packing=True` → packs multiple short samples into one sequence for efficiency  
- `num_train_epochs=2` → number of full dataset passes  
- `learning_rate=1e-4` → initial learning rate  
- `lr_scheduler_type="cosine"` → cosine decay schedule  
- `warmup_ratio=0.03` → warmup phase for stable training  
- `logging_steps=10` → log metrics every 10 steps  
- `save_strategy="epoch"` → save checkpoint every epoch  
- `fp16=True` → use mixed precision (faster + less memory)  
- `gradient_checking=True` → reduce memory usage with checkpointing  
- `push_to_hub=True` → push final model to Hugging Face Hub  
- `hub_model_id="ArielZamir23/legal-assistant-qwen2_5-1_5b-lora"` → repo name on Hugging Face Hub  
- `hub_strategy="every_save"` → push every checkpoint  
- `report_to="wandb"` → log training metrics to Weights & Biases  

In [8]:
training_args = SFTConfig(
    output_dir="./legal-assistant",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    packing=True,
    num_train_epochs=2,
    learning_rate=1e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
    gradient_checkpointing=True,
    push_to_hub=True,
    hub_model_id="ArielZamir23/legal-assistant-qwen2_5-1_5b-lora",
    hub_strategy="every_save",
    report_to="wandb"
)

##Initialize Weights & Biases (wandb)  

We initialize a new **wandb run** to track training metrics:  
- `project="legal-assistant"` → experiment project name  
- `name="qwen2.5-1.5b-lora-cuad"` → specific run name  

This lets us monitor:  
- Training loss  
- Learning rate schedule  
- GPU usage and runtime  
- Checkpoint saving  

In [9]:
wandb.init(project="legal-assistant", name="qwen2.5-1.5b-lora-cuad")

##Start Training with SFTTrainer  

We create an `SFTTrainer` that will:  
- Use our model + tokenizer  
- Train on the prepared `cuad_qa` dataset  
- Apply the training configuration defined earlier  

The trainer handles everything automatically:  
- Forward & backward pass  
- Optimizer updates  
- Loss logging  
- Saving checkpoints  
- Pushing to Hugging Face Hub  

In [10]:
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=dataset["train"],
    args=training_args
)



Tokenizing train dataset:   0%|          | 0/11178 [00:00<?, ? examples/s]

Packing train dataset:   0%|          | 0/11178 [00:00<?, ? examples/s]

In [11]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
10,1.8106
20,1.6377
30,1.3889
40,1.5028
50,1.4606
60,1.363
70,1.4516
80,1.4419
90,1.3912
100,1.3719


TrainOutput(global_step=220, training_loss=1.3896615115079012, metrics={'train_runtime': 2228.5418, 'train_samples_per_second': 0.786, 'train_steps_per_second': 0.099, 'total_flos': 1.4134539991062528e+16, 'train_loss': 1.3896615115079012})

##Inference Example (Quick Start)

Once the model is fine-tuned, we can use it for **legal question answering**.  
Below we load the model with `pipeline` from 🤗 Transformers and ask a **domain-specific question**:  

**Example Question:**  
👉 *"What is the termination clause in this contract?"*

The model responds with a legally styled answer extracted/generated from the training domain.

In [2]:
from transformers import pipeline

question = "What is the termination clause in this contract?"
generator = pipeline("text-generation", model="ArielZamir23/legal-assistant-qwen2_5-1_5b-lora")

output = generator([{"role": "user", "content": question}], max_new_tokens=256)
print(output[0]["generated_text"])

Device set to use cuda:0


[{'role': 'user', 'content': 'What is the termination clause in this contract?'}, {'role': 'assistant', 'content': 'This Agreement shall terminate automatically at any time upon expiration of 12 months from the date it was executed or if the Parties do not enter into an agreement to continue providing Services pursuant to this Agreement within [***] after such expiration (the "Initial Term").'}]
