## Qwen2.5-LoRA-Finetune-Baseline-Train
This notebook demonstrates how to fine-tune a large language model (Qwen2.5) using the LoRA (Low-Rank Adaptation) technique for classification tasks. The example uses the Jigsaw dataset to classify if Reddit comments violate community rules.

### Note
The training code is for demonstration purposes only. To reproduce the weights in the inference notebook, you need to change the config as follows. Training was performed on a local A6000.

- `IS_DEBUG = True` → `False`
- `MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"` → `"Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4"`
- `TRAIN_BS = 1` → `8`
- `GRAD_ACC_NUM = 8` → `1`


### References
* https://www.kaggle.com/code/abdmental01/jigsaw-mpnet-base-v2-inference-cv-0-876
* https://www.kaggle.com/code/aerdem4/jigsaw-acrc-qwen7b-finetune-logits-processor-zoo
* https://www.guruguru.science/competitions/24/discussions/21027ff1-2074-4e21-a249-b2d4170bd516/


### 1. Setup and Imports

In [1]:
!pip install trl
!pip install optimum
!pip install auto-gptq
!pip install bitsandbytes
!pip install peft accelerate

Collecting trl
  Downloading trl-0.19.1-py3-none-any.whl.metadata (10 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets>=3.0.0->trl)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting

In [2]:
import os
import pandas as pd
import torch
from sklearn.model_selection import KFold
from tqdm import tqdm
from torch.utils.data import Dataset
import wandb
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
)
from transformers.utils import is_torch_bf16_gpu_available
from peft import LoraConfig, TaskType, get_peft_model
from trl import DataCollatorForCompletionOnlyLM

2025-07-28 06:22:48.354124: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753683768.543686      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753683768.595477      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### 2. Configuration Settings

In [None]:
# Main configuration parameters
WANDB = False  # Enable/disable Weights & Biases logging
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"  # Pre-trained model to fine-tune
IS_DEBUG = True  # Debug mode with small dataset
N_FOLDS = 5  # Number of cross-validation folds
EPOCH = 1  # Training epochs
LR = 1e-4  # Learning rate
TRAIN_BS = 1 #8  # Training batch size
GRAD_ACC_NUM = 8 #1  # Gradient accumulation steps
EVAL_BS = 8  # Evaluation batch size
FOLD = 0  # Current fold to train
SEED = 42  # Random seed for reproducibility
 
# Derive experiment name and paths
EXP_ID = "jigsaw-lora-finetune-baseline"
if IS_DEBUG:
    EXP_ID += "_debug"
EXP_NAME = EXP_ID + f"_fold{FOLD}"
COMPETITION_NAME = "jigsaw-kaggle"
OUTPUT_DIR = "./ " # f"/kaggle/output/{EXP_NAME}/"
os.makedirs(OUTPUT_DIR, exist_ok=True)
MODEL_OUTPUT_PATH = f"{OUTPUT_DIR}/trained_model"

### 3. Data Loading and Preprocessing

In [4]:
# Load the dataset
df = pd.read_csv("/kaggle/input/jigsaw-agile-community-rules/train.csv")
if IS_DEBUG:
    # Use a small subset for debugging
    df = df.sample(50, random_state=SEED).reset_index(drop=True)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.padding_side = "left"  # Important for causal language models

# Define system prompt for the classification task
SYS_PROMPT = """
You are given a comment on reddit. Your task is to classify if it violates the given rule. Only respond Yes/No.
"""

prompts = []
for i, row in df.iterrows():
    text = f"""
r/{row.subreddit}
Rule: {row.rule}

1) {row.positive_example_1}
Violation: Yes

2) {row.negative_example_1}
Violation: No

3) {row.negative_example_2}
Violation: No

4) {row.positive_example_2}
Violation: Yes

5) {row.body}
"""
    
    # Format as a chat conversation using the model's template
    messages = [
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": text}
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    ) + "Answer:"
    prompts.append(prompt)

# Add the formatted prompts to the dataframe
df["text"] = prompts
df["label"] = df["rule_violation"].apply(lambda x: "Yes" if x == 1 else "No")

# Append the label to create completion-based training examples
df["text"] = df["text"] + df["label"]

# Tokenize the examples
def preprocess_row(row, tokenizer) -> dict:
    item = tokenizer(row["text"], add_special_tokens=False, truncation=False)
    return item

def preprocess_df(df, tokenizer) -> pd.DataFrame:
    items = []
    for _, row in df.iterrows():
        items.append(preprocess_row(row, tokenizer))
    df = pd.concat([
        df,
        pd.DataFrame(items)
    ], axis=1)
    return df

df = preprocess_df(df, tokenizer)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

### 4. Dataset and Model Setup

In [5]:
# Create a PyTorch dataset class
class ClassifyDataset(Dataset):
    def __init__(
        self,
        df: pd.DataFrame,
    ):
        self.df = df

    def __len__(self) -> int:
        return len(self.df)

    def __getitem__(self, index) -> dict:
        row = self.df.iloc[index]

        inputs = {
            "input_ids": row["input_ids"],
        }
        return inputs

# Data collator for completion-only learning
data_collator = DataCollatorForCompletionOnlyLM("Answer:", tokenizer=tokenizer)

# Load the base model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    # device_map="auto",  # Automatically distribute model across available GPUs
)

# Configure LoRA parameters
lora_config = LoraConfig(
    r=16,  # Rank of the update matrices
    lora_alpha=16,  # Alpha parameter for LoRA scaling
    lora_dropout=0.05,  # Dropout probability for LoRA layers
    task_type=TaskType.CAUSAL_LM,
    bias='none',  # Don't train bias terms
    # Target the attention and MLP modules of the transformer
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ]
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Show what percentage of parameters will be trained

# Initialize Weights & Biases for experiment tracking
if WANDB:
    wandb.login()
    wandb.init(project=COMPETITION_NAME, name=EXP_NAME)
    REPORT_TO = "wandb"
else:
    REPORT_TO = "none"

config.json: 0.00B [00:00, ?B/s]

  @custom_fwd
  @custom_bwd
  @custom_fwd(cast_inputs=torch.float16)


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.58G [00:00<?, ?B/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

trainable params: 40,370,176 || all params: 1,130,569,216 || trainable%: 3.5708


### 5. Cross-Validation Split

In [6]:
# Split data into train and validation sets
kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(kf.split(df)):
    if fold == FOLD:
        df_train = df.iloc[train_idx].reset_index(drop=True)
        df_val = df.iloc[val_idx].reset_index(drop=True)
        break

# Save the split data
df_train.to_pickle(f"{OUTPUT_DIR}/train.pkl")
df_val.to_pickle(f"{OUTPUT_DIR}/val.pkl")

### 6. Training Configuration and Execution

In [7]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir=MODEL_OUTPUT_PATH,
    logging_steps=10,  # Log metrics every 10 steps
    logging_strategy="steps",
    eval_strategy="no",  # No evaluation during training
    save_strategy="steps",
    save_steps=0.1,  # Save checkpoint after 10% of training steps
    save_total_limit=10,  # Keep only the 10 most recent checkpoints
    num_train_epochs=EPOCH,
    optim="paged_adamw_8bit",  # 8-bit optimizer for memory efficiency
    lr_scheduler_type="linear",
    warmup_ratio=0.1,  # Warm up learning rate over 10% of steps
    learning_rate=LR,
    weight_decay=0.01,

    # Use BF16 if available, otherwise FP16
    bf16=is_torch_bf16_gpu_available(),
    fp16=not is_torch_bf16_gpu_available(),

    per_device_train_batch_size=TRAIN_BS,
    per_device_eval_batch_size=EVAL_BS,
    gradient_accumulation_steps=GRAD_ACC_NUM,
    gradient_checkpointing=True,  # Save memory with gradient checkpointing
    gradient_checkpointing_kwargs={"use_reentrant": False},
    group_by_length=False,
    report_to=REPORT_TO,
    seed=42,
    remove_unused_columns=False,  # Keep all columns in the dataset
)

# Initialize trainer
trainer = Trainer(
    model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=ClassifyDataset(df_train),
    eval_dataset=ClassifyDataset(df_val),
    data_collator=data_collator,
)

# Start training
trainer_output = trainer.train()

# Save the final model
trainer.save_model(MODEL_OUTPUT_PATH)

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss


### Conclusion
This notebook demonstrates a complete workflow for fine-tuning Qwen2.5 using LoRA for a text classification task. The key components include:

1. Setting up the model with quantization (GPTQ-Int4)
2. Formatting the data as completion tasks
3. Using LoRA to efficiently fine-tune only a small subset of parameters
4. Training with mixed precision for memory efficiency

After training, the model will be saved and can be used for inference on new data.