# Qwen3 1.7B Training Notebook (4-bit BitsAndBytes + QLoRA Fine-tuning)

This notebook fine-tunes Qwen3-1.7B (base model) on training data only using 4-bit quantization via BitsAndBytes with QLoRA and saves the model for later use.

**Key Changes from DoRA Version:**
- Uses `Qwen/Qwen3-1.7B` (base model) from local Kaggle input
- Uses BitsAndBytes 4-bit NF4 quantization for QLoRA
- Uses standard QLoRA (no DoRA)
- Trains only on training data (no test-time training)
- Saves the fine-tuned model for later loading

**Benefits of 4-bit + QLoRA:**
- **Lower VRAM**: ~8-10GB per GPU
- **Faster training**: Efficient with 4-bit quantization
- **Better compatibility**: Full PEFT ecosystem support
- `r=16` - Rank from TT-1 config
- `lora_alpha=32` - Alpha from TT-1 config
- `lora_dropout=0.1` - Dropout from TT-1 config

In [None]:
# Install dependencies - BitsAndBytes + QLoRA setup
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'trl==0.21.0' 'optimum==1.27.0' 'bitsandbytes==0.46.1' 'deepspeed==0.17.4' 'logits-processor-zoo==0.2.1' 'vllm==0.10.0'
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'triton==3.2.0'
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'clean-text'
# Install latest PEFT for QLoRA support (ensure v0.10.0+) - No auto-gptq needed!
!uv pip install --system --no-index -U --no-deps --find-links='/kaggle/input/jigsaw-packages2/whls/' 'peft' 'accelerate' 'datasets'

# Note: Removed auto-gptq dependency as we're using BitsAndBytes quantization
print("✅ Dependencies installed for 4-bit BitsAndBytes + QLoRA setup")
print("📁 Model will be loaded from local Kaggle input: /kaggle/working/qwen3-1.7b")

# 1. Train Qwen3 1.7B with 4-bit Quantization + QLoRA

In [None]:
%%writefile constants.py
# Using base Qwen3 1.7B model from Kaggle input (no internet needed)
# Model is pre-loaded in Kaggle environment
BASE_MODEL_PATH = "/kaggle/working/qwen3-1.7b"  # Local Kaggle path
print("✅ Using model from local Kaggle input (no internet required)")

LORA_PATH = "qwen3_1.7b_4bit_qlora_output/"  # 4-bit QLoRA output path
FINAL_MODEL_PATH = "qwen3_1.7b_4bit_finetuned/"  # Path for merged final model
DATA_PATH = "/kaggle/input/jigsaw-agile-community-rules/"

POSITIVE_ANSWER = "Yes"
NEGATIVE_ANSWER = "No"
COMPLETE_PHRASE = "Answer:"
BASE_PROMPT = '''You are a moderator of subreddit.  given a comment from reddit and a rule. Your task is to classify whether the comment violates the rule. Only respond Yes/No.'''

In [None]:
%%writefile utils.py
import pandas as pd
from datasets import Dataset
from constants import POSITIVE_ANSWER, NEGATIVE_ANSWER, COMPLETE_PHRASE, BASE_PROMPT
import random, numpy as np
random.seed(42)
np.random.seed(42)


def build_prompt(row):
    return f"""
{BASE_PROMPT}

Subreddit: r/{row["subreddit"]}
Rule: {row["rule"]}
Examples:
1) {row["positive_example"]}
{COMPLETE_PHRASE} Yes

2) {row["negative_example"]}
{COMPLETE_PHRASE} No

---
Comment: {row["body"]}
{COMPLETE_PHRASE}"""


def get_dataframe_to_train(data_path, training_only=True):
    """Modified: Only use training data when training_only=True"""
    train_dataset = pd.read_csv(f"{data_path}/train.csv")
    
    flatten = []

    # ---------- Process training data ----------
    train_df = train_dataset[["body", "rule", "subreddit", "rule_violation",
                              "positive_example_1","positive_example_2",
                              "negative_example_1","negative_example_2"]].copy()

    # Randomly select positive_example and negative_example
    train_df["positive_example"] = np.where(
        np.random.rand(len(train_df)) < 0.5,
        train_df["positive_example_1"],
        train_df["positive_example_2"]
    )
    train_df["negative_example"] = np.where(
        np.random.rand(len(train_df)) < 0.5,
        train_df["negative_example_1"],
        train_df["negative_example_2"]
    )

    # Drop original candidate columns
    train_df.drop(columns=["positive_example_1","positive_example_2",
                           "negative_example_1","negative_example_2"], inplace=True)

    flatten.append(train_df)
    
    # Changed: Skip test data processing when training_only=True
    if not training_only:
        test_dataset = pd.read_csv(f"{data_path}/test.csv").sample(frac=0.5, random_state=42).reset_index(drop=True)
        
        # ---------- Process test data ----------
        for violation_type in ["positive", "negative"]:
            for i in range(1, 3):
                sub_dataset = test_dataset[["rule","subreddit",
                                            "positive_example_1","positive_example_2",
                                            "negative_example_1","negative_example_2"]].copy()

                if violation_type == "positive":
                    body_col = f"positive_example_{i}"
                    other_positive_col = f"positive_example_{3-i}"
                    sub_dataset["body"] = sub_dataset[body_col]
                    sub_dataset["positive_example"] = sub_dataset[other_positive_col]
                    sub_dataset["negative_example"] = np.where(
                        np.random.rand(len(sub_dataset)) < 0.5,
                        sub_dataset["negative_example_1"],
                        sub_dataset["negative_example_2"]
                    )
                    sub_dataset["rule_violation"] = 1

                else:  # violation_type == "negative"
                    body_col = f"negative_example_{i}"
                    other_negative_col = f"negative_example_{3-i}"
                    sub_dataset["body"] = sub_dataset[body_col]
                    sub_dataset["negative_example"] = sub_dataset[other_negative_col]
                    sub_dataset["positive_example"] = np.where(
                        np.random.rand(len(sub_dataset)) < 0.5,
                        sub_dataset["positive_example_1"],
                        sub_dataset["positive_example_2"]
                    )
                    sub_dataset["rule_violation"] = 0

                sub_dataset.drop(columns=["positive_example_1","positive_example_2",
                                          "negative_example_1","negative_example_2"], inplace=True)

                flatten.append(sub_dataset)

    # Merge all DataFrames
    dataframe = pd.concat(flatten, axis=0)
    dataframe = dataframe.drop_duplicates(ignore_index=True)

    return dataframe


def build_dataset(dataframe):
    dataframe["prompt"] = dataframe.apply(build_prompt, axis=1)

    columns = ["prompt"]
    if "rule_violation" in dataframe:
        dataframe["completion"] = dataframe["rule_violation"].map(
            {
                1: POSITIVE_ANSWER,
                0: NEGATIVE_ANSWER,
            }
        )
        columns.append("completion")

    dataframe = dataframe[columns]
    dataset = Dataset.from_pandas(dataframe)
    dataset.to_pandas().to_csv("/kaggle/working/training_dataset.csv", index=False)
    return dataset

In [None]:
%%writefile train.py
import pandas as pd
import torch

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model, PeftModel  # Added PeftModel for saving
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig  # Added BitsAndBytesConfig
from tqdm.auto import tqdm
from transformers.utils import is_torch_bf16_gpu_available
from utils import build_dataset, get_dataframe_to_train
from constants import DATA_PATH, BASE_MODEL_PATH, LORA_PATH, FINAL_MODEL_PATH


def main():
    # Changed: Only use training data (training_only=True)
    dataframe = get_dataframe_to_train(DATA_PATH, training_only=True)
    train_dataset = build_dataset(dataframe)
    
    print(f"Training dataset size: {len(train_dataset)} samples")
    
    # BitsAndBytes 4-bit quantization config for QLoRA
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,  # Enable 4-bit quantization
        bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16
        bnb_4bit_use_double_quant=True,  # Use double quantization for better quality
        bnb_4bit_quant_type="nf4"  # Use NF4 quantization (standard for QLoRA)
    )
    print("✅ BitsAndBytes 4-bit quantization config created")
    
    # QLoRA configuration with settings from TT-1
    lora_config = LoraConfig(
        r=8,  # From TT-1 config
        lora_alpha=16,  # From TT-1 config
        lora_dropout=0.045,  # From TT-1 config
        bias="none",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        task_type="CAUSAL_LM",
        use_dora=True
        # Removed use_dora=True for standard QLoRA
    )
    print("✅ QLoRA config created with TT-1 settings")
    
    # Optimized training config for 4-bit + QLoRA
    training_args = SFTConfig(
        num_train_epochs=1,  # Keep same epochs
        
        # Increased batch sizes due to lower memory usage with 4-bit
        per_device_train_batch_size=6,  # Increased from 2 to 4 (4-bit uses less VRAM)
        gradient_accumulation_steps=3,  # Reduced from 8 to 4 (effective batch size = 4*4*2 = 32)
        
        optim="paged_adamw_8bit",  # Keep 8-bit optimizer for memory efficiency
        learning_rate=1e-4,  # Keep same learning rate
        weight_decay=0.01,
        max_grad_norm=1.0,
        
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        
        bf16=is_torch_bf16_gpu_available(),
        fp16=not is_torch_bf16_gpu_available(),
        dataloader_pin_memory=True,
        
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
    
        save_strategy="epoch",
        save_steps=500,
        output_dir=LORA_PATH,
        logging_steps=50,
        report_to="none",
    
        completion_only_loss=True,
        packing=False,
        remove_unused_columns=False,
    )
    print("✅ Training config created with optimized batch sizes for 4-bit")
    
    # Load the model with BitsAndBytes quantization (local Kaggle input only)
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_PATH,
        quantization_config=quantization_config,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
        local_files_only=True,  # Use only local files (no internet)
    )
    print("✅ Base model loaded from local Kaggle input")
    
    # Create SFTTrainer with the loaded model (remove model_init_kwargs)
    trainer = SFTTrainer(
        model=base_model,  # Pass the loaded model directly
        args=training_args,
        train_dataset=train_dataset,
        peft_config=lora_config,
    )
    
    print("🚀 Starting 4-bit BitsAndBytes + QLoRA training...")
    trainer.train()
    
    # Save the LoRA adapters
    trainer.save_model(LORA_PATH)
    print(f"✅ 4-bit QLoRA adapters saved to: {LORA_PATH}")
    
    # Merge and save the final model for easier loading
    print("🔄 Merging 4-bit QLoRA adapters with base model...")
    
    # Load base model with same quantization for merging (local only)
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_PATH,
        quantization_config=quantization_config,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
        local_files_only=True,  # Use only local files
    )
    print("✅ Base model loaded for merging from local Kaggle input")
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH, trust_remote_code=True, local_files_only=True)
    
    # Load and merge LoRA adapters
    peft_model = PeftModel.from_pretrained(base_model, LORA_PATH)
    merged_model = peft_model.merge_and_unload()
    
    # Save merged model
    merged_model.save_pretrained(FINAL_MODEL_PATH)
    tokenizer.save_pretrained(FINAL_MODEL_PATH)
    
    print(f"✅ Final 4-bit + QLoRA merged model saved to: {FINAL_MODEL_PATH}")
    print("🎉 4-bit BitsAndBytes + QLoRA training completed successfully!")


if __name__ == "__main__":
    main()

In [None]:
%%writefile accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 4  # Reduced from 8 to 4 (4-bit uses less memory)
  gradient_clipping: 1.0
  train_batch_size: 32  # Same effective batch size: 4*4*2 = 32
  train_micro_batch_size_per_gpu: 4  # Increased from 2 to 4 (4-bit is more efficient)
  
  zero_stage: 2
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  
  stage3_gather_16bit_weights_on_model_save: false
  stage3_max_live_parameters: 1e8
  stage3_max_reuse_distance: 1e8
  stage3_prefetch_bucket_size: 5e7
  stage3_param_persistence_threshold: 1e5
  
  zero_allow_untested_optimizer: true
  zero_force_ds_cpu_optimizer: false
  
  fp16:
    enabled: true
    loss_scale: 0
    initial_scale_power: 16
    loss_scale_window: 1000
    hysteresis: 2
    min_loss_scale: 1
  
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
  dynamo_use_fullgraph: false
  dynamo_use_dynamic: false
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

In [None]:
!accelerate launch --config_file accelerate_config.yaml train.py

In [None]:
# Check training output and model files
import os
print("4-bit QLoRA adapter files:")
!ls -la qwen3_1.7b_4bit_qlora_output/
print("\nFinal 4-bit merged model files:")
!ls -la qwen3_1.7b_4bit_finetuned/

In [None]:
# Create a compressed archive for easier upload to Kaggle datasets
!tar -czf qwen3_1.7b_4bit_qlora_model.tar.gz qwen3_1.7b_4bit_finetuned/
print("4-bit QLoRA model archived as: qwen3_1.7b_4bit_qlora_model.tar.gz")
print("Upload this file to Kaggle as a dataset for use in the inference notebook.")

# ⚡ Speed Optimization Guide for 2x T4 GPUs (28GB Total VRAM) - 4-bit QLoRA Edition

## Current Settings Status: ✅ **EXCELLENT** for 2x T4 GPUs with 4-bit
- **Memory**: 1.7B model (4-bit) + QLoRA fits in ~8-10GB per GPU
- **Batch Size**: 4 per device × 4 accumulation = 32 effective batch size
- **DeepSpeed**: ZeRO Stage 2 with FP16 - optimal for this setup
- **Speed**: Efficient with 4-bit quantization

## 🚀 Additional Speed Optimizations for 4-bit QLoRA:

### **Quick Wins (Even Better with 4-bit):**
1. **Increase Batch Size Further** (Cell 7 - `train.py`):
   ```python
   per_device_train_batch_size=6,  # Can go higher with 4-bit (4→6-8)
   gradient_accumulation_steps=3,   # Adjust accordingly (4→2-3)
   ```
   
2. **Faster Optimizer** (Cell 7 - `train.py`):
   ```python
   optim="adamw_torch_fused",  # Even faster with 4-bit
   ```

3. **Reduce QLoRA Rank** (Cell 7 - `train.py`):
   ```python
   r=8,              # Can use lower rank with 4-bit efficiency
   lora_alpha=16,     # Adjust proportionally
   ```

## 💡 **Why 4-bit BitsAndBytes + QLoRA is Superior:**
1. **Standard QLoRA**: Reliable and well-tested
2. **Memory Efficient**: 4-bit NF4 uses less memory
3. **Faster**: Dynamic quantization is optimized for training
4. **Flexible**: Easy to adjust quantization settings
5. **Future-Proof**: Better PEFT ecosystem support