# Qwen3 1.7B Inference with Test-Time Training (GPTQ + LoRA)

This notebook performs inference with test-time training using GPTQ quantized Qwen3-1.7B and LoRA fine-tuning.

**Key Changes from BitsAndBytes Version:**
- Uses GPTQ quantized Qwen3-1.7B model (Int4/Int8)
- Uses auto-gptq for quantization handling
- Uses standard LoRA (no DoRA support with GPTQ)
- Performs test-time training on test data before predictions
- Compatible with pre-quantized models from Kaggle

**Benefits of GPTQ + LoRA:**
- **Pre-quantized**: Model already quantized, no dynamic quantization
- **Stable**: GPTQ provides consistent quantization
- **Memory Efficient**: Int4/Int8 quantization reduces VRAM usage
- **Test Drive**: Verifies setup with small sample before full inference

In [None]:
# Install dependencies - GPTQ + LoRA setup
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'trl==0.21.0' 'optimum==1.27.0' 'auto-gptq==0.7.1' 'deepspeed==0.17.4' 'logits-processor-zoo==0.2.1' 'vllm==0.10.0'
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'triton==3.2.0'
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'clean-text'
# Install latest PEFT for LoRA support (no BitsAndBytes needed for GPTQ)
!uv pip install --system --no-index -U --no-deps --find-links='/kaggle/input/jigsaw-packages2/whls/' 'peft' 'accelerate' 'datasets'

print("✅ Dependencies installed for GPTQ + LoRA setup")
print("📁 Models will be loaded from GPTQ datasets on Kaggle")

# 1. Test Drive Training (Verification on Test Data Sample)

This section performs a quick test drive on the first 100 test examples to verify the setup works correctly. The fine-tuned model from this test is **not used** for final predictions.

In [None]:
%%writefile constants.py
# GPTQ model paths
BASE_MODEL_PATH = "/kaggle/input/qwen3-gptq/transformers/1.7b-gptq-int4/1"  # TODO: Update this path
PRETRAINED_MODEL_PATH = "/kaggle/input/qwen3-gptq-finetuned/qwen3_1.7b_gptq_finetuned/"  # TODO: Update this path

LORA_PATH = "qwen3_1.7b_gptq_lora_output/"  # GPTQ LoRA output path
TESTTIME_MODEL_PATH = "qwen3_1.7b_gptq_testtime/"  # Path for test-time fine-tuned model
DATA_PATH = "/kaggle/input/jigsaw-agile-community-rules/"

POSITIVE_ANSWER = "Yes"
NEGATIVE_ANSWER = "No"
COMPLETE_PHRASE = "Answer:"
BASE_PROMPT = '''You are a moderator of subreddit.  given a comment from reddit and a rule. Your task is to classify whether the comment violates the rule. Only respond Yes/No.'''

print("✅ Using GPTQ model paths from Kaggle inputs")

In [None]:
%%writefile utils.py
import pandas as pd
from datasets import Dataset
from constants import POSITIVE_ANSWER, NEGATIVE_ANSWER, COMPLETE_PHRASE, BASE_PROMPT
import random, numpy as np
random.seed(42)
np.random.seed(42)


def build_prompt(row):
    return f"""
{BASE_PROMPT}

Subreddit: r/{row["subreddit"]}
Rule: {row["rule"]}
Examples:
1) {row["positive_example"]}
{COMPLETE_PHRASE} Yes

2) {row["negative_example"]}
{COMPLETE_PHRASE} No

---
Comment: {row["body"]}
{COMPLETE_PHRASE}"""


def get_testtime_dataframe(data_path, sample_size=None):
    """Process test data for test-time training"""
    test_dataset = pd.read_csv(f"{data_path}/test.csv")
    
    if sample_size:
        test_dataset = test_dataset.head(sample_size)
    
    flatten = []
    
    # Process test data for test-time training
    for violation_type in ["positive", "negative"]:
        for i in range(1, 3):
            sub_dataset = test_dataset[["rule","subreddit",
                                        "positive_example_1","positive_example_2",
                                        "negative_example_1","negative_example_2"]].copy()

            if violation_type == "positive":
                body_col = f"positive_example_{i}"
                other_positive_col = f"positive_example_{3-i}"
                sub_dataset["body"] = sub_dataset[body_col]
                sub_dataset["positive_example"] = sub_dataset[other_positive_col]
                sub_dataset["negative_example"] = np.where(
                    np.random.rand(len(sub_dataset)) < 0.5,
                    sub_dataset["negative_example_1"],
                    sub_dataset["negative_example_2"]
                )
                sub_dataset["rule_violation"] = 1

            else:  # violation_type == "negative"
                body_col = f"negative_example_{i}"
                other_negative_col = f"negative_example_{3-i}"
                sub_dataset["body"] = sub_dataset[body_col]
                sub_dataset["negative_example"] = sub_dataset[other_negative_col]
                sub_dataset["positive_example"] = np.where(
                    np.random.rand(len(sub_dataset)) < 0.5,
                    sub_dataset["positive_example_1"],
                    sub_dataset["positive_example_2"]
                )
                sub_dataset["rule_violation"] = 0

            sub_dataset.drop(columns=["positive_example_1","positive_example_2",
                                      "negative_example_1","negative_example_2"], inplace=True)

            flatten.append(sub_dataset)

    # Merge all DataFrames
    dataframe = pd.concat(flatten, axis=0)
    dataframe = dataframe.drop_duplicates(ignore_index=True)

    return dataframe


def get_dataframe_to_train(data_path, training_only=True):
    """Modified: Only use training data when training_only=True"""
    train_dataset = pd.read_csv(f"{data_path}/train.csv")
    
    flatten = []

    # ---------- Process training data ----------
    train_df = train_dataset[["body", "rule", "subreddit", "rule_violation",
                              "positive_example_1","positive_example_2",
                              "negative_example_1","negative_example_2"]].copy()

    # Randomly select positive_example and negative_example
    train_df["positive_example"] = np.where(
        np.random.rand(len(train_df)) < 0.5,
        train_df["positive_example_1"],
        train_df["positive_example_2"]
    )
    train_df["negative_example"] = np.where(
        np.random.rand(len(train_df)) < 0.5,
        train_df["negative_example_1"],
        train_df["negative_example_2"]
    )

    # Drop original candidate columns
    train_df.drop(columns=["positive_example_1","positive_example_2",
                           "negative_example_1","negative_example_2"], inplace=True)

    flatten.append(train_df)
    
    # Changed: Skip test data processing when training_only=True
    if not training_only:
        test_dataset = pd.read_csv(f"{data_path}/test.csv").sample(frac=0.5, random_state=42).reset_index(drop=True)
        
        # ---------- Process test data ----------
        for violation_type in ["positive", "negative"]:
            for i in range(1, 3):
                sub_dataset = test_dataset[["rule","subreddit",
                                            "positive_example_1","positive_example_2",
                                            "negative_example_1","negative_example_2"]].copy()

                if violation_type == "positive":
                    body_col = f"positive_example_{i}"
                    other_positive_col = f"positive_example_{3-i}"
                    sub_dataset["body"] = sub_dataset[body_col]
                    sub_dataset["positive_example"] = sub_dataset[other_positive_col]
                    sub_dataset["negative_example"] = np.where(
                        np.random.rand(len(sub_dataset)) < 0.5,
                        sub_dataset["negative_example_1"],
                        sub_dataset["negative_example_2"]
                    )
                    sub_dataset["rule_violation"] = 1

                else:  # violation_type == "negative"
                    body_col = f"negative_example_{i}"
                    other_negative_col = f"negative_example_{3-i}"
                    sub_dataset["body"] = sub_dataset[body_col]
                    sub_dataset["negative_example"] = sub_dataset[other_negative_col]
                    sub_dataset["positive_example"] = np.where(
                        np.random.rand(len(sub_dataset)) < 0.5,
                        sub_dataset["positive_example_1"],
                        sub_dataset["positive_example_2"]
                    )
                    sub_dataset["rule_violation"] = 0

                sub_dataset.drop(columns=["positive_example_1","positive_example_2",
                                          "negative_example_1","negative_example_2"], inplace=True)

                flatten.append(sub_dataset)

    # Merge all DataFrames
    dataframe = pd.concat(flatten, axis=0)
    dataframe = dataframe.drop_duplicates(ignore_index=True)

    return dataframe


def build_dataset(dataframe):
    dataframe["prompt"] = dataframe.apply(build_prompt, axis=1)

    columns = ["prompt"]
    if "rule_violation" in dataframe:
        dataframe["completion"] = dataframe["rule_violation"].map(
            {
                1: POSITIVE_ANSWER,
                0: NEGATIVE_ANSWER,
            }
        )
        columns.append("completion")

    dataframe = dataframe[columns]
    dataset = Dataset.from_pandas(dataframe)
    return dataset

In [None]:
%%writefile test_drive.py
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from transformers.utils import is_torch_bf16_gpu_available
from utils import build_dataset, get_testtime_dataframe
from constants import DATA_PATH, BASE_MODEL_PATH


def test_drive_training():
    print("🚀 Starting test drive training on first 100 test examples...")
    
    # Load first 100 test examples for test-time training
    dataframe = get_testtime_dataframe(DATA_PATH, sample_size=100)
    train_dataset = build_dataset(dataframe)
    
    print(f"Test drive dataset size: {len(train_dataset)} samples")
    
    # LoRA configuration for test drive (same as TT-1)
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.1,
        bias="none",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        task_type="CAUSAL_LM",
        # No use_dora=True - GPTQ doesn't support DoRA
    )
    print("✅ LoRA config created for test drive")
    
    # Load the GPTQ model (no quantization config needed - already quantized)
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_PATH,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
        local_files_only=True,
    )
    print("✅ GPTQ model loaded for test drive")
    
    # Short training config for test drive
    training_args = SFTConfig(
        num_train_epochs=1,  # Very short training
        
        per_device_train_batch_size=2,  # Smaller for GPTQ
        gradient_accumulation_steps=8,
        
        optim="paged_adamw_8bit",
        learning_rate=1e-4,
        weight_decay=0.01,
        max_grad_norm=1.0,
        
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        
        bf16=is_torch_bf16_gpu_available(),
        fp16=not is_torch_bf16_gpu_available(),
        dataloader_pin_memory=True,
        
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
    
        save_strategy="no",  # Don't save test drive model
        output_dir="test_drive_output/",
        logging_steps=10,
        report_to="none",
    
        completion_only_loss=True,
        packing=False,
        remove_unused_columns=False,
    )
    print("✅ Test drive training config created")
    
    # Create trainer for test drive
    trainer = SFTTrainer(
        model=base_model,
        args=training_args,
        train_dataset=train_dataset,
        peft_config=lora_config,
    )
    
    # Run test drive training
    trainer.train()
    print("✅ Test drive training completed successfully!")
    print("📝 Setup verified - proceeding to test-time training...")
    
    # Clean up to free memory
    del trainer, base_model
    torch.cuda.empty_cache()
    
    return True


if __name__ == "__main__":
    test_drive_training()

In [None]:
# Run test drive training
!python test_drive.py

# 2. Test-Time Training on Full Test Data

Now that the setup is verified, we perform test-time training on the full test dataset using our fine-tuned GPTQ model.

In [None]:
%%writefile inference.py
import pandas as pd
import torch
import os
from tqdm import tqdm

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.utils import is_torch_bf16_gpu_available
from utils import build_dataset, get_testtime_dataframe, build_prompt
from constants import (DATA_PATH, BASE_MODEL_PATH, PRETRAINED_MODEL_PATH, 
                      TESTTIME_MODEL_PATH, POSITIVE_ANSWER, NEGATIVE_ANSWER, COMPLETE_PHRASE)


def test_time_training():
    print("🚀 Starting test-time training on full test dataset...")
    
    # Load full test data for test-time training
    dataframe = get_testtime_dataframe(DATA_PATH)
    train_dataset = build_dataset(dataframe)
    
    print(f"Test-time training dataset size: {len(train_dataset)} samples")
    
    # LoRA configuration for test-time training (same as training)
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.1,
        bias="none",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        task_type="CAUSAL_LM",
        # No use_dora=True - GPTQ doesn't support DoRA
    )
    print("✅ LoRA config created for test-time training")
    
    # Check if pretrained model exists, otherwise use base GPTQ model
    if os.path.exists(PRETRAINED_MODEL_PATH):
        print(f"📦 Loading fine-tuned GPTQ model from: {PRETRAINED_MODEL_PATH}")
        base_model = AutoModelForCausalLM.from_pretrained(
            PRETRAINED_MODEL_PATH,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
            local_files_only=True,
        )
    else:
        print(f"📦 Fine-tuned model not found, using base GPTQ model from: {BASE_MODEL_PATH}")
        base_model = AutoModelForCausalLM.from_pretrained(
            BASE_MODEL_PATH,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
            local_files_only=True,
        )
    
    print("✅ GPTQ model loaded for test-time training")
    
    # Test-time training configuration
    training_args = SFTConfig(
        num_train_epochs=1,  # Single epoch for test-time training
        
        # GPTQ batch sizes for test-time training
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        
        optim="paged_adamw_8bit",
        learning_rate=2e-5,  # Lower learning rate for test-time training
        weight_decay=0.01,
        max_grad_norm=1.0,
        
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        
        bf16=is_torch_bf16_gpu_available(),
        fp16=not is_torch_bf16_gpu_available(),
        dataloader_pin_memory=True,
        
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
    
        save_strategy="epoch",
        output_dir=TESTTIME_MODEL_PATH,
        logging_steps=50,
        report_to="none",
    
        completion_only_loss=True,
        packing=False,
        remove_unused_columns=False,
    )
    print("✅ Test-time training config created")
    
    # Create trainer for test-time training
    trainer = SFTTrainer(
        model=base_model,
        args=training_args,
        train_dataset=train_dataset,
        peft_config=lora_config,
    )
    
    # Run test-time training
    trainer.train()
    
    # Save the test-time trained model
    trainer.save_model(TESTTIME_MODEL_PATH)
    print(f"✅ Test-time trained model saved to: {TESTTIME_MODEL_PATH}")
    
    # Merge and save the final test-time model
    print("🔄 Merging test-time LoRA adapters...")
    
    # Reload base model for merging
    if os.path.exists(PRETRAINED_MODEL_PATH):
        base_model = AutoModelForCausalLM.from_pretrained(
            PRETRAINED_MODEL_PATH,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
            local_files_only=True,
        )
        tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_PATH, trust_remote_code=True, local_files_only=True)
    else:
        base_model = AutoModelForCausalLM.from_pretrained(
            BASE_MODEL_PATH,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
            local_files_only=True,
        )
        tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH, trust_remote_code=True, local_files_only=True)
    
    # Load and merge test-time LoRA adapters
    peft_model = PeftModel.from_pretrained(base_model, TESTTIME_MODEL_PATH)
    merged_model = peft_model.merge_and_unload()
    
    # Save merged test-time model
    final_testtime_path = TESTTIME_MODEL_PATH + "_merged"
    merged_model.save_pretrained(final_testtime_path)
    tokenizer.save_pretrained(final_testtime_path)
    
    print(f"✅ Final test-time trained model saved to: {final_testtime_path}")
    
    return merged_model, tokenizer


def generate_predictions(model, tokenizer):
    print("🔮 Generating predictions for test dataset...")
    
    # Load test data
    test_df = pd.read_csv(f"{DATA_PATH}/test.csv")
    
    predictions = []
    
    for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Generating predictions"):
        prompt = build_prompt(row)
        
        # Tokenize the prompt
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        
        # Generate prediction
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=5,  # We only need "Yes" or "No"
                do_sample=False,
                temperature=0.1,
                pad_token_id=tokenizer.eos_token_id,
                repetition_penalty=1.1,
            )
        
        # Extract the generated text
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract the answer after "Answer:"
        if COMPLETE_PHRASE in generated_text:
            answer_part = generated_text.split(COMPLETE_PHRASE)[-1].strip()
            if POSITIVE_ANSWER.lower() in answer_part.lower():
                prediction = 1
            elif NEGATIVE_ANSWER.lower() in answer_part.lower():
                prediction = 0
            else:
                # Default to negative if unclear
                prediction = 0
        else:
            # Default to negative if no answer found
            prediction = 0
        
        predictions.append(prediction)
    
    # Create submission DataFrame
    submission_df = pd.DataFrame({
        'id': test_df['id'],
        'rule_violation': predictions
    })
    
    submission_df.to_csv('/kaggle/working/submission.csv', index=False)
    print("✅ Submission saved to: /kaggle/working/submission.csv")
    
    # Show prediction distribution
    print(f"\nPrediction Distribution:")
    print(f"No Violation (0): {sum(p == 0 for p in predictions)} ({sum(p == 0 for p in predictions)/len(predictions)*100:.1f}%)")
    print(f"Violation (1): {sum(p == 1 for p in predictions)} ({sum(p == 1 for p in predictions)/len(predictions)*100:.1f}%)")
    
    return submission_df


def main():
    print("📝 Note: Using GPTQ model for test-time training and inference")
    
    # Step 1: Perform test-time training
    model, tokenizer = test_time_training()
    
    # Step 2: Generate predictions
    submission_df = generate_predictions(model, tokenizer)
    
    print("🎉 GPTQ test-time training and inference completed successfully!")
    return submission_df


if __name__ == "__main__":
    main()

In [None]:
%%writefile accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 8  # Higher accumulation for GPTQ
  gradient_clipping: 1.0
  train_batch_size: 32  # Same effective batch size: 2*8*2 = 32
  train_micro_batch_size_per_gpu: 2  # Lower for GPTQ memory usage
  
  zero_stage: 2
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  
  stage3_gather_16bit_weights_on_model_save: false
  stage3_max_live_parameters: 1e8
  stage3_max_reuse_distance: 1e8
  stage3_prefetch_bucket_size: 5e7
  stage3_param_persistence_threshold: 1e5
  
  zero_allow_untested_optimizer: true
  zero_force_ds_cpu_optimizer: false
  
  fp16:
    enabled: true
    loss_scale: 0
    initial_scale_power: 16
    loss_scale_window: 1000
    hysteresis: 2
    min_loss_scale: 1
  
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
  dynamo_use_fullgraph: false
  dynamo_use_dynamic: false
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

In [None]:
!accelerate launch --config_file accelerate_config.yaml inference.py

In [None]:
# Verify submission file
import pandas as pd
submission = pd.read_csv('/kaggle/working/submission.csv')
print(f"Submission shape: {submission.shape}")
print(f"Submission head:\n{submission.head()}")
print(f"Value counts:\n{submission['rule_violation'].value_counts()}")

# ⚡ Speed Optimization Guide for 2x T4 GPUs (28GB Total VRAM) - GPTQ Edition

## Current Settings Status: ✅ **GOOD** for 2x T4 GPUs with GPTQ
- **Memory**: GPTQ Int4 model + LoRA fits in ~10-14GB per GPU
- **Batch Size**: 2 per device × 8 accumulation = 32 effective batch size  
- **DeepSpeed**: ZeRO Stage 2 with FP16 - optimal for this setup
- **Speed**: Stable with GPTQ pre-quantization

## 🚀 Additional Speed Optimizations for GPTQ:

### **Quick Wins for Test-Time Training:**
1. **Increase Batch Size Carefully** (Cell 9 - `inference.py`):
   ```python
   per_device_train_batch_size=3,  # Can try 3-4 if VRAM allows
   gradient_accumulation_steps=6,   # Adjust accordingly
   ```
   
2. **Faster Optimizer** (Cell 9 - `inference.py`):
   ```python
   optim="adamw_torch_fused",  # If PyTorch 2.0+
   ```

3. **Reduce LoRA Rank** (Cell 9 - `inference.py`):
   ```python
   r=8,              # Can reduce from 16 to 8 for faster training
   lora_alpha=16,     # Adjust proportionally
   ```

### **Inference Speed Optimizations:**
1. **Batch Inference** (Cell 9 - `inference.py`):
   ```python
   # Process multiple samples at once
   batch_size = 4  # Can increase if VRAM allows
   ```

2. **Optimized Generation** (Cell 9 - `inference.py`):
   ```python
   max_new_tokens=3,     # Reduce from 5 to 3
   do_sample=False,      # Keep deterministic
   use_cache=True,       # Enable KV cache
   ```

## 💡 **GPTQ Performance Notes:**
1. **Memory**: GPTQ uses more VRAM than 4-bit BitsAndBytes but is pre-quantized
2. **Speed**: Generally stable, may be slower than dynamic quantization
3. **Quality**: Consistent quantization quality, good for production
4. **Compatibility**: Works with all standard LoRA configurations