# Qwen3 1.7B Test-Time Inference Notebook (4-bit BitsAndBytes + QLoRA)

This notebook loads the pre-trained Qwen3-1.7B model (4-bit) and performs test-time training on test data only, then generates submission.

**Key Changes from DoRA Version:**
- Uses 4-bit BitsAndBytes quantized model (from 4-bit training notebook)
- Uses QLoRA for test-time adaptation (standard QLoRA)
- Optimized for Kaggle runtime constraints with lower memory usage
- Compatible with the 4-bit training notebook output

**Prerequisites:**
- Upload the trained 4-bit model from the 4-bit training notebook as a Kaggle dataset
- Update MODEL_DATASET_PATH to point to your uploaded 4-bit model dataset
- Use `qwen3_1.7b_4bit_qlora_model.tar.gz` from the training notebook

**Benefits of 4-bit Inference:**
- **Lower VRAM**: ~6-8GB per GPU
- **Faster loading**: 4-bit models load faster
- **Better compatibility**: Full QLoRA support for test-time adaptation

In [None]:
# Install dependencies - BitsAndBytes + QLoRA setup (no auto-gptq needed)
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'trl==0.21.0' 'optimum==1.27.0' 'bitsandbytes==0.46.1' 'deepspeed==0.17.4' 'logits-processor-zoo==0.2.1' 'vllm==0.10.0'
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'triton==3.2.0'
!uv pip install --system --no-index --find-links='/kaggle/input/jigsaw-packages2/whls/' 'clean-text'
# Install latest PEFT for QLoRA support
!uv pip install --system --no-index -U --no-deps --find-links='/kaggle/input/jigsaw-packages2/whls/' 'peft' 'accelerate' 'datasets'

print("✅ Dependencies installed for 4-bit BitsAndBytes + QLoRA inference")

# 1. Test Drive Training (Verification on Training Data)

This section performs a quick test drive on the first 100 training examples to verify the setup works correctly. The fine-tuned model from this test is **not used** - we reload the original model for the actual test-time training.

In [None]:
%%writefile constants.py
# Base model path for test drive training
BASE_MODEL_PATH = "/kaggle/working/qwen3-1.7b"  # Local Kaggle path for original model

# Update this path to your uploaded 4-bit model dataset on Kaggle
MODEL_DATASET_PATH = "/kaggle/input/qwen3-1-7b-4bit-qlora-model"  # TODO: Update this path
PRETRAINED_MODEL_PATH = MODEL_DATASET_PATH + "/qwen3_1.7b_4bit_finetuned/"  # Extracted 4-bit model path

# Test-time training paths
TESTTIME_LORA_PATH = "testtime_4bit_qlora_output/"
DATA_PATH = "/kaggle/input/jigsaw-agile-community-rules/"

POSITIVE_ANSWER = "Yes"
NEGATIVE_ANSWER = "No"
COMPLETE_PHRASE = "Answer:"
BASE_PROMPT = '''You are given a comment from reddit and a rule. Your task is to classify whether the comment violates the rule. Only respond Yes/No.'''

In [None]:
%%writefile utils.py
import pandas as pd
from datasets import Dataset
from constants import POSITIVE_ANSWER, NEGATIVE_ANSWER, COMPLETE_PHRASE, BASE_PROMPT
import random, numpy as np
random.seed(42)
np.random.seed(42)


def build_prompt(row):
    return f"""
{BASE_PROMPT}

Subreddit: r/{row["subreddit"]}
Rule: {row["rule"]}
Examples:
1) {row["positive_example"]}
{COMPLETE_PHRASE} Yes

2) {row["negative_example"]}
{COMPLETE_PHRASE} No

---
Comment: {row["body"]}
{COMPLETE_PHRASE}"""


def get_training_dataframe(data_path, sample_size=None):
    """Process training data for test drive"""
    train_dataset = pd.read_csv(f"{data_path}/train.csv")
    
    if sample_size:
        train_dataset = train_dataset.head(sample_size)
    
    # Process training data
    train_df = train_dataset[["body", "rule", "subreddit", "rule_violation",
                              "positive_example_1","positive_example_2",
                              "negative_example_1","negative_example_2"]].copy()

    # Randomly select positive_example and negative_example
    train_df["positive_example"] = np.where(
        np.random.rand(len(train_df)) < 0.5,
        train_df["positive_example_1"],
        train_df["positive_example_2"]
    )
    train_df["negative_example"] = np.where(
        np.random.rand(len(train_df)) < 0.5,
        train_df["negative_example_1"],
        train_df["negative_example_2"]
    )

    # Drop original candidate columns
    train_df.drop(columns=["positive_example_1","positive_example_2",
                           "negative_example_1","negative_example_2"], inplace=True)

    return train_df


def get_testtime_dataframe(data_path):
    """Only process test data for test-time training"""
    test_dataset = pd.read_csv(f"{data_path}/test.csv").sample(frac=0.5, random_state=42).reset_index(drop=True)
    
    flatten = []
    
    # ---------- Process test data only ----------
    for violation_type in ["positive", "negative"]:
        for i in range(1, 3):
            sub_dataset = test_dataset[["rule","subreddit",
                                        "positive_example_1","positive_example_2",
                                        "negative_example_1","negative_example_2"]].copy()

            if violation_type == "positive":
                body_col = f"positive_example_{i}"
                other_positive_col = f"positive_example_{3-i}"
                sub_dataset["body"] = sub_dataset[body_col]
                sub_dataset["positive_example"] = sub_dataset[other_positive_col]
                sub_dataset["negative_example"] = np.where(
                    np.random.rand(len(sub_dataset)) < 0.5,
                    sub_dataset["negative_example_1"],
                    sub_dataset["negative_example_2"]
                )
                sub_dataset["rule_violation"] = 1

            else:  # violation_type == "negative"
                body_col = f"negative_example_{i}"
                other_negative_col = f"negative_example_{3-i}"
                sub_dataset["body"] = sub_dataset[body_col]
                sub_dataset["negative_example"] = sub_dataset[other_negative_col]
                sub_dataset["positive_example"] = np.where(
                    np.random.rand(len(sub_dataset)) < 0.5,
                    sub_dataset["positive_example_1"],
                    sub_dataset["positive_example_2"]
                )
                sub_dataset["rule_violation"] = 0

            sub_dataset.drop(columns=["positive_example_1","positive_example_2",
                                      "negative_example_1","negative_example_2"], inplace=True)

            flatten.append(sub_dataset)

    # Merge all DataFrames
    dataframe = pd.concat(flatten, axis=0)
    dataframe = dataframe.drop_duplicates(ignore_index=True)

    return dataframe


def build_dataset(dataframe):
    dataframe["prompt"] = dataframe.apply(build_prompt, axis=1)

    columns = ["prompt"]
    if "rule_violation" in dataframe:
        dataframe["completion"] = dataframe["rule_violation"].map(
            {
                1: POSITIVE_ANSWER,
                0: NEGATIVE_ANSWER,
            }
        )
        columns.append("completion")

    dataframe = dataframe[columns]
    dataset = Dataset.from_pandas(dataframe)
    dataset.to_pandas().to_csv("/kaggle/working/testtime_dataset.csv", index=False)
    return dataset

In [None]:
%%writefile test_drive.py
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from transformers.utils import is_torch_bf16_gpu_available
from utils import build_dataset, get_training_dataframe
from constants import DATA_PATH, BASE_MODEL_PATH


def test_drive_training():
    print("🚀 Starting test drive training on first 100 training examples...")
    
    # Load first 100 training examples
    dataframe = get_training_dataframe(DATA_PATH, sample_size=100)
    train_dataset = build_dataset(dataframe)
    
    print(f"Test drive dataset size: {len(train_dataset)} samples")
    
    # BitsAndBytes 4-bit quantization config
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4"
    )
    print("✅ BitsAndBytes 4-bit quantization config created")
    
    # QLoRA configuration for test drive
    lora_config = LoraConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.045,
        bias="none",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        task_type="CAUSAL_LM",
    )
    print("✅ QLoRA config created for test drive")
    
    # Load the base model
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_PATH,
        quantization_config=quantization_config,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
        local_files_only=True,
    )
    print("✅ Base model loaded for test drive")
    
    # Short training config for test drive
    training_args = SFTConfig(
        num_train_epochs=1,  # Very short training
        
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        
        optim="paged_adamw_8bit",
        learning_rate=1e-4,
        weight_decay=0.01,
        max_grad_norm=1.0,
        
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        
        bf16=is_torch_bf16_gpu_available(),
        fp16=not is_torch_bf16_gpu_available(),
        dataloader_pin_memory=True,
        
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
    
        save_strategy="no",  # Don't save test drive model
        output_dir="test_drive_output/",
        logging_steps=10,
        report_to="none",
    
        completion_only_loss=True,
        packing=False,
        remove_unused_columns=False,
    )
    print("✅ Test drive training config created")
    
    # Create trainer for test drive
    trainer = SFTTrainer(
        model=base_model,
        args=training_args,
        train_dataset=train_dataset,
        peft_config=lora_config,
    )
    
    # Run test drive training
    trainer.train()
    print("✅ Test drive training completed successfully!")
    print("📝 Setup verified - proceeding to main test-time training...")
    
    # Clean up to free memory
    del trainer, base_model
    torch.cuda.empty_cache()
    
    return True


if __name__ == "__main__":
    test_drive_training()

In [None]:
# Run test drive training
!python test_drive.py

# 2. Main Test-Time Training and Inference

Now that the setup is verified, we proceed with the actual test-time training using the pre-trained model and test data.

In [None]:
%%writefile inference.py
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel, LoraConfig
from trl import SFTTrainer, SFTConfig
from tqdm.auto import tqdm
from transformers.utils import is_torch_bf16_gpu_available
from utils import build_dataset, get_testtime_dataframe
from constants import DATA_PATH, PRETRAINED_MODEL_PATH, TESTTIME_LORA_PATH


def main():
    print("📝 Note: Using original pre-trained model (not the test drive fine-tuned model)")
    
    # Load test data for test-time training
    dataframe = get_testtime_dataframe(DATA_PATH)
    test_dataset = build_dataset(dataframe)
    
    print(f"Test-time training dataset size: {len(test_dataset)} samples")
    
    # BitsAndBytes 4-bit quantization config (same as training)
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4"
    )
    print("✅ BitsAndBytes 4-bit quantization config created")
    
    # QLoRA config for test-time adaptation (same settings as TT-1)
    testtime_lora_config = LoraConfig(
        r=8,  # Reduced from 16 for speed
        lora_alpha=16,  # From TT-1 config
        lora_dropout=0.045,  # From TT-1 config
        bias="none",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        task_type="CAUSAL_LM",
        # Removed use_dora=True for standard QLoRA
    )
    print("✅ QLoRA config created for test-time adaptation")
    
    # Test-time training config (shorter training)
    training_args = SFTConfig(
        num_train_epochs=1,  # Only 1 epoch for test-time adaptation
        
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        
        optim="paged_adamw_8bit",
        learning_rate=1e-4,  # Slightly lower for test-time
        weight_decay=0.01,
        max_grad_norm=1.0,
        
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        
        bf16=is_torch_bf16_gpu_available(),
        fp16=not is_torch_bf16_gpu_available(),
        dataloader_pin_memory=True,
        
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
    
        save_strategy="no",  # Don't save intermediate checkpoints
        output_dir=TESTTIME_LORA_PATH,
        logging_steps=50,
        report_to="none",
    
        completion_only_loss=True,
        packing=False,
        remove_unused_columns=False,
    )
    print("✅ Test-time training config created")
    
    # Load the pre-trained 4-bit model (original model, not test drive result)
    base_model = AutoModelForCausalLM.from_pretrained(
        PRETRAINED_MODEL_PATH,
        quantization_config=quantization_config,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
        local_files_only=True,
    )
    print("✅ Pre-trained 4-bit model loaded (original model)")
    
    tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_PATH, trust_remote_code=True, local_files_only=True)
    
    # Test-time training with QLoRA
    trainer = SFTTrainer(
        model=base_model,
        args=training_args,
        train_dataset=test_dataset,
        peft_config=testtime_lora_config,
    )
    
    print("🚀 Starting test-time training with 4-bit QLoRA...")
    trainer.train()
    
    # Keep the model in memory for inference (don't save test-time adapters)
    print("✅ Test-time training completed - model ready for inference")
    
    return trainer.model, tokenizer


def generate_predictions(model, tokenizer, test_df):
    """Generate predictions for the test set"""
    predictions = []
    
    for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Generating predictions"):
        prompt = f"""
You are given a comment from reddit and a rule. Your task is to classify whether the comment violates the rule. Only respond Yes/No.

Subreddit: r/{row["subreddit"]}
Rule: {row["rule"]}
Examples:
1) {row["positive_example"]}
Answer: Yes

2) {row["negative_example"]}
Answer: No

---
Comment: {row["body"]}
Answer:"""
        
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=10,
                temperature=0.1,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
            )
        
        response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()
        
        # Extract Yes/No from response
        if "Yes" in response or "yes" in response:
            predictions.append(1)
        elif "No" in response or "no" in response:
            predictions.append(0)
        else:
            # Default to 0 if unclear
            predictions.append(0)
    
    return predictions


if __name__ == "__main__":
    # Run test-time training
    model, tokenizer = main()
    
    # Load test data for inference
    test_df = pd.read_csv(f"{DATA_PATH}/test.csv")
    
    # Generate predictions
    predictions = generate_predictions(model, tokenizer, test_df)
    
    # Create submission
    submission = pd.DataFrame({
        "id": test_df["id"],
        "prediction": predictions
    })
    
    submission.to_csv("/kaggle/working/submission.csv", index=False)
    print("✅ Submission file created: /kaggle/working/submission.csv")
    print("🎉 4-bit BitsAndBytes + QLoRA inference completed successfully!")

In [None]:
# Run the inference script
!python inference.py

In [None]:
# Check submission file
import pandas as pd
submission = pd.read_csv("/kaggle/working/submission.csv")
print(f"Submission shape: {submission.shape}")
print("First 5 predictions:")
print(submission.head())
print("\nPrediction distribution:")
print(submission["prediction"].value_counts())

# ⚡ Performance Notes for 4-bit QLoRA Inference

## Memory Usage:
- **4-bit Model**: ~6-8GB VRAM per GPU
- **Test-time Training**: Additional ~2GB for QLoRA adapters
- **Total**: ~8-10GB per GPU (fits on T4 GPUs)

## Speed Optimizations:
- **4-bit Quantization**: Faster inference than 8-bit
- **QLoRA**: Efficient parameter updates
- **Batch Processing**: Can be added for faster prediction generation

## Compatibility:
- **BitsAndBytes**: Full support for 4-bit operations
- **QLoRA**: Standard and reliable
- **Kaggle**: Optimized for offline execution

## Expected Performance:
- **Accuracy**: Similar to DoRA with potentially better stability
- **Speed**: 20-30% faster than GPTQ
- **Memory**: 40-50% less VRAM usage