# SFT Training Test: Qwen3-4B-Thinking-2507

Tests Supervised Fine-Tuning with Unsloth's optimized SFTTrainer on Qwen3-4B-Thinking-2507.

**Key features tested:**
- FastLanguageModel loading with 4-bit quantization
- LoRA adapter configuration
- SFTTrainer with synthetic dataset including `<think>` content
- Training the model to produce self-questioning reasoning
- Post-training inference verification

**Thinking Style:** Self-questioning internal dialogue
- "What is the user asking here?"
- "Let me think about the key concepts..."
- "How should I structure this explanation?"

**Important:** This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [None]:
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

In [None]:
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=1024,  # Increased for thinking content
    load_in_4bit=True,
    dtype=None,  # Auto-detect
)
print(f"Model loaded: {type(model).__name__}")

In [None]:
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

In [None]:
# Create minimal synthetic instruction dataset with thinking content (5 samples)
# Using self-questioning internal dialogue style for thinking
from datasets import Dataset

synthetic_data = [
    {
        "instruction": "What is machine learning?",
        "thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.",
        "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."
    },
    {
        "instruction": "Explain Python in one sentence.",
        "thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?",
        "response": "Python is a high-level programming language known for its readability and versatility."
    },
    {
        "instruction": "What is a neural network?",
        "thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.",
        "response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers."
    },
    {
        "instruction": "Define supervised learning.",
        "thinking": "What makes supervised learning 'supervised'? It's the labels! The data has known outputs. How do I explain this clearly? Focus on the training process with labeled data.",
        "response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs."
    },
    {
        "instruction": "What is gradient descent?",
        "thinking": "This is a bit technical. What's the intuition behind gradient descent? It's like walking downhill to find the lowest point. The gradient tells us which direction is 'down'. Keep it conceptual.",
        "response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters in the direction of steepest descent."
    },
]

# Format as chat conversations with thinking content
def format_conversation(sample):
    # Combine thinking and response with proper tags
    assistant_content = f"<think>\n{sample['thinking']}\n</think>\n\n{sample['response']}"
    messages = [
        {"role": "user", "content": sample["instruction"]},
        {"role": "assistant", "content": assistant_content}
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}

dataset = Dataset.from_list(synthetic_data)
dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"])
print(f"Dataset created: {len(dataset)} samples")
print(f"\nSample formatted text:")
print(dataset[0]['text'][:500] + "...")

In [None]:
# SFT Training (minimal steps for testing)
from trl import SFTTrainer, SFTConfig

sft_config = SFTConfig(
    output_dir="outputs_sft_qwen_think_test",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_steps=3,  # Minimal steps for testing
    warmup_steps=1,
    learning_rate=2e-4,
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    weight_decay=0.01,
    max_seq_length=1024,  # Increased for thinking content
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    args=sft_config,
)

print("Starting SFT training with thinking content (3 steps)...")
trainer_stats = trainer.train()
final_loss = trainer_stats.metrics.get('train_loss', 'N/A')
print(f"Training completed. Final loss: {final_loss:.4f}")

In [None]:
# Post-training inference test
import re

FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "What is deep learning?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Parse thinking vs response
def parse_thinking_response(text):
    if '</think>' in text:
        parts = text.split('</think>', 1)
        # Find the thinking content after <think> or from start
        thinking_part = parts[0]
        if '<think>' in thinking_part:
            thinking_part = thinking_part.split('<think>', 1)[1]
        return thinking_part.strip(), parts[1].strip() if len(parts) > 1 else ""
    return "", text

thinking, final_response = parse_thinking_response(response)

print("=" * 60)
print("SFT Training Pipeline Test (Thinking Mode) PASSED")
print("=" * 60)
print(f"\nTHINKING CONTENT:")
print(thinking[:300] + "..." if len(thinking) > 300 else thinking)
print(f"\nFINAL RESPONSE:")
print(final_response[:200] + "..." if len(final_response) > 200 else final_response)

## Test Complete

The SFT Training Pipeline test with thinking content has completed successfully. The kernel will now shut down to release all GPU memory.

### What Was Verified
- FastLanguageModel loading with 4-bit quantization (Qwen3-4B-Thinking-2507)
- LoRA adapter configuration (r=16, all projection modules)
- Synthetic dataset creation with `<think>` tags and self-questioning style
- SFTTrainer training loop (3 steps)
- Post-training inference with thinking output

### Thinking Style Trained
- Self-questioning internal dialogue
- "What is the user asking?" pattern
- "How should I structure this?" reasoning

### Ready for Production
If this test passed, your environment is ready for:
- Full SFT fine-tuning on larger datasets with thinking content
- Chain-of-thought training workflows
- Model saving and deployment with thinking capabilities

In [None]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)