# CourtRankRL GRPO Training - Chunk-Based, RTX 5090 Optimized

## Agents.md Specifikáció (Chunk-Based)

Ez a notebook a CourtRankRL GRPO alapú reranking modell tanítását végzi el **RTX 5090 GPU-n** (24GB VRAM).

### Főbb jellemzők (Chunk-Based megoldás):
- **Model**: Qwen/Qwen3-4B-Instruct-2507 (4-bit) + QLoRA (rank=64, alpha=128)
- **Training**: TRL GRPOTrainer GRPO algoritmussal
  - Loss: "dapo" (eliminates length bias)
  - Reward scaling: "batch" (robust - PPO Lite)
  - Importance sampling: "sequence" (stable - GSPO)
- **Dataset**: 98 query (teljes), 20 chunk/slate, **TELJES chunk szöveg** (~500-800 char)
- **Slate stratégia**: Chunk-level retrieval (nem doc aggregáció!) → legrelevánsabb chunk-ok
- **Baseline**: Slate sorrendje = fusion ranking [0,1,2,...] (BM25+FAISS fusion szerint)
- **Hardware**: Batch size 2, grad accumulation 2, 6 generations/prompt
- **Training time**: ~45-60 perc (500 steps)
- **Input**: training_slates.jsonl (chunk-based prepare_training_slates() kimenet)
- **Output**: LoRA adapter weights + metrics JSON

### Előfeltételek:
- **RTX 5090 GPU (24GB VRAM)** vagy hasonló
- HF token környezeti változóban: `HUGGINGFACE_TOKEN`
- Chunk-based slate JSONL fájl: `/workspace/training_slates.jsonl`

### Chunk-based slate formátum:
```json
{
  "query_id": "magyar query szöveg",
  "slate": [
    {
      "chunk_id": "0302-G_20416_2019_11_0",
      "doc_id": "0302-G_20416_2019_11",
      "bm25_score": 12.5,
      "faiss_score": 0.85,
      "relevance": 2,
      "court": "Fővárosi Törvényszék",
      "domain": "G",
      "year": "2019",
      "text": "TELJES chunk szöveg (500-800 char) - nem preview!"
    }
  ]
}
```

### Miért chunk-based?
- ✅ **Releváns kontextus**: BM25+FAISS már kiválasztotta a legrelevánsabb chunk-okat
- ✅ **Teljes szöveg**: A model látja, MIÉRT releváns egy dokumentum
- ✅ **Jobb tanulás**: A model megtanulja értékelni a valódi tartalmat, nem csak metaadatokat


In [None]:
# Környezet setup és csomagok telepítése
%pip install -q torch transformers peft trl datasets accelerate bitsandbytes
%pip install -q numpy huggingface_hub

print("✅ Csomagok telepítve")


In [None]:
import os
import json
import sys
import re
import random
from pathlib import Path
from typing import Dict, List

import numpy as np
import torch
from datasets import Dataset
from sentence_transformers import SentenceTransformer
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl.trainer.grpo_trainer import GRPOTrainer
from trl.trainer.grpo_config import GRPOConfig
from huggingface_hub import login

print("✅ Importok betöltve")
print(f"PyTorch verzió: {torch.__version__}")
print(f"CUDA elérhető: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memória: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")


In [None]:
# HuggingFace bejelentkezés
hf_token = os.getenv("HUGGINGFACE_TOKEN")
if hf_token:
    login(token=hf_token)
    print("✅ HuggingFace bejelentkezés sikeres")
else:
    print("⚠️ Nincs HUGGINGFACE_TOKEN, a modell letöltése korlátozott lehet")


In [None]:
# Konfiguráció (RTX 5090 optimalizált - agents.md szerint)
MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Dataset (agents.md: teljes 98 query, 20 chunk/slate)
SLATE_SIZE = 20
GROUP_SIZE = 20  # = SLATE_SIZE
# (Chunk-based, teljes szöveg - nem preview)

# LoRA konfiguráció (agents.md: rank=64, alpha=128)
LORA_RANK = 64
LORA_ALPHA = 128
LORA_DROPOUT = 0.05

# Training konfiguráció (RTX 5090 optimized)
LEARNING_RATE = 1e-5
MAX_STEPS = 500  # ~5 epoch (98 query × 5)
SAVE_STEPS = 500  # Csak final save
EVAL_STEPS = 50
LOGGING_STEPS = 10
WARMUP_STEPS = 50  # 10% warmup
GRADIENT_ACCUMULATION_STEPS = 2  # RTX 5090: 2 (vs 4)
NUM_GENERATIONS = 6  # RTX 5090: 6 (vs 4)
PER_DEVICE_BATCH_SIZE = 2  # RTX 5090: batch=2

# GRPO Reward (agents.md szerint)
NDCG_K = 10
ENTROPY_BONUS = 0.01  # Exploration
REWARD_CLIP_MIN = -1.0
REWARD_CLIP_MAX = 1.0

# Train/Eval split (agents.md: 80/20, seed 42)
TRAIN_SPLIT = 0.8
SEED = 42

# Paths (RunPod workspace)
BASE_PATH = Path(os.getenv("WORKSPACE_PATH", "/workspace"))
SLATE_FILE = BASE_PATH / "training_slates.jsonl"
OUTPUT_DIR = BASE_PATH / "artifacts" / "grpo_policy"
METRICS_FILE = OUTPUT_DIR / "metrics.json"

print("📋 RTX 5090 Konfiguráció:")
print(f"  Model: {MODEL_NAME}")
print(f"  Slate size: {SLATE_SIZE} doc × {TEXT_PREVIEW} char")
print(f"  LoRA: rank={LORA_RANK}, alpha={LORA_ALPHA}")
print(f"  Batch: {PER_DEVICE_BATCH_SIZE} × {GRADIENT_ACCUMULATION_STEPS} = {PER_DEVICE_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Steps: {MAX_STEPS}, Generations: {NUM_GENERATIONS}")
print(f"  Slate file: {SLATE_FILE}")
print(f"  Output: {OUTPUT_DIR}")

# Ellenőrzés
if not SLATE_FILE.exists():
    raise FileNotFoundError(f"❌ Slate fájl nem található: {SLATE_FILE}")

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


In [None]:
# Segédfüggvények

def calculate_ndcg(ranked_indices: List[int], true_relevance: List[float], k: int = 10) -> float:
    """
    NDCG@k számítás agents.md szerint.
    Formula: DCG = sum(rel_i / log2(i + 2))
    """
    if not true_relevance or not ranked_indices:
        return 0.0

    # DCG calculation
    dcg = 0.0
    for i in range(min(k, len(ranked_indices))):
        if i < len(true_relevance):
            rel = true_relevance[ranked_indices[i]]
            dcg += rel / np.log2(i + 2)

    # IDCG calculation
    sorted_rel = sorted(true_relevance, reverse=True)
    idcg = 0.0
    for i in range(min(k, len(sorted_rel))):
        idcg += sorted_rel[i] / np.log2(i + 2)

    return dcg / idcg if idcg > 0 else 0.0


def calculate_entropy(ranking: List[int]) -> float:
    """Entropy számítás a ranking diverzitásához."""
    if not ranking:
        return 0.0

    counts = {}
    for idx in ranking:
        counts[idx] = counts.get(idx, 0) + 1

    probs = [count / len(ranking) for count in counts.values()]
    entropy = -sum(p * np.log(p + 1e-10) for p in probs if p > 0)

    return entropy


def parse_model_ranking(completion: str, slate_size: int = SLATE_SIZE) -> List[int]:
    """
    Model kimenetből ranking kinyerése.
    Várható formátum: "1,3,2,4,0" vagy "1, 3, 2, 4, 0"
    
    Javított fallback (agents.md): random shuffle (nem baseline!) 
    hogy a model ne tanuljon meg baseline-t outputolni hiba esetén.
    """
    try:
        numbers = [int(x.strip()) for x in completion.split(",") if x.strip().isdigit()]
        # Csak valid indexeket tartunk meg
        valid_numbers = [n for n in numbers if 0 <= n < slate_size]
        
        if len(valid_numbers) >= slate_size // 2:
            # Ha legalább fele valid, használjuk
            return valid_numbers[:slate_size]
        else:
            # Ha túl kevés valid szám: random shuffle (büntetéshez vezet)
            indices = list(range(slate_size))
            random.shuffle(indices)
            return indices
    except:
        # Parse error: random shuffle (büntetéshez vezet - agents.md)
        indices = list(range(slate_size))
        random.shuffle(indices)
        return indices


print("✅ Segédfüggvények definiálva")


In [None]:
# Slate adatok betöltése
print(f"📂 Slate adatok betöltése: {SLATE_FILE}")

slates_data = []
with open(SLATE_FILE, 'r', encoding='utf-8') as f:
    for line_num, line in enumerate(f, 1):
        line = line.strip()
        if not line:
            continue
        try:
            slate = json.loads(line)
            slates_data.append(slate)
        except json.JSONDecodeError as e:
            print(f"⚠️ JSON hiba a {line_num}. sorban: {e}")
            continue

if not slates_data:
    raise ValueError("❌ Nincs betölthető slate adat!")

print(f"✅ Betöltve: {len(slates_data)} slate")

# Slate formátum validáció
sample = slates_data[0]
print(f"\n📋 Minta slate struktúra:")
print(f"  Query ID: {sample['query_id'][:50]}...")
print(f"  Slate elemek: {len(sample['slate'])}")
print(f"  Minta elem kulcsok: {list(sample['slate'][0].keys())}")


In [None]:
# Detailed Learning-to-Rank Prompt Template (Chunk-Based, Full Context)

def create_training_prompt(query_id: str, slate: List[Dict]) -> str:
    """
    Comprehensive learning-to-rank prompt for GRPO training.

    This prompt is designed to maximize GRPO learning effectiveness by providing:
    - Clear task definition and objectives
    - Detailed relevance criteria explanation
    - Step-by-step reasoning instructions
    - Full document context for accurate assessment
    - Structured format for consistent parsing

    Goal: Optimize nDCG@10 through better document understanding and ranking strategy learning.
    """
    lines = []
    for i, candidate in enumerate(slate):
        # Extract all available metadata for comprehensive assessment
        court = candidate.get('court', 'Unknown Court')
        domain = candidate.get('domain', 'Unknown')
        year = candidate.get('year', 'Unknown')
        bm25 = candidate.get('bm25_score', 0.0)
        faiss = candidate.get('faiss_score', 0.0)
        doc_id = candidate.get('doc_id', f'unknown_doc_{i}')
        chunk_id = candidate.get('chunk_id', f'unknown_chunk_{i}')

        # Full chunk text for complete context understanding
        text = candidate.get('text', 'No text available')

        lines.append(
            f"[{i}] DOCUMENT: {doc_id} | CHUNK: {chunk_id}\n"
            f"    COURT: {court} | DOMAIN: {domain} | YEAR: {year}\n"
            f"    RETRIEVAL_SCORES: BM25={bm25:.3f}, Dense={faiss:.3f}\n"
            f"    CONTENT: {text}\n"
        )

    prompt = f"""You are an expert legal information retrieval system designed to rank Hungarian court decisions for maximum relevance to user queries.

## TASK DEFINITION
Your objective is to learn optimal document ranking through reinforcement learning. Given a query and candidate documents, you must produce a ranking that maximizes nDCG@10, focusing on true relevance over superficial features.

## PROJECT CONTEXT
This is a CourtRankRL system for Hungarian judicial documents. Documents come from various courts (Fővárosi Törvényszék, etc.), legal domains ( criminal, civil), and years. Each document represents a real court decision that may contain relevant legal precedents, arguments, or conclusions.

## RELEVANCE ASSESSMENT CRITERIA
When evaluating document relevance to the query, consider:

1. **Semantic Match**: How well does the document content address the specific legal concepts, situations, or outcomes mentioned in the query?
2. **Legal Domain Alignment**: Does the document's court/domain match the legal area of the query?
3. **Temporal Relevance**: Is the document from a relevant time period that would contain applicable precedents?
4. **Precedent Value**: Does the document contain similar fact patterns, legal arguments, or conclusions that would be useful?
5. **Authority Level**: Higher courts and more recent decisions generally have higher precedential value.

## RANKING STRATEGY
Follow these steps for optimal ranking:

1. **Analyze Query Intent**: Identify the core legal concepts, parties involved, and desired outcomes.
2. **Content Evaluation**: Read each document's full text to understand its legal context and relevance.
3. **Multi-factor Scoring**: Combine semantic relevance, domain match, and precedential value.
4. **Confidence Assessment**: Prioritize documents where relevance is clear and direct.
5. **Diversity Consideration**: Include documents from different courts/periods if they provide complementary legal insights.

## CANDIDATE DOCUMENTS
Below are {len(slate)} candidate document chunks retrieved by hybrid BM25+dense retrieval. Each contains full text content for complete relevance assessment.

QUERY: "{query_id}"

CANDIDATES:
{chr(10).join(lines)}

## RESPONSE FORMAT
Provide your ranking as a comma-separated list of indices in descending relevance order.
Format: "index1,index2,index3,..." (e.g., "2,0,15,8,3,11,1,5,9,12")

Guidelines for response:
- Include ALL {len(slate)} documents in your ranking
- Order by decreasing relevance (most relevant first)
- Use only valid indices from [0, {len(slate)-1}]
- Separate indices with commas, no spaces
- Base your ranking on actual content analysis, not retrieval scores

Your ranking will be evaluated using nDCG@10. Focus on identifying truly relevant documents that best match the query's legal intent and requirements."""

    return prompt


# Test enhanced prompt
test_prompt = create_training_prompt(slates_data[0]["query_id"], slates_data[0]["slate"])
print("📝 Enhanced learning-to-rank prompt sample:")
print("="*80)
print(test_prompt[:1500])  # First 1500 chars for preview
print("\n... (truncated)")
print("="*80)
print(f"\n📊 Prompt Statistics:")
print(f"  Total length: {len(test_prompt)} characters")
print(f"  Estimated tokens: ~{len(test_prompt.split())*1.2:.0f}")
print(f"  Average chunk length: {sum(len(c.get('text', '')) for c in slates_data[0]['slate']) / len(slates_data[0]['slate']):.0f} characters")
print(f"  Number of candidates: {len(slates_data[0]['slate'])}")


In [None]:
# Dataset előkészítése TRL GRPOTrainer-hez (chunk-based, shuffled split)
# Agents.md: TRL-kompatibilis data passing via global slate lookup dict

training_examples = []
slate_lookup = {}  # Global dict: query_id -> slate_data (reward function-höz)

for slate_data in slates_data:
    query_id = slate_data["query_id"]
    prompt = create_training_prompt(query_id, slate_data["slate"])
    
    # Dataset csak a prompt-ot tartalmazza (TRL best practice)
    training_examples.append({
        "prompt": prompt
    })
    
    # Slate metadata külön tárolása (reward function-höz)
    slate_lookup[query_id] = slate_data["slate"]

# Full dataset
full_dataset = Dataset.from_list(training_examples)

# Train/eval split (Agents.md: 80/20, SHUFFLED, seed=42)
shuffled_indices = list(range(len(full_dataset)))
random.Random(SEED).shuffle(shuffled_indices)

train_size = int(len(full_dataset) * TRAIN_SPLIT)
train_indices = shuffled_indices[:train_size]
eval_indices = shuffled_indices[train_size:]

train_dataset = full_dataset.select(train_indices)
eval_dataset = full_dataset.select(eval_indices)

print(f"✅ Dataset létrehozva (shuffled split):")
print(f"  Training: {len(train_dataset)} query (80%)")
print(f"  Evaluation: {len(eval_dataset)} query (20%)")
print(f"  Slate lookup: {len(slate_lookup)} entry (global dict)")
print(f"  Slate size: {SLATE_SIZE} chunk/query")
print(f"  Random seed: {SEED}")


In [None]:
# Model és tokenizer inicializálása
print(f"🔄 Model betöltése: {MODEL_NAME}")

# 4-bit quantization konfiguráció (agents.md szerint)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Tokenizer
tokenizer = SentenceTransformer(
    MODEL_NAME,
    trust_remote_code=True,
    token=hf_token
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("✅ Tokenizer betöltve")

# Model
model = SentenceTransformer(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    token=hf_token
)

print("✅ Model betöltve (4-bit quantized)")

# LoRA konfiguráció (RTX 5090 optimalizált - agents.md: rank=64, 7 modules)
lora_config = LoraConfig(
    r=LORA_RANK,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
    # 7 target modules: full Qwen3 coverage (agents.md RTX 5090 spec)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print(f"✅ LoRA adapterek alkalmazva:")
print(f"  Rank: {LORA_RANK}, Alpha: {LORA_ALPHA}")
print(f"  Target modules: 7 (full coverage)")


In [None]:
# GRPO Reward Function (TRL-kompatibilis, chunk-based, javított)

def reward_function(completions, prompts, **kwargs):
    """
    TRL-kompatibilis GRPO reward function (chunk-based, javított).
    
    Agents.md szerinti javítások:
    - Regex-based query_id parsing (robust)
    - Baseline order = slate fusion ranking [0,1,2,...]
    - Negative penalty for parse failures (not zero)
    - nDCG@10 difference (core GRPO objective)
    - Entropy bonus (exploration)
    - Reward clipping (stability)
    
    Args:
        completions: Model által generált output lista
        prompts: Input prompt lista (query_id extraction-höz)
        **kwargs: További TRL argumentumok
    """
    rewards = []
    
    for completion, prompt in zip(completions, prompts):
        try:
            # Query ID kinyerése REGEX-szel (robust - agents.md)
            match = re.search(r'QUERY:\s*"([^"]+)"', prompt)
            
            if not match:
                # Parse failure: negative penalty (agents.md)
                rewards.append(-0.5)
                continue
                
            query_id = match.group(1)
            
            if query_id not in slate_lookup:
                rewards.append(-0.5)
                continue
            
            # Slate metadata lookup (global dict)
            slate = slate_lookup[query_id]
            relevance = [doc.get('relevance', 0) for doc in slate]
            
            # BASELINE ORDER: slate már fusion szerint rendezett (agents.md)
            # A slate-ben lévő sorrend [0,1,2,...] = fusion baseline!
            baseline = list(range(len(slate)))
            
            # Parse model ranking
            predicted = parse_model_ranking(completion, len(slate))
            
            # GRPO core: nDCG@10 difference
            ndcg_baseline = calculate_ndcg(baseline, relevance, k=NDCG_K)
            ndcg_policy = calculate_ndcg(predicted, relevance, k=NDCG_K)
            reward = ndcg_policy - ndcg_baseline
            
            # Entropy bonus (exploration - agents.md: 0.01 weight)
            if len(predicted) > 1:
                unique_ratio = len(set(predicted)) / len(predicted)
                reward += ENTROPY_BONUS * unique_ratio
            
            # Reward clipping (stability - agents.md)
            reward = max(REWARD_CLIP_MIN, min(REWARD_CLIP_MAX, reward))
            rewards.append(reward)
            
        except Exception as e:
            # Unexpected error: negative penalty
            rewards.append(-0.5)
    
    return rewards


print("✅ GRPO Reward function definiálva (chunk-based, javított)")
print(f"  Query parsing: regex-based (robust)")
print(f"  Baseline: slate fusion order [0,1,2,...]")
print(f"  Core: nDCG@{NDCG_K} difference")
print(f"  Entropy bonus: {ENTROPY_BONUS}")
print(f"  Parse failure penalty: -0.5")
print(f"  Clipping: [{REWARD_CLIP_MIN}, {REWARD_CLIP_MAX}]")


In [None]:
# GRPO Trainer konfiguráció (RTX 5090 optimalizált + TRL best practices)

grpo_config = GRPOConfig(
    output_dir=str(OUTPUT_DIR),
    
    # === GRPO Algorithm (TRL best practices) ===
    loss_type="dapo",  # Eliminates length bias (TRL default)
    scale_rewards="batch",  # Robust (PPO Lite paper)
    mask_truncated_completions=True,  # Stability (DAPO paper)
    importance_sampling_level="sequence",  # Stable (GSPO paper)
    
    # === GRPO Core ===
    epsilon=0.2,  # GRPO clipping
    kl_coef=0.0,  # Disabled (agents.md)
    num_generations=NUM_GENERATIONS,
    group_size=GROUP_SIZE,
    max_completion_length=128,
    
    # === Training (RTX 5090 optimized) ===
    max_steps=MAX_STEPS,
    learning_rate=LEARNING_RATE,
    warmup_steps=WARMUP_STEPS,
    per_device_train_batch_size=PER_DEVICE_BATCH_SIZE,  # 2 (RTX 5090)
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,  # 2
    
    # === Optimization ===
    bf16=True,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},  # Modern PyTorch
    max_grad_norm=1.0,
    
    # === Logging ===
    logging_steps=LOGGING_STEPS,
    logging_first_step=True,
    eval_steps=EVAL_STEPS,
    save_steps=SAVE_STEPS,
    log_completions=True,  # Debug-hoz hasznos
    
    # === Other ===
    dataloader_num_workers=2,
    seed=SEED,
)

print("✅ GRPO Trainer konfiguráció (TRL best practices):")
print(f"  Loss: {grpo_config.loss_type} (eliminates length bias)")
print(f"  Reward scaling: {grpo_config.scale_rewards} (robust)")
print(f"  Group size: {grpo_config.group_size}")
print(f"  Batch: {grpo_config.per_device_train_batch_size} × {grpo_config.gradient_accumulation_steps} = {grpo_config.per_device_train_batch_size * grpo_config.gradient_accumulation_steps}")
print(f"  Steps: {grpo_config.max_steps}, Generations: {grpo_config.num_generations}")
print(f"  KL coef: {grpo_config.kl_coef} (disabled)")


In [None]:
# GRPO Trainer inicializálása (train/eval split)

trainer = GRPOTrainer(
    model=model,
    reward_funcs=reward_function,
    args=grpo_config,
    train_dataset=train_dataset,  # 80% (agents.md)
    eval_dataset=eval_dataset,     # 20% (agents.md)
    tokenizer=tokenizer,
)

print("✅ GRPO Trainer inicializálva (train/eval split)")
print(f"\n📊 Training információk:")
print(f"  Training queries: {len(train_dataset)}")
print(f"  Eval queries: {len(eval_dataset)}")
print(f"  Effective batch size: {PER_DEVICE_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  Total training steps: {MAX_STEPS}")
print(f"  GPU memória: {torch.cuda.memory_allocated() / 1e9:.2f} GB")


In [None]:
# Training indítása
print("\n🚀 GRPO TRAINING INDÍTÁSA\n")
print("="*60)

try:
    trainer.train()
    print("\n" + "="*60)
    print("✅ Training sikeresen befejezve!")
except Exception as e:
    print(f"\n❌ Training hiba: {e}")
    raise


In [None]:
# Artifactumok mentése
print("\n💾 Artifactumok mentése...")

# LoRA adapter mentése
trainer.save_model(str(OUTPUT_DIR))
print(f"  ✅ LoRA adapter: {OUTPUT_DIR}")

# Tokenizer mentése
tokenizer.save_pretrained(str(OUTPUT_DIR))
print(f"  ✅ Tokenizer: {OUTPUT_DIR}")

# Metrics mentése
final_metrics = {
    "model_name": MODEL_NAME,
    "training_samples": len(dataset),
    "slate_size": SLATE_SIZE,
    "group_size": GROUP_SIZE,
    "max_steps": MAX_STEPS,
    "learning_rate": LEARNING_RATE,
    "lora_rank": LORA_RANK,
    "lora_alpha": LORA_ALPHA,
    "final_loss": trainer.state.log_history[-1].get("loss", 0.0) if trainer.state.log_history else 0.0,
    "total_steps": len(trainer.state.log_history) if trainer.state.log_history else 0,
    "status": "completed"
}

with open(METRICS_FILE, 'w', encoding='utf-8') as f:
    json.dump(final_metrics, f, ensure_ascii=False, indent=2)

print(f"  ✅ Metrics: {METRICS_FILE}")

print("\n✅ Minden artifact sikeresen mentve!")


## Training összefoglaló

### Generált artifactumok:

A training befejeztével a következő fájlok kerültek létrehozásra a `/workspace/artifacts/grpo_policy/` könyvtárban:

- `adapter_model.bin` - LoRA adapter weights
- `adapter_config.json` - LoRA konfiguráció
- `tokenizer_config.json` - Tokenizer konfiguráció
- `tokenizer.json` - Tokenizer weights
- `metrics.json` - Training metrikák és konfiguráció

### Következő lépések:

1. **Artifactumok letöltése:**
   ```bash
   # RunPod terminálból
   cd /workspace/artifacts/grpo_policy
   ls -lh
   ```

2. **Metrics elemzése:**
   ```bash
   cat metrics.json
   ```

3. **Helyi elemzés:**
   - Töltsd le a `metrics.json` fájlt a lokális `data/models/grpo_policy/` könyvtárba
   - Az adapter weights-eket cloud-on kell hagyni (lokális inference nem támogatott)

### Agents.md specifikáció szerint:

- ✅ Cloud-only training (RunPod)
- ✅ Qwen/Qwen3-4B-Instruct-2507 + QLoRA
- ✅ Group size = slate length (20)
- ✅ NDCG@10 alapú reward
- ✅ Hungarian status messages
- ✅ Artifact export `/workspace/artifacts/grpo_policy/`
