# CourtRankRL FAISS Index √âp√≠t√©s ‚Äì Lok√°lis/Cloud GPU

## Specifik√°ci√≥
- **Input**: Pre-computed embeddings (`embeddings.npy`) √©s chunk ID mapping (`embedding_chunk_ids.json`)
- **Index**: OPQ64_256,IVF65536,PQ64x4fsr (optimal memory/accuracy balance)
- **Metrika**: Inner Product (IP) - L2-normaliz√°lt vektorokkal
- **Kimenet**: `faiss_index.bin` √©s `chunk_id_map.npy`

## K√∂rnyezet-f√ºgg≈ë futtat√°s
- **Lok√°lis (M3 MacBook Air)**: CPU-only training, memory-optimized
- **Cloud GPU (RunPod)**: GPU-accelerated training, faster build
- **Rugalmas path konfigur√°ci√≥**: workspace vagy lok√°lis artifacts

## FAISS param√©terek (agents.md szerint)
- **nlist**: 65536 (1M-10M vektorokhoz)
- **nprobe**: 256-1024 (recall optimaliz√°l√°s)
- **Train sample**: 1,966,080 vektor (adapt√≠v)
- **Memory footprint**: ~2-3GB (M3 MacBook Air kompatibilis)


In [None]:
%pip install --upgrade --quiet faiss-gpu tqdm numpy psutil


‚ö†Ô∏è faiss-gpu nem el√©rhet≈ë, faiss-cpu telep√≠t√©se...
‚ùå FAISS telep√≠t√©s sikertelen
‚úÖ Egy√©b csomagok telep√≠tve


/Users/zelenyianszkimate/Documents/CourtRankRL/.venv/bin/python: No module named pip
/Users/zelenyianszkimate/Documents/CourtRankRL/.venv/bin/python: No module named pip
/Users/zelenyianszkimate/Documents/CourtRankRL/.venv/bin/python: No module named pip
/Users/zelenyianszkimate/Documents/CourtRankRL/.venv/bin/python: No module named pip
/Users/zelenyianszkimate/Documents/CourtRankRL/.venv/bin/python: No module named pip


In [2]:
import json
import os
import time
from pathlib import Path
from typing import List

import faiss
import numpy as np
import psutil
import torch
from tqdm import tqdm


In [None]:
# --- Konfigur√°ci√≥ ---

# Rugalmas path konfigur√°ci√≥ - workspace vagy lok√°lis artifacts
BASE_PATH = Path(os.getenv("WORKSPACE_PATH", "/workspace"))
ARTIFACTS_PATH = Path(os.getenv("ARTIFACTS_PATH", str(BASE_PATH)))

# Input f√°jlok
EMBEDDINGS_PATH = ARTIFACTS_PATH / "embeddings.npy"
CHUNK_IDS_PATH = ARTIFACTS_PATH / "embedding_chunk_ids.json"

# Output f√°jlok
FAISS_PATH = BASE_PATH / "faiss_index.bin"
CHUNK_MAP_PATH = BASE_PATH / "chunk_id_map.npy"

# FAISS param√©terek (agents.md szerint) - CPU-optimaliz√°lt alternat√≠v√°k
EMBED_DIM = 768

# CPU-optimaliz√°lt nlist kalkul√°ci√≥ (dokument√°ci√≥ alapj√°n)
# N = vektorok sz√°ma, optim√°lis nlist: 4*sqrt(N) - 16*sqrt(N)
N_ESTIMATED = 2_000_000  # becs√ºlt vektorok sz√°ma
NLIST_MIN = int(4 * (N_ESTIMATED ** 0.5))   # ~5656
NLIST_MAX = int(16 * (N_ESTIMATED ** 0.5))  # ~22624
NLIST_TARGET = min(16384, NLIST_MAX)  # CPU-ra cs√∂kkentve 65536-r√≥l

# Alternat√≠v index konfigur√°ci√≥k CPU-hoz
INDEX_ALTERNATIVES = [
    f"OPQ64_256,IVF{NLIST_TARGET},PQ64x4fsr",  # jelenlegi (cs√∂kkentett nlist)
    f"IVF{NLIST_TARGET},PQ64x4fs,RFlat",       # PQ + RFlat √∫jraranking
    f"HNSW32",                                 # HNSW alternat√≠va
    f"OPQ64,IVF{NLIST_TARGET//4},PQ64x4fsr",   # negyed nlist
]

NPROBE_TARGET = min(256, NLIST_TARGET // 4)  # recall optimaliz√°l√°s
TRAIN_SAMPLE_SIZE = 1_966_080  # adapt√≠v train sample
RNG_SEED = 42

np.random.seed(RNG_SEED)

print(f"üìÇ Base path: {BASE_PATH}")
print(f"üì¶ Artifacts path: {ARTIFACTS_PATH}")
print(f"üìÑ Embeddings: {EMBEDDINGS_PATH}")
print(f"üìÑ Chunk IDs: {CHUNK_IDS_PATH}")

# Ellen≈ërz√©s
if not EMBEDDINGS_PATH.exists():
    raise FileNotFoundError(f"‚ùå Nem tal√°lhat√≥: {EMBEDDINGS_PATH}")
if not CHUNK_IDS_PATH.exists():
    raise FileNotFoundError(f"‚ùå Nem tal√°lhat√≥: {CHUNK_IDS_PATH}")

BASE_PATH.mkdir(parents=True, exist_ok=True)
print("‚úÖ Konfigur√°ci√≥ bet√∂ltve - FAISS INDEX √âP√çT√âS M√ìD")


üìÇ Base path: /workspace
üì¶ Artifacts path: /workspace
üìÑ Embeddings: /workspace/embeddings.npy
üìÑ Chunk IDs: /workspace/embedding_chunk_ids.json


FileNotFoundError: ‚ùå Nem tal√°lhat√≥: /workspace/embeddings.npy

In [None]:
# --- Embedding bet√∂lt√©s √©s robusztus valid√°ci√≥ ---

print("="*80)
print("üì• EMBEDDING BET√ñLT√âS")
print("="*80)

load_start = time.time()

# Embedding vektorok bet√∂lt√©se
print("üîÑ Embedding vektorok bet√∂lt√©se...")
all_embeddings = np.load(EMBEDDINGS_PATH)
print(f"‚úÖ Embeddings bet√∂ltve: {all_embeddings.shape}")

# Chunk ID lista bet√∂lt√©se
print("üîÑ Chunk ID lista bet√∂lt√©se...")
with CHUNK_IDS_PATH.open("r", encoding="utf-8") as handle:
    all_chunk_ids = json.load(handle)
print(f"‚úÖ Chunk IDs bet√∂ltve: {len(all_chunk_ids)}")

# Alap valid√°ci√≥ (m√©ret √©s dim)
n_vectors, d = all_embeddings.shape
assert d == EMBED_DIM, f"Embedding dim v√°rtan {EMBED_DIM}, de {d}"
assert len(all_chunk_ids) == n_vectors, (
    f"Chunk ID-k sz√°ma ({len(all_chunk_ids)}) nem egyezik a vektorok sz√°m√°val ({n_vectors})"
)

# Robusztus tiszt√≠t√°s: sz≈±r√©s NaN/Inf √©s z√©r√≥-norm√°j√∫ vektorokra
print("üßπ Embedding tiszt√≠t√°s (NaN/Inf/z√©r√≥ norm√°k)...")
finite_mask = np.isfinite(all_embeddings).all(axis=1)
norms = np.linalg.norm(all_embeddings, axis=1)
nonzero_mask = norms > 0
valid_mask = finite_mask & nonzero_mask

num_invalid = int((~valid_mask).sum())
if num_invalid > 0:
    print(f"‚ö†Ô∏è {num_invalid:,} probl√©m√°s vektor kisz≈±rve (NaN/Inf vagy z√©r√≥-norma)")
    all_embeddings = all_embeddings[valid_mask]
    all_chunk_ids = [cid for cid, keep in zip(all_chunk_ids, valid_mask) if keep]
else:
    print("‚úÖ Nincs NaN/Inf/z√©r√≥-norma hiba az embeddingekben")

# Biztons√°gos L2-normaliz√°l√°s (clip a null√°val oszt√°s ellen)
norms = np.linalg.norm(all_embeddings, axis=1, keepdims=True)
all_embeddings = all_embeddings / np.clip(norms, 1e-12, None)

# Float32 konverzi√≥
if all_embeddings.dtype != np.float32:
    all_embeddings = all_embeddings.astype(np.float32, copy=False)

# V√©gs≈ë ellen≈ërz√©sek
assert np.isfinite(all_embeddings).all(), "Embedding m√°trixban NaN/Inf maradt a tiszt√≠t√°s ut√°n"
final_norms = np.linalg.norm(all_embeddings, axis=1)
if not np.allclose(final_norms, 1.0, atol=1e-6):
    print("‚ÑπÔ∏è Normaliz√°ci√≥ ut√°ni norm√°k nem pontosan 1 ‚Äì √∫jranormaliz√°l√°s")
    all_embeddings = all_embeddings / np.clip(
        np.linalg.norm(all_embeddings, axis=1, keepdims=True), 1e-12, None
    )

# Metrik√°k friss√≠t√©se a sz≈±r√©s ut√°n
n_vectors, d = all_embeddings.shape

load_time = time.time() - load_start
print(f"‚è±Ô∏è Bet√∂lt√©si id≈ë: {load_time:.2f} m√°sodperc")
print(f"üìä √ñsszes√≠tett adatok: {n_vectors:,} vektor, {d} dimenzi√≥")
print(f"üíæ Mem√≥ria haszn√°lat: {all_embeddings.nbytes / 1024**3:.2f} GB")


In [None]:
# --- FAISS Index √âp√≠t√©s ---

print("="*80)
print("üèóÔ∏è FAISS INDEX √âP√çT√âS")
print("="*80)

build_start = time.time()

# 1) CPU-optimaliz√°lt index kiv√°laszt√°s
print(f"üìä Dataset: {n_vectors:,} vektor, {d} dimenzi√≥")
print(f"üìè Optim√°lis nlist tartom√°ny: {NLIST_MIN:,} - {NLIST_MAX:,}")

# CPU-bar√°t index kiv√°laszt√°s pr√≥b√°lgat√°ssal
def select_cpu_friendly_index():
    """CPU-n m≈±k√∂d≈ë index kiv√°laszt√°sa pr√≥b√°lgat√°ssal"""
    for i, factory_str in enumerate(INDEX_ALTERNATIVES):
        print(f"üîç Pr√≥b√°lkoz√°s {i+1}/{len(INDEX_ALTERNATIVES)}: {factory_str}")
        try:
            index = faiss.index_factory(d, factory_str, faiss.METRIC_INNER_PRODUCT)
            print(f"‚úÖ Sikeres: {factory_str}")
            return index, factory_str
        except Exception as e:
            print(f"‚ùå Sikertelen: {factory_str} - {type(e).__name__}: {e}")
            continue
    raise RuntimeError("‚ùå Egyik index sem m≈±k√∂dik CPU-n")

index_cpu, factory_str = select_cpu_friendly_index()
nlist = getattr(index_cpu, 'nlist', NLIST_TARGET)  # HNSW-n√°l nincs nlist
print(f"‚úÖ V√©gs≈ë index: {factory_str} (nlist: {nlist:,})")

# 2) Train minta kiv√°laszt√°s
target_train = min(n_vectors, TRAIN_SAMPLE_SIZE)
rng = np.random.default_rng(RNG_SEED)
if n_vectors > target_train:
    print(f"üé≤ Random train sample: {target_train:,} vektor")
    train_idx = rng.choice(n_vectors, size=target_train, replace=False)
    train_matrix = all_embeddings[train_idx]
else:
    print(f"üìä √ñsszes vektor tr√©ninghez: {n_vectors:,}")
    train_matrix = all_embeddings

train_matrix = np.ascontiguousarray(train_matrix.astype("float32", copy=False))

# 3) Device detection √©s training
USE_GPU = torch.cuda.is_available()
print(f"üöÄ GPU el√©rhet≈ë: {USE_GPU}")

try:
    if USE_GPU:
        print("üöÄ GPU-first tr√©ning...")
        try:
            # Dinamikus import GPU funkci√≥khoz
            if hasattr(faiss, 'StandardGpuResources'):
                res = faiss.StandardGpuResources()
                index = faiss.index_cpu_to_gpu(res, 0, index_cpu)
                print("üéì FAISS tr√©ning indul (GPU)...")
                index.train(train_matrix)
                print("‚úÖ GPU tr√©ning k√©sz")
            else:
                raise AttributeError("GPU funkci√≥k nem el√©rhet≈ëek")
        except (AttributeError, RuntimeError):
            print("‚ö†Ô∏è GPU funkci√≥k nem el√©rhet≈ëek (faiss-cpu), CPU training...")
            raise Exception("GPU not available")
    else:
        raise Exception("No GPU available")
except Exception as e:
    print(f"‚ö†Ô∏è GPU tr√©ning sikertelen ({type(e).__name__}): {e}")
    print("üß† CPU tr√©ning (low-mem paramokkal)...")
    try:
        faiss.omp_set_num_threads(min(os.cpu_count() or 8, 16))
        print(f"üîß OMP sz√°lak: {faiss.omp_get_max_threads()}")
    except Exception as e2:
        print(f"‚ö†Ô∏è OMP konfigur√°ci√≥ sikertelen: {e2}")
    print("üéì FAISS tr√©ning indul (CPU)...")
    index = index_cpu
    index.train(train_matrix)
    print("‚úÖ CPU tr√©ning k√©sz")

# 5) nprobe be√°ll√≠t√°sa
index.nprobe = min(int(NPROBE_TARGET), nlist)
print(f"üéØ nprobe: {index.nprobe}")

# 6) Vektorok hozz√°ad√°sa batch-ben
def add_in_batches(faiss_index, vectors, batch_size=200_000):
    """Batch-es hozz√°ad√°s mem√≥ria optimaliz√°l√°ssal"""
    N = vectors.shape[0]
    added = 0
    for start in range(0, N, batch_size):
        end = min(N, start + batch_size)
        faiss_index.add(vectors[start:end])
        added = end
        if (start // batch_size) % 5 == 0 or end == N:
            print(f"‚ûï Add progress: {added:,}/{N:,}")
    return added

print("üì• Vektorok hozz√°ad√°sa az indexhez (batch)...")
added = add_in_batches(index, all_embeddings, batch_size=200_000)
print(f"‚úÖ Index m√©ret: {index.ntotal:,} vektor (added: {added:,})")

build_time = time.time() - build_start
print(f"‚è±Ô∏è Build id≈ë: {build_time:.2f} m√°sodperc")


In [None]:
# --- Ment√©s √©s valid√°ci√≥ ---

print("="*80)
print("üíæ MENT√âS √âS VALID√ÅCI√ì")
print("="*80)

save_start = time.time()

# 7) GPU ‚Üí CPU konverzi√≥ ment√©shez (ha sz√ºks√©ges)
if isinstance(index, faiss.GpuIndex):
    print("üîÅ GPU ‚Üí CPU konverzi√≥ ment√©shez...")
    index_cpu_final = faiss.index_gpu_to_cpu(index)
else:
    index_cpu_final = index

# 8) FAISS index ment√©se
print("üíæ FAISS index ment√©se...")
faiss.write_index(index_cpu_final, str(FAISS_PATH))
print(f"‚úÖ Mentve: {FAISS_PATH}")

# 9) Chunk ID mapping ment√©se (npy form√°tumban)
print("üíæ Chunk ID mapping ment√©se (npy)...")
chunk_ids_array = np.asarray(all_chunk_ids, dtype=object)
np.save(CHUNK_MAP_PATH, chunk_ids_array)
print(f"‚úÖ Mentve: {CHUNK_MAP_PATH} (shape={chunk_ids_array.shape})")

save_time = time.time() - save_start
total_time = time.time() - load_start

# 10) Teljes√≠tm√©ny metrik√°k
mem_usage = psutil.virtual_memory().used / 1024**3
idx_size_gb = FAISS_PATH.stat().st_size / 1024**3

print("="*80)
print("üìä TELJES√çTM√âNY METRIK√ÅK")
print("="*80)
print(f"‚è±Ô∏è Teljes feldolgoz√°si id≈ë: {total_time:.2f} m√°sodperc")
print(f"‚è±Ô∏è Bet√∂lt√©si id≈ë: {load_time:.2f} m√°sodperc")
print(f"‚è±Ô∏è Build id≈ë: {build_time:.2f} m√°sodperc")
print(f"‚è±Ô∏è Ment√©si id≈ë: {save_time:.2f} m√°sodperc")
print(f"üíæ Mem√≥ria haszn√°lat: {mem_usage:.1f}GB")
print(f"üì¶ Index f√°jlm√©ret: {idx_size_gb:.2f}GB")
print(f"üßÆ Vektorok: {index_cpu_final.ntotal:,}")
print(f"üìè nlist: {nlist:,}, nprobe: {index.nprobe}")
print(f"üèóÔ∏è Index t√≠pusa: {factory_str}")

print("üéâ FAISS index √©p√≠t√©se sikeres!")
print(f"   Index: {FAISS_PATH.name}")
print(f"   Mapping: {CHUNK_MAP_PATH.name}")
print(f"   Vektorok: {index_cpu_final.ntotal:,}")
print(f"   Index t√≠pusa: {factory_str}")
print(f"   nlist: {nlist:,}, nprobe: {index.nprobe}")


## √ñsszefoglal√≥
- A FAISS dense index sikeresen elk√©sz√ºlt CPU-optimaliz√°lt konfigur√°ci√≥val.
- L2-normaliz√°lt vektorokkal Inner Product metrika haszn√°lat√°val.
- Kimeneti f√°jlok: `faiss_index.bin` √©s `chunk_id_map.npy` a megadott k√∂nyvt√°rban.
- A notebook automatikusan kiv√°lasztja a CPU-n m≈±k√∂d≈ë legjobb indexet.

### CPU-optimaliz√°lt index kiv√°laszt√°s:
A rendszer pr√≥b√°lgat√°ssal v√°lasztja ki a CPU-n m≈±k√∂d≈ë legjobb indexet:
1. **OPQ64_256,IVF16384,PQ64x4fsr** (cs√∂kkentett nlist)
2. **IVF16384,PQ64x4fs,RFlat** (PQ + √∫jraranking)
3. **HNSW32** (HNSW alternat√≠va)
4. **OPQ64,IVF4096,PQ64x4fsr** (negyed nlist)

### K√∂rnyezet-f√ºgg≈ë optimaliz√°l√°sok:
- **Lok√°lis (M3)**: CPU-only training, memory-optimized, OMP thread limiting
- **Cloud GPU**: GPU-accelerated training, faster build times (ha el√©rhet≈ë)
- **Rugalmas path**: workspace vagy lok√°lis artifacts haszn√°lata

### FAISS param√©terek (dinamikus):
- **nlist**: 5656-22624 tartom√°ny (4*sqrt(N)-16*sqrt(N) alapj√°n)
- **nprobe**: nlist/4 (recall optimaliz√°l√°s)
- **Train sample**: 1,966,080 vektor (adapt√≠v)
- **Memory footprint**: ~2-3GB (M3 MacBook Air kompatibilis)

### Haszn√°lat:
```bash
# Lok√°lis futtat√°s (M3 MacBook Air):
export ARTIFACTS_PATH="/path/to/artifacts"
jupyter notebook faiss_index_builder.ipynb

# Cloud futtat√°s (RunPod):
export WORKSPACE_PATH="/workspace"
jupyter notebook faiss_index_builder.ipynb
```

### K√∂vetkez≈ë l√©p√©s:
A `scripts/hybrid_retrieval.py` script haszn√°lja ezt az indexet a hybrid retrieval-hez.
