# LLM-Judge Fake Receipt Detector — Demo Notebook

Este notebook ejecuta el pipeline completo desde un entorno Jupyter:

1. **Setup** — configurar token de HuggingFace y dependencias
2. **Dataset** — cargar etiquetas y explorar los datos disponibles
3. **Muestreo** — seleccionar receipts REAL/FAKE para la evaluación
4. **Análisis Forense** — señales de imagen sin LLM (ELA, ruido, copy-move)
5. **Demo de un recibo** — ejecutar los 3 jueces LLM sobre un solo recibo
6. **Pipeline completo** — ejecutar todos los recibos muestreados
7. **Evaluación** — métricas: accuracy, precision, recall, F1, confusion matrix

> **Requisito:** un `HF_TOKEN` válido con acceso a modelos de HuggingFace Inference API.
> Los modelos usados son `Qwen/Qwen2.5-VL-72B-Instruct` e `InternVL3-14B` (serverless inference).

---
## 0. Setup

In [None]:
import sys
import os
from pathlib import Path

# Asegurar que el root del proyecto está en el path
PROJECT_ROOT = Path('..').resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Cambiar el directorio de trabajo al root para que los paths relativos funcionen
os.chdir(PROJECT_ROOT)
print(f"Working directory: {os.getcwd()}")

In [None]:
# Opción 1: Cargar desde fichero .env (requiere python-dotenv)
# from dotenv import load_dotenv
# load_dotenv()

# Opción 2: Poner el token directamente (NO subir a git)
# os.environ['HF_TOKEN'] = 'hf_xxxxxxxxxxxxx'

# Verificar que el token está disponible
hf_token = os.environ.get('HF_TOKEN', '')
if not hf_token:
    print("WARNING: HF_TOKEN no encontrado. Los jueces LLM fallarán.")
    print("Configúralo con: os.environ['HF_TOKEN'] = 'hf_xxx'")
else:
    print(f"HF_TOKEN encontrado: {hf_token[:8]}...")

---
## 1. Cargar Dataset

In [None]:
from pipeline.dataset import DatasetManager

dm = DatasetManager()

# Cargar etiquetas del split de entrenamiento
# Usa los CSVs precalculados en data/dataset/findit2/ si existen,
# o el fichero raw del dataset extraído como fallback.
labels_train = dm.load_labels('train')
print(f"Train — total: {len(labels_train)}, "
      f"REAL: {sum(v=='REAL' for v in labels_train.values())}, "
      f"FAKE: {sum(v=='FAKE' for v in labels_train.values())}")

In [None]:
import pandas as pd

# Cargar los tres splits
all_splits = dm.load_all_splits()

rows = []
for split, labels in all_splits.items():
    real = sum(v == 'REAL' for v in labels.values())
    fake = sum(v == 'FAKE' for v in labels.values())
    rows.append({'split': split, 'total': len(labels), 'REAL': real, 'FAKE': fake,
                 'fake_pct': round(100 * fake / len(labels), 1)})

pd.DataFrame(rows).set_index('split')

In [None]:
# Encontrar la imagen de un recibo concreto
# (Necesita que el dataset esté descargado y extraído)
sample_id = list(labels_train.keys())[0]
sample_label = labels_train[sample_id]

img_path = dm.find_image(sample_id, 'train')
ocr_path = dm.find_ocr_txt(sample_id, 'train')

print(f"ID      : {sample_id}")
print(f"Label   : {sample_label}")
print(f"Image   : {img_path}")
print(f"OCR txt : {ocr_path}")

### 1a. Descargar y extraer el dataset (solo si aún no está extraído)

Si `data/raw/findit2/` ya existe, este paso se salta automáticamente.

In [None]:
# DESCOMENTAR para descargar (~400 MB)
# dm.download()
# dm.extract()

---
## 2. Muestreo de Recibos

In [None]:
from pipeline.sampler import ReceiptSampler

sampler = ReceiptSampler()
print(f"Config: {sampler.real_count} REAL + {sampler.fake_count} FAKE, "
      f"split='{sampler.split}', seed={sampler.random_seed}")

In [None]:
# Seleccionar muestra (reproducible con la semilla fija)
labels = dm.load_labels(sampler.split)
sample = sampler.sample(labels, dataset_manager=dm)

# Mostrar como DataFrame
sample_df = pd.DataFrame([
    {
        'id': r['id'],
        'label': r['label'],
        'image_found': bool(r.get('image_path')),
        'ocr_found': bool(r.get('ocr_txt_path')),
    }
    for r in sample
])
print(f"Muestra: {len(sample_df)} recibos")
print(sample_df['label'].value_counts().to_string())
sample_df.head(10)

In [None]:
# Guardar la muestra en outputs/samples.json
sampler.save(sample)
print("Muestra guardada.")

---
## 3. Análisis Forense (sin LLM)

El `ForensicPipeline` extrae señales de imagen:
- **ELA** (Error Level Analysis): detecta artefactos de recompresión JPEG
- **Noise map**: bloques con varianza anómala (posibles regiones pegadas)
- **Copy-move**: detección de regiones duplicadas dentro del mismo recibo
- **OCR**: extrae texto estructurado para verificación aritmética

No requiere GPU ni token de HuggingFace.

In [None]:
from pipeline.forensic_pipeline import ForensicPipeline

fp = ForensicPipeline(
    output_dir='outputs/forensic',
    save_images=True,
    verbose=True,
)

# Usar el primer recibo de la muestra que tenga imagen
receipt = next(r for r in sample if r.get('image_path'))
img_p = Path(receipt['image_path'])
ocr_p = receipt.get('ocr_txt_path')

print(f"Analizando: {receipt['id']} ({receipt['label']})")
forensic_ctx = fp.analyze(img_p, ocr_txt_path=ocr_p)

print(f"\nELA mean error       : {forensic_ctx.ela_mean_error}")
print(f"ELA suspicious ratio : {forensic_ctx.ela_suspicious_ratio:.1%}" 
      if forensic_ctx.ela_suspicious_ratio is not None else "ELA: N/A")
print(f"Copy-move confidence : {forensic_ctx.cm_confidence}")

In [None]:
# Ver el bloque de texto que se añade al prompt del juez
print(forensic_ctx.to_prompt_section())

In [None]:
# Visualizar las imágenes forenses guardadas
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image

forensic_dir = Path('outputs/forensic')
forensic_imgs = sorted(forensic_dir.glob('*.png'))

# Mostrar la imagen original + imágenes forenses
images_to_show = [img_p] + forensic_imgs[:3]
titles = ['Original'] + [p.stem.split('_', 1)[-1] for p in forensic_imgs[:3]]

if images_to_show:
    fig, axes = plt.subplots(1, len(images_to_show), figsize=(5 * len(images_to_show), 6))
    if len(images_to_show) == 1:
        axes = [axes]
    for ax, path, title in zip(axes, images_to_show, titles):
        try:
            img = Image.open(path)
            ax.imshow(img, cmap='gray' if img.mode == 'L' else None)
        except Exception:
            ax.imshow(mpimg.imread(str(path)))
        ax.set_title(title, fontsize=10)
        ax.axis('off')
    plt.suptitle(f"{receipt['id']} ({receipt['label']})", fontsize=12)
    plt.tight_layout()
    plt.show()
else:
    print("No hay imágenes forenses disponibles. ¿Está el dataset extraído?")

---
## 4. Demo: Un Recibo con los 3 Jueces LLM

Requiere `HF_TOKEN` configurado en el entorno.

Los 3 jueces son:
| Judge | Modelo | Persona | Temperatura |
|-------|--------|---------|-------------|
| judge_1 | Qwen2.5-VL-72B | Forensic Accountant | 0.1 |
| judge_2 | Qwen2.5-VL-72B | Document Examiner | 0.7 |
| judge_3 | InternVL3-14B | Visual Inspector | 0.3 |

In [None]:
import json

# Escoger un recibo de la muestra (cambia el índice para probar distintos)
RECEIPT_IDX = 0

receipt = sample[RECEIPT_IDX]
receipt_id = receipt['id']
ground_truth = receipt['label']

img_path = Path(receipt['image_path']) if receipt.get('image_path') else dm.find_image(receipt_id)
ocr_path = receipt.get('ocr_txt_path') or dm.find_ocr_txt(receipt_id)

print(f"Receipt ID    : {receipt_id}")
print(f"Ground Truth  : {ground_truth}")
print(f"Image         : {img_path}")
print(f"OCR           : {ocr_path}")

In [None]:
# Análisis forense previo (opcional pero recomendado)
USE_FORENSIC = True  # Cambiar a False para omitir

forensic_context = None
if USE_FORENSIC and img_path and img_path.exists():
    fp = ForensicPipeline(output_dir='outputs/forensic', save_images=False, verbose=False)
    forensic_context = fp.analyze(img_path, ocr_txt_path=ocr_path)
    ela = (f"{forensic_context.ela_suspicious_ratio:.1%}" 
           if forensic_context.ela_suspicious_ratio is not None else 'N/A')
    cm = (f"{forensic_context.cm_confidence:.2f}" 
          if forensic_context.cm_confidence is not None else 'N/A')
    print(f"Forensic: ELA={ela}  CopyMove={cm}")
else:
    print("Forensic skipped.")

In [None]:
from judges.qwen_judge import make_forensic_accountant, make_document_examiner
from judges.internvl_judge import InternVLJudge
from judges.voting import VotingEngine

# Instanciar los 3 jueces y el motor de votación
judges = [
    make_forensic_accountant(),
    make_document_examiner(),
    InternVLJudge(),
]
engine = VotingEngine()

# Ejecutar cada juez
judge_results = []
for judge in judges:
    print(f"\nEjecutando {judge.judge_name}...", end=' ', flush=True)
    result = judge.judge(
        receipt_id=receipt_id,
        image_path=img_path,
        forensic_context=forensic_context,
    )
    print(f"{result.label} ({result.confidence:.1f}%)")
    judge_results.append(result)
    print(json.dumps(result.to_dict(), indent=2, ensure_ascii=False))

In [None]:
# Veredicto final por votación
verdict = engine.aggregate(judge_results)

is_correct = verdict.label == ground_truth
print("=" * 50)
print(f"VEREDICTO FINAL : {verdict.label}")
print(f"Ground Truth    : {ground_truth}")
print(f"Resultado       : {'✓ CORRECTO' if is_correct else '✗ ERROR'}")
print(f"Tally           : {verdict.tally}")
print("=" * 50)
print(json.dumps(verdict.to_dict(), indent=2, ensure_ascii=False))

---
## 5. Pipeline Completo — Todos los Recibos Muestreados

Ejecuta los 3 jueces sobre los 20 recibos de la muestra y guarda los resultados en `outputs/results/`.

In [None]:
import yaml

# Cargar config del voting engine
with open('configs/judges.yaml') as f:
    cfg = yaml.safe_load(f)
voting_cfg = cfg.get('voting', {})

judges = [
    make_forensic_accountant(),
    make_document_examiner(),
    InternVLJudge(),
]
engine = VotingEngine(
    strategy=voting_cfg.get('strategy', 'majority'),
    uncertain_threshold=voting_cfg.get('uncertain_threshold', 2),
)

# Cargar muestra guardada
sample_loaded = sampler.load()
print(f"Muestra cargada: {len(sample_loaded)} recibos")

# Carpeta de resultados
results_dir = Path('outputs/results')
results_dir.mkdir(parents=True, exist_ok=True)

In [None]:
# Ejecutar pipeline completo
USE_FORENSIC = True

fp = ForensicPipeline(output_dir='outputs/forensic', save_images=False, verbose=False) if USE_FORENSIC else None

run_results = []
for receipt in sample_loaded:
    rid = receipt['id']
    img_p = Path(receipt['image_path']) if receipt.get('image_path') else dm.find_image(rid)
    ocr_p = receipt.get('ocr_txt_path') or dm.find_ocr_txt(rid)

    if img_p is None or not img_p.exists():
        print(f"[SKIP] {rid} — imagen no encontrada")
        continue

    # Análisis forense
    fctx = None
    if fp:
        fctx = fp.analyze(img_p, ocr_txt_path=ocr_p)
        ela = f"{fctx.ela_suspicious_ratio:.0%}" if fctx.ela_suspicious_ratio is not None else 'N/A'
    else:
        ela = 'OFF'

    # Jueces
    jresults = []
    for judge in judges:
        r = judge.judge(receipt_id=rid, image_path=img_p, forensic_context=fctx)
        jresults.append(r)

    verdict = engine.aggregate(jresults)
    output = verdict.to_dict()
    output['ground_truth'] = receipt['label']
    output['forensic_used'] = USE_FORENSIC

    # Guardar JSON de resultado
    out_path = results_dir / f"{rid}.json"
    with open(out_path, 'w') as f:
        json.dump(output, f, indent=2, ensure_ascii=False)

    match = 'OK' if verdict.label == receipt['label'] else 'WRONG'
    print(f"{rid[:30]:30s} GT={receipt['label']:4s} → {verdict.label:4s} ELA={ela:>5s} [{match}]")
    run_results.append(output)

print(f"\nProcesados: {len(run_results)}/{len(sample_loaded)} recibos")

---
## 6. Evaluación de Resultados

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from pipeline.evaluator import Evaluator

ground_truth_map = {r['id']: r['label'] for r in sample_loaded}

ev = Evaluator()
ev.load_results()

summary = ev.summary(ground_truth_map)
print(json.dumps(summary, indent=2))

In [None]:
cm = summary['confusion_matrix']
cm_matrix = [
    [cm.get('TP', 0), cm.get('FN', 0)],
    [cm.get('FP', 0), cm.get('TN', 0)],
]

fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(
    cm_matrix, annot=True, fmt='d', cmap='Blues',
    xticklabels=['Pred FAKE', 'Pred REAL'],
    yticklabels=['GT FAKE', 'GT REAL'],
    ax=ax
)
ax.set_title('Confusion Matrix')
plt.tight_layout()
plt.savefig('outputs/eval_confusion_matrix.png', dpi=150)
plt.show()

In [None]:
# Accuracy por juez individual
judge_stats = {}
for result in ev._results:
    gt = ground_truth_map.get(result['receipt_id'])
    if gt is None:
        continue
    for j in result.get('judges', []):
        name = j['judge_name']
        if name not in judge_stats:
            judge_stats[name] = {'correct': 0, 'total': 0}
        judge_stats[name]['total'] += 1
        if j['label'] == gt:
            judge_stats[name]['correct'] += 1

print("=== ACCURACY POR JUEZ ===")
for name, stats in judge_stats.items():
    acc = stats['correct'] / max(stats['total'], 1)
    print(f"  {name:25s}: {acc:.1%}  ({stats['correct']}/{stats['total']})")

print(f"\n=== VOTING FINAL ===")
print(f"  Accuracy : {summary.get('accuracy', 0):.1%}")
print(f"  Precision: {summary.get('precision', 0):.1%}")
print(f"  Recall   : {summary.get('recall', 0):.1%}")
print(f"  F1 Score : {summary.get('f1', 0):.1%}")

In [None]:
# Casos donde los jueces no coinciden
cases = ev.disagreement_cases(n=5)
if cases:
    print("=== CASOS DE DESACUERDO ENTRE JUECES ===")
    for case in cases:
        gt = ground_truth_map.get(case['receipt_id'], '?')
        print(f"\n--- {case['receipt_id']} ---")
        print(f"  Ground truth : {gt}")
        print(f"  Veredicto    : {case.get('label', '?')} | {case.get('tally', '')}")
        for j in case.get('judges', []):
            print(f"  [{j['judge_name']:20s}] {j['label']:4s} ({j['confidence']:.0f}%)")
            for r in j.get('reasons', [])[:2]:
                print(f"    • {r}")
else:
    print("No hay casos de desacuerdo o aún no se han ejecutado los jueces.")

---
## 7. Equivalente CLI

Todos los pasos anteriores también se pueden ejecutar desde la línea de comandos:

In [None]:
# Equivalentes CLI — solo visualización, no ejecutan
cli_commands = [
    ("Descargar dataset",          "python main.py download"),
    ("Muestrear 20 recibos",       "python main.py sample"),
    ("Ejecutar pipeline",           "python main.py run"),
    ("Pipeline + forense",          "python main.py run --forensic"),
    ("Evaluar resultados",          "python main.py evaluate"),
    ("Demo un recibo",              "python main.py demo X00016469622"),
    ("Demo recibo + forense",       "python main.py demo X00016469622 --forensic"),
    ("Solo análisis forense",       "python main.py forensic X00016469622"),
]

print("=== COMANDOS CLI EQUIVALENTES ===")
for desc, cmd in cli_commands:
    print(f"  {desc:30s}  →  {cmd}")

In [None]:
# Para ejecutar un comando CLI desde el notebook:
# !python ../main.py sample
# !python ../main.py run --forensic
# !python ../main.py evaluate