# Manual Query Design and Benchmark – TFM_RAG_NOR

## Índice

1. [Introducción y objetivos](#1-introducción-y-objetivos)
2. [Carga de datos y preparación](#2-carga-de-datos-y-preparación)
3. [Creación manual de queries y benchmark](#3-creación-manual-de-queries-y-benchmark)



In [None]:
import json
from collections import Counter

# Cargar metadatos y textos de los chunks
with open("../data/chunks/metadata.json", "r", encoding="utf-8") as f:
    metadata = json.load(f)

with open("../data/chunks/texts.json", "r", encoding="utf-8") as f:
    texts = json.load(f)

assert len(metadata) == len(texts), "Los archivos no tienen la misma longitud."
print(f"Se han cargado {len(texts)} chunks en total.")

# Ver cuántos chunks hay por PDF
pdfs = [meta["pdf"] for meta in metadata]
contador_chunks = Counter(pdfs)
print("\nChunks por PDF:")
for pdf, n in contador_chunks.items():
    print(f"{pdf}: {n}")

# Opcional: ver qué campos tiene cada metadato
all_keys = set()
for meta in metadata:
    all_keys.update(meta.keys())
print("\nCampos presentes en los metadatos:", all_keys)


Se han cargado 357 chunks en total.

Chunks por PDF:
oecd_ai_classification_framework.pdf: 81
ai_hleg_ethics_guidelines.pdf: 68
nist_privacy_framework_v1.pdf: 48
eu_ai_act_regulation.pdf: 149
oecd_legal_0449_en.pdf: 11

Campos presentes en los metadatos: {'titles', 'n_words', 'chunk_index', 'pages', 'pdf'}


---

## 3. Creación manual de queries y benchmark

A partir de los chunks cargados, se explora el contenido de los documentos y se van creando preguntas reales, con su respuesta esperada y los fragmentos relevantes. Así se construye el set de evaluación manual, que después usaré para comparar sistemas de recuperación.

Se selecciona un chunk aleatorio del corpus, se muestran sus metadatos y texto, y se prepara para pasárselo a una IA que generará queries relevantes usando una plantilla estándar.



In [13]:
import random

# chunk aleatorio
idx = random.randint(0, len(metadata) - 1)
meta = metadata[idx]
texto = texts[idx]

print(f"pdf: {meta['pdf']}")
print(f"chunk_id: {idx}")
print(f"páginas: {meta.get('pages', '-')}")
print("texto:")
print(texto)


pdf: ai_hleg_ethics_guidelines.pdf
chunk_id: 87
páginas: [7]
texto:
is used throughout the Commission s communication. The scope of these Guidelines however aims to encompass not only those AI systems made in Europe, but also those developed elsewhere and deployed or used in Europe. Throughout this document, we hence aim to promote trustworthy AI for Europe. The European Group on Ethics in Science and New Technologies (EGE) is an advisory group of the Commission. See Section 3.3 of COM(2018)237. The Glossary at the end of this document provides a definition of AI systems for the purpose of this document. This definition is further elaborated on in a dedicated document prepared by the AI HLEG that accompanies these Guidelines, titled "A definition of AI: Main capabilities and scientific disciplines". benefits that they can bring. To help Europe realise those benefits, our vision is to ensure and scale Trustworthy AI. Trust in the development, deployment and use of AI systems concerns no

In [7]:
raw = """
---
pdf: eu_ai_act_regulation.pdf
chunk_id: 262
páginas: [37]
pregunta: "What is the main advisory responsibility of the European Artificial Intelligence Board according to the regulation?"
respuesta_esperada: "The Board is responsible for issuing opinions, recommendations, advice, or guidance on matters related to the implementation of the regulation, including technical specifications or existing standards."
relevant_chunks: [262]
dificultad: "easy"
---
pdf: eu_ai_act_regulation.pdf
chunk_id: 262
páginas: [37]
pregunta: "Which requirement is imposed on providers of high-risk AI systems regarding post-market activities?"
respuesta_esperada: "All providers of high-risk AI systems must have a post-market monitoring system in place to enable corrective actions and improvements based on experience from use."
relevant_chunks: [262]
dificultad: "medium"
---
pdf: eu_ai_act_regulation.pdf
chunk_id: 262
páginas: [37]
pregunta: "How do the roles and interactions of national competent authorities, national supervisory authorities, and the European Artificial Intelligence Board contribute to the harmonised implementation and enforcement of the regulation across Member States?"
respuesta_esperada: "Member States must designate one or more national competent authorities to supervise the regulation's application and implementation, including a national supervisory authority as the official point of contact. The European Artificial Intelligence Board provides advice, guidance, and recommendations to support harmonised and effective implementation of the regulation across the Union."
relevant_chunks: [262]
dificultad: "hard"
---
"""


In [11]:
import os

os.makedirs("../data/eval/", exist_ok=True)

In [12]:
import re
import json

blocks = [b.strip() for b in raw.split('---') if b.strip()]
queries = []

for block in blocks:
    # Usa regex para extraer los campos
    pdf = re.search(r'pdf:\s*(.*)', block)
    chunk_id = re.search(r'chunk_id:\s*(\d+)', block)
    paginas = re.search(r'p[aá]ginas:\s*(\[.*?\]|\d+)', block)
    pregunta = re.search(r'pregunta:\s*"(.*)"', block)
    respuesta = re.search(r'respuesta_esperada:\s*"(.*)"', block)
    relevant_chunks = re.search(r'relevant_chunks:\s*(\[.*?\])', block)
    dificultad = re.search(r'dificultad:\s*"(easy|medium|hard)"', block)
    
    # Monta el diccionario
    query = {
        "pdf": pdf.group(1).strip() if pdf else "",
        "chunk_id": int(chunk_id.group(1)) if chunk_id else None,
        "paginas": eval(paginas.group(1)) if paginas else [],
        "pregunta": pregunta.group(1).strip() if pregunta else "",
        "respuesta_esperada": respuesta.group(1).strip() if respuesta else "",
        "relevant_chunks": eval(relevant_chunks.group(1)) if relevant_chunks else [],
        "dificultad": dificultad.group(1) if dificultad else ""
    }
    queries.append(query)

print(f"{len(queries)} queries procesadas.")

# Guarda en JSON
with open("../data/eval/qa_eval_set.json", "w", encoding="utf-8") as f:
    json.dump(queries, f, ensure_ascii=False, indent=2)
print("Guardado en ../data/eval/qa_eval_set.json")


3 queries procesadas.
Guardado en ../data/eval/qa_eval_set.json


#### Lo mismo para ir añadiando nuevas, de diferentes niveles:


In [151]:
import random

# chunk aleatorio
idx = random.randint(0, len(metadata) - 1)
meta = metadata[idx]
texto = texts[idx]

print(f"pdf: {meta['pdf']}")
print(f"chunk_id: {idx}")
print(f"páginas: {meta.get('pages', '-')}")
print("texto:")
print(texto)

pdf: ai_hleg_ethics_guidelines.pdf
chunk_id: 90
páginas: []
texto:
the feedback gathered through this piloting phase, we will revise the assessment list of these Guidelines by early 2020. The piloting phase will be launched by the summer of 2019 and last until the end of the year. All interested stakeholders will be able to participate by indicating their interest through the European AI Alliance. B. A FRAMEWORK FOR TRUSTWORTHY AI These Guidelines articulate a framework for achieving Trustworthy AI based on fundamental rights as enshrined in the Charter of Fundamental Rights of the European Union (EU Charter), and in relevant international human rights law. Below, we briefly touch upon Trustworthy AI s three components. Lawful AI AI systems do not operate in a lawless world. A number of legally binding rules at European, national and international level already apply or are relevant to the development, deployment and use of AI systems today. Legal sources include, but are not limited t

In [157]:
raw = """
---
pdf: oecd_legal_0449_en.pdf
chunk_id: 346
páginas: [3]
pregunta: "What are the five values-based principles for trustworthy AI set out in the OECD Recommendation on Artificial Intelligence?"
respuesta_esperada: "The five principles are: inclusive growth, sustainable development and well-being; human-centred values and fairness; transparency and explainability; robustness, security and safety; and accountability."
relevant_chunks: [346]
dificultad: "easy"
---
pdf: oecd_legal_0449_en.pdf
chunk_id: 346
páginas: [3]
pregunta: "What are the main areas of policy recommendations for trustworthy AI in the OECD Recommendation?"
respuesta_esperada: "The main areas are: investing in AI research and development; fostering a digital ecosystem for AI; shaping an enabling policy environment; building human capacity and preparing for labour market transformation; and international cooperation for trustworthy AI."
relevant_chunks: [346]
dificultad: "medium"
---
pdf: oecd_legal_0449_en.pdf
chunk_id: 346
páginas: [3]
pregunta: "Discuss the purpose and scope of the OECD Recommendation on Artificial Intelligence adopted in 2019."
respuesta_esperada: "The Recommendation aims to foster innovation and trust in AI by promoting responsible stewardship and ensuring respect for human rights and democratic values. It sets flexible standards focused on AI-specific issues, provides value-based principles, and policy recommendations, and includes provisions for developing metrics and an evidence base to assess progress."
relevant_chunks: [346]
dificultad: "hard"
---
"""
blocks = [b.strip() for b in raw.split('---') if b.strip()]
new_queries = []

for block in blocks:
    pdf = re.search(r'pdf:\s*(.*)', block)
    chunk_id = re.search(r'chunk_id:\s*(\d+)', block)
    paginas = re.search(r'p[aá]ginas:\s*(\[.*?\]|\d+)', block)
    pregunta = re.search(r'pregunta:\s*"(.*)"', block)
    respuesta = re.search(r'respuesta_esperada:\s*"(.*)"', block)
    relevant_chunks = re.search(r'relevant_chunks:\s*(\[.*?\])', block)
    dificultad = re.search(r'dificultad:\s*"(easy|medium|hard)"', block)
    query = {
        "pdf": pdf.group(1).strip() if pdf else "",
        "chunk_id": int(chunk_id.group(1)) if chunk_id else None,
        "paginas": eval(paginas.group(1)) if paginas else [],
        "pregunta": pregunta.group(1).strip() if pregunta else "",
        "respuesta_esperada": respuesta.group(1).strip() if respuesta else "",
        "relevant_chunks": eval(relevant_chunks.group(1)) if relevant_chunks else [],
        "dificultad": dificultad.group(1) if dificultad else ""
    }
    new_queries.append(query)

print(f"{len(new_queries)} queries nuevas procesadas.")

queries.extend(new_queries)

with open("../data/eval/qa_eval_set.json", "w", encoding="utf-8") as f:
    json.dump(queries, f, ensure_ascii=False, indent=2)

print(f"Total queries guardadas: {len(queries)}")

3 queries nuevas procesadas.
Total queries guardadas: 150


In [161]:
import json
from collections import Counter

# Carga el archivo JSON con las queries
with open("../data/eval/qa_eval_set.json", "r", encoding="utf-8") as f:
    queries = json.load(f)

# Cuenta las queries por PDF
pdfs = [q["pdf"] for q in queries]
contador = Counter(pdfs)

print("Queries por PDF:")
for pdf, n in contador.items():
    print(f"{pdf}: {n}")


Queries por PDF:
eu_ai_act_regulation.pdf: 48
ai_hleg_ethics_guidelines.pdf: 36
oecd_ai_classification_framework.pdf: 30
nist_privacy_framework_v1.pdf: 30
oecd_legal_0449_en.pdf: 6


In [160]:
import json

# Carga las queries
with open("../data/eval/qa_eval_set.json", "r", encoding="utf-8") as f:
    queries = json.load(f)

# (Asegúrate de tener 'metadata' y 'texts' cargados, como antes)
# with open("../data/chunks/metadata.json", "r", encoding="utf-8") as f:
#     metadata = json.load(f)
# with open("../data/chunks/texts.json", "r", encoding="utf-8") as f:
#     texts = json.load(f)

# Muestra para cada query: pdf, chunk_id, pregunta, texto del chunk
for q in queries:
    idx = q["chunk_id"]
    pdf = q["pdf"]
    pregunta = q["pregunta"]
    chunk_pdf = metadata[idx]["pdf"]
    chunk_text = texts[idx][:120].replace("\n", " ")  # Solo inicio del texto para no saturar
    print(f"Query PDF: {pdf} | chunk_id: {idx} | PDF en metadata: {chunk_pdf}")
    print(f"Pregunta: {pregunta}")
    print(f"Texto chunk: {chunk_text} ...")
    print("-" * 80)


Query PDF: eu_ai_act_regulation.pdf | chunk_id: 262 | PDF en metadata: eu_ai_act_regulation.pdf
Pregunta: What is the main advisory responsibility of the European Artificial Intelligence Board according to the regulation?
Texto chunk: access to Testing and Experimentation Facilities to bodies, groups or laboratories established or accredited pursuant to ...
--------------------------------------------------------------------------------
Query PDF: eu_ai_act_regulation.pdf | chunk_id: 262 | PDF en metadata: eu_ai_act_regulation.pdf
Pregunta: Which requirement is imposed on providers of high-risk AI systems regarding post-market activities?
Texto chunk: access to Testing and Experimentation Facilities to bodies, groups or laboratories established or accredited pursuant to ...
--------------------------------------------------------------------------------
Query PDF: eu_ai_act_regulation.pdf | chunk_id: 262 | PDF en metadata: eu_ai_act_regulation.pdf
Pregunta: How do the roles and interac