# RAG Generation en TFM_RAG_NOR

## Índice


1. [Introducción y objetivos](#1-introducción-y-objetivos)
2. [Carga de datos y configuración](#2-carga-de-datos-y-configuración)
3. [Pipeline RAG: Recuperación + Generación](#3-pipeline-rag-recuperación--generación)
    - 3.1. Recuperación de chunks relevantes (Hybrid MPNet α=0.3)
    - 3.2. Construcción del prompt
    - 3.3. Generación de respuesta con LLM
    - 3.4. Visualización de resultados (respuesta + chunks)
4. [Evaluación de la generación](#4-evaluación-de-la-generación)
    - 4.1. Evaluación manual (exactitud, completitud, estilo)
    - 4.2. Evaluación automática (BERTScore y opcionalmente métricas RAGAS simples)
5. [Demo tester](#5-demo-tester)  

---

## 1. Introducción y objetivos

En este notebook se implementa la fase de generación del sistema RAG sobre el corpus de documentos normativos.  
Hasta ahora se han preparado los datos, creado los índices y evaluado diferentes métodos de recuperación. El mejor rendimiento lo dio el método híbrido BM25 + MPNet (α=0.3), que será el que se use aquí como base.

El objetivo es montar un pipeline completo de RAG (retrieval → generación), capaz de:
- Recuperar los chunks más relevantes para una query.
- Construir un prompt con esos chunks y la pregunta.
- Enviar el prompt a un LLM open-source vía API gratuita.
- Obtener una respuesta fundamentada en los documentos recuperados.
- Guardar y evaluar las respuestas generadas.

Este notebook servirá como prototipo mínimo viable de RAG, sobre el que se podrá analizar el alcance, limitaciones y posibles mejoras.

---

## 2. Carga de datos y configuración

En esta parte se importan las librerías necesarias, se cargan los datos de soporte (benchmark de preguntas y respuestas) y se configuran las claves de acceso para el modelo de lenguaje.  

También se definen las rutas de trabajo (`data/`, `results/`) y se cargan las funciones de recuperación que ya se usaron en la fase anterior para garantizar coherencia.


In [2]:
!pip install openai

Collecting openai
  Downloading openai-1.101.0-py3-none-any.whl.metadata (29 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.10.0-cp313-cp313-win_amd64.whl.metadata (5.3 kB)
Downloading openai-1.101.0-py3-none-any.whl (810 kB)
   ---------------------------------------- 0.0/810.8 kB ? eta -:--:--
   ---------------------------------------- 810.8/810.8 kB 7.8 MB/s eta 0:00:00
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.10.0-cp313-cp313-win_amd64.whl (205 kB)
Installing collected packages: jiter, distro, openai

   ---------------------------------------- 0/3 [jiter]
   ---------------------------------------- 0/3 [jiter]
   ------------- -------------------------- 1/3 [distro]
   -------------------------- ------------- 2/3 [openai]
   -------------------------- ------------- 2/3 [openai]
   -------------------------- ------------- 2/3 [ope


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import os
import json
import pandas as pd
from datetime import datetime
from openai import OpenAI

DATA_PATH = "../data/"
RESULTS_PATH = "../results/"
os.makedirs(RESULTS_PATH, exist_ok=True)

with open(os.path.join(DATA_PATH, "eval", "qa_eval_set.json"), "r", encoding="utf-8") as f:
    qa_eval_set = json.load(f)

print("Preguntas cargadas:", len(qa_eval_set))

client = OpenAI(
    api_key="GROQ_API_KEY",
    base_url="https://api.groq.com/openai/v1"
)


Preguntas cargadas: 150


---

## 3. Pipeline RAG: Recuperación + Generación

### 3.1. Recuperación de chunks relevantes (Hybrid MPNet α=0.3)

En esta parte se reutilizan los índices creados en la fase anterior para recuperar los chunks más relevantes a partir de una query.  
Se aplicará el mismo pipeline de preprocesado y combinación híbrida (BM25 + MPNet con α=0.3) que dio los mejores resultados en la evaluación.  

El objetivo es garantizar que cualquier nueva pregunta, ya sea escrita a mano o tomada del benchmark, pase por el mismo proceso de recuperación coherente antes de la generación de respuesta.


In [6]:
pip install ipywidgets

Collecting ipywidgets
  Using cached ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Using cached widgetsnbextension-4.0.14-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Using cached jupyterlab_widgets-3.0.15-py3-none-any.whl.metadata (20 kB)
Using cached ipywidgets-8.1.7-py3-none-any.whl (139 kB)
Using cached jupyterlab_widgets-3.0.15-py3-none-any.whl (216 kB)
Using cached widgetsnbextension-4.0.14-py3-none-any.whl (2.2 MB)
Installing collected packages: widgetsnbextension, jupyterlab_widgets, ipywidgets

   ---------------------------------------- 0/3 [widgetsnbextension]
   ------------- -------------------------- 1/3 [jupyterlab_widgets]
   -------------------------- ------------- 2/3 [ipywidgets]
   -------------------------- ------------- 2/3 [ipywidgets]
   -------------------------- ------------- 2/3 [ipywidgets]
   -------------------------- ------------- 2/3 [ipywidget


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
import faiss
import pickle
from sentence_transformers import SentenceTransformer
import json

# BM25
with open("../data/bm25/bm25_index.pkl", "rb") as f:
    bm25 = pickle.load(f)

# FAISS MPNet
faiss_index = faiss.read_index("../data/faiss_index/faiss_index_mpnet.faiss")

# Cargar chunks y metadatos (listas)
with open("../data/chunks/texts.json", "r", encoding="utf-8") as f:
    texts = json.load(f)

with open("../data/chunks/metadata.json", "r", encoding="utf-8") as f:
    metadata = json.load(f)

# Modelo embeddings MPNet
mpnet_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

def preprocess(text):
    return text.lower().strip()

def hybrid_retrieval(query, alpha=0.3, top_k=3):
    q = preprocess(query)

    # BM25
    tokenized_q = q.split()
    bm25_scores = bm25.get_scores(tokenized_q)

    # MPNet
    q_emb = mpnet_model.encode([q])
    D, I = faiss_index.search(q_emb, len(texts))
    mpnet_scores = [0] * len(texts)
    for idx, score in zip(I[0], D[0]):
        mpnet_scores[idx] = float(score)

    # Híbrido
    hybrid_scores = [
        alpha * mpnet_scores[i] + (1 - alpha) * bm25_scores[i]
        for i in range(len(texts))
    ]

    # Top-k
    ranked = sorted(enumerate(hybrid_scores), key=lambda x: x[1], reverse=True)[:top_k]

    # Devolver chunks
    results = []
    for idx, score in ranked:
        results.append({
            "chunk": texts[idx],
            "meta": metadata[idx],
            "score": score
        })
    return results

# Test
test_query = "What principles does UNESCO establish on AI ethics?"
chunks = hybrid_retrieval(test_query, alpha=0.3, top_k=3)
for c in chunks:
    print(c["meta"], c["chunk"][:200], "...\n")



{'pdf': 'ai_hleg_ethics_guidelines.pdf', 'pages': [], 'titles': [], 'chunk_index': 50, 'n_words': 300} Fundamental rights: Did you carry out a fundamental rights impact assessment where there could be a negative impact on fundamental rights? Did you identify and document potential trade-offs made betwe ...

{'pdf': 'ai_hleg_ethics_guidelines.pdf', 'pages': [32], 'titles': ['6. Societal and environmental well-being'], 'chunk_index': 56, 'n_words': 300} a set of procedures to avoid creating or reinforcing unfair bias in the AI system, both regarding the use of input data as well as for the algorithm design? Did you assess and acknowledge the possible ...

{'pdf': 'ai_hleg_ethics_guidelines.pdf', 'pages': [], 'titles': [], 'chunk_index': 55, 'n_words': 300} in mind from the start? Did you research and try to use the simplest and most interpretable model possible for the application in question? Did you assess whether you can analyse your training and tes ...



### 3.2. Construcción del prompt

Se genera un prompt que combina la pregunta del usuario con los chunks recuperados.  
El prompt incluye referencias al documento y página de cada chunk, de forma que el modelo pueda usar esa información como contexto y devolver una respuesta fundamentada.

In [9]:
def build_prompt(query, chunks):
    context_parts = []
    for c in chunks:
        ref = f"[Doc: {c['meta']['pdf']}, page: {c['meta']['pages']}]"
        text = c["chunk"]
        context_parts.append(f"{ref}\n{text}")
    
    context = "\n\n".join(context_parts)

    prompt = f"""
Question: {query}

Context:
{context}

Instruction:
Answer the question using only the context above.
If the answer is not in the documents, say clearly that it is not found.
Always include the reference (Doc and page).
"""
    return prompt.strip()

# Test
prompt_example = build_prompt(test_query, chunks)
print(prompt_example[:800], "...\n")


Question: What principles does UNESCO establish on AI ethics?

Context:
[Doc: ai_hleg_ethics_guidelines.pdf, page: []]
Fundamental rights: Did you carry out a fundamental rights impact assessment where there could be a negative impact on fundamental rights? Did you identify and document potential trade-offs made between the different principles and rights? Does the AI system interact with decisions by human (end) users (e.g. recommended actions or decisions to take, presenting of options)? Could the AI system affect human autonomy by interfering with the (end) user s decision-making process in an unintended way? Did you consider whether the AI system should communicate to (end) users that a decision, content, advice or outcome is the result of an algorithmic decision? In case of a chat bot ...



### 3.3. Generación de respuesta con LLM

Se conecta con el modelo `llama-3-8b-instruct` servido por Groq a través de la API compatible con OpenAI.  
Se envía el prompt generado y se obtiene como salida una respuesta en lenguaje natural, fundamentada en los documentos recuperados.


In [11]:
def generate_response(prompt, model="llama3-8b-8192"):
    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are an assistant specialized in AI ethics regulations."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2,
        max_tokens=512
    )
    return completion.choices[0].message.content.strip()

# Test con el prompt de ejemplo
response = generate_response(prompt_example)
print("=== Answer ===\n")
print(response)


=== Answer ===

According to the UNESCO AI Ethics Guidelines, the following principles are established:

1. Human rights: The guidelines emphasize the importance of conducting a fundamental rights impact assessment to identify potential trade-offs between different principles and rights. It also highlights the need to consider whether the AI system interacts with human decisions and whether it affects human autonomy.

2. Human agency: The guidelines stress the importance of considering the task allocation between the AI system and humans for meaningful interactions and appropriate human oversight and control. It also emphasizes the need to prevent overconfidence in or overreliance on the AI system.

3. Human oversight: The guidelines recommend considering the appropriate level of human control for the particular AI system and use case. It also suggests establishing mechanisms and measures to ensure human control or oversight.

4. Transparency: The guidelines emphasize the importance of

### 3.4. Visualización de resultados (respuesta + chunks)

Se muestra de forma ordenada la pregunta, la respuesta generada y los chunks usados, con sus metadatos (documento, página, título).  

In [12]:
def display_result(query, response, chunks):
    print("=== Query ===")
    print(query, "\n")

    print("=== Answer ===")
    print(response, "\n")

    print("=== Chunks usados ===")
    for c in chunks:
        meta = c["meta"]
        doc = meta.get("pdf", "N/A")
        pages = meta.get("pages", [])
        title = meta.get("titles", [])
        print(f"- Doc: {doc} | Page: {pages} | Title: {title}")
        print(f"  Text: {c['chunk'][:200]}...\n")  # solo los primeros 200 chars


# Ejemplo de flujo completo
test_query = "What principles does UNESCO establish on AI ethics?"
chunks = hybrid_retrieval(test_query, alpha=0.3, top_k=3)
prompt_example = build_prompt(test_query, chunks)
response = generate_response(prompt_example)

display_result(test_query, response, chunks)


=== Query ===
What principles does UNESCO establish on AI ethics? 

=== Answer ===
According to the UNESCO AI Ethics Guidelines, the following principles are established:

1. Human rights: The guidelines emphasize the importance of conducting a fundamental rights impact assessment to identify potential negative impacts on fundamental rights and to document potential trade-offs between different principles and rights.

2. Human agency: The guidelines highlight the need to consider the task allocation between AI systems and humans for meaningful interactions and appropriate human oversight and control.

3. Human oversight: The guidelines emphasize the importance of establishing mechanisms and measures to ensure human control or oversight, including audit and remedy mechanisms.

4. Transparency: The guidelines recommend communicating to end-users that they are interacting with an AI system and not with another human, and establishing mechanisms to inform end-users on the reasons and crite

## 4. Evaluación de la generación

### 4.1. Evaluación manual (exactitud, completitud, estilo)

La evaluación manual se centra en comprobar si las respuestas generadas cumplen con los siguientes criterios:

- **Exactitud:** la respuesta es correcta según el benchmark.
- **Completitud:** cubre todos los aspectos de la pregunta y no se queda a medias.
- **Estilo:** la respuesta es clara, concisa y comprensible.

Se utilizará una tabla con columnas para la query, la respuesta esperada, la respuesta generada y los tres criterios evaluados.  
Esto permite tener una revisión cualitativa directa de la calidad del sistema.


In [None]:
import csv

def evaluate_manual(qa_eval_set, output_file="../results/rag_generation_manual.csv", max_queries=10):
    results = []

    for i, item in enumerate(qa_eval_set):
        if max_queries and i >= max_queries:
            break

        query = item["pregunta"]
        expected = item["respuesta_esperada"]

        chunks = hybrid_retrieval(query, alpha=0.3, top_k=3)
        prompt = build_prompt(query, chunks)
        response = generate_response(prompt)

        results.append({
            "query": query,
            "expected": expected,
            "generated": response,
            "chunks": [c["meta"] for c in chunks],
            "dificultad": item["dificultad"]
        })

        print(f"[{i+1}] {query}")
        print("Generated:", response[:150], "...\n")

    # Guardar en CSV
    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["query", "expected", "generated", "chunks", "dificultad", "exactitud", "completitud", "estilo"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for r in results:
            writer.writerow({
                "query": r["query"],
                "expected": r["expected"],
                "generated": r["generated"],
                "chunks": r["chunks"],
                "dificultad": r["dificultad"],
                "exactitud": "",
                "completitud": "",
                "estilo": ""
            })

    print(f"\nResultados guardados en {output_file}")
    return results



# Test rápido
evaluate_manual(qa_eval_set, max_queries=10)


[1] What is the main advisory responsibility of the European Artificial Intelligence Board according to the regulation?
Generated: According to the European Artificial Intelligence Board, the main advisory responsibility is to provide guidance on the ethical and human-centric deve ...

[2] Which requirement is imposed on providers of high-risk AI systems regarding post-market activities?
Generated: According to the EU AI Act Regulation, the requirement imposed on providers of high-risk AI systems regarding post-market activities is to establish a ...

[3] How do the roles and interactions of national competent authorities, national supervisory authorities, and the European Artificial Intelligence Board contribute to the harmonised implementation and enforcement of the regulation across Member States?
Generated: According to the European Artificial Intelligence Act Regulation, the roles and interactions of national competent authorities, national supervisory a ...

[4] What are the thre

[{'query': 'What is the main advisory responsibility of the European Artificial Intelligence Board according to the regulation?',
  'expected': 'The Board is responsible for issuing opinions, recommendations, advice, or guidance on matters related to the implementation of the regulation, including technical specifications or existing standards.',
  'generated': 'According to the European Artificial Intelligence Board, the main advisory responsibility is to provide guidance on the ethical and human-centric development and deployment of artificial intelligence (AI) systems. This is stated in the document "AI HLEG Ethics Guidelines" (page 27), which recommends that stakeholders consider implementing a process that involves all levels of an organization, including top management, to ensure the trustworthy development and deployment of AI systems.',
  'chunks': [{'pdf': 'eu_ai_act_regulation.pdf',
    'pages': [2],
    'titles': ['EXPLANATORY MEMORANDUM'],
    'chunk_index': 1,
    'n_words

### 4.2. Evaluación automática (BERTScore y opcionalmente métricas RAGAS simples)

Para complementar la evaluación manual, se utilizan métricas automáticas que comparan las respuestas generadas con las respuestas esperadas del benchmark.

- **BERTScore:** mide la similitud semántica entre la respuesta generada y la respuesta esperada, usando embeddings de un modelo pre-entrenado.  
  A diferencia de BLEU o ROUGE, es más robusto frente a paráfrasis.

- **RAGAS (opcional):** framework de métricas específicas para RAG, que evalúa aspectos como *faithfulness* y *answer relevance*.  
  Solo se explorará si el coste computacional y de dependencias lo permite.

In [16]:
!pip install bert-score

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
Installing collected packages: bert-score
Successfully installed bert-score-0.3.13



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [21]:
from bert_score import score
import csv

def evaluate_bertscore(results, lang="en", model_type="distilbert-base-uncased", output_file="../results/rag_generation_bertscore.csv"):
    """
    Calcula BERTScore con un modelo más ligero y guarda los resultados en CSV.
    """
    expected_answers = [r["expected"] for r in results]
    generated_answers = [r["generated"] for r in results]

    # Calcular BERTScore con modelo pequeño
    P, R, F1 = score(generated_answers, expected_answers, lang=lang, model_type=model_type, verbose=True)

    # Añadir a resultados
    for i, r in enumerate(results):
        r["bertscore_precision"] = float(P[i])
        r["bertscore_recall"] = float(R[i])
        r["bertscore_f1"] = float(F1[i])

    # Guardar en CSV
    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["query", "expected", "generated", "dificultad", "bertscore_precision", "bertscore_recall", "bertscore_f1"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for r in results:
            writer.writerow({
                "query": r["query"],
                "expected": r["expected"],
                "generated": r["generated"],
                "dificultad": r["dificultad"],
                "bertscore_precision": r["bertscore_precision"],
                "bertscore_recall": r["bertscore_recall"],
                "bertscore_f1": r["bertscore_f1"]
            })

    print(f"\nResultados guardados en {output_file}")
    return results

# === Test breve con tus 10 resultados ===
results = evaluate_manual(qa_eval_set, max_queries=10)
results_with_scores = evaluate_bertscore(results, lang="en")

# Resumen en consola (solo query y F1)
for r in results_with_scores:
    print(r["query"][:60], "... | F1:", round(r["bertscore_f1"], 3))



[1] What is the main advisory responsibility of the European Artificial Intelligence Board according to the regulation?
Generated: According to the European Artificial Intelligence Board's advisory responsibility, the main advisory responsibility is to provide guidance on the deve ...

[2] Which requirement is imposed on providers of high-risk AI systems regarding post-market activities?
Generated: According to the EU AI Act Regulation, the requirement imposed on providers of high-risk AI systems regarding post-market activities is to establish a ...

[3] How do the roles and interactions of national competent authorities, national supervisory authorities, and the European Artificial Intelligence Board contribute to the harmonised implementation and enforcement of the regulation across Member States?
Generated: According to the EU AI Act Regulation, the roles and interactions of national competent authorities, national supervisory authorities, and the Europea ...

[4] What are the thre

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Error while downloading from https://huggingface.co/distilbert-base-uncased/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...
Error while downloading from https://huggingface.co/distilbert-base-uncased/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...


calculating scores...
computing bert embedding.


100%|██████████| 1/1 [00:10<00:00, 10.98s/it]


computing greedy matching.


100%|██████████| 1/1 [00:00<00:00,  4.55it/s]

done in 11.22 seconds, 0.89 sentences/sec

Resultados guardados en ../results/rag_generation_bertscore.csv
What is the main advisory responsibility of the European Art ... | F1: 0.762
Which requirement is imposed on providers of high-risk AI sy ... | F1: 0.815
How do the roles and interactions of national competent auth ... | F1: 0.8
What are the three components that Trustworthy AI should mee ... | F1: 0.82
Why does the document emphasize a holistic and systemic appr ... | F1: 0.875
Explain how the concept of Trustworthy AI in these guideline ... | F1: 0.788
What are some example questions that help determine the tran ... | F1: 0.792
According to the OECD framework, what aspects should policy  ... | F1: 0.727
Describe how the OECD framework recommends assessing both th ... | F1: 0.808
What are the three pillars of the European Commission's visi ... | F1: 0.899





In [22]:
f1_scores = [r["bertscore_f1"] for r in results_with_scores]
print("BERTScore F1 promedio:", round(sum(f1_scores)/len(f1_scores), 3))

BERTScore F1 promedio: 0.809


In [23]:
user_query = "What are the three pillars of the European Commission's vision on AI?"
chunks = hybrid_retrieval(user_query, alpha=0.3, top_k=3)
prompt = build_prompt(user_query, chunks)
response = generate_response(prompt)

display_result(user_query, response, chunks)

=== Query ===
What are the three pillars of the European Commission's vision on AI? 

=== Answer ===
The three pillars of the European Commission's vision on AI are:

1. Increasing public and private investments in AI to boost its uptake.
2. Preparing for socio-economic changes.
3. Ensuring an appropriate ethical and legal framework to strengthen European values.

These pillars are mentioned in the document "AI Hleg Ethics Guidelines.pdf", page [6]. 

=== Chunks usados ===
- Doc: ai_hleg_ethics_guidelines.pdf | Page: [] | Title: []
  Text: are trustworthy. When drafting these Guidelines, Trustworthy AI has, therefore, been our foundational ambition. Trustworthy AI has three components: (1) it should be lawful, ensuring compliance with a...

- Doc: eu_ai_act_regulation.pdf | Page: [] | Title: []
  Text: examined in the White Paper on AI. Consistency and complementarity is therefore ensured with other ongoing or planned initiatives of the Commission that also aim to address those problem

### 4.3. Evaluación automática con métricas RAGAS

Además de BERTScore, se exploran métricas específicas para sistemas RAG utilizando la librería **RAGAS**.  

- **Faithfulness**: mide si la respuesta está respaldada por el contexto proporcionado.  
- **Answer relevance**: mide si la respuesta realmente responde a la pregunta.  
- **Context recall**: mide si los chunks recuperados contienen la información necesaria.  



In [25]:
!pip install ragas

Collecting ragas
  Downloading ragas-0.3.2-py3-none-any.whl.metadata (21 kB)
Collecting datasets (from ragas)
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting tiktoken (from ragas)
  Downloading tiktoken-0.11.0-cp313-cp313-win_amd64.whl.metadata (6.9 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting diskcache>=5.6.3 (from ragas)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting langchain (from ragas)
  Downloading langchain-0.3.27-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-core (from ragas)
  Downloading langchain_core-0.3.75-py3-none-any.whl.metadata (5.7 kB)
Collecting langchain-community (from ragas)
  Downloading langchain_community-0.3.28-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain_openai (from ragas)
  Downloading langchain_openai-0.3.32-py3-none-any.whl.metadata (2.4 kB)
Collecting instructor (from ragas)
  Downloading instructor-1.10.0-py3-n


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [29]:
!pip install ragas langchain_openai




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from ragas import evaluate
from datasets import Dataset
from langchain_openai import ChatOpenAI
from langchain_community.embeddings import HuggingFaceEmbeddings
import pandas as pd

def prepare_ragas_dataset(results):
    """
    Convierte los resultados del pipeline a un Dataset de HuggingFace
    en el formato esperado por RAGAS.
    """
    ragas_data = {
        "question": [],
        "answer": [],
        "contexts": [],
        "ground_truth": []
    }

    for r in results:
        ragas_data["question"].append(r["query"])
        ragas_data["answer"].append(r["generated"])
        ragas_data["contexts"].append([c if isinstance(c, str) else str(c) for c in r["chunks"]])
        ragas_data["ground_truth"].append(r["expected"])

    return Dataset.from_dict(ragas_data)


# Crear dataset con resultados
ragas_dataset = prepare_ragas_dataset(results_with_scores)

# Configurar Groq LLaMA-3 como LLM backend
llm = ChatOpenAI(
    model="llama-3-8b-instruct",
    api_key="GROQ_API_KEY",
    base_url="https://api.groq.com/openai/v1"
)

# Configurar embeddings HuggingFace (MPNet)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Evaluación con métricas RAGAS
metrics = [faithfulness, answer_relevancy, context_recall]
ragas_results = evaluate(ragas_dataset, metrics=metrics, llm=llm, embeddings=embeddings)

print("Resultados RAGAS:", ragas_results)

df_ragas = pd.DataFrame([ragas_results])
df_ragas.to_csv("../results/rag_generation_ragas.csv", index=False)
print("Resultados guardados en ../results/rag_generation_ragas.csv")


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]Exception raised in Job[15]: NotFoundError(Error code: 404 - {'error': {'message': 'The model `llama-3-8b-instruct` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'code': 'model_not_found'}})
Evaluating:   3%|▎         | 1/30 [00:01<00:39,  1.37s/it]Exception raised in Job[9]: NotFoundError(Error code: 404 - {'error': {'message': 'The model `llama-3-8b-instruct` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'code': 'model_not_found'}})
Exception raised in Job[10]: NotFoundError(Error code: 404 - {'error': {'message': 'The model `llama-3-8b-instruct` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'code': 'model_not_found'}})
Exception raised in Job[11]: NotFoundError(Error code: 404 - {'error': {'message': 'The model `llama-3-8b-instruct` do

Resultados RAGAS: {'faithfulness': nan, 'answer_relevancy': nan, 'context_recall': nan}
Resultados guardados en ../results/rag_generation_ragas.csv


In [35]:
import numpy as np

vals = [
    round(np.mean(ragas_results["faithfulness"]), 3),
    round(np.mean(ragas_results["answer_relevancy"]), 3),
    round(np.mean(ragas_results["context_recall"]), 3)
]

print(vals)


[np.float64(nan), np.float64(nan), np.float64(nan)]


## 5. Demo tester

En esta sección se incluye una demostración práctica del sistema RAG desarrollado.  

In [24]:
# Demo interactiva en el notebook

while True:
    user_query = input("Escribe tu pregunta (o 'exit' para salir): ")
    if user_query.lower() == "exit":
        break
    
    chunks = hybrid_retrieval(user_query, alpha=0.3, top_k=3)
    prompt = build_prompt(user_query, chunks)
    response = generate_response(prompt)
    
    display_result(user_query, response, chunks)


=== Query ===
What are the three pillars of the European Commission's vision on AI 

=== Answer ===
According to the provided context, the three pillars of the European Commission's vision on AI are:

1. Increasing public and private investments in AI to boost its uptake.
2. Preparing for socio-economic changes.
3. Ensuring an appropriate ethical and legal framework to strengthen European values.

These pillars are mentioned in the document "AI Hleg Ethics Guidelines.pdf" on page [6]. 

=== Chunks usados ===
- Doc: ai_hleg_ethics_guidelines.pdf | Page: [] | Title: []
  Text: are trustworthy. When drafting these Guidelines, Trustworthy AI has, therefore, been our foundational ambition. Trustworthy AI has three components: (1) it should be lawful, ensuring compliance with a...

- Doc: eu_ai_act_regulation.pdf | Page: [] | Title: []
  Text: examined in the White Paper on AI. Consistency and complementarity is therefore ensured with other ongoing or planned initiatives of the Commission th