# 🧠 Migliorare le Query RAG con HyDE (Hypothetical Document Embeddings)

## ✨ Cos'è HyDE?

**HyDE (Hypothetical Document Embeddings)** è una tecnica per il retrieval semantico che consiste nel:

> ✅ Generare **documenti ipotetici** che potrebbero rispondere alla query dell'utente
> ✅ Calcolare gli **embedding di questi documenti**
> ✅ Usare tali embedding per recuperare documenti realmente presenti nel **vector store**

### 🎯 Obiettivo

Superare i limiti della query originale generando **risposte ipotetiche** che:

* Riflettono vari **punti di vista**
* Sono più **dense semanticamente**
* Si avvicinano meglio ai documenti nel vector store

---

## 🧪 Differenze con il Multi-Query Approach

| Caratteristica      | Multi-Query Approach           | HyDE (Hypothetical Embedding)           |
| ------------------- | ------------------------------ | --------------------------------------- |
| Tipo di output      | Domande riformulate            | Risposte ipotetiche generate dall'LLM   |
| Input nel retriever | Query linguistiche             | Risposte simulate                       |
| Obiettivo           | Copertura di intenti semantici | Simulare risposte semanticamente ricche |
| Vantaggio           | Adatto a domande specifiche    | Potente per query vaghe o complesse     |

In [1]:
from dotenv import load_dotenv
import os
load_dotenv()
from langchain_openai import OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores.chroma import Chroma
from langchain_community.document_loaders.directory import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = DirectoryLoader("./data", glob="**/*.txt")

docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    length_function=len, 
    is_separator_regex=False
)

chunks = text_splitter.split_documents(docs)

embedding_function = OpenAIEmbeddings()
model = ChatOpenAI()

db = Chroma.from_documents(docs, embedding_function)

retriever = db.as_retriever()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


---

## ⚙️ Pipeline Step-by-Step

### 1. 📝 Prompt per la generazione delle risposte

In [7]:
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnableLambda
import re

HYDE_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five hypothetical answers to the user's query. These answers should offer diverse perspectives or interpretations, aiding in a comprehensive understanding of the query. Present the hypothetical answers as follows:

    <<Answer considering a specific perspective>>
    <<Answer from a different angle>>
    <<Answer exploring an alternative possibility>>
    <<Answer providing a contrasting viewpoint>>
    <<Answer that includes a unique insight>>

    Note: Present only the hypothetical answers, without numbering (or "-", "1.", "*") and so on, to provide a range of potential interpretations or solutions related to the query.
    Original question: {question}""",
)

def split_and_clean_text(input_text):
    return [item for item in re.split(r"<<|>>", input_text) if item.strip()]

---

### 2. 🧱 Catena LangChain per HyDE

In [8]:
hyde_chain = (
    HYDE_PROMPT | model | StrOutputParser() | RunnableLambda(split_and_clean_text)
)

---

### 3. 📤 Esecuzione

In [10]:
list_of_answers = hyde_chain.invoke("Who is the owner of the restaurant?")

In [11]:
list_of_answers

['The owner of the restaurant may be a local entrepreneur who invested in creating a unique dining experience in the community.',
 "Considering the size and popularity of the restaurant, it's possible that a successful restaurant chain or corporation owns it.",
 'In a different scenario, the restaurant could be owned by a group of investors who pooled their resources to start the business.',
 "Contrary to common assumptions, the restaurant's ownership might be a family-run business passed down through generations, preserving tradition and authenticity.",
 "It's an interesting possibility that the restaurant is actually owned by a celebrity chef or a well-known public figure, adding a touch of glamour and prestige to the establishment."]

In [12]:
#per togliere i duplicati
def flatten_and_unique_documents(documents):
    flattened_docs = [doc for sublist in documents for doc in sublist]

    unique_docs = []
    unique_contents = set()
    for doc in flattened_docs:
        if doc.page_content not in unique_contents:
            unique_docs.append(doc)
            unique_contents.add(doc.page_content)

    return unique_docs

In [13]:
docs = [retriever.invoke(a) for a in list_of_answers]

flatten_and_unique_documents(documents=docs)

[Document(metadata={'source': 'data\\restaurant.txt'}, page_content="In the charming streets of Palermo, tucked away in a quaint alley, stood Chef Amico, a restaurant that was more than a mere eatery—it was a slice of Sicilian heaven. Founded by Amico, a chef whose name was synonymous with passion and creativity, the restaurant was a mosaic of his life’s journey through the flavors of Italy.\n\nChef Amico’s doors opened to a world where the aromas of garlic and olive oil were as welcoming as a warm embrace. The walls, adorned with photos of Amico’s travels and family recipes, spoke of a rich culinary heritage. The chatter and laughter of patrons filled the air, creating a symphony as delightful as the dishes served.\n\nOne evening, as the sun cast a golden glow over the city, a renowned food critic, Elena Rossi, stepped into Chef Amico. Her mission was to uncover the secret behind the restaurant's growing fame. She was greeted by Amico himself, whose eyes sparkled with the joy of a man

## 🧪 Quando usare HyDE?

### ✅ Vantaggi

* Genera embedding **più ricchi e semanticamente densi**
* Migliora il recupero per domande **ambigue** o **sottospecificate**
* Spesso scopre documenti **più pertinenti** di quanto la query originale riesca a ottenere

### ⚠️ Limitazioni

* ❌ Meno efficace se la domanda è molto specifica (es. domande tecniche o a dominio ristretto)
* ❌ LLM può "hallucinare" risposte troppo astratte o imprecise
* ❌ Richiede più token = maggiore costo computazionale

---

## 🛠 Strumenti di valutazione

Puoi valutare l'efficacia del metodo HyDE rispetto al multi-query con **RAGAS**, che fornisce metriche come:

* 🔹 Answer Relevance
* 🔹 Contextual Recall
* 🔹 Context Precision
* 🔹 Faithfulness

---

## 🚀 Conclusione

HyDE è un approccio potente per migliorare il retrieval nei sistemi RAG, soprattutto quando la domanda originale non è sufficiente a recuperare documenti rilevanti. Integra bene sistemi LLM con vector store intelligenti.

---

## 🔮 Prossimo step

La prossima sezione introdurrà una tecnica chiamata **Parent-Child Retrieval**, utile per migliorare ulteriormente il contesto dei chunk in documenti strutturati.