# 📚 RAGAS – Tecniche di Chunking Avanzate

## ❓ Perché il Chunking è ancora importante?

| Motivo                  | Spiegazione                                                                              |
| ----------------------- | ---------------------------------------------------------------------------------------- |
| 🧠 Finestra contestuale | I modelli LLM hanno limiti di token (es. 16k → 128k → 1M), ma il chunking è ancora utile |
| 🔎 Embedding            | Gli embedding richiedono **unità semanticamente coese** per essere efficaci              |
| 🎯 Recupero & sintesi   | Miglior chunking = miglior **retrieval**, **answering**, e **faithfulness**              |

📌 *Il chunking influenza direttamente la qualità degli embedding e quindi delle risposte generate.*

Dunque creeremo un **Chunker basato su LLM** che è già risultato di gran successo in vari contesti applicativi di RAG.
---

## 🛠️ Tecniche di Chunking Esplorate

### 1. ✂️ `CharacterTextSplitter` (base)

```python
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=20
)
chunks = splitter.split_text(text)
```

📉 *Pro:* semplice
📉 *Contro:* non semantico, taglia arbitrariamente

In [1]:
from dotenv import load_dotenv
import os
load_dotenv()

openai_api_key = os.getenv("OPENAI_API_KEY")

In [2]:
with open("./data/restaurant.txt") as f:
    raw_data = f.read()

In [3]:
print(raw_data)

In the charming streets of Palermo, tucked away in a quaint alley, stood Chef Amico, a restaurant that was more than a mere eateryâ€”it was a slice of Sicilian heaven. Founded by Amico, a chef whose name was synonymous with passion and creativity, the restaurant was a mosaic of his lifeâ€™s journey through the flavors of Italy.

Chef Amicoâ€™s doors opened to a world where the aromas of garlic and olive oil were as welcoming as a warm embrace. The walls, adorned with photos of Amicoâ€™s travels and family recipes, spoke of a rich culinary heritage. The chatter and laughter of patrons filled the air, creating a symphony as delightful as the dishes served.

One evening, as the sun cast a golden glow over the city, a renowned food critic, Elena Rossi, stepped into Chef Amico. Her mission was to uncover the secret behind the restaurant's growing fame. She was greeted by Amico himself, whose eyes sparkled with the joy of a man who loved his work.

Elena was led to a table adorned with a sim

In [5]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len, 
    is_separator_regex=False
)

docs = text_splitter.split_text(raw_data)
print(docs)
print(len(docs))

Created a chunk of size 329, which is longer than the specified 200
Created a chunk of size 331, which is longer than the specified 200
Created a chunk of size 291, which is longer than the specified 200
Created a chunk of size 376, which is longer than the specified 200
Created a chunk of size 291, which is longer than the specified 200


['In the charming streets of Palermo, tucked away in a quaint alley, stood Chef Amico, a restaurant that was more than a mere eateryâ€”it was a slice of Sicilian heaven. Founded by Amico, a chef whose name was synonymous with passion and creativity, the restaurant was a mosaic of his lifeâ€™s journey through the flavors of Italy.', 'Chef Amicoâ€™s doors opened to a world where the aromas of garlic and olive oil were as welcoming as a warm embrace. The walls, adorned with photos of Amicoâ€™s travels and family recipes, spoke of a rich culinary heritage. The chatter and laughter of patrons filled the air, creating a symphony as delightful as the dishes served.', "One evening, as the sun cast a golden glow over the city, a renowned food critic, Elena Rossi, stepped into Chef Amico. Her mission was to uncover the secret behind the restaurant's growing fame. She was greeted by Amico himself, whose eyes sparkled with the joy of a man who loved his work.", 'Elena was led to a table adorned wi

---

### 2. 🔁 `RecursiveCharacterTextSplitter` (raccomandato)

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
)
chunks = splitter.split_text(text)
```

📈 *Pro:* migliore coerenza, usa separatori gerarchici
📉 *Contro:* ancora non semantico, ancora non cattura il significato semantico del testo nei chunks


In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200, 
    chunk_overlap=20,
    length_function=len, 
    is_separator_regex=False,
    separators=["\n\n", "\n", ","]
)

docs = splitter.split_text(raw_data)
print(docs)
print(len(docs))

['In the charming streets of Palermo, tucked away in a quaint alley, stood Chef Amico, a restaurant that was more than a mere eateryâ€”it was a slice of Sicilian heaven. Founded by Amico', ', a chef whose name was synonymous with passion and creativity, the restaurant was a mosaic of his lifeâ€™s journey through the flavors of Italy.', 'Chef Amicoâ€™s doors opened to a world where the aromas of garlic and olive oil were as welcoming as a warm embrace. The walls, adorned with photos of Amicoâ€™s travels and family recipes', ', spoke of a rich culinary heritage. The chatter and laughter of patrons filled the air, creating a symphony as delightful as the dishes served.', 'One evening, as the sun cast a golden glow over the city, a renowned food critic, Elena Rossi', ", Elena Rossi, stepped into Chef Amico. Her mission was to uncover the secret behind the restaurant's growing fame. She was greeted by Amico himself", ', whose eyes sparkled with the joy of a man who loved his work.', 'Elena 

---

### 3. 🧠 `SemanticChunker` (embedding-aware, sperimentale)

```python
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings

chunker = SemanticChunker(OpenAIEmbeddings())
chunks = chunker.split_text(text)
```

🧪 *Funzionamento:*

* Divide in frasi (piccoli chunks) ➡️ crea embedding ➡️ combina le frasi più simili in un nuovo chunk più grande
* Ci fermiamo davanti ad una soglia di "semantic break" predefinita. Questa soglia decide se mantenere i chunks separati o combinarli in chunks più grandi e semanticamente coerenti.

📈 *Pro:* chunk semanticamente coesi
📉 *Contro:* talvolta troppo pochi chunk (es. solo 2)

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

print(embeddings.model)

text_splitter = SemanticChunker(OpenAIEmbeddings())


text-embedding-ada-002


In [10]:
docs = text_splitter.split_text(raw_data)
print(docs)
print(len(docs))

["In the charming streets of Palermo, tucked away in a quaint alley, stood Chef Amico, a restaurant that was more than a mere eateryâ€”it was a slice of Sicilian heaven. Founded by Amico, a chef whose name was synonymous with passion and creativity, the restaurant was a mosaic of his lifeâ€™s journey through the flavors of Italy. Chef Amicoâ€™s doors opened to a world where the aromas of garlic and olive oil were as welcoming as a warm embrace. The walls, adorned with photos of Amicoâ€™s travels and family recipes, spoke of a rich culinary heritage. The chatter and laughter of patrons filled the air, creating a symphony as delightful as the dishes served. One evening, as the sun cast a golden glow over the city, a renowned food critic, Elena Rossi, stepped into Chef Amico. Her mission was to uncover the secret behind the restaurant's growing fame. She was greeted by Amico himself, whose eyes sparkled with the joy of a man who loved his work.", 'Elena was led to a table adorned with a s

---

### 4. 🤖 `LLM-based Chunking` (custom GPT splitter)

📌 Miglior risultato aziendale nel caso d'uso dell’autore
⚠️ *Costo elevato (chiamate a GPT-4 / GPT-3.5)*

#### 👨‍💻 Esempio di Implementazione: `GPTSplitter`

📈 *Pro:* chunking guidato semantico
🎯 *Few-shot prompting* con esempi rilevanti al dominio migliora l’output
📉 *Contro:* costo + latenza

In [15]:
import re
from langchain_openai import ChatOpenAI
from typing import List, Any
from langchain.text_splitter import TextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# la classe base TextSplitter ha come abstractmetho split_text, quindi va implementato per forza
class GPTSplitter(TextSplitter):
    def __init__(self, model_name: str = "gpt-3.5-turbo", **kwargs: Any) -> None:
        super().__init__(**kwargs)
        self.model = ChatOpenAI(model=model_name)

        # definiamo il nostro prompt
        self.prompt = ChatPromptTemplate.from_template(
            "You are an expert in identifying semantic meaning of text. "
            "You warp each chunk in <<<>>>.\n\n"
            "Example:\n"
            "Text: \"The curious cat perched on the windowsill, its eyes wide as it watched the fluttering birds outside. "
            "With a swift leap, it was on the ground, stealthily making its way towards the door. "
            "Suddenly, a noise startled it, causing the cat to freeze in place.\"\n"
            "Wrapped:\n"
            "<<<The curious cat perched on the windowsill, its eyes wide as it watched the fluttering birds outside.>>>\n"
            "<<<With a swift leap, it was on the ground, stealthily making its way towards the door.>>>\n"
            "<<<Suddenly, a noise startled it, causing the cat to freeze in place.>>>\n\n"
            "Now, process the following text:\n\n"
            "{text}"
        )

        self.output_parser =  StrOutputParser()

        self.chain = (
            {"text": RunnablePassthrough()}
            | self.prompt
            | self.model
            | self.output_parser
        )

    # metodo necessario
    def split_text(self, text: str) -> List[str]:
        response = self.chain.invoke({"text": text})
        # usiamo regex per splittare appropriatamente da <<< e >>> markers
        # troviamo tutti gli elementi contenuti in <<<>>>
        # (.*?) . = qualsiasi carattere 
        chunks = re.findall(r"<<<(.*?)>>>", response, re.DOTALL) # da una lista
        return [chunk.strip() for chunk in chunks] # strip: pulisce i chunks da spazi e newline inutili



In [16]:
# istanziamo la nostra classe
gpt_splitter = GPTSplitter()
gpt_docs = gpt_splitter.split_text(raw_data)

In [17]:
print(gpt_docs)

['In the charming streets of Palermo, tucked away in a quaint alley, stood Chef Amico, a restaurant that was more than a mere eateryâ€”it was a slice of Sicilian heaven.', 'Founded by Amico, a chef whose name was synonymous with passion and creativity, the restaurant was a mosaic of his lifeâ€™s journey through the flavors of Italy.', 'Chef Amicoâ€™s doors opened to a world where the aromas of garlic and olive oil were as welcoming as a warm embrace.', 'The walls, adorned with photos of Amicoâ€™s travels and family recipes, spoke of a rich culinary heritage.', 'The chatter and laughter of patrons filled the air, creating a symphony as delightful as the dishes served.', 'One evening, as the sun cast a golden glow over the city, a renowned food critic, Elena Rossi, stepped into Chef Amico.', "Her mission was to uncover the secret behind the restaurant's growing fame. She was greeted by Amico himself, whose eyes sparkled with the joy of a man who loved his work.", 'Elena was led to a tabl

In [18]:
print(len(gpt_docs))

15


In [20]:
gpt_docs[0]

'In the charming streets of Palermo, tucked away in a quaint alley, stood Chef Amico, a restaurant that was more than a mere eateryâ€”it was a slice of Sicilian heaven.'

---

## 📊 Confronto Tecniche

| Tecnica                 | Semantico | Complessità | Accuratezza | Chunk Count | Costo     |
| ----------------------- | --------- | ----------- | ----------- | ----------- | --------- |
| `CharacterTextSplitter` | ❌         | 🟢 Bassa    | 🔴 Bassa    | 🟢 \~6      | 🟢 0€     |
| `RecursiveTextSplitter` | ❌         | 🟢 Bassa    | 🟠 Media    | 🟡 \~12     | 🟢 0€     |
| `SemanticChunker`       | ✅         | 🟠 Media    | 🟢 Alta     | 🔴 \~2      | 🟠 API    |
| `GPTSplitter (LLM)`     | ✅✅        | 🔴 Alta     | 🟢🔵 Ottima | 🟢 \~16     | 🔴 \$\$\$ |

---

## 🧠 Suggerimenti Finali

* 🔁 **Itera sul prompt LLM** con esempi di dominio (manuali auto, medicina, ricette…)
* ⚖️ **Bilancia chunk size** e **quantità** per non penalizzare il retriever
* 📐 Usa il chunking *più semantico possibile* per embedding-based RAG

---

## 🔜 Prossimo step

➡️ Studio dei modelli di **embedding** per completare la catena semantica dopo il chunking!