# Embedding Models and Chunking

Before building our first RAG, we need to asses the quality of the selected embedding models.
The documents are in German, including English text passages and english annotion (mutable). The embedding model is therefore required to posses enough multilanugage capability.

Chunking has a direct impact on search speed and retrieval accuracy. Smaller chunks are better represented in the embedding vector, while larger chunks need less embedding vectors, resulting in higher retrieval speeds. In addidion, the chunking quality may determine retrieval accuracy. This notebook explores the interplay between both.

## Setup: Dataset

First let's load the documents into memory

In [72]:
from pathlib import Path
from llama_index.core import SimpleDirectoryReader

# Replace this path with your actual docs folder
source_dir = Path("data", "evaluation", "rapid_ocr_with_eng_model_text_blocks")

# Load markdown documents
documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data()
print(f"Loaded {len(documents)} documents")

Loaded 40 documents


## Chunking Method

### SentenceSplitter Overview

The `SentenceSplitter` from LlamaIndex is a text chunking utility that breaks documents into semantically meaningful sentences or sentence groups. It preserves sentence boundaries, resulting in more coherent and effective chunks for embedding and retrieval. It is the default chunking method used by llama-index.

**Key Features:**
- Splits by sentence rather than fixed character count.
- Supports parameters like `chunk_size` (max tokens or characters per chunk) and `chunk_overlap` (repeated content between chunks).
- Returns `TextNode` objects containing `text` and `metadata`.

In [73]:
from utils import display_citation
from llama_index.core.node_parser import SentenceSplitter

In [None]:
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
sentence_nodes = splitter.get_nodes_from_documents(documents)

print("Second Chunk:")
display_citation(sentence_nodes[1].text) 

print("Third Chunk:")
display_citation(sentence_nodes[2].text) 

Second Chunk:


> Zusammenfassung
> 
> - ● PE/PTT Aktivitäten Abschlußbericht: Hannah verlässt das HIP Team, das DSRPFS Team für PE-PTT Team wird in anderer Form weitergeführt.
> - ● Moritz verlässt das HIP Team bis auf das Thema 'unified cut-offs'.
> - ● Die Möglichkeit einer Entwicklung interner Mehrweg-Trays wird untersucht und bewertet.
> - ● Entscheidung : Der Roboterbau im Elab zur Unterstützung des KI-Trainings wird gecancelt, da er nicht mehr benötigt wird.
> - ● CN110 und CN95 Feinschnitte werden Yilmaz angeboten, ansonsten entsorgt.
> 
> ## 2. Abschluss PE/PTT (Hannah)
> 
> 2024-01-11\_Abschluss\_HIP.pdf
> 
> ## DSRPSF Projektabschluss HIP
> 
> 11.01.2024.CAPCCT
> 
> Hannah Berberich fur Manuela Reddmann, Melanie Kohlmeier, Katrin Meininger.Christian Albrecht, Christoph Schuck, Lisa Hintz, Jasamin Kabiri und Benjamin Reinhard
> 
> Roche Diagnostics GmbH Diagnostics Division
> 
> System Integration Dcs
> 
> Wann?
> 
> Wer?
> 
> Was?
> 
> Impact?
> 
> HIP CAP-CCT
> 
> ## Ubersicht Aufgaben DSRPSF in HIP
> 
> ## Q4 2020 - Q3 2023
> 
> Q4 2020
> 
> Melanie Kohlmeier Manuela Reddmann Katrin Meininger
> 
> Melanie Kohlmeier Manuela Reddmann Katrin Meininger
> 
> Melanie Kohlmeier Manuela Reddmann Katrin Meininger
> 
> Einarbeitung durch TnT Team Erste Aquivalenztests
> 
> 2021 Kevin Kraus tw. Hannah Berberich
> 
> 47 TnT Funktionsbewertungen mit 129 Fertigungsvarianten, Davon:
> 
> - &gt; 18 Freigabeversuche &gt; z.B.

Third Chunk:


> Hannah Berberich
> 
> 47 TnT Funktionsbewertungen mit 129 Fertigungsvarianten, Davon:
> 
> - &gt; 18 Freigabeversuche &gt; z.B. Anlagen, Bauteile, Prozessschritte,Materialien und Einsatzstoffe
> 
> &amp;Unterdosierung &gt;6 Homogenitatsbestimmungen -&gt;z.B Leistungslauf,POCL
> 
> &gt;14weitere Versuche&gt;u.a.Aquivalenztests, Ursachenanalysen
> 
> unter kontinuierlicher Erhohung des Standardisierungsgrades
> 
> o D-Diner POCL1 Bewertung o3 D-Dimer POL Bewertungen
> 
> 2022
> 
> Q1-Q3 2023
> 
> Benjamin Reinhard tw Jasamin Kabiri tw. Christian Albrecht tw. Hannah Berberich
> 
> Aufbau 52 Labor im Geb.272 oAufbau Team4auf 7
> 
> o 2 TnT POCL Bewertungen o 6 TnT POL Bewertungen o 2 FRL TnT Assay Freigaben 4 Stabilitats-Studien 18 Chromatographiezeitbestimmungen o 5 Bewertungen von PE Freigabeversuchen o Freigabe SG008B1
> 
> o D-Dimer Freigabe MBBs o Erstellung Messplan PT PRAD und POCL/POL Prograrmm D-Dimer
> 
> o 2 Bewertungen TnT Versuche o Support Fluidikbewertungen SG008B1+SG\_P01 o 4 Bewertungen Cartridge Versuche
> 
> Prozessentwicklung
> 
> ## AbschlieBende HIP Tasks Q4 2023 und Ausblick 2024 ff
> 
> Go Live DSRP 2.0
> 
> 
>   Image Info:
>   - label: Diagramm
>   - description: Das Bild zeigt ein Organisationsdiagramm, das die Struktur und Aufgaben eines Teams darstellt.

### SemanticSplitterNodeParser

The `SemanticSplitterNodeParser` is a chunking method in LlamaIndex that uses **embedding-based sentence similarity** to identify natural semantic boundaries in a document. It creates chunks that are more coherent and context-aware than character-based or naive sentence-based splitters.

**What It Does**:

- Converts each sentence into an embedding.
- Computes similarity between consecutive sentences.
- Places chunk breaks where sentence similarity drops below a threshold, ensuring semantic boundaries.
- Optionally adjusts chunk size using `buffer_size`.

**Key Parameters**:
- `buffer_size`: Number of surrounding sentences included to preserve context.
- `breakpoint_percentile_threshold`: Percentile threshold to decide where breaks occur based on sentence similarity.
- `embed_model`: The embedding model used to compute sentence-level similarity

This method is ideal for use cases requiring semantic integrity in chunking, such as RAG pipelines, summarization, or contextual question answering.

In [75]:
import requests
import warnings
import urllib3
from huggingface_hub import configure_http_backend

def backend_factory() -> requests.Session:
    session = requests.Session()
    session.verify = False
    return session

configure_http_backend(backend_factory=backend_factory)

# Suppress only InsecureRequestWarning from urllib3
warnings.filterwarnings("ignore", category=urllib3.exceptions.InsecureRequestWarning)

In [None]:
from llama_index.embeddings.langchain import LangchainEmbedding
from langchain.embeddings import HuggingFaceEmbeddings

# Load MiniLM embedding model (384 dim, fast and accurate)
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
embed_model = LangchainEmbedding(embedding_model)

In [None]:
from llama_index.core.node_parser import SemanticSplitterNodeParser

In [None]:
splitter = SemanticSplitterNodeParser(
    buffer_size=2,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model
)
semantic_nodes = splitter.get_nodes_from_documents(documents)

In [None]:
print("Second Chunk:")
display_citation(semantic_nodes[1].text) 

print("Third Chunk:")
display_citation(semantic_nodes[2].text) 

Second Chunk:


> | Melanie Kohimeier Manuela Reddmann Katrin Meininger | Melanie KohImeier Manuela Reddmann Katrin Meininger                                                                                                                                                                                               | Kevin Kraus tw. 

Third Chunk:


> Hannah Berberich     | Melanie Kohimeier ManuelaReddmann Katrin Meininger                                                                                                                                                                             | Benjamin Reinhard tw — Jasamin Kabiri tw. 

## SemanticDoubleMergingSplitterNodeParser

`SemanticDoubleMergingSplitterNodeParser` is an advanced chunking strategy in LlamaIndex that intelligently merges sentences and chunks based on semantic similarity. It uses **SpaCy embeddings** and a **two-pass merging process** to create context-rich, coherent chunks suitable for high-quality retrieval.

**The algorithm:**  

1. **Embeds sentences** and measures similarity between them.
2. **Identifies semantic breakpoints** where similarity drops below a defined threshold.
3. **Appends short fragments** to previous chunks if they lack standalone context.
4. **Merges neighboring chunks** if their semantic similarity exceeds the merging threshold.
5. Ensures all final chunks respect the `max_chunk_size` limit.

This produces segments that are semantically coherent, complete in thought, and suitable for vector-based retrieval.

**Key Features**  

- Controlled chunk sizing for model efficiency
- Language-aware via `LanguageConfig` support

**Key Parameters**  

- `initial_threshold`: Similarity score to start a new chunk
- `appending_threshold`: Similarity required to append small fragments
- `merging_threshold`: Similarity for merging neighboring chunks
- `max_chunk_size`: Maximum number of characters per chunk

In [78]:
%pip install spacy
!python3 -m spacy download en_core_web_md
!python -m nltk.downloader punkt_tab

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[nltk_data] Downloading package punkt_tab to /home/vscode/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [79]:
from llama_index.core.node_parser import SemanticDoubleMergingSplitterNodeParser, LanguageConfig

config = LanguageConfig(language="english", spacy_model="en_core_web_md")

splitter = SemanticDoubleMergingSplitterNodeParser(
    language_config=config,
    initial_threshold=0.4,
    appending_threshold=0.5,
    merging_threshold=0.5,
    max_chunk_size=5000,
)

double_merged_nodes = splitter.get_nodes_from_documents(documents)

  ).similarity(
  ).similarity(
  current_nlp.similarity(
  and current_nlp.similarity(
  and current_nlp.similarity(


In [80]:
print("Second Chunk:")
display_citation(double_merged_nodes[1].text) 

print("Third Chunk:")
display_citation(double_merged_nodes[2].text) 

Second Chunk:


> Meeting
> 
> Datum:
> 
> 11. Januar 2024
> 
> Meeting
> 
> Ort:
> 
> Präsenzmeeting
> 
> Minutes
> 
> Zeit
> 
> : 10:00 - 12:00
> 
> Teilnehmer:
> 
> Andreas Weller
> 
> Arnhild Thiel
> 
> Barbara Haller
> 
> Christopher Dörr
> 
> David Wirthensohn
> 
> Edda Reiß
> 
> Florian Diehr
> 
> Florian Kemmer
> 
> Gerold Diez
> 
> Jana Böhm
> 
> Gäste:
> 
> Anja Schröter
> 
> Max Müller
> 
> Nächstes CAP CCT Meeting: 11. Januar 2024 Agenda HIP CAP CCT Fortlaufend
> 
> 1/9
> 
> Roche Diagnostics GmbH; Sandhofer Strasse 116; D-68305 Mannheim; Telefon +49 621 759 0; Telefax +49 621 759 2890
> 
> Sitz der Gesellschaft: Mannheim - Registergericht: AG Mannheim HRB 3962 - Geschäftsführung: Claus Haberda; Andreas Schmitz - Aufsichtsratsvorsitzender: Dr. Thomas Schinecker
> 
> Confidentiality Note
> 
> This message is intended only for the use of the named recipient(s) and may contain confidential and/or privileged information. If you are not the intended recipient, please contact the sender and delete the message. Any unauthorized use of the information contained in this message is prohibited. : Simon Gessler
> 
> : Jana Böhm
> 
> Julia Körner
> 
> Jürgen Richter
> 
> Katharine Klyta
> 
> Kristin Eikmeier
> 
> Moritz Marcinowski
> 
> Özgür Dagdelen
> 
> Pamela Espindola
> 
> Sabrina Mehlhase
> 
> Sascha Lutz
> 
> Sebastian Pankalla
> 
> Simon Gessler
> 
> Jürgen Spinke
> 
> Claas Andreae
> 
> Helmut Walter
> 
> Markus Schantz
> 
> Ludger Jürgens
> 
> CC:
> 
> HIP CAP-CCT
> 
> ## 1. Zusammenfassung
> 
> - ● PE/PTT Aktivitäten Abschlußbericht: Hannah verlässt das HIP Team, das DSRPFS Team für PE-PTT Team wird in anderer Form weitergeführt. - ● Moritz verlässt das HIP Team bis auf das Thema 'unified cut-offs'. - ● Die Möglichkeit einer Entwicklung interner Mehrweg-Trays wird untersucht und bewertet. - ● Entscheidung : Der Roboterbau im Elab zur Unterstützung des KI-Trainings wird gecancelt, da er nicht mehr benötigt wird. - ● CN110 und CN95 Feinschnitte werden Yilmaz angeboten, ansonsten entsorgt. ## 2. Abschluss PE/PTT (Hannah)
> 
> 2024-01-11\_Abschluss\_HIP.pdf
> 
> ## DSRPSF Projektabschluss HIP
> 
> 11.01.2024.CAPCCT
> 
> Hannah Berberich fur Manuela Reddmann, Melanie Kohlmeier, Katrin Meininger.Christian Albrecht, Christoph Schuck, Lisa Hintz, Jasamin Kabiri und Benjamin Reinhard
> 
> Roche Diagnostics GmbH Diagnostics Division
> 
> System Integration Dcs
> 
> Wann? Wer?

Third Chunk:


> Was?

## Creating Embeddings

Next, in order to test for semantic qualtiy of the chunking, we need to select and use an embedding model. (Shouldn't I use the same for chunking itself).

Since we need to use embedding models provided by bedrock, our options are limited to:
- "cohere.embed-multilingual-v3"
- "cohere.embed-english-v3"
- "amazon.titan-embed-text-v2:0"

Let's proceed with coheres multilingual model, since the texts are in german. 

### Using Porktey API 

Using portkey, the API is:

In [13]:
import os
api_key = os.getenv("API_KEY")
assert api_key

In [None]:
from portkey_ai import Portkey

_portkey = Portkey(
    api_key=api_key,
    base_url="https://eu.aigw.galileo.roche.com/v1"
)


response = _portkey.embeddings.create(
    model="cohere.embed-multilingual-v3",
    input=["okys"],
    input_type="search_query"
)

In [15]:
response.data

[Embedding(embedding=[0.02583313, 0.006164551, -0.012016296, 0.04901123, -0.046173096, -0.0004286766, 0.019760132, -0.003490448, -0.024307251, -0.0070724487, 0.016036987, -0.0036067963, 0.01864624, -0.0022850037, -0.0066490173, 0.0071907043, 0.037719727, 0.02609253, 0.08074951, 0.0072517395, -0.015670776, -0.0046844482, 0.046081543, -0.024353027, -0.025360107, 0.043670654, 0.07501221, -0.0569458, 0.037261963, 0.044036865, -0.015777588, -0.040100098, 0.076049805, 0.02507019, 0.03277588, 0.0005645752, 0.019134521, -0.0027484894, -0.026626587, 0.020141602, 0.029876709, -0.009765625, -0.010093689, -0.0022010803, -0.04324341, 0.02897644, 0.008171082, -0.022857666, -0.006351471, -0.028869629, -0.03363037, 0.07849121, -0.007820129, -0.0519104, 0.020355225, 0.048980713, 0.037719727, -0.0014982224, -0.0063552856, -0.01525116, 0.0034160614, -0.036499023, -0.023773193, 0.035308838, 0.0138168335, -0.062805176, -0.028701782, -0.044921875, -0.019821167, 0.0042762756, -0.009483337, -0.014419556, 0.00

The API for Titan models is somewhat different. It requies plain text input and no input type specification. You can also determine the embedding size:

In [None]:
response = _portkey.embeddings.create(
    model="amazon.titan-embed-text-v2:0",
    input="okys",
    dimensions=512
)

In [17]:
response.data

[Embedding(embedding=[-0.060042575001716614, 0.0296210628002882, 0.08858203142881393, 0.03670750930905342, -0.016825852915644646, 0.0015104481717571616, 0.019672004505991936, 0.011552215553820133, -0.03180985897779465, -0.0035907060373574495, -0.0306951105594635, -0.03401561081409454, -0.0015251456061378121, 0.022168314084410667, 0.005408670753240585, 0.022326594218611717, -0.023470737040042877, -0.016574230045080185, 0.013743269257247448, -0.026170549914240837, 0.0254255011677742, 0.02026442624628544, 0.024230483919382095, 0.005127157550305128, -0.039077192544937134, -0.016510916873812675, -0.014498492702841759, 0.001827009255066514, 0.05058644711971283, -0.036047253757715225, 0.03512018173933029, 0.014681645669043064, -0.040112800896167755, 0.008755172602832317, 0.015665248036384583, 0.05554401874542236, 0.02082066796720028, -0.017981231212615967, 0.017763596028089523, 0.04932020232081413, -0.027323735877871513, -0.052232563495635986, -0.0035138269886374474, -0.04673345014452934, -0.

### Creating a Vector Store Compatible API Class

The API calls can be condensed into a custom embedding class, which is compatible with llama-index vector storage APIs.  
Note, to protect itself from being momentarily flooded with user data, bedrock implements rate limits. We use an exponential backup strategy to 
let the network recover when reached.

In [149]:
import time
import random
import pdb
from tqdm import tqdm
from typing import List
from langchain.embeddings.base import Embeddings
from portkey_ai import Portkey
from portkey_ai._vendor.openai import RateLimitError

class PortkeyEmbedding(Embeddings):
    def __init__(
        self, 
        model_name: str, 
        api_key: str, 
        base_url="https://eu.aigw.galileo.roche.com/v1",
        dimensions: int = 1024, 
        verbose: str = False,
        show_progress: bool = True
    ):
        self._model_name = model_name
        self._dimensions = dimensions
        self._verbose = verbose
        self._portkey = Portkey(api_key=api_key, base_url=base_url)
        self._show_progress = show_progress
        
    @property
    def model_name(self):
        return self._model_name

    def _retry_with_backoff(self, func, *args, max_retries=5, **kwargs):
        # Exponential retry to respect bedrock rate limits
        delay = 1.0
        for attempt in range(max_retries):
            try:
                return func(*args, **kwargs)
            except RateLimitError as e:
                print(f"RateLimitError: {e}, retrying in {delay:.2f}s...") if self._verbose else None
                time.sleep(delay + random.uniform(0, 0.5))
                delay *= 2
            except:
                print()
                pdb.set_trace()
        raise RuntimeError("Exceeded maximum retries due to rate limits.")


    def embed_documents(self, texts: List[str], is_query:bool= False) -> List[List[float]]:
        embeddings = []
        iterator = tqdm(enumerate(texts), total=len(texts), desc="Embedding documents", disable=not self._show_progress)

        for index, text in iterator:
            payload = {
                "model": self._model_name,
                "input": text if "titan" in self._model_name else [text],
            }
            if is_query and 'titan' not in self._model_name:
                payload["input_type"] = "search_query"
            elif 'titan' not in self._model_name:
                payload["input_type"] = "search_document"
            if self._dimensions:
                payload["dimensions"] = self._dimensions

            response = self._retry_with_backoff(self._portkey.embeddings.create, **payload)
            embeddings.append(response.data[0].embedding)

        return embeddings


    def embed_query(self, text: str) -> List[float]:
        return self.embed_documents([text], is_query=True)[0]


### Creating all Embeddings at Once
We can now create all embeddings at once, by pussing the chunked nodes to the vector store 

In [84]:
from langchain.schema import Document as LCDocument
from langchain.vectorstores import FAISS


# Initialize custom embedding model
embedding_model = PortkeyEmbedding(
    model_name="amazon.titan-embed-text-v2:0",
    api_key=api_key,
)

converted_documents = [
    LCDocument(
        page_content=node.get_content(), 
        metadata=node.metadata
        )
    for node in sentence_nodes
]

vectorstore = FAISS.from_documents(
    converted_documents, 
    embedding=embedding_model
)


The embeddings are now available as np matrix.

In [85]:
import numpy as np

index = vectorstore.index
embeddings = np.array(index.reconstruct_n(0, index.ntotal))
embeddings.shape

(1271, 1024)

In [None]:
# Save to a directory
from pathlib import Path

store_persist = Path("data", "embedding_eval", "sentence_splitter_512_50" + "+" + "amazon.titan-embed-text-v2:0")
store_persist.mkdir(parents=True, exist_ok=True)

vectorstore.save_local(str(store_persist))