# **Semantic Chunking:**

**Semantic Chunking** is a method of breaking down a document into smaller, meaningful units by grouping text segments that share a similar meaning or topic, rather than by fixed-size or simple sentence boundaries. <br>



## **What is Semantic Chunking:**

*Semantic chunking* is a method to split (or segment) a long text into smaller “chunks” (segments) not just by raw size (fixed number of tokens/characters) but by **semantic similarity/boundaries**. The goal is to make each chunk relatively coherent in meaning, so that each is more “self-contained” in terms of topic, idea, or sense.

In contrast to simpler split strategies (e.g. splitting every *n* sentences or *n* tokens), semantic chunking tries to find *natural breakpoints* based on how meaning shifts in the text, using embeddings (vector representations) to detect those shifts.



## **How LangChain’s `SemanticChunker` works:**

LangChain has an experimental component `SemanticChunker` (in `langchain_experimental.text_splitter`) which implements semantic chunking. 

Here are its main characteristics, parameters, and algorithm:



### **Key Steps / Algorithm:**

1. **Sentence splitting**
   First, the text is split into sentences using a regex (default: splittings after `.`, `?`, `!` followed by whitespace). 

2. **Create embeddings**
   Each sentence (or maybe each small group) is converted into embeddings (vector form) using a provided embeddings model. This lets the algorithm measure “semantic distance” between sentences. 

3. **Initial grouping**
   There is a grouping step: for example, grouping every 3 sentences as a base chunk. This is a starting point.

4. **Detect “breakpoints” where meaning shifts**
   After having sentences (or initial small chunks) with embeddings, it measures distances (or differences) between neighbouring sentences / chunks. If the distance is *large enough* (more than some threshold), that indicates a semantic “break” — that is, a good place to split. 

5. **Merge or adjust chunks**
   Depending on how close neighbouring chunks are, some may be merged; also other constraints like minimum chunk size or buffer size may influence the final segmentation. 

6. **Thresholding strategy**
   There are different strategies to decide what “large enough” means. `SemanticChunker` supports multiple types:

   * `Percentile`: e.g., take all pairwise distances, find the X-th percentile (default 95%), use that as threshold. 
   * `Standard Deviation`: threshold based on number of standard deviations above mean difference. 
   * `Interquartile`: using IQR (interquartile range) to define threshold. 
   * `Gradient`: detect anomalies in gradients of distances (i.e. sudden jumps) to find natural split points. 

7. **Other parameters / constraints**
   There are other knobs:

   * `buffer_size` (int): controls some kind of overlapping / buffer of context around splits. 
   * `min_chunk_size`: minimum size (in terms of something, e.g. number of sentences) a chunk must have. Prevents overly small chunks. [1])
   * `sentence_split_regex`: custom regex to split into sentences. 
   * `add_start_index`: whether to store some start index metadata. 
   * `number_of_chunks`: possibly limit desired number of chunks. 



### Example (from LangChain docs)

* They load a long document (“State of the Union”) and invoke `SemanticChunker(OpenAIEmbeddings())` to split. 
* With the default settings, this creates many chunks (e.g. 26) for the document.
* Changing the `breakpoint_threshold_type` from `percentile` to `standard_deviation` or `interquartile` results in different chunking (fewer or more, depending on how strict the threshold is). 



## Benefits and Trade-offs

**Benefits:**

* Chunks align more closely to shifts in topic or meaning → better for downstream tasks like summarization, question answering, retrieval, etc., since each chunk is more semantically coherent.
* Can reduce noise: avoids cutting off in the middle of a topic just because of a token limit or fixed chunk size.
* More adaptive: documents with different structures will be chunked differently based on content, not just length.

**Trade-offs / challenges:**

* **Computational cost**: embedding every sentence, computing distances etc. can be more expensive than simpler fixed splits.
* **Parameter sensitivity**: thresholds (percentile / std dev / IQR etc.), minimum sizes etc. need tuning. If threshold too low, too many splits → many small chunks, maybe too fragmented. If too high, you may miss meaningful topic shifts.
* **Dependence on embedding quality**: If embeddings aren’t good, semantic similarity/distance estimates may be noisy, leading to weird splits.
* **Not always perfect**: sudden shifts in writing style or rhetorical devices may mislead; some “semantic shifts” may not align with what a human would want as a chunk break.



## When to use Semantic Chunking

You might prefer semantic chunking when:

* The text is long, varied, covers multiple topics or subtopics.
* You will perform downstream tasks like semantic search, retrieval, QA, summarization etc. where coherent sub-units help.
* You want chunk sizes to reflect meaning rather than arbitrary lengths.


---

## **Understand the Different types of Breakpoints:**

### 1. **Percentile Breakpoint**

* **How it works:**

  * Compute the semantic distance between consecutive sentences (using embeddings).
  * Collect all these distances.
  * Choose a threshold = the *X-th percentile* of all distances (default: 95th).
  * A distance bigger than that threshold → mark it as a breakpoint.

* **Example:**
  Distances between 8 sentence pairs:

  ```
  [0.1, 0.15, 0.12, 0.55, 0.18, 0.2, 0.9, 0.16]
  ```

  * 95th percentile ≈ 0.87.
  * Distances above 0.87 → `[0.9]`.
  * Breakpoint occurs before the last pair.

* **Effect:** Splits only at the largest “semantic jumps.”
  → Good for long documents with few big topic shifts.



### 2. **Standard Deviation Breakpoint**

* **How it works:**

  * Compute mean and standard deviation of all distances.
  * Breakpoints = distances above **mean + N×std** (N is configurable, e.g. 2).

* **Example:**
  Distances:

  ```
  [0.1, 0.15, 0.12, 0.55, 0.18, 0.2, 0.9, 0.16]
  ```

  * Mean ≈ 0.29
  * Std ≈ 0.27
  * Threshold = 0.29 + 2×0.27 = 0.83
  * Distances above 0.83 → `[0.9]`
  * Breakpoint at the same spot as Percentile in this case.

* **Effect:** Adapts to overall variance. Works well if distances are normally distributed.



### 3. **Interquartile Breakpoint (IQR method)**

* **How it works:**

  * Compute 25th percentile (Q1) and 75th percentile (Q3).
  * IQR = Q3 - Q1.
  * Threshold = Q3 + k×IQR (usually k=1.5).
  * Breakpoints = distances above this threshold.

* **Example:**
  Distances:

  ```
  [0.1, 0.15, 0.12, 0.55, 0.18, 0.2, 0.9, 0.16]
  ```

  * Sorted: `[0.1, 0.12, 0.15, 0.16, 0.18, 0.2, 0.55, 0.9]`
  * Q1 = 0.14, Q3 = 0.375 → IQR = 0.235
  * Threshold = 0.375 + 1.5×0.235 = 0.727
  * Distances above 0.727 → `[0.9]`
  * Breakpoint at last pair.

* **Effect:** Robust to outliers. Works well if a few jumps are much larger than the rest.



### 4. **Gradient Breakpoint**

* **How it works:**

  * Look at the *change in distances* between consecutive pairs (the gradient).
  * If the gradient suddenly spikes, mark as breakpoint.

* **Example:**
  Distances:

  ```
  [0.1, 0.15, 0.12, 0.55, 0.18, 0.2, 0.9, 0.16]
  ```

  * Gradients:

    * 0.1 → 0.15 = +0.05
    * 0.15 → 0.12 = -0.03
    * 0.12 → 0.55 = +0.43 ← **big jump**
    * 0.55 → 0.18 = -0.37
    * 0.18 → 0.2 = +0.02
    * 0.2 → 0.9 = +0.7 ← **big jump**
    * 0.9 → 0.16 = -0.74
  * Breakpoints at jumps `+0.43` and `+0.7`.

* **Effect:** Finds *sudden shifts* in topic even if absolute distance isn’t extreme. Good for detecting subtle turning points.



### 🔑 Summary Table

| Method             | Threshold Rule                        | Best For                    |
| ------------------ | ------------------------------------- | --------------------------- |
| Percentile         | Top X% of distances                   | Large, obvious topic shifts |
| Standard Deviation | Mean + N×std deviation                | Normal-like distributions   |
| Interquartile      | Q3 + 1.5×IQR                          | Robust against outliers     |
| Gradient           | Sudden spikes in distance differences | Capturing subtle shifts     |

---

## **Example - How Semantic Chunking Works:**

1. **Compute distances → decide breakpoints**
2. **Use breakpoints to slice the text into chunks**

Let’s go step by step.



### 🔹 Step 1: Start with Sentences

Semantic chunking begins with splitting the document into **sentences** (or very small segments).

Example text (6 sentences):

```
S1: The U.S. President delivered a speech on healthcare reform.  
S2: He emphasized the importance of affordable coverage.  
S3: Later, he shifted to foreign policy and national security.  
S4: He discussed recent diplomatic talks with allies.  
S5: The speech ended with remarks about climate change.  
S6: New funding was announced for renewable energy projects.
```



### 🔹 Step 2: Breakpoints from Distances

Suppose we computed embedding distances between consecutive sentences:

```
S1–S2: 0.1
S2–S3: 0.65   ← Breakpoint (topic shift: healthcare → foreign policy)
S3–S4: 0.15
S4–S5: 0.7    ← Breakpoint (foreign policy → climate change)
S5–S6: 0.2
```

Depending on the chosen method (Percentile, Std Dev, IQR, Gradient), we got **breakpoints at S2–S3 and S4–S5**.



### 🔹 Step 3: Select the Chunks

Now we **cut** the text at those breakpoints:

* **Chunk 1:** `[S1, S2]` → Healthcare
* **Chunk 2:** `[S3, S4]` → Foreign policy
* **Chunk 3:** `[S5, S6]` → Climate change



### 🔹 Step 4: Adjust with Constraints

LangChain’s `SemanticChunker` allows tuning:

* **min\_chunk\_size** → ensures you don’t end up with too tiny chunks (e.g., at least 3 sentences).
* **buffer\_size** → allows overlap (carry over context across boundaries).
* **number\_of\_chunks** → limit max chunks (force merging if needed).

Example:
If `min_chunk_size = 3`, then instead of splitting into 3 small chunks, the algorithm might merge Chunk 2 and 3 → `[S3, S4, S5, S6]`.



## **SemanticChunker Method's Arguments:**


* **buffer\_size (int)**
  * **What it is:** adds a small amount of **overlap** around every breakpoint so adjacent chunks keep a bit of shared context.
  * **How it works:** after splitting, the chunker copies `buffer_size` sentences from the *end* of chunk *i* to the *beginning* of chunk *i+1*.
  * **Example:** if a split lands between S2 and S3 and `buffer_size=1`
    * Chunk A: `[S1, S2]`
    * Chunk B (with buffer): `[S2, S3, S4, …]`
      This helps downstream QA avoid losing pronoun references or definitions at boundaries. 


* **add\_start\_index (bool)**
  * **What it is:** stores the **character start index** (in the original text) for each produced chunk in metadata.
  * **Why it’s useful:** lets you map chunks back to the original document for highlighting, citations, or UI offsets.
  * **Example:** a returned `Document` may have `metadata={"start_index": 1423}` so you can jump the user to that exact spot. 


* **number\_of\_chunks (int | None)**
  * **What it is:** an optional **target count** of chunks to produce.
  * **How it works:** the splitter adjusts its breakpoints (e.g., by relaxing/tightening the threshold) or merges/splits to land near this number.
  * **When to use:** you want predictable paging—for instance, “give me \~12 chunks for this chapter so I can show one per screen.”
  * **Example:** if natural breaks would yield 20 chunks but `number_of_chunks=12`, nearby boundaries are softened/merged to approach 12 coherent segments. 


* **sentence\_split\_regex (str)**
  * **What it is:** the **regex** used for the very first pass of splitting into sentences before any semantic logic. Default: `r"(?<=[.?!])\s+"` (split on `.?!` followed by whitespace).
  * **Why it matters:** better sentence detection → better embeddings → better breakpoint detection.
  * **Example tweaks:**
    * Include colons/semicolons for technical prose: `r"(?<=[.?!;:])\s+"`
    * Keep abbreviations from splitting (e.g., “U.S.”) by using a more careful regex/tokenizer. 

* **min\_chunk\_size (int | None)**
  * **What it is:** a **floor** on how small a chunk can be (in sentences).
  * **How it works:** after finding breakpoints, if a candidate chunk would have fewer than `min_chunk_size` sentences, it gets **merged** with a neighbor to avoid tiny, low-signal chunks.
  * **Example:** natural breaks yield `[S1,S2] | [S3] | [S4,S5,S6]`. With `min_chunk_size=2`, the middle singleton `[S3]` merges with a neighbor → `[S1,S2,S3] | [S4,S5,S6]`. 



#### **Putting it all together (mini scenario)**

Text of 9 sentences (S1…S9). Semantic distances suggest breaks after S3 and S6.

* **Raw splits:** `[S1–S3] | [S4–S6] | [S7–S9]`
* **With `buffer_size=1`:**

  * Chunk 1: `[S1,S2,S3]`
  * Chunk 2: `[S3,S4,S5,S6]` (S3 buffered)
  * Chunk 3: `[S6,S7,S8,S9]` (S6 buffered)
* **With `min_chunk_size=4`:** merge to avoid short chunks → `[S1–S6] | [S7–S9]` (and then apply buffer if set).
* **With `number_of_chunks=2`:** the algorithm will similarly favor two coherent groups, even if raw splits suggested three.



## **Load Documents:**

In [3]:
# Read the PDf:

from langchain_pymupdf4llm import PyMuPDF4LLMLoader
from langchain_community.document_loaders.parsers import TesseractBlobParser, LLMImageBlobParser
from IPython.display import Markdown, display




loader = PyMuPDF4LLMLoader(
    file_path="Data/36206.pdf",
    mode='single',
    # extract_images=True,
    # images_parser=LLMImageBlobParser(
    #     model=llm
    # ),
    table_strategy='lines'
)


docs = loader.load()

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# display(Markdown(docs[0].page_content))
len(docs[0].page_content)

45845

## **Load Embeddings:**

In [14]:
from langchain_openai.embeddings import AzureOpenAIEmbeddings
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_API_VERSION = os.getenv("AZURE_API_VERSION")

AZURE_OPENAI_ENDPOINT = os.environ["AZURE_OPENAI_ENDPOINT"]


embeddings = AzureOpenAIEmbeddings(
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_API_VERSION,
    azure_deployment="text-embedding-3-small"
)

## **Implementation 01:**

* **`Breakpoints`**:

  * This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.

  * There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.

  * Note: if the resulting chunk sizes are too small/big, the additional kwargs `breakpoint_threshold_amount` and `min_chunk_size` can be used for adjustments.


* **`Breakpoint Type:`**  **Percentile**
  
  * The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split. The default value for X is `95.0` and can be adjusted by the keyword argument `breakpoint_threshold_amount` which expects a number between `0.0` and `100.0`.

In [23]:
from langchain_experimental.text_splitter import SemanticChunker


# Semantic Chunker -- Percentile
text_splitter_percentile = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95.0,
    min_chunk_size=10,
)


chunks = text_splitter_percentile.create_documents(texts=[docs[0].page_content])


In [25]:
len(chunks)

17

In [21]:
chunks[0].page_content

"Received: 19 July 2022\n\n\n## **-**\n\n\n\nRevised: 13 September 2022\n\n\n## **-**\n\n\n\nAccepted: 17 October 2022\n\n\n\n[DOI: 10.1002/cncr.34687](https://doi.org/10.1002/cncr.34687)\n\n\n**O R I G I N A L A R T I C L E**\n\n# **Quality of life with cemiplimab plus chemotherapy for** **first‐line treatment of advanced non–small cell lung cancer:** **Patient‐reported outcomes from phase 3 EMPOWER‐Lung 3**\n\n\n**Tamta Makharadze MD** **[1]** | **Ruben G. W. Quek PhD** **[2]** | **Tamar Melkadze MD** **[3]** |\n**Miranda Gogishvili MD** **[4]** | **Cristina Ivanescu PhD** **[5]** | **Davit Giorgadze MD** **[6]** |\n**Mikhail Dvorkin MD** **[7]** | **Konstantin Penkov MD** **[8]** | **Konstantin Laktionov MD** **[9]** |\n\n**Gia Nemsadze MD** **[10]** | **Marina Nechaeva MD** **[11]** | **Irina Rozhkova MD** **[12]** |\n**Ewa Kalinka MD** **[13]** | **Christian Gessner MD** **[14]** | **Brizio Moreno‐Jaime MD** **[15]** |\n**Rodolfo Passalacqua MD** **[16]** | **Gerasimos Konidaris M

## **Implementation 02:**

* **`Breakpoint Type:` Standard Deviation** 
  * In this method, any difference greater than X standard deviations is split. The default value for X is `3.0` and can be adjusted by the keyword argument `breakpoint_threshold_amount`.

In [26]:
text_splitter_standard_deviation = SemanticChunker(
    embeddings=embeddings, 
    breakpoint_threshold_amount=3.0, # default
    breakpoint_threshold_type="standard_deviation",
    min_chunk_size=10,
)

chunks = text_splitter_standard_deviation.create_documents(texts=[docs[0].page_content])


In [27]:
len(chunks)

1

In [28]:
chunks[0].page_content

"Received: 19 July 2022\n\n\n## **-**\n\n\n\nRevised: 13 September 2022\n\n\n## **-**\n\n\n\nAccepted: 17 October 2022\n\n\n\n[DOI: 10.1002/cncr.34687](https://doi.org/10.1002/cncr.34687)\n\n\n**O R I G I N A L A R T I C L E**\n\n# **Quality of life with cemiplimab plus chemotherapy for** **first‐line treatment of advanced non–small cell lung cancer:** **Patient‐reported outcomes from phase 3 EMPOWER‐Lung 3**\n\n\n**Tamta Makharadze MD** **[1]** | **Ruben G. W. Quek PhD** **[2]** | **Tamar Melkadze MD** **[3]** |\n**Miranda Gogishvili MD** **[4]** | **Cristina Ivanescu PhD** **[5]** | **Davit Giorgadze MD** **[6]** |\n**Mikhail Dvorkin MD** **[7]** | **Konstantin Penkov MD** **[8]** | **Konstantin Laktionov MD** **[9]** |\n\n**Gia Nemsadze MD** **[10]** | **Marina Nechaeva MD** **[11]** | **Irina Rozhkova MD** **[12]** |\n**Ewa Kalinka MD** **[13]** | **Christian Gessner MD** **[14]** | **Brizio Moreno‐Jaime MD** **[15]** |\n**Rodolfo Passalacqua MD** **[16]** | **Gerasimos Konidaris M

## **Implementation 03:**

* **`Breakpoint Type:` Interquartile**
    * In this method, the interquartile distance is used to split chunks. The interquartile range can be scaled by the keyword argument `breakpoint_threshold_amount`, the default value is `1.5`.

In [29]:
text_splitter_interquantile = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=1.5, # default
    min_chunk_size=10
)

chunks = text_splitter_interquantile.create_documents(texts=[docs[0].page_content])


In [30]:
len(chunks)

17

In [31]:
chunks[0].page_content

"Received: 19 July 2022\n\n\n## **-**\n\n\n\nRevised: 13 September 2022\n\n\n## **-**\n\n\n\nAccepted: 17 October 2022\n\n\n\n[DOI: 10.1002/cncr.34687](https://doi.org/10.1002/cncr.34687)\n\n\n**O R I G I N A L A R T I C L E**\n\n# **Quality of life with cemiplimab plus chemotherapy for** **first‐line treatment of advanced non–small cell lung cancer:** **Patient‐reported outcomes from phase 3 EMPOWER‐Lung 3**\n\n\n**Tamta Makharadze MD** **[1]** | **Ruben G. W. Quek PhD** **[2]** | **Tamar Melkadze MD** **[3]** |\n**Miranda Gogishvili MD** **[4]** | **Cristina Ivanescu PhD** **[5]** | **Davit Giorgadze MD** **[6]** |\n**Mikhail Dvorkin MD** **[7]** | **Konstantin Penkov MD** **[8]** | **Konstantin Laktionov MD** **[9]** |\n\n**Gia Nemsadze MD** **[10]** | **Marina Nechaeva MD** **[11]** | **Irina Rozhkova MD** **[12]** |\n**Ewa Kalinka MD** **[13]** | **Christian Gessner MD** **[14]** | **Brizio Moreno‐Jaime MD** **[15]** |\n**Rodolfo Passalacqua MD** **[16]** | **Gerasimos Konidaris M

## **Implementation 04:**

* **`Breakpoint Type:` Gradient**
    * In this method, the gradient of distance is used to split chunks along with the percentile method. This method is useful when chunks are <u>highly correlated with each other or specific to a domain e.g. legal or medical</u>. 
    * The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data. 
    * Similar to the percentile method, the split can be adjusted by the keyword argument `breakpoint_threshold_amount` which expects a number between `0.0` and `100.0`, the default value is `95.0`.

In [32]:
text_splitter_gradient = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="gradient",
    breakpoint_threshold_amount=95.5,
    min_chunk_size=10
)


chunks = text_splitter_gradient.create_documents(texts=[docs[0].page_content])


In [33]:
len(chunks)

16

In [34]:
chunks[0].page_content

'Received: 19 July 2022\n\n\n## **-**\n\n\n\nRevised: 13 September 2022\n\n\n## **-**\n\n\n\nAccepted: 17 October 2022\n\n\n\n[DOI: 10.1002/cncr.34687](https://doi.org/10.1002/cncr.34687)\n\n\n**O R I G I N A L A R T I C L E**\n\n# **Quality of life with cemiplimab plus chemotherapy for** **first‐line treatment of advanced non–small cell lung cancer:** **Patient‐reported outcomes from phase 3 EMPOWER‐Lung 3**\n\n\n**Tamta Makharadze MD** **[1]** | **Ruben G.'

## **When to Use Each `Breakpoint` Strategy:**

### **1.** **Percentile**

* **What it does:** Marks splits at the *top X% largest distances*. (default 95th percentile)
* **When to use:**

  * Documents are long and have **clear, strong topic shifts**.
  * You only care about the **biggest jumps** (e.g., chapter changes, section headings).
  * You want fewer, bigger chunks.
* **Trade-off:** May *miss* subtle shifts if they don’t make it into the top X%.

✅ Example: splitting a **book** or **research paper** into high-level thematic chunks.



### **2.** **Standard Deviation**

* **What it does:** Uses mean + *N×std deviation* as threshold.
* **When to use:**

  * Distances look roughly **normally distributed** (most are similar, a few are outliers).
  * You want a method that adapts to both **average distance** and **variance**.
  * Best for **balanced text** where topic shifts aren’t extreme.
* **Trade-off:** Sensitive to skewed distributions (if text is uneven, it might misfire).

✅ Example: splitting a **report or academic article** where topic transitions are moderate.



### **3.** **Interquartile (IQR method)**

* **What it does:** Threshold = Q3 + 1.5×IQR (like outlier detection).
* **When to use:**

  * Text has **a few big jumps and many small ones** (outliers).
  * You want robustness against skewed data (IQR ignores extreme small values).
  * Works well if you expect **occasional but real topic leaps**.
* **Trade-off:** Might under-split if jumps are gradual (no outliers).

✅ Example: splitting **news articles** or **blogs** where most sentences flow smoothly, but once in a while a new section starts abruptly.



### **4.** **Gradient**

* **What it does:** Looks for **sudden spikes** in distance *changes* (derivative), not just absolute values.
* **When to use:**

  * You want to catch **subtle or relative topic shifts** (e.g., when two similar topics diverge).
  * Great when text has **gradual build-ups** but then a sharp turn.
  * Useful for **dialogues, transcripts, or essays** where topics bleed into each other.
* **Trade-off:** Can be noisier (detects many small jumps if text isn’t smooth).

✅ Example: splitting a **podcast transcript** or **meeting notes**, where conversations drift and suddenly pivot.



### 🔑 Quick Heuristic

| Method        | Best for                  | Think of it as…                   |
| ------------- | ------------------------- | --------------------------------- |
| Percentile    | Major topic shifts only   | “Give me just the **headlines**.” |
| Standard Dev  | Balanced, moderate shifts | “Use a **statistical average**.”  |
| Interquartile | Few big outliers          | “Spot the **outlier jumps**.”     |
| Gradient      | Subtle pivots in flow     | “Track the **momentum shifts**.”  |



## **Recursive Character Splitter Chunking vs. Semantic Chunking:**


### 🔹 **1. Recursive Character Splitter Chunking**

* **How it works**
  * Splits text by **characters** (not meaning), using a **hierarchy of separators**.
  * Default separator order: `["\n\n", "\n", " ", ""]`
    → it tries to split by **paragraph**, then **line**, then **space**, and if none work, finally by raw characters.
  * It keeps splitting until chunks fit within `chunk_size`.
  * Often used with **overlap** (e.g. `chunk_overlap=50`) so chunks share some context.

* **Example**
  Text:
  ```
  LangChain is an open-source framework. 
  It helps build LLM-powered apps. 
  It has modules for retrieval, agents, and evaluation.
  ```

  With `chunk_size=50`:

  * Chunk 1: `"LangChain is an open-source framework. It helps build"`
  * Chunk 2: `"LLM-powered apps. It has modules for retrieval, agents,"`
  * Chunk 3: `"and evaluation."`

  Notice: Splits are based only on length, **not meaning**.

* **When to use**
  * You just need **guaranteed chunk sizes** for token limits.
  * Text is fairly uniform and meaning isn’t critical per chunk (e.g., log files, simple documents).
  * It’s **fast and cheap** (no embeddings required).



* **2. Semantic Chunking**

  * **How it works**
    * Splits text into **sentences**.
    * Embeds each sentence → measures **semantic distance** between neighbors.
    * Uses a **breakpoint strategy** (Percentile, Std Dev, IQR, Gradient) to decide where topics shift.
    * Forms chunks that are **coherent in meaning**, not just length.
    * Extra options like `buffer_size`, `min_chunk_size`, etc. refine the splits.

  * **Example**
    Same text:
    ```
    LangChain is an open-source framework. 
    It helps build LLM-powered apps. 
    It has modules for retrieval, agents, and evaluation.
    ```

    Semantic similarity is high across sentences → **no breakpoints**.
    → The whole passage may remain one chunk.

    If we add a new sentence:

    ```
    Recently, NASA announced a new Mars mission.
    ```

    The distance between “LangChain modules…” and “NASA mission” would be **large**, so a breakpoint is created → 2 chunks:

    * Chunk 1: LangChain-related sentences
    * Chunk 2: NASA mission

  * **When to use**
    * You need chunks to be **semantically meaningful** (for retrieval, QA, summarization).
    * Works best with **long, topic-rich documents** (reports, articles, transcripts).
    * More **expensive** (needs embeddings) and a bit slower.



### 🔑 **Key Differences:**

| Feature                | Recursive Character Splitter       | Semantic Chunking                     |
| ---------------------- | ---------------------------------- | ------------------------------------- |
| **Basis of split**     | Character count / separators       | Semantic meaning via embeddings       |
| **Dependencies**       | None (just text ops)               | Requires embeddings model             |
| **Chunk size control** | Precise (e.g., 1000 tokens each)   | Approximate, depends on topics        |
| **Speed/Cost**         | Fast, cheap                        | Slower, embedding cost                |
| **Chunk coherence**    | May cut mid-sentence/topic         | Keeps topics together                 |
| **Best use case**      | Token-limit control, preprocessing | Semantic retrieval, QA, summarization |

