# 🦛 Chonkie: RecursiveChunker for PDF and Markdown Chunking

![](https://drive.google.com/uc?export=view&id=1FHBnC4CgPjrvSvoXwJiZqg56I8Ck3SbD)

In this notebook, we go over how you can use Chonkie to quickly parse, chunk and run RAG on PDF and Markdown files!

## Overview

Our entire workflow can be summarized in four simple steps (See table of contents sidebar to jump to a step):

1. Convert PDF document to Markdown
2. With Chonkie's RecursiveChunker, chunk our markdown document
3. Prepare Chunks for RAG
4. Use our generated data with an LLM!

## Acknowledgments

We use some awesome services in this notebook, be sure to check them out too:

- Chonkie (of course)
- Arxiv (for our example PDF file)
- Docling (for PDF to Markdown conversion)
- Model2Vec (for embeddings)
- Vicinity (for indexing)
- Together Client (for LLM calls)

## Installs and Imports

Installation might take a while, but only the first time!

**You may have to restart on your colab instance.** Don't worry, that's completely normal!

Just grab a snack and await the CHONK!

In [1]:
!pip install -q chonkie docling model2vec vicinity together rich[jupyter]

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/162.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.9/165.9 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.9/87.9 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m9.5 MB/s[0m eta [36m0:

In [2]:
from chonkie import RecursiveChunker, RecursiveRules, RecursiveLevel
from docling.document_converter import DocumentConverter
from google.colab import userdata
from model2vec import StaticModel
import numpy as np
import os
from pprint import pprint
from rich.console import Console
from rich.text import Text
from together import Together
from transformers import AutoTokenizer
from typing import List
from vicinity import Vicinity, Backend, Metric

## Inits and Utils

Lets setup everything we need for a smooth chunking experience

### Define utilty functions

Just some functions to make the `.pynb` experience better :D

In [3]:
# Rich text console for better printing
console = Console()

# A wrapper to pretty print
def rprint(text: str, console: Console=console, width: int = 80) -> None:
  richtext = Text(text)
  console.print(richtext.wrap(console, width=width))

### Initialize Your Model!

Lets initialize the model we will build our application with!

Here, we are going to be using Deepseek's R1 model.

In [None]:
# Set your Together API key to use Deepseek R1 with it~
os.environ['TOGETHER_API_KEY'] = userdata.get('TOGETHER_API_KEY')

# Initialise a model2vec model for encoding sentences for retrieval
model = StaticModel.from_pretrained("minishlab/potion-retrieval-32M")

# Initialise the Together client to call upon Deepseek R1
client = Together()

# (Optional) Initialise the tokenizer for Deepseek R1
# We use this to get token counts at various points in this colab.
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/129M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/16.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/202 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.49M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.58k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.85M [00:00<?, ?B/s]

## Step 1: Use docling to convert from PDF to Markdown!

**Note**: This step can take about 20-30 seconds depending on your PDF.

In [4]:
# Docling can convert any PDF to markdown!
converter = DocumentConverter()
source = "https://arxiv.org/pdf/1706.03762"
result = converter.convert(source)
text = result.document.export_to_markdown()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
# @title A quick look at the text we'll be working with~
rprint(text)

In [6]:
from IPython.display import HTML
HTML(f"<button onclick=navigator.clipboard.writeText('{text}')>Copy</button>")

### Get the total token counts for this PDF

Our example PDF is made up of ~**9,865** tokens! Keep this in mind as we move forward ;)

In [None]:
total_text_tokens = len(tokenizer.encode(text))
rprint(f"This PDF contains: {total_text_tokens} tokens")

## Step 2: Chunk your texts w/ Chonkie!

### Initalize your Chunker!
For effective markdown chunking, we will be using Chonkie's **recurisve chunker**!

With the recursive chunker, we can define custom RecursiveRules that fit our file's structure and format. Since we are working with Markdown files, we will specify rules that help Chonkie understand it's syntax.

In [None]:
rules = RecursiveRules(
    levels=[
        RecursiveLevel(delimiters=['######', '#####', '####', '###', '##', '#']),
        RecursiveLevel(delimiters=['\n\n', '\n', '\r\n', '\r']),
        RecursiveLevel(delimiters='.?!;:'),
        RecursiveLevel()
    ]
)
chunker = RecursiveChunker(rules=rules, chunk_size=384)

### Lets Chunk!

In [None]:
# This is all it takes to chunk!
chunks = chunker(text)
print(f"Total number of chunks: {len(chunks)}")

Total number of chunks: 57


In [None]:
# @title A quick look at our chunks~
for chunk in chunks[:4]:
  rprint(chunk.text)
  print('-'*80, '\n\n')

-------------------------------------------------------------------------------- 




-------------------------------------------------------------------------------- 




-------------------------------------------------------------------------------- 




-------------------------------------------------------------------------------- 




## Step 3: Retrieval Augmented Generation!

Let's setup RAG to get our chunks when needed

### Step 3.1: Get the embeddings for each of the chunks!

In [None]:
items = [chunk.text for chunk in chunks]
vectors = model.encode(items)
print(vectors.shape)

(57, 512)


### Step 3.2: Create an index with the chunks and embeddings for retrieval

In [None]:
# Initialize the Vicinity instance (using basic backend and cosine metric)
vicinity = Vicinity.from_vectors_and_items(
    vectors=vectors,
    items=items,
    backend_type=Backend.BASIC,
    metric=Metric.COSINE
)

### Step 3.3: Pack it all together in a single retrieval function!

Given a query, retrieve all relevant embeddings~

In [None]:
def get_embeddings(query: str):
  query_vector = model.encode(query)
  results = vicinity.query(query_vector, k=4)
  return [x[0] for x in results[0]]

### (Optional) Step 3.4: Test our function

In [None]:
query = "What is a Multi-Head Self Attention?"
retrieved_chunks = get_embeddings(query)

for chunk in retrieved_chunks:
  rprint(chunk)
  print('-'*80, '\n\n')

-------------------------------------------------------------------------------- 




-------------------------------------------------------------------------------- 




-------------------------------------------------------------------------------- 




-------------------------------------------------------------------------------- 




## Step 4: Lets build our LLM Application!

Our data is ready! Let's setup our LLM application to answer user queries

In [None]:
# A simple function to make LLM prompts with chunks
def create_prompt(chunks: List[str], query: str) -> str:
  prompt_template = """<instructions>
  Based on the provided contexts, answer the given question to the best of your ability. Remember to also add citations at appropriate points in the format of square brackets like [1][2][3], especially at sentence or paragraph endings.
  You will be given 4 passages in the context, marked with a label 'Doc [1]:' to denote the passage number. Use that number for citations. Answer only from the given context, and if there's no appropriate context, reply "No relevant context found!".
  </instructions>

  <context>
  {context}
  </context>

  <query>
  {query}
  </query>
  """
  context = "\n\n".join([f"Doc {i+1}: {chunk}" for i, chunk in enumerate(chunks)])
  prompt = prompt_template.format(context=context, query=query)
  return prompt

In [None]:
# Prompt to use
query = "What is a Multi-Head Self Attention?"
retrieved_chunks = get_embeddings(query)
prompt = create_prompt(retrieved_chunks, query)

In [None]:
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": prompt}],
)
# Print the final response without the thinking tokens
answer = response.choices[0].message.content.split("</think>")[-1]
rprint(answer)

### Wow, the model works great! And how many tokens did we use?

In [None]:
prompt_tokens = len(tokenizer.encode(prompt))
rprint(f"This prompt contains: {prompt_tokens} tokens")

**1,069 tokens!** Rememeber from earlier--without chunking we'd have used 9,865 tokens!

## Conclusions

We got a great answer back from our model, while saving about **~8,800 input tokens**! Chonkie is one efficient hippo 🚀

Hope you found this notebook useful! If you want to learn more about Chonkie, check our [GitHub](https://github.com/chonkie-ai/chonkie), [Twitter](https://x.com/ChonkieAI), and [Bluesky](https://bsky.app/profile/chonkieai.bsky.social)!

If you have any questions about anything we showed us here, reach us at `support@chonkie.ai` or on our [Discord](https://discord.gg/6V5pqvqsCY)

# Fin

Happy Chonking! 🦛✨