## RAG with ChromaDB, Chonkie, and Paul Graham's Essays

Welcome! In this notebook, we’re building a Retrieval-Augmented Generation (RAG) pipeline from scratch.

Here’s what we’ll use:
- ChromaDB for fast vector search
- Chonkie for smart, semantic text chunking
- OpenAI for embeddings and LLM completions
- A dataset of Paul Graham’s essays (because they’re awesome)

What you’ll learn:
1. How to load and peek at a real-world essay dataset
2. How to chunk text in a way that actually makes sense for retrieval
3. How to store and search those chunks in a vector DB
4. How to use retrieved context to make your LLM answers way better

Let’s dive in 🚀

### 1. Install dependencies and load data
We’ll need the OpenAI-compatible SDK to interact with Inference.net, Chroma for vector storage, the `datasets` library for fetching data, and **Chonkie** for smart chunking. 

In [None]:
!pip install openai chromadb datasets chonkie

### 2. Import Dependencies and Load the Paul Graham essays dataset

We pull the full set of essays from 🤗 Hub, convert it to a Pandas DataFrame, and do a quick sanity check on the row count.

In [None]:
from datasets import load_dataset
import chromadb
from chonkie import SemanticChunker
from openai import OpenAI
import os
import pandas as pd

ds = load_dataset("pookie3000/paul_graham_all_essays")
ds = ds["train"].to_pandas()
print(len(ds))
ds.head()

222


Unnamed: 0,text
0,| \n \nFebruary 2009 \n \nHacker News was ...
1,| \n \nMay 2008 \n \nAdults lie constantly...
2,| \n \nNovember 2008 \n \nOne of the diffe...
3,| \n \nDecember 2010 \n \nI was thinking r...
4,| \n \n| **Want to start a startup?** Get f...


### 3. Chunk the essays semantically  
`SemanticChunker` splits each essay into overlapping, semantically-coherent chunks—useful when you want to do retrieval at paragraph-level rather than whole-essay-level.

In [61]:
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.47,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=5000,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)


batch_chunks = chunker.chunk_batch(ds["text"].tolist())

🦛 choooooooooooooooooooonk 100% • 222/222 docs chunked [00:02<00:00, 92.83doc/s] 🌱 


### 4. Flatten chunk objects to raw text  
We only need the text content for embedding, not the additional metadata.

In [62]:
chunk_texts = [chunk.text for chunks in batch_chunks for chunk in chunks]

print("Number of chunks:", len(chunk_texts), "\n" + "-"*100)
for chunk in chunk_texts[:2]:
    print(chunk)
    print("-"*100)

Number of chunks: 883 
----------------------------------------------------------------------------------------------------
|  
  
February 2009  
  
Hacker News was two years old last week. Initially it was supposed to be a
side project—an application to sharpen Arc on, and a place for current and
future Y Combinator founders to exchange news. It's grown bigger and taken up
more time than I expected, but I don't regret that because I've learned so
much from working on it.  
  
**Growth**  

----------------------------------------------------------------------------------------------------
  
When we launched in February 2007, weekday traffic was around 1600 daily
uniques. It's since grown to around 22,000. This growth rate is a bit higher
than I'd like. I'd like the site to grow, since a site that isn't growing at
least slowly is probably dead. But I wouldn't want it to grow as large as Digg
or Reddit—mainly because that would dilute the character of the site, but also
because I don'

### 5. Create embeddings with Inference.net  
We hit the Inference.net `/v1/embeddings` endpoint (OpenAI-compatible) in mini-batches of 32. You can use any batch size you want, but batch sizes that are too large may case requests to be slow or fail.

In [67]:
client = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.environ.get("INFERENCE_API_KEY"),
)

# Process embeddings in batches of 32
batch_size = 32
all_embeddings = []

for i in range(0, len(chunk_texts), batch_size):
    batch = chunk_texts[i:i + batch_size]
    response = client.embeddings.create(
        model="qwen/qwen3-embedding-4b",
        input=batch
    )
    batch_embeddings = [data.embedding for data in response.data]
    all_embeddings.extend(batch_embeddings)

embeddings = all_embeddings

### 6. Persist the vectors in an in-memory Chroma collection  
`EphemeralClient` keeps everything in RAM—perfect for demos; switch to a persistent client in production.

In [None]:
# let's insert into chroma
chroma_client = chromadb.EphemeralClient() # Note that this is in memory and not suitable for production. Use a persistent client, a cloud client, or a completely different vector store in production.

collection = chroma_client.create_collection(
    name="paul_graham_all_essays",
    metadata={"hnsw:space": "cosine", "dimension": 2560}
)

collection.add(
    documents=chunk_texts,
    embeddings=embeddings,
    ids=[str(i) for i in range(len(chunk_texts))],
)

### 7. Helper: `rag_query()`  
Given a natural-language question, we:  
1. Embed the query  
2. Retrieve the top-K nearest chunks  
3. Feed **both** the question and retrieved context into an LLM  
4. Return the answer + a DataFrame of retrieved chunks

In [87]:
def rag_query(question: str, *, k: int = 3, temperature: float = 0.3):
    """
    Return (answer, DataFrame-of-retrieved-chunks).
    """
    # Embed query
    query_vec = client.embeddings.create(
        model="qwen/qwen3-embedding-4b",
        input=question
    ).data[0].embedding

    # Retrieve top-k chunks
    res = collection.query(
        query_embeddings=[query_vec],
        n_results=k,
        include=["documents", "distances"]
    )

    # Feed chunks to the LLM
    context = "\n\n".join(res["documents"][0])
    completion = client.chat.completions.create(
        model="meta-llama/llama-3.1-8b-instruct/fp-8",
        messages=[{"role": "user", "content": f"Question: {question}\nContext:\n{context}\n\nAnswer:"}],
        temperature=temperature
    ).choices[0].message.content.strip()

    df = pd.DataFrame(
        {"document": res["documents"][0], "distance": res["distances"][0]}
    )

    return completion, df

### 8. Example: “What does Paul Graham consider the meaning of work?”  
We fire a single RAG query and print both the answer and what text chunks were actually used.

In [85]:
answer, df = rag_query("What does Paul Graham consider the meaning of work?")
print(answer)
df

According to Paul Graham, the meaning of work is not just about doing something to earn a living, but about finding something that you are passionate about and enjoy doing. He argues that people who do great work are often those who have found a way to make their work feel like a project of their own, rather than just a chore.

Graham identifies three key ingredients for great work: natural ability, practice, and effort. He notes that while natural ability can be an asset, it is not enough on its own, and that practice and effort are essential for achieving great results.

Graham also emphasizes the importance of finding work that you love, and that this is not just a matter of doing what you would like to do at any given moment, but about finding something that you can be passionate about and enjoy doing over a longer period of time.

He suggests that people should aim to find work that is challenging and meaningful, and that they should be willing to take risks and face challenges in

Unnamed: 0,document,distance
0,\n \nThe reason some subjects seemed easy wa...,0.457119
1,\nTo do something well you have to like it. ...,0.467175
2,"| \n \nJune 2021 \n \nA few days ago, on t...",0.529098


> **That’s it!** You now have a fully-working, end-to-end Retrieval-Augmented Generation pipeline using:  
> • HuggingFace datasets → text source  
> • Chonkie → semantic chunking  
> • Inference.net → embeddings & LLM completions  
> • Chroma → vector storage and similarity search  
> Feel free to swap in your own dataset, vector database, or target LLM to customise the workflow for your use-case.