<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Cr_tq7DbbS7pQC5y6NlJ7NtPqqs0UBDl?usp=sharing)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

### **Chonkie-AI: Advanced Text Chunking for RAG**

Chonkie-AI is a powerful text-chunking library designed for Retrieval-Augmented Generation (RAG) applications. It provides various chunking methods to efficiently split text into meaningful segments for better retrieval and processing.

### **🚀 Supported Chunking Methods**  
- **🔢 TokenChunker** – Splits text into fixed-size token chunks.  
- **📝 WordChunker** – Chunks text based on words.  
- **📖 SentenceChunker** – Chunks text at sentence boundaries.  
- **🔄 RecursiveChunker** – Uses hierarchical splitting with customizable rules.  
- **🧠 SemanticChunker** – Splits text based on semantic similarity.  
- **🔍 SDPMChunker** – Utilizes a Semantic Double-Pass Merge approach.  
- **🧪 LateChunker (Experimental)** – Embeds text first, then chunks for improved embeddings.

### 📦 **Dependency Installation**  








In [None]:
!pip install -q chonkie tiktoken docling model2vec vicinity together rich[jupyter]

### **🔤 Importing TokenChunker and GPT-2 Tokenizer**








In [None]:
from chonkie import TokenChunker
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("gpt2")

### **🛠️ Initializing TokenChunker with GPT-2 Tokenizer**








In [None]:
chunker = TokenChunker(tokenizer)

### **📚 Chunking Text with TokenChunker**








In [None]:
chunks = chunker("Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.")

### **🔍 Iterating Through Chunks and Displaying Details**








In [None]:
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")

Chunk: Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.
Tokens: 24


### **📂 Importing Libraries for Document Processing and Embeddings**








In [None]:
import os
from typing import List

from docling.document_converter import DocumentConverter
from google.colab import userdata
from model2vec import StaticModel
from rich.console import Console
from rich.text import Text
from together import Together
from transformers import AutoTokenizer
from vicinity import Backend, Metric, Vicinity
from google.colab import userdata

from chonkie import RecursiveChunker, RecursiveLevel, RecursiveRules

### **🖥️ Setting Up Rich Console for Pretty Printing**








In [None]:
from rich.console import Console

console = Console()


# A wrapper to pretty print
def rprint(text: str, console: Console = console, width: int = 80) -> None:
    richtext = Text(text)
    console.print(richtext.wrap(console, width=width))

### **🔑 Setting Up API Keys and Loading Models**








In [None]:
os.environ["TOGETHER_API_KEY"] = userdata.get("TOGETHER_API_KEY")

model = StaticModel.from_pretrained("minishlab/potion-retrieval-32M")

client = Together()

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1")

### **📄 Converting Document to Markdown Format**








In [None]:
converter = DocumentConverter()
source = "https://arxiv.org/pdf/1706.03762"
result = converter.convert(source)
text = result.document.export_to_markdown()



In [None]:
rprint(text[:2000])

### **🔢 Calculating Total Token Count in PDF**








In [None]:
total_text_tokens = len(tokenizer.encode(text))
rprint(f"This PDF contains: {total_text_tokens} tokens")

### **📜 Defining Recursive Chunking Rules**








In [None]:

rules = RecursiveRules(
    levels=[
        RecursiveLevel(delimiters=["######", "#####", "####", "###", "##", "#"], include_delim="next"),
        RecursiveLevel(delimiters=["\n\n", "\n", "\r\n", "\r"]),
        RecursiveLevel(delimiters=".?!;:"),
        RecursiveLevel(),
    ]
)
chunker = RecursiveChunker(rules=rules, chunk_size=384)

### **📊 Chunking Text and Counting Total Chunks**








In [None]:
chunks = chunker(text)
print(f"Total number of chunks: {len(chunks)}")


Total number of chunks: 57


### **🔍 Displaying Sample Chunks from Text**








In [None]:
for chunk in chunks[:4]:
    rprint(chunk.text)
    print("-" * 80, "\n\n")

### **📈 Encoding Chunks into Vector Representations**








In [None]:

items = [chunk.text for chunk in chunks]
vectors = model.encode(items)
print(vectors.shape)

(57, 512)


### **🧭 Creating a Vicinity Index for Similarity Search**








In [None]:
vicinity = Vicinity.from_vectors_and_items(
    vectors=vectors, items=items, backend_type=Backend.BASIC, metric=Metric.COSINE
)


### **🔎 Retrieving Similar Chunks Using Embeddings**








In [None]:

def get_embeddings(query: str):
    query_vector = model.encode(query)
    results = vicinity.query(query_vector, k=4)
    return [x[0] for x in results[0]]

### **🤖 Retrieving and Displaying Relevant Chunks**








In [None]:

query = "What is a Multi-Head Self Attention?"
retrieved_chunks = get_embeddings(query)

for chunk in retrieved_chunks:
    rprint(chunk)
    print("-" * 80, "\n\n")

### **🔍 Querying for Multi-Head Self Attention Explanation**








In [None]:
query = "What is a Multi-Head Self Attention?"
retrieved_chunks = get_embeddings(query)

for chunk in retrieved_chunks:
    rprint(chunk)
    print("-" * 80, "\n\n")

### **📝 Generating a Prompt for Context-Based Question Answering**








In [None]:
def create_prompt(chunks: List[str], query: str) -> str:
    prompt_template = """
  Based on the provided contexts, answer the given question to the best of your ability. Remember to also add citations at appropriate points in the format of square brackets like [1][2][3], especially at sentence or paragraph endings.
  You will be given 4 passages in the context, marked with a label 'Doc [1]:' to denote the passage number. Use that number for citations. Answer only from the given context, and if there's no appropriate context, reply "No relevant context found!".



  {context}



  {query}

  """
    context = "\n\n".join([f"Doc {i+1}: {chunk}" for i, chunk in enumerate(chunks)])
    prompt = prompt_template.format(context=context, query=query)
    return prompt

### **🛠️ Creating a Query-Specific Prompt with Retrieved Context**








In [None]:
query = "What is a Multi-Head Self Attention?"
retrieved_chunks = get_embeddings(query)
prompt = create_prompt(retrieved_chunks, query)

### **🤖 Generating an AI Response Using OpenAI GPT-4o**








In [None]:
import openai

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
)

answer = response.choices[0].message.content
print(answer)

Multi-Head Self Attention is an attention mechanism used in transformer models where instead of performing a single attention function, the model performs multiple attention functions, or heads, in parallel. Each head independently projects the queries, keys, and values into different subspaces by using learned linear projections, allowing the model to capture information from different representation subspaces at different positions. The outputs from these heads are then concatenated and projected again to form the final output. This mechanism allows the model to attend to information jointly from different sources, avoiding the averaging effect that a single attention head would introduce. In the described setup, there are typically 8 parallel attention layers or heads, each with a reduced dimension, making the computational cost similar to that of single-head attention with full dimensionality [4].


### **📊 Calculating Token Count for the Prompt**








In [None]:
prompt_tokens = len(tokenizer.encode(prompt))
rprint(f"This prompt contains: {prompt_tokens} tokens")