### Semantic Chunking In RAG (Advanced Techinques)

Semantic chunking means splitting a document into meaningful units based on the content/semantics, not just raw length or characters.

Traditional chunking -> “cut every 500 tokens” (blind splitting).

Semantic chunking -> “cut where topics or ideas change” (context-aware splitting).

🔹 Benefits

Preserves context → LLM sees complete ideas.

Reduces hallucination → avoids half-cut concepts.

Improves retrieval accuracy → relevant chunks match queries better.

🔹 When to Use

Long documents with multiple sections (research papers, legal docs, manuals).

Knowledge bases where context boundaries matter.

RAG pipelines where precise retrieval boosts performance.


###############################################################

✅ Summary in One Line
Semantic chunking = intelligent splitting of documents into meaning-preserving chunks, unlike fixed-size chunking.

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()
# Access keys
groq_api_key = os.getenv("GROQ_AI_API")
hf_api_key = os.getenv("HUGGINGFACE_API_KEY")

In [15]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

docs = """LangChain is a powerful framework for building applications with Large Language Models (LLMs).
It provides modular abstractions to connect and combine LLMs with external tools like OpenAI, Pinecone, and other data sources.
Using LangChain, you can create pipelines that include chains, agents, memory modules, and retrievers to handle complex tasks.
The Eiffel Tower is located in Paris.
France is widely known as a popular tourist destination.
"""

# Split into sentences
sentences = [s.strip() for s in docs.split("\n") if s.strip()]

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

similarity_threshold = 0.31
semantic_chunks = []

# Initialize first chunk
current_chunk = sentences[0]
current_chunk_embeddings = [embeddings[0]]

# Iterate through sentences
for i in range(1, len(sentences)):
    # Compute mean embedding of current chunk
    chunk_mean = np.mean(current_chunk_embeddings, axis=0).reshape(1, -1)
    sim = cosine_similarity(chunk_mean, embeddings[i].reshape(1, -1))[0][0]
    
    if sim >= similarity_threshold:
        current_chunk += " " + sentences[i]
        current_chunk_embeddings.append(embeddings[i])
    else:
        semantic_chunks.append(current_chunk)
        current_chunk = sentences[i]
        current_chunk_embeddings = [embeddings[i]]

# Append the last chunk
semantic_chunks.append(current_chunk)

# Display chunks
for idx, chunk in enumerate(semantic_chunks):
    print(f"Chunk {idx+1}: {chunk}\n")


Chunk 1: LangChain is a powerful framework for building applications with Large Language Models (LLMs). It provides modular abstractions to connect and combine LLMs with external tools like OpenAI, Pinecone, and other data sources. Using LangChain, you can create pipelines that include chains, agents, memory modules, and retrievers to handle complex tasks.

Chunk 2: The Eiffel Tower is located in Paris. France is widely known as a popular tourist destination.



### RAG pipeline modular coding

In [16]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain.schema import Document
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chat_models import init_chat_model

from langchain.schema.runnable import RunnableLambda, RunnableMap
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

import os
from dotenv import load_dotenv
load_dotenv()
# Access keys
groq_api_key = os.getenv("GROQ_AI_API")
hf_api_key = os.getenv("HUGGINGFACE_API_KEY")

In [None]:
# Custom Semantic Chunking for RAG Pipelines with thresholding

class ThresholdSemanticChunker:
    def __init__(self, model_name='all-MiniLM-L6-v2', similarity_threshold=0.31):
        self.model = SentenceTransformer(model_name)
        self.similarity_threshold = similarity_threshold

    def chunk(self, documents):
        sentences = [s.strip() for s in documents.split("\n") if s.strip()]
        embeddings = self.model.encode(sentences)

        semantic_chunks = []
        current_chunk = sentences[0]
        current_chunk_embeddings = [embeddings[0]]

        for i in range(1, len(sentences)):
            chunk_mean = np.mean(current_chunk_embeddings, axis=0).reshape(1, -1)
            sim = cosine_similarity(chunk_mean, embeddings[i].reshape(1, -1))[0][0]

            if sim >= self.similarity_threshold:
                current_chunk += " " + sentences[i]
                current_chunk_embeddings.append(embeddings[i])
            else:
                semantic_chunks.append(current_chunk)
                current_chunk = sentences[i]
                current_chunk_embeddings = [embeddings[i]]

        semantic_chunks.append(current_chunk)
        return semantic_chunks
