**Chunking Techniques in Language Models**


Large Language Models (LLMs) can suffer from a problem known as hallucination, where the model generates incorrect or nonsensical information. To mitigate this issue, it is important to stay within the LLM's context window. One approach is to break texts into smaller parts, a process known as chunking. This technique improves the accuracy and efficiency of the retrieval process and enhances the overall performance of Retrieval-Augmented Generation (RAG) models.

**Types of Chunking Techniques**
1. Recursive Chunking

  Recursive chunking involves breaking down tasks or information into smaller parts in a hierarchical structure. Each chunk can be further divided into sub-chunks recursively. This top-down approach allows larger chunks to be split into smaller, manageable units.

2. Semantic Chunking

  Semantic chunking focuses on breaking information into meaningful or contextually related chunks. This method emphasizes the structure and understanding of the data. It is typically slower than recursive chunking due to its focus on meaning and comprehension.


1. Install the required dependencies

In [7]:
!pip install -qU langchain_experimental langchain_openai langchain_community langchain ragas chromadb langchain-groq fastembed pypdf openai

2. Imports

In [24]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from google.colab import userdata
from groq import Groq
from langchain_groq import ChatGroq
from langchain_community.vectorstores import Chroma
groq_api_key = userdata.get("Groq")
embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

3. Fetch Sample Docs

In [9]:
! wget "https://arxiv.org/pdf/1810.04805.pdf"

--2024-10-23 05:55:26--  https://arxiv.org/pdf/1810.04805.pdf
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/1810.04805 [following]
--2024-10-23 05:55:26--  http://arxiv.org/pdf/1810.04805
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 775166 (757K) [application/pdf]
Saving to: ‘1810.04805.pdf.1’


2024-10-23 05:55:26 (36.1 MB/s) - ‘1810.04805.pdf.1’ saved [775166/775166]



4. Process the PDF Content

In [10]:
loader = PyPDFLoader("1810.04805.pdf")
documents = loader.load()

print("Number of pages:", len(documents))


Number of pages: 16


5. Recursive Chunking(Notice the overlap in clunk 1 and 2)

In [11]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size for demo
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

recursive_texts = text_splitter.split_documents(documents)
print("Chunk 1: ", recursive_texts[0].page_content)
print("Chunk 2: ", recursive_texts[1].page_content)

Chunk 1:  BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout }@google.com
Abstract
We introduce a new language representa-
tion model called BERT , which stands for
Bidirectional Encoder Representations from
Transformers. Unlike recent language repre-
sentation models (Peters et al., 2018a; Rad-
ford et al., 2018), BERT is designed to pre-
train deep bidirectional representations from
unlabeled text by jointly conditioning on both
left and right context in all layers. As a re-
sult, the pre-trained BERT model can be ﬁne-
tuned with just one additional output layer
to create state-of-the-art models for a wide
range of tasks, such as question answering and
language inference, without substantial task-
speciﬁc architecture modiﬁcations.
BERT is conceptually simple and empirically
powerful. It obtains new state-of-the-art re-
Chunk 2:  BERT i

6. Semantic Chunking(Percentile)

In [25]:
text_splitter = SemanticChunker(embed_model, breakpoint_threshold_type="percentile")

semantic_texts = text_splitter.split_documents(documents)
print(semantic_texts[0].page_content)



BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout }@google.com
Abstract
We introduce a new language representa-
tion model called BERT , which stands for
Bidirectional Encoder Representations from
Transformers. Unlike recent language repre-
sentation models (Peters et al., 2018a; Rad-
ford et al., 2018), BERT is designed to pre-
train deep bidirectional representations from
unlabeled text by jointly conditioning on both
left and right context in all layers. As a re-
sult, the pre-trained BERT model can be ﬁne-
tuned with just one additional output layer
to create state-of-the-art models for a wide
range of tasks, such as question answering and
language inference, without substantial task-
speciﬁc architecture modiﬁcations. BERT is conceptually simple and empirically
powerful. It obtains new state-of-the-art re-
sults on eleven natural la

7. Instantiate the Vectorstore

In [26]:
recursive_chunk_vectorstore = Chroma.from_documents(recursive_texts, embedding=embed_model)
semantic_chunk_vectorstore = Chroma.from_documents(semantic_texts, embedding=embed_model)


8. Instantiate Retrieval(Recursive Chunking)


In [46]:
recursive_chunk_retriever = recursive_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})
recursive_chunk_retriever.invoke("What is BERT is designed for?")


[Document(metadata={'page': 6, 'source': '1810.04805.pdf'}, page_content='BERT LARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2\nTable 2: SQuAD 1.1 results. The BERT ensemble\nis 7x systems which use different pre-training check-\npoints and ﬁne-tuning seeds.\nSystem Dev Test\nEM F1 EM F1\nTop Leaderboard Systems (Dec 10th, 2018)\nHuman 86.3 89.0 86.9 89.5\n#1 Single - MIR-MRC (F-Net) - - 74.8 78.0\n#2 Single - nlnet - - 74.2 77.1\nPublished\nunet (Ensemble) - - 71.4 74.9\nSLQA+ (Single) - 71.4 74.4\nOurs\nBERT LARGE (Single) 78.7 81.9 80.0 83.1')]

9. Instantiate Retrieval(Semantic Chunking)

In [47]:
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})
semantic_chunk_retriever.invoke("What is BERT is designed for?")

[Document(metadata={'page': 6, 'source': '1810.04805.pdf'}, page_content='BERT LARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2\nTable 2: SQuAD 1.1 results. The BERT ensemble\nis 7x systems which use different pre-training check-\npoints and ﬁne-tuning seeds.\nSystem Dev Test\nEM F1 EM F1\nTop Leaderboard Systems (Dec 10th, 2018)\nHuman 86.3 89.0 86.9 89.5\n#1 Single - MIR-MRC (F-Net) - - 74.8 78.0\n#2 Single - nlnet - - 74.2 77.1\nPublished\nunet (Ensemble) - - 71.4 74.9\nSLQA+ (Single) - 71.4 74.4\nOurs\nBERT LARGE (Single) 78.7 81.9 80.0 83.1')]

10. Implementation into RAG(Semantic Chunking)

In [53]:
from langchain_core.prompts import ChatPromptTemplate

rag_template = """\
Use the following context to answer the user's query. If you cannot answer, please respond with 'I don't know'.

User's Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)
chat_model = ChatGroq(temperature=0,
                      model_name="mixtral-8x7b-32768",
                      api_key=userdata.get("Groq"),)
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

semantic_rag_chain = (
    {"context" : semantic_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | chat_model
    | StrOutputParser()
)
semantic_rag_chain.invoke("What is BERT is designed for?")



'BERT (Bidirectional Encoder Representations from Transformers) is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. This means that BERT can understand the context of a word by looking at the words that come before and after it. After pre-training, the BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.'

11. Implementation into RAG(Recursive Chunking)

In [54]:

recursive_rag_chain = (
    {"context" : recursive_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | chat_model
    | StrOutputParser()
)
recursive_rag_chain.invoke("What is BERT is designed for?")


'BERT (Bidirectional Encoder Representations from Transformers) is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. It consists of two steps: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models.'

**Conclusion**

Although the result from both chunking techniques are almost the same, recursive chunking does provide more factual representation of the answer such as elaborating on the two stages of the BERT model while the Semantic chunking technique approach gave a more concise, high-level explanation of BERT’s design.The differences between the two outputs indicate that recursive chunking is better at getting extensive information, and semantic chunking excels at providing more focused, high-level responses. Depending on the purpose, one strategy may be more appropriate than another.