In [1]:
# --- Create a RAG chatbot to answer questions based on Chanakya Neeti PDF ---
# important package installations
!pip install langchain openai pypdf faiss-cpu sentence-transformers
!pip install -U langchain-community pypdf
!pip install -qU langchain-openai

!pip install chromadb



In [2]:
# Mount google drive to get the PDF for Chanakya Neeti
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Set environment variables to get Hugging Face Token and Open AI API key

In [None]:
# Environment Variables
import os
import getpass
# -------------- USE YOUR OWN TOKEN ---------------
# key = ""
# if not os.environ.get("OPENAI_API_KEY"):
#   os.environ["OPENAI_API_KEY"] = key

# hf_token = ""
# if not os.environ.get("HF_TOKEN"):
#   os.environ["HF_TOKEN"] = hf_token

## Get PDF document and number of pages in the document


In [3]:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from sentence_transformers import SentenceTransformer

# --- Configuration ---
PDF_PATH = "/content/drive/MyDrive/Colab Notebooks/AI-ML(new)/ChanakyaNeeti_in_English.pdf"  # Replace with the actual path to your PDF
# Make sure your OpenAI API key is set as an environment variable
# os.environ["OPENAI_API_KEY"] = "sk-..."

# --- 1. Load the Document ---
print("Loading PDF document...")
loader = PyPDFLoader(PDF_PATH)
documents = loader.load()
print(f"Loaded {len(documents)} pages.")

Loading PDF document...
Loaded 14 pages.


### Using Hybrid Text splitter (Markdown text splitter + RecursiveCharacter Text Splitter)
The combined strategy selected offers two key benefits that work together to create more effective chunks for a retrieval-augmented generation (RAG) system:

Maintaining Semantic Integrity: By using the MarkdownHeaderTextSplitter first, you ensure that the content is grouped by its logical structure. Instead of simply cutting text at a certain character limit, this method keeps entire sections (like "Introduction" or "Soil Health") together. This means that a chunk retrieved for a query will likely contain the complete context of a single topic, leading to more accurate and relevant answers.

Optimizing for LLM Context Windows: After the first stage, the RecursiveCharacterTextSplitter takes over. It breaks down those larger, logically-sound sections into smaller, more manageable chunks. This is crucial because it ensures that each final chunk is small enough to fit within the limited context window of an LLM. This prevents data loss and allows you to process very long documents without losing the important, higher-level context.

In [4]:
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from typing import List

In [23]:
def hybrid_text_splitter(markdown_text: str, chunk_size: int = 1000, chunk_overlap: int = 200) -> List[Document]:
    """
    A two-stage text splitting strategy for Markdown documents.

    This function first splits the document by markdown headers to maintain
    logical sections, then recursively splits each section into smaller chunks
    that fit within an LLM's context window.

    Args:
        markdown_text (str): The full text of the markdown document.
        chunk_size (int): The target size of the final text chunks.
        chunk_overlap (int): The number of characters to overlap between chunks.

    Returns:
        List[Document]: A list of LangChain Document objects, where each
                        document is a semantically-rich text chunk.
    """
    # Define headers to split on.
    # The list contains a tuple of the header type and the separator.
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]

    # --- Stage 1: Split by Markdown headers ---
    # This maintains the logical structure of the document.
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on
    )
    markdown_splits = markdown_splitter.split_text(markdown_text)

    print(f"Initial split by headers resulted in {len(markdown_splits)} sections.")

    # --- Stage 2: Recursive split on each header section ---
    # This ensures each final chunk is within the LLM's context window.
    recursive_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", " ", ""]
    )

    final_chunks = []
    for split in markdown_splits:
        # The split_documents method on the recursive splitter
        # takes a list of Document objects.
        section_chunks = recursive_splitter.split_documents([split])
        final_chunks.extend(section_chunks)

    print(f"After recursive splitting, we have {len(final_chunks)} final chunks.")
    return final_chunks

In [24]:
texts = []
for doc in documents:
  # print(doc.page_content)
  chunks = hybrid_text_splitter(doc.page_content)
  # print(chunks)
  texts.append(chunks)

# Flatten the list using a list comprehension
final_texts = [item.page_content for sublist in texts for item in sublist]
print(len(final_texts))
print(final_texts)


Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 6 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 6 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 6 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 6 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.


Check the dimension of the former and the latter text lists created for RAG

In [25]:
def check_dimensions(input_list):
    """
    Recursively checks the number of dimensions in a list.
    Returns 1 for a flat list, 2 for a 2D list, etc.
    Returns 0 if the input is not a list.
    """
    # Base case: if the input is not a list, it adds 0 to the count
    if not isinstance(input_list, list):
        return 0

    # Recursive step: add 1 for the current dimension and
    # call the function on the first element
    if input_list and isinstance(input_list[0], list):
        return 1 + check_dimensions(input_list[0])

    # If the list is empty or the first element is not a list, it's a flat list (1D)
    return 1

In [26]:
print(check_dimensions(texts))
print(check_dimensions(final_texts))


2
1


## Use ChromaDB for Vector Enbedding Storage
Using embedding model as "all-MiniLM-L6-v2" due to the below benefits:

For vectorization of text chunks from a PDF, a fantastic all-around choice from the Sentence Transformers library is all-MiniLM-L6-v2.

Here's a breakdown of why it's a popular and effective model for this task:

Excellent Performance: It's trained on a massive dataset for semantic similarity, which means it's great at capturing the core meaning of your text chunks.

Speed and Efficiency: It's a small and fast model, which is a major benefit when you're processing a large number of documents. Its efficiency makes it suitable for applications where latency is a concern.

Good Starting Point: For most general-purpose text embedding tasks, all-MiniLM-L6-v2 offers a great balance of performance and efficiency, making it an ideal default to begin with.

In [27]:
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions

model_name = "all-MiniLM-L6-v2"

# Load sentence-transformers model directly (optional, to check)
embedding_model = SentenceTransformer(model_name)

# Define Chroma embedding function
hf_embedder = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=model_name
)

# Function to get embeddings
def embed_texts(texts):
    return embedding_model.encode(texts).tolist()

In [44]:
!rm -rf ./docs/chroma  # remove old database files if any


In [29]:
print(final_texts)

['Chanakya Niti-Shastra\nChanakya Niti-Shastra\n(The Political Ethics of Chanakya Pandit)\nTranslated by Miles Davis (Patita Pavana dasa)\nhttp://www.indiadivine.com/chanakya-niti-shastra.htm\nAbout 2,300 years ago the Greek conqueror Alexan-\nder the Great invaded the Indian sub-continent. His\noﬀensive upon the land’s patchwork of small Hindu em-\npires proved to be highly successful due to the disunity\nof the petty rulers. It was Chanakya Pandit who, feeling\ndeeply distressed at heart, searched for and discovered a\nqualiﬁed leader in the person of Chandragupta Maurya.\nAlthough a mere dasi-putra, that is, a son of a maidser-\nvant by the Magadha King Nanda, Chandragupta was\nhighly intelligent, courageous and physically powerful.\nChanakya cared little that by birth he should not have\ndared to approach the throne. A man of acute discre-\ntion, Chanakya desired only that a ruler of extraordi-\nnary capabilities be raised to the exalted post of King of', 'dared to approach the thr

In [45]:
from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma/'

In [49]:
# Init Chroma
chroma_client = chromadb.PersistentClient(path="./docs/chroma2")

collection = chroma_client.get_or_create_collection(
    name="my_collection",
    embedding_function=hf_embedder
)


In [50]:
collection.add(
    documents=final_texts,
    embeddings = embed_texts(final_texts),
    ids=[f"id_{i}" for i in range(len(final_texts))]
)

In [60]:
import chromadb
from typing import List

def retrieve_documents(query, collection, top_k=3):
    """
    Retrieves documents from a ChromaDB collection based on a user query.
    It can either perform a semantic search or retrieve all documents for summarization.

    Args:
        query (str): The user's query string.
        collection (chromadb.Collection): The ChromaDB collection to query.
        top_k (int): The number of top results to return for a standard query.

    Returns:
        tuple: A tuple containing a list of documents and a list of their corresponding IDs.
    """
    # Keywords to check for a full document summarization request.
    summary_keywords = ["summarize", "summary", "entire document", "all documents", "full text"]

    # Check if the query contains any of the summary keywords (case-insensitive).
    is_summary_request = any(keyword in query.lower() for keyword in summary_keywords)

    if is_summary_request:
        print(f"Detected a summarization request for the entire document.")
        # Retrieve all documents from the collection for summarization.
        # The .get() method without a query performs a full retrieval.
        try:
            full_collection = collection.get(include=['documents'])
            docs = full_collection['documents']
            ids = full_collection['ids']
            print(f"Retrieved a total of {len(docs)} documents.")
            return docs, ids
        except Exception as e:
            print(f"Error retrieving all documents: {e}")
            return [], []
    else:
        print(f"Performing semantic search for the top {top_k} documents for the query: '{query}'")
        try:
            # Step 1: Embed the user's query for semantic search.
            query_embedding = embed_texts(query)

            # Step 2: Query the Chroma collection with the embedding.
            results = collection.query(
                query_embeddings=query_embedding,
                n_results=top_k
            )

            # Extract the documents and their IDs from the first result set.
            docs = results['documents'][0]
            ids = results['ids'][0]
            print(f"Found {len(docs)} relevant documents.")
            return docs, ids
        except Exception as e:
            print(f"Error during document retrieval: {e}")
            return [], []

In [62]:
# Example 1: A specific query
query = "What is the importance of discipline?"
docs, ids = retrieve_documents(query, collection)
print("\n--- Specific Query Results ---")
print("Retrieved ", len(docs), " documents")
for i, doc in enumerate(docs):
    print(f"Document {ids[i]}: {doc}")

# Example 2: A summarization query
query = "Please summarize the entire document."
docs, ids = retrieve_documents(query, collection)
print("Retrieved ", len(docs), " documents")
print("\n--- Summary Request Results ---")
for i, doc in enumerate(docs):
    print(f"Document {ids[i]}: {doc}")

Performing semantic search for the top 3 documents for the query: 'What is the importance of discipline?'
Found 3 relevant documents.

--- Specific Query Results ---
Retrieved  3  documents
Document id_32: a lion is that whatever a man intends doing should be
done by him with a whole-hearted and strenuous eﬀort.
17. The wise man should restrain his senses like the
crane and accomplish his purpose with due knowledge
of his place, time and ability.
18. To wake at the proper time; to take a bold stand
and ﬁght; to make a fair division (of property) among
relations; and to earn one’s own bread by personal ex-
ertion are the four excellent things to be learned from a
cock.
19. Union in privacy (with one’s wife); boldness; storing
away useful items; watchfulness; and not easily trusting
others; these ﬁve things are to be learned from a crow.
20. Contentment with little or nothing to eat although
one may have a great appetite; to awaken instantly al-
though one may be in a deep slumber; unﬂin

## Adding the LLM API for a prompt-adjacent response

In [70]:
from huggingface_hub import InferenceClient

def generate_answer(query, collection, top_k=3):
    docs, ids = retrieve_documents(query, collection, top_k=top_k)

    # Create a prompt for the LLM
    context = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(docs)])
    prompt = f"""
                You are an expert assistant. Answer the following question using ONLY the documents provided below.
                Cite each reference in square brackets corresponding to the documents used.

                Documents:
                {context}

                Question:
                {query}

                Answer with citations and use pointers if necessary.
            """

    messages = [{"role": "user", "content": prompt}]

    client = InferenceClient("Qwen/Qwen2.5-Coder-32B-Instruct")

    response = client.chat_completion(messages, max_tokens=1000)

    answer = "Assistant: \n" + str(response.choices[0].message["content"])
    return answer

In [71]:
query = "What does the book say about discipline?"
answer = generate_answer(query, collection, 5)
print(answer)

Performing semantic search for the top 5 documents for the query: 'What does the book say about discipline?'
Found 5 relevant documents.
Assistant: 
The book emphasizes the importance of discipline through several points:

- **Discipline in Learning and Behavior**: "Let not a single day pass without your learning a verse, half a verse, or a fourth of it, or even one letter of it; nor without attending to charity, study and other pious activity." [5:13]
- **Discipline in Parenting**: "Those parents who do not educate their sons are their enemies; for as is a crane among swans, so are ignorant sons in a public assembly." [5:11]
- **Discipline Through Chastisement**: "Many a bad habit is developed through overindulgence, and many a good one by chastisement, therefore beat your son as well as your pupil; never indulge them." [5:12]
- **Discipline in Personal Conduct**: "The wise man should restrain his senses like the crane and accomplish his purpose with due knowledge of his place, time a

In [72]:
query = "Summarize the lessons in the book"
answer = generate_answer(query, collection, 5)
print(answer)

Detected a summarization request for the entire document.
Retrieved a total of 92 documents.
Assistant: 
The book "Chanakya Niti-Shastra" offers a comprehensive set of ethical and practical guidelines for living a virtuous life and governing effectively. Here are the key lessons summarized with citations:

- **Importance of a Qualified Leader**: Chanakya Pandit recognized the need for a capable ruler to protect the nation from external threats like the Greeks. He identified Chandragupta Maurya as the ideal candidate due to his intelligence, courage, and physical strength [1].

- **Strategic Elimination of Enemies**: Chanakya's personal vendetta against King Nanda and his sons led to their swift demise, clearing the path for Chandragupta's rise to power [2].

- **Uniting the Subcontinent**: After defeating the Nandas and Greeks, Chanakya used his political acumen to unify much of the Indian subcontinent under Chandragupta's rule, showcasing his strategic brilliance [3].

- **Application