In [1]:
# --- Create a RAG chatbot to answer questions based on Chanakya Neeti PDF ---
# important package installations
!pip install langchain openai pypdf faiss-cpu sentence-transformers
!pip install -U langchain-community pypdf
!pip install -qU langchain-openai

!pip install chromadb

Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m53.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf, faiss-cpu
Successfully installed faiss-cpu-1.12.0 pypdf-6.0.0
Collecting langchain-community
  Downloading langchain_community-0.3.29-py3-none-any.whl.metadata (2.9 kB)
Collecting requests<3,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7,>=0.6.7 (from langchain-commun

In [4]:
# Mount google drive to get the PDF for Chanakya Neeti
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Set environment variables to get Hugging Face Token and Open AI API key

In [None]:
# Environment Variables
import os
import getpass

# key = ""
# if not os.environ.get("OPENAI_API_KEY"):
#   os.environ["OPENAI_API_KEY"] = key

# hf_token = ""
# if not os.environ.get("HF_TOKEN"):
#   os.environ["HF_TOKEN"] = hf_token

## Get PDF document and number of pages in the document


In [6]:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from sentence_transformers import SentenceTransformer

# --- Configuration ---
PDF_PATH = "/content/drive/MyDrive/Colab Notebooks/AI-ML(new)/ChanakyaNeeti_in_English.pdf"  # Replace with the actual path to your PDF
# Make sure your OpenAI API key is set as an environment variable
# os.environ["OPENAI_API_KEY"] = "sk-..."

# --- 1. Load the Document ---
print("Loading PDF document...")
loader = PyPDFLoader(PDF_PATH)
documents = loader.load()
print(f"Loaded {len(documents)} pages.")

Loading PDF document...
Loaded 14 pages.


### Using Hybrid Text splitter (Markdown text splitter + RecursiveCharacter Text Splitter)
The combined strategy selected offers two key benefits that work together to create more effective chunks for a retrieval-augmented generation (RAG) system:

Maintaining Semantic Integrity: By using the MarkdownHeaderTextSplitter first, you ensure that the content is grouped by its logical structure. Instead of simply cutting text at a certain character limit, this method keeps entire sections (like "Introduction" or "Soil Health") together. This means that a chunk retrieved for a query will likely contain the complete context of a single topic, leading to more accurate and relevant answers.

Optimizing for LLM Context Windows: After the first stage, the RecursiveCharacterTextSplitter takes over. It breaks down those larger, logically-sound sections into smaller, more manageable chunks. This is crucial because it ensures that each final chunk is small enough to fit within the limited context window of an LLM. This prevents data loss and allows you to process very long documents without losing the important, higher-level context.

In [7]:
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from typing import List

In [8]:
def hybrid_text_splitter(markdown_text: str, chunk_size: int = 1000, chunk_overlap: int = 200) -> List[Document]:
    """
    A two-stage text splitting strategy for Markdown documents.

    This function first splits the document by markdown headers to maintain
    logical sections, then recursively splits each section into smaller chunks
    that fit within an LLM's context window.

    Args:
        markdown_text (str): The full text of the markdown document.
        chunk_size (int): The target size of the final text chunks.
        chunk_overlap (int): The number of characters to overlap between chunks.

    Returns:
        List[Document]: A list of LangChain Document objects, where each
                        document is a semantically-rich text chunk.
    """
    # Define headers to split on.
    # The list contains a tuple of the header type and the separator.
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]

    # --- Stage 1: Split by Markdown headers ---
    # This maintains the logical structure of the document.
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on
    )
    markdown_splits = markdown_splitter.split_text(markdown_text)

    print(f"Initial split by headers resulted in {len(markdown_splits)} sections.")

    # --- Stage 2: Recursive split on each header section ---
    # This ensures each final chunk is within the LLM's context window.
    recursive_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", " ", ""]
    )

    final_chunks = []
    for split in markdown_splits:
        # The split_documents method on the recursive splitter
        # takes a list of Document objects.
        section_chunks = recursive_splitter.split_documents([split])
        final_chunks.extend(section_chunks)

    print(f"After recursive splitting, we have {len(final_chunks)} final chunks.")
    return final_chunks

In [9]:
texts = []
for doc in documents:
  # print(doc.page_content)
  chunks = hybrid_text_splitter(doc.page_content)
  # print(chunks)
  texts.append(chunks)

# Flatten the list using a list comprehension
final_texts = [item.page_content for sublist in texts for item in sublist]
print(len(final_texts))
print(final_texts)


Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 6 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 6 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 6 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 6 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.
Initial split by headers resulted in 1 sections.
After recursive splitting, we have 7 final chunks.


Check the dimension of the former and the latter text lists created for RAG

## Use ChromaDB for Vector Enbedding Storage
Using embedding model as "all-MiniLM-L6-v2" due to the below benefits:

For vectorization of text chunks from a PDF, a fantastic all-around choice from the Sentence Transformers library is all-MiniLM-L6-v2.

Here's a breakdown of why it's a popular and effective model for this task:

Excellent Performance: It's trained on a massive dataset for semantic similarity, which means it's great at capturing the core meaning of your text chunks.

Speed and Efficiency: It's a small and fast model, which is a major benefit when you're processing a large number of documents. Its efficiency makes it suitable for applications where latency is a concern.

Good Starting Point: For most general-purpose text embedding tasks, all-MiniLM-L6-v2 offers a great balance of performance and efficiency, making it an ideal default to begin with.

In [10]:
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions

model_name = "all-MiniLM-L6-v2"

# Load sentence-transformers model directly (optional, to check)
embedding_model = SentenceTransformer(model_name)

# Define Chroma embedding function
hf_embedder = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=model_name
)

# Function to get embeddings
def embed_texts(texts):
    return embedding_model.encode(texts).tolist()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
!rm -rf ./docs/chroma  # remove old database files if any


In [12]:
print(final_texts)

['Chanakya Niti-Shastra\nChanakya Niti-Shastra\n(The Political Ethics of Chanakya Pandit)\nTranslated by Miles Davis (Patita Pavana dasa)\nhttp://www.indiadivine.com/chanakya-niti-shastra.htm\nAbout 2,300 years ago the Greek conqueror Alexan-\nder the Great invaded the Indian sub-continent. His\noﬀensive upon the land’s patchwork of small Hindu em-\npires proved to be highly successful due to the disunity\nof the petty rulers. It was Chanakya Pandit who, feeling\ndeeply distressed at heart, searched for and discovered a\nqualiﬁed leader in the person of Chandragupta Maurya.\nAlthough a mere dasi-putra, that is, a son of a maidser-\nvant by the Magadha King Nanda, Chandragupta was\nhighly intelligent, courageous and physically powerful.\nChanakya cared little that by birth he should not have\ndared to approach the throne. A man of acute discre-\ntion, Chanakya desired only that a ruler of extraordi-\nnary capabilities be raised to the exalted post of King of', 'dared to approach the thr

In [13]:
from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma/'

In [14]:
# Init Chroma
chroma_client = chromadb.PersistentClient(path="./docs/chroma2")

collection = chroma_client.get_or_create_collection(
    name="my_collection",
    embedding_function=hf_embedder
)


In [15]:
collection.add(
    documents=final_texts,
    embeddings = embed_texts(final_texts),
    ids=[f"id_{i}" for i in range(len(final_texts))]
)

In [16]:
!pip install scikit-learn



## Using Maximal Marginal Relevance (MMR) - ranking algorithm - to select items (e.g., documents, sentences, embeddings) that are both relevant to a query and diverse from each other

In [20]:
import chromadb
from typing import List
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def mmr(doc_embeddings, query_embedding, lambda_param=0.6, top_k=3):
    doc_embeddings = np.array(doc_embeddings)
    query_embedding = np.array(query_embedding).reshape(1, -1)

    selected = []
    candidates = list(range(len(doc_embeddings)))

    for _ in range(top_k):
        mmr_scores = []
        for idx in candidates:
            relevance = cosine_similarity(query_embedding, doc_embeddings[idx].reshape(1, -1))[0][0]
            diversity = max([cosine_similarity(doc_embeddings[idx].reshape(1, -1), doc_embeddings[j].reshape(1, -1))[0][0] for j in selected] or [0])
            score = lambda_param * relevance - (1 - lambda_param) * diversity
            mmr_scores.append((score, idx))

        mmr_scores.sort(reverse=True)
        selected_idx = mmr_scores[0][1]
        selected.append(selected_idx)
        candidates.remove(selected_idx)

    return selected

def retrieve_documents(query, collection, top_k=3, lambda_param=0.6):
    """
    Retrieves documents from a ChromaDB collection based on a user query.
    Uses MMR to ensure diversity in the top-k results.

    Args:
        query (str): The user's query string.
        collection (chromadb.Collection): The ChromaDB collection to query.
        top_k (int): Number of top results to return.
        lambda_param (float): Trade-off between relevance and diversity in MMR.

    Returns:
        tuple: A tuple containing a list of documents and a list of their corresponding IDs.
    """
    summary_keywords = ["summarize", "summary", "entire document", "all documents", "full text"]
    is_summary_request = any(keyword in query.lower() for keyword in summary_keywords)

    if is_summary_request:
        print(f"Detected a summarization request for the entire document.")
        try:
            full_collection = collection.get(include=['documents'])
            docs = full_collection['documents']
            ids = full_collection['ids']
            print(f"Retrieved a total of {len(docs)} documents.")
            return docs, ids
        except Exception as e:
            print(f"Error retrieving all documents: {e}")
            return [], []
    else:
        print(f"Performing semantic search with MMR for the top {top_k} documents for the query: '{query}'")
        try:
            query_embedding = embed_texts(query)
            if isinstance(query_embedding[0], float):  # Single embedding
                query_embedding = [query_embedding]


            # Retrieve more candidates than top_k to allow MMR to work effectively
            candidate_results = collection.query(
                                  query_embeddings=query_embedding,
                                  n_results=top_k * 3,
                                  include=['documents', 'embeddings']
                              )


            candidate_docs = candidate_results['documents'][0]
            candidate_ids = candidate_results['ids'][0]
            candidate_embeddings = candidate_results['embeddings'][0]

            selected_indices = mmr(candidate_embeddings, query_embedding, lambda_param=lambda_param, top_k=top_k)
            selected_docs = [candidate_docs[i] for i in selected_indices]
            selected_ids = [candidate_ids[i] for i in selected_indices]

            print(f"MMR selected {len(selected_docs)} diverse and relevant documents.")
            return selected_docs, selected_ids
        except Exception as e:
            print(f"Error during MMR-based document retrieval: {e}")
            return [], []


In [21]:
# Example 1: A specific query
query = "What is the importance of discipline?"
docs, ids = retrieve_documents(query, collection)
print("\n--- Specific Query Results ---")
print("Retrieved ", len(docs), " documents")
for i, doc in enumerate(docs):
    print(f"Document {ids[i]}: {doc}")

print("\n-------------------------------------------------------\n")

# Example 2: A summarization query
query = "Please summarize the entire document."
docs, ids = retrieve_documents(query, collection)
print("Retrieved ", len(docs), " documents")
print("\n--- Summary Request Results ---")
for i, doc in enumerate(docs):
    print(f"Document {ids[i]}: {doc}")

Performing semantic search with MMR for the top 3 documents for the query: 'What is the importance of discipline?'
MMR selected 3 diverse and relevant documents.

--- Specific Query Results ---
Retrieved  3  documents
Document id_32: a lion is that whatever a man intends doing should be
done by him with a whole-hearted and strenuous eﬀort.
17. The wise man should restrain his senses like the
crane and accomplish his purpose with due knowledge
of his place, time and ability.
18. To wake at the proper time; to take a bold stand
and ﬁght; to make a fair division (of property) among
relations; and to earn one’s own bread by personal ex-
ertion are the four excellent things to be learned from a
cock.
19. Union in privacy (with one’s wife); boldness; storing
away useful items; watchfulness; and not easily trusting
others; these ﬁve things are to be learned from a crow.
20. Contentment with little or nothing to eat although
one may have a great appetite; to awaken instantly al-
though one may

## Adding the LLM API for a prompt-adjacent response

In [22]:
from huggingface_hub import InferenceClient

def generate_answer(query, collection, top_k=3):
    docs, ids = retrieve_documents(query, collection, top_k=top_k)

    # Create a prompt for the LLM
    context = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(docs)])
    prompt = f"""
                You are an expert assistant. Answer the following question using ONLY the documents provided below.
                Cite each reference in square brackets corresponding to the documents used.

                Documents:
                {context}

                Question:
                {query}

                Answer with citations and use pointers if necessary.
            """

    messages = [{"role": "user", "content": prompt}]

    client = InferenceClient("Qwen/Qwen2.5-Coder-32B-Instruct")

    response = client.chat_completion(messages, max_tokens=1000)

    answer = "Assistant: \n" + str(response.choices[0].message["content"])
    return answer

In [23]:
query = "What does the book say about discipline?"
answer = generate_answer(query, collection, 5)
print(answer)

Performing semantic search with MMR for the top 5 documents for the query: 'What does the book say about discipline?'
MMR selected 5 diverse and relevant documents.
Assistant: 
The book does not explicitly provide a comprehensive section on discipline. However, it does touch upon aspects related to discipline through various examples and principles. Here are some relevant points:

- **Self-Control and Restraint**: The wise man should restrain his senses like the crane and accomplish his purpose with due knowledge of his place, time, and ability [2:17].
- **Effort and Diligence**: A lion is characterized by doing whatever one intends with whole-hearted and strenuous effort [2:16].
- **Contentment and Self-Sufficiency**: Contentment with little or nothing to eat, despite having a great appetite, is a quality to be learned from the dog [2:20].
- **Watchfulness and Caution**: Watchfulness is one of the qualities to be learned from a crow [2:19].
- **Avoiding Sense Gratification**: Seeking 

In [24]:
query = "Summarize the lessons in the book"
answer = generate_answer(query, collection, 5)
print(answer)

Detected a summarization request for the entire document.
Retrieved a total of 92 documents.
Assistant: 
The book "Chanakya Niti-Shastra" offers a comprehensive guide to political ethics and wisdom, emphasizing the importance of virtue, knowledge, and practical wisdom in leadership and personal conduct. Here are the key lessons summarized with citations:

- **Importance of Virtuous Leadership**: Chanakya Pandit sought a capable leader, Chandragupta Maurya, to counter the Greek invasion, highlighting the need for virtuous and capable rulers [1].
- **Strategic Elimination of Opponents**: Chanakya's strategic elimination of the Nanda dynasty to pave the way for Chandragupta's rise to power underscores the importance of strategic thinking and decisive action [2].
- **Uniting the Subcontinent**: Under Chanakya's guidance, Chandragupta united much of the Indian subcontinent, demonstrating the power of political acumen and strategic alliances [3].
- **Application of Niti-Shastra**: Chanakya's