# Legal Chatbot Experiments

This notebook demonstrates the process of loading legal documents, embedding them, and storing them in ChromaDB for our Indian legal assistant chatbot.

## 1. Setup and Imports

In [1]:
# Import necessary libraries
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# Load environment variables
load_dotenv()

# Set paths
DATA_DIR = "data"
CHROMA_DIR = "chroma_db"

## 2. Document Loading

First, we'll load the PDF documents from the data directory. We'll give special treatment to the Constitution of India.

In [2]:
def load_documents(data_dir=DATA_DIR):
    """Load PDF documents from the data directory"""
    documents = []
    constitution_docs = []
    
    # Get all PDF files from the data directory
    pdf_files = [os.path.join(data_dir, file) for file in os.listdir(data_dir) if file.endswith('.pdf')]
    
    print(f"Found {len(pdf_files)} PDF files")
    
    # First, check if Constitution of India exists
    constitution_path = os.path.join(data_dir, "constitutionOfIndia.pdf")
    
    # Load each PDF file
    for pdf_file in pdf_files:
        try:
            print(f"Loading {pdf_file}")
            loader = PyPDFLoader(pdf_file)
            loaded_docs = loader.load()
            
            # Separate Constitution of India from other documents
            if pdf_file == constitution_path:
                print("Found Constitution of India document - will give special weightage")
                constitution_docs.extend(loaded_docs)
            else:
                documents.extend(loaded_docs)
                
        except Exception as e:
            print(f"Error loading {pdf_file}: {e}")
    
    return documents, constitution_docs

# Load the documents
regular_docs, constitution_docs = load_documents()

print(f"Loaded {len(regular_docs)} regular document pages")
print(f"Loaded {len(constitution_docs)} Constitution document pages")

Found 41 PDF files
Loading data\20230302100 (1).pdf
Loading data\20230302100.pdf
Loading data\2023030213-1.pdf
Loading data\2023030215-1.pdf
Loading data\2023030215.pdf
Loading data\2023030216-3.pdf
Loading data\2023030219-2 (1).pdf
Loading data\2023030219-2.pdf
Loading data\2023030227.pdf
Loading data\2023030233.pdf
Loading data\2023030234-1.pdf
Loading data\2023030234-2.pdf
Loading data\2023030236-1.pdf
Loading data\2023030239-1.pdf
Loading data\2023030241.pdf
Loading data\2023030245.pdf
Loading data\2023030251-2.pdf
Loading data\2023030255-1.pdf
Loading data\2023030259-2.pdf
Loading data\2023030265.pdf
Loading data\2023030274-1.pdf
Loading data\2023030280-1.pdf
Loading data\2023030282-1.pdf
Loading data\2023030287-2.pdf
Loading data\2023030287-3.pdf
Loading data\2023030288-2.pdf
Loading data\2023030293-1.pdf
Loading data\2023030298.pdf
Loading data\20231019830058002.pdf
Loading data\20231019850001859 (1).pdf
Loading data\20231019850001859.pdf
Loading data\20240716890312078.pdf
Loadi

## 3. Document Splitting

Next, we'll split the documents into smaller chunks for embedding.

In [3]:
def split_documents(documents, chunk_size=1000, chunk_overlap=200):
    """Split documents into chunks for embedding"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks")
    
    return chunks

# Split regular documents
regular_chunks = split_documents(regular_docs)

# Split constitution documents with smaller chunk size for more granularity
constitution_chunks = split_documents(constitution_docs, chunk_size=800, chunk_overlap=150)

Split 2373 documents into 6089 chunks
Split 404 documents into 1471 chunks


## 4. Initialize Embeddings Model

Now we'll initialize the HuggingFace embeddings model.

In [4]:
def get_embeddings_model():
    """Initialize and return the HuggingFace embeddings model"""
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    
    # Initialize the embeddings model
    embeddings = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs={'device': 'cpu'},
        encode_kwargs={'normalize_embeddings': True}
    )
    
    return embeddings

# Get embeddings model
embeddings = get_embeddings_model()
print("Embeddings model initialized")

  embeddings = HuggingFaceEmbeddings(
  from .autonotebook import tqdm as notebook_tqdm


Embeddings model initialized


## 5. Create Vector Store with Special Weightage for Constitution

We'll create a vector store with special weightage for the Constitution of India by:
1. Adding the Constitution documents multiple times to increase their representation
2. Using metadata to identify Constitution chunks

In [5]:
# Add metadata to identify Constitution chunks
for chunk in constitution_chunks:
    if not chunk.metadata:
        chunk.metadata = {}
    chunk.metadata["source_type"] = "constitution"
    chunk.metadata["priority"] = "high"

# Add metadata to regular chunks
for chunk in regular_chunks:
    if not chunk.metadata:
        chunk.metadata = {}
    chunk.metadata["source_type"] = "regular"
    chunk.metadata["priority"] = "normal"

# Duplicate Constitution chunks to give them more weight (3x representation)
weighted_constitution = constitution_chunks * 3
print(f"Created {len(weighted_constitution)} weighted Constitution chunks")

# Combine all chunks, with Constitution chunks appearing multiple times
all_chunks = regular_chunks + weighted_constitution
print(f"Total chunks for vector store: {len(all_chunks)}")

Created 4413 weighted Constitution chunks
Total chunks for vector store: 10502


In [7]:
# Create and persist the vector store
def create_vector_store(documents, embeddings, persist_directory=CHROMA_DIR):
    """Create and persist a vector store from documents"""
    # Create the vector store with persistence
    vector_store = Chroma.from_documents(
        documents=documents,
        embedding=embeddings,
        persist_directory=persist_directory
    )
    
    # Note: In newer versions of langchain_chroma, persistence is handled automatically
    # when persist_directory is provided
    
    return vector_store

# Create the vector store
vector_store = create_vector_store(all_chunks, embeddings)
print(f"Vector store created and persisted to {CHROMA_DIR}")

Vector store created and persisted to chroma_db


## 6. Test Retrieval

Let's test the retrieval to see if the Constitution is given proper weightage.

In [8]:
# Test queries
test_queries = [
    "What are the fundamental rights in India?",
    "Explain Article 21 of the Indian Constitution",
    "What is the process for impeachment of the President of India?"
]

# Create a retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

# Test each query
for query in test_queries:
    print(f"\nQuery: {query}")
    docs = retriever.invoke(query)
    
    # Count how many results are from the Constitution
    constitution_count = sum(1 for doc in docs if doc.metadata.get("source_type") == "constitution")
    
    print(f"Retrieved {len(docs)} documents")
    print(f"Constitution documents: {constitution_count}")
    print(f"Other documents: {len(docs) - constitution_count}")
    
    # Print the first result
    if docs:
        print("\nFirst result:")
        print(f"Source: {docs[0].metadata.get('source')}")
        print(f"Type: {docs[0].metadata.get('source_type')}")
        print(f"Content preview: {docs[0].page_content[:200]}...")


Query: What are the fundamental rights in India?
Retrieved 5 documents
Constitution documents: 3
Other documents: 2

First result:
Source: data\2025050838.pdf
Type: regular
Content preview: participation of all stakeholders, the notion of representation in a democracy 
would be rendered hollow. [Paras 19, 21, 22]
Constitution of India – Fundamental rights – Conflict of – Voter's right ...

Query: Explain Article 21 of the Indian Constitution
Retrieved 5 documents
Constitution documents: 5
Other documents: 0

First result:
Source: data\constitutionOfIndia.pdf
Type: constitution
Content preview: THE CONSTITUTION OF  INDIA 
(Part XXI.—Temporary, Transitional and Special Provisions) 
235...

Query: What is the process for impeachment of the President of India?
Retrieved 5 documents
Constitution documents: 3
Other documents: 2

First result:
Source: data\20240716890312078.pdf
Type: regular
Content preview: THE CONSTITUTION OF  INDIA(Part V.—The Union)29"I,  A.B., do swear in the name of Go

## 7. Test Specific Legal Queries

Let's test some specific legal queries about public protests and rights.

In [12]:
# Specific legal queries
legal_queries = [
    "Is it legal to protest in public in India?",
    "What are my rights if I am arrested in India?",
    "Can the government restrict freedom of speech in India?",
    "What is the legal age of marriage in India?",
    "What are the laws regarding property inheritance in India?"
]

# Test each query
for query in legal_queries:
    print(f"\nQuery: {query}")
    docs = retriever.invoke(query)
    
    # Count how many results are from the Constitution
    constitution_count = sum(1 for doc in docs if doc.metadata.get("source_type") == "constitution")
    
    print(f"Retrieved {len(docs)} documents")
    print(f"Constitution documents: {constitution_count}")
    print(f"Other documents: {len(docs) - constitution_count}")
    
    # Print the first result
    if docs:
        print("\nFirst result:")
        print(f"Source: {docs[0].metadata.get('source')}")
        print(f"Type: {docs[0].metadata.get('source_type')}")
        print(f"Content preview: {docs[0].page_content[:200]}...")
        print("\n" + "-"*50)


Query: Is it legal to protest in public in India?
Retrieved 5 documents
Constitution documents: 0
Other documents: 5

First result:
Source: data\20240716890312078.pdf
Type: regular
Content preview: freedom of speech, etc.—(1) All citizens shall have the right—(a) to freedom of speech and expression;(b) to assemble peaceably and without arms;(c) to form associations or unions2[or co-operative soc...

--------------------------------------------------

Query: What are my rights if I am arrested in India?
Retrieved 5 documents
Constitution documents: 3
Other documents: 2

First result:
Source: data\20240716890312078.pdf
Type: regular
Content preview: THE CONSTITUTION OF  INDIA(Part III.—Fundamental Rights)12(2) Every person who is arrested and detained in custody shall be produced before the nearest magistrate within a period of twenty-four hours ...

--------------------------------------------------

Query: Can the government restrict freedom of speech in India?
Retrieved 5 documents
C

## 8. Test Complete RAG Chain

Now let's test the complete RAG chain with the LLM to generate answers based on retrieved documents.

In [13]:
# Import necessary libraries for the RAG chain
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate

def create_rag_chain(retriever):
    """Create a RAG chain with the retriever and LLM"""
    # Create the LLM
    llm = ChatOpenAI(model="gpt-4o-mini")
    
    # Create the prompt
    system_template = """You are an expert legal assistant specializing in Indian law. 
Use the following pieces of context to answer the user's question about Indian legal matters.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Keep your answers factual, precise, and based on the provided context.
Always cite the specific legal document, case, or section of the constitution you're referencing.

Context:
{context}

Question: {input}
"""
    
    prompt = ChatPromptTemplate.from_messages([
        ("system", system_template),
    ])
    
    # Create the document chain
    document_chain = create_stuff_documents_chain(llm, prompt)
    
    # Create the RAG chain
    rag_chain = create_retrieval_chain(retriever, document_chain)
    
    return rag_chain

# Create the RAG chain
rag_chain = create_rag_chain(retriever)

# Test the RAG chain with specific legal queries
for query in legal_queries:
    print(f"\nQuestion: {query}")
    response = rag_chain.invoke({"input": query})
    print("\nAnswer:")
    print(response["answer"])
    print("\n" + "-"*80)


Question: Is it legal to protest in public in India?

Answer:
Yes, it is legal to protest in public in India. Citizens have the right to assemble peaceably and without arms, as enshrined in Article 19(1)(b) of the Constitution of India. This right is consistent with the principles of a constitutional democracy, where citizens can express their views and grievances concerning government policies and actions. However, this right may be subject to reasonable restrictions as specified in Article 19(2), which outlines the grounds on which such restrictions can be imposed.

--------------------------------------------------------------------------------

Question: What are my rights if I am arrested in India?

Answer:
If you are arrested in India, your rights include the following, as outlined in Article 22 of the Constitution of India:

1. **Right to be Informed**: You have the right to be informed at the time of arrest of the reasons for your arrest.

2. **Right to Legal Counsel**: You ha

## 9. Save and Load Vector Store

Demonstrate how to load the vector store for future use.

In [9]:
def load_vector_store(embeddings, persist_directory=CHROMA_DIR):
    """Load an existing vector store"""
    if not os.path.exists(persist_directory):
        raise ValueError(f"Vector store directory {persist_directory} does not exist")
    
    vector_store = Chroma(
        persist_directory=persist_directory,
        embedding_function=embeddings
    )
    
    return vector_store

# Load the vector store
loaded_vector_store = load_vector_store(embeddings)
print("Vector store loaded successfully")

# Check collection stats
collection = loaded_vector_store._collection
count = collection.count()
print(f"Vector store contains {count} documents")

Vector store loaded successfully
Vector store contains 21004 documents
