# RAG (Retrieval Augmented Generation)

#### What is RAG?
**Retrieval Augmented Generation** is a way to improve how AI models, like chatbots, generate the text. It combines the AI's ability to create text with a system that finds and uses relevant information from a database/knowledge base.


#### How Does RAG Work?
1. **You ask a question/query**: You ask the Chatbot something.
2. **Find relevant information for you**: The Chatbot searches a knowledge-base to find the most relevant information related to your question/query.
3. **Genearate accurate response**: The Chatbot uses this information to create a more accurate answer.

In [144]:
# Installing required packages
!pip install --upgrade pip
!pip install tf-keras --upgrade -q
!pip install --upgrade transformers numpy sentence-transformers langchain_community langchain langchain_community chromadb cmake -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gptqmodel 1.9.0 requires numpy>=2.2.2, but you have numpy 2.1.3 which is incompatible.
gptqmodel 1.9.0 requires protobuf>=5.29.3, but you have protobuf 4.25.7 which is incompatible.[0m[31m
[0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 2.2.5 which is incompatible.
gptqmodel 1.9.0 requires protobuf>=5.29.3, but you have protobuf 4.25.7 which is incompatible.
numba 0.61.0 requires numpy<2.2,>=1.24, but you have numpy 2.2.5 which is incompatible.[0m[31m
[0m

In [145]:
# Importing all the required libraries and modules
import os
import numpy as np
import pandas as pd
import re
import networkx as nx
import matplotlib.pyplot as plt
import faiss
from pathlib import Path
from datetime import datetime
from typing import Any, List
from pydantic import Field, BaseModel, Extra
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama
from langchain.chains.llm import LLMChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader, PyPDFLoader
from chromadb.utils import embedding_functions
from chromadb import PersistentClient
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.embeddings import FakeEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA, LLMChain, ConversationalRetrievalChain
from langchain.chains.summarize import load_summarize_chain
from langchain.memory import ConversationSummaryMemory, ConversationBufferMemory
from langchain.schema import BaseRetriever
from langchain.schema import Document
from langchain_community.vectorstores.utils import filter_complex_metadata
from langchain.chains.combine_documents.refine import RefineDocumentsChain
from langchain.chains.conversational_retrieval.base import ConversationalRetrievalChain

import warnings
warnings.filterwarnings('ignore')

In [146]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initializing the vectorizer
# "all-MiniLM-L6-v2": is a good general-purpose embedding model that balances performance and efficiency
vectorizer = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") # This vectorizer converts text into vectors in embedding space

# RAG with multiple Policies 

In [147]:
# Load the CSV Report with Policy Metadata

def load_policy_report(csv_file: str) -> pd.DataFrame:
    """
    Loads the report.csv that includes policy titles and reference URLs.
    Expected CSV columns: 'Title', 'URL', and potentially others.
    """
    try:
        # Try reading with the default engine and UTF-8 encoding
        df = pd.read_csv(csv_file, encoding="utf-8", sep="\t", on_bad_lines='skip')
    except UnicodeDecodeError:
        print("Failed to decode with encoding utf-8. Trying 'utf-16' instead.")
        df = pd.read_csv(csv_file, encoding="utf-16", sep="\t", on_bad_lines='skip')

    return df

In [148]:
# Load the CSV file (adjust the path if needed)
report_df = load_policy_report("report.csv")
print(f"Report loaded: {len(report_df)} policies found in report.csv.")

Failed to decode with encoding utf-8. Trying 'utf-16' instead.
Report loaded: 429 policies found in report.csv.


In [149]:
def map_policy_metadata(report_df: pd.DataFrame) -> dict:
    """ Helper function that creates a mapping where keys are lowercase policy titles (from the CSV) and values are the corresponding URL. """
    mapping = {}
    for _, row in report_df.iterrows():
        title = row["Title"].strip().lower()
        url = row["URL"].strip()
        mapping[title] = url
    return mapping

# Load CSV and create mapping.
report_df = load_policy_report("report.csv")
policy_mapping = map_policy_metadata(report_df)
print(f"[INFO]: Report loaded with {len(policy_mapping)} policies.")

Failed to decode with encoding utf-8. Trying 'utf-16' instead.
[INFO]: Report loaded with 428 policies.


In [150]:
# First, I'm trying to all the Policy documents (pdfs), chunking them while preserving the metadata of each policy document.
def load_policies(folder_path: str, policy_mapping: dict) -> List[Document]:
    """ Helper function that loads all PDFs from the folder and update each Document's metadata with. """
    all_docs = []
    for pdf_path in Path(folder_path).glob("*.pdf"):
        loader = PyPDFLoader(str(pdf_path))
        docs = loader.load()  
        file_title = pdf_path.stem.lower()
        matched_url = None
        matched_policy_title = None

        for title_key, url in policy_mapping.items():
            if title_key in file_title:
                matched_url = url
                matched_policy_title = title_key  
                break
        for doc in docs:
            doc.metadata["source_file"] = pdf_path.name
            if matched_url:
                doc.metadata["policy_title"] = matched_policy_title
                doc.metadata["policy_url"] = matched_url
            else:
                # Fallback if no match is found in the CSV.
                doc.metadata["policy_title"] = pdf_path.stem
                doc.metadata["policy_url"] = None
        all_docs.extend(docs)
    return all_docs

raw_documents = load_policies("Policies/", policy_mapping)
print(f"Loaded {len(raw_documents)} document pages from policies.")

Loaded 3139 document pages from policies.


In [151]:
print(f"[INFO]: Loaded {len(raw_documents)} documents (pages) from the 'Policies/' folder.")

[INFO]: Loaded 3139 documents (pages) from the 'Policies/' folder.


In [152]:
# Chunking each document (policy) page while preserving the metadata
def chunk_documents(raw_docs: list[Document], chunk_size: int = 500, overlap: int = 50) -> list[Document]:
    """ Helper function that chunks each document and preserve the metadata. """
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    chunks = []
    for doc in raw_docs:
        doc_chunks = splitter.split_documents([doc])
        for chunk in doc_chunks:
            chunk.metadata = doc.metadata.copy()  # Preserve original metadata
            chunks.append(chunk)
    return chunks

In [153]:
chunked_docs = chunk_documents(raw_documents)
print(f"[INFO]: Chunked into {len(chunked_docs)} segments.")

[INFO]: Chunked into 18115 segments.


In [154]:
def simple_filter_metadata(metadata: dict, allowed_types=(str, int, float, bool)) -> dict:
    """ 
    Helper function that filters a metadata dictionary so that each value is of type str, int, float, or bool.
    If a value is not one of these types (and not None), it's converted to a string.
    Keys with None values are dropped.
    """
    filtered = {}
    for key, value in metadata.items():
        if value is None:
            continue  # Skip None values
        if isinstance(value, allowed_types):
            filtered[key] = value
        else:
            # Optionally, convert the value to a string.
            filtered[key] = str(value)
    return filtered

In [155]:
# Building Semantic & Keyword based Retriever

# After chunking, I'm filtering metadata for each document chunk.
# I'm replacing any None values with a default, or removes keys with non-simple types.
for doc in chunked_docs:
    doc.metadata = simple_filter_metadata(doc.metadata)

db_semantic = Chroma.from_documents(
    chunked_docs, 
    vectorizer,
    client=PersistentClient(path="./chroma_db")
)
semantic_retriever = db_semantic.as_retriever(search_kwargs={"k": 5})
print("[INFO]: Semantic retriever set up.")

[INFO]: Semantic retriever set up.


In [156]:
# Using FakeEmbeddings for keyword search (BM25-like retrieval)

db_keyword = FAISS.from_documents(chunked_docs, FakeEmbeddings(size=768))
keyword_retriever = db_keyword.as_retriever(search_kwargs={"k": 5})
print("[INFO]: Keyword retriever set up.")

[INFO]: Keyword retriever set up.


#### Building GraphRAG Component

Overhere, I'm trying to build a graph over document chunk using NetworkX, where nodes represent chunks and edges connect similar chunks.  This graph helps propagate context across policy boundaries. 

This graph leverages approximate nearest neighbor search via FAISS to build a sparse graph over our policy documents chunks. This approach avoids computing all pairwise similarities (which can be prohibitively expensive for 500+ PDFs) by efficiently retrieving only the nearest neighbors for each chunk. 

In [157]:
'''
Approach: I'm using FAISS IndexIVFFlat to perform approximate nearest neighbor search. Over here, an edge is added between two nodes if their inner product (cosine similarity) exceeds the specified threshold.
'''

def build_policy_graph(
    docs: list[Document], 
    vectorizer, 
    k_neighbors: int = 5, 
    threshold: float = 0.9, 
    nlist: int = 100
) -> nx.Graph:
    """ Helper function that build a graph. """

    # Computing embeddings for each document chunk
    embeddings = []
    for doc in docs:
        emb = np.array(vectorizer.embed_query(doc.page_content)).astype("float32")
        embeddings.append(emb)
    embeddings = np.stack(embeddings)
    dim = embeddings.shape[1]

    # Building an approximate FAISS index with inner-product
    quantizer = faiss.IndexFlatIP(dim)
    index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
    index.train(embeddings)
    index.add(embeddings)

    # Retrieving approximate k_neighbors for each chunk (include self in results)
    # returns: D = distances, I = Indices
    D, I = index.search(embeddings, k_neighbors + 1)

    # Building a graph based on neighbors with similarity above threshold
    G = nx.Graph()
    
    # Adding nodes with metadata (source and page)
    for idx, doc in enumerate(docs):
        G.add_node(idx, doc=doc)

    # For each document, adding edges from its approximate neighbors (skipping self)
    for i, (neighbors, distances) in enumerate(zip(I, D)):
        for j, sim in zip(neighbors[1:], distances[1:]):
            if sim >= threshold:
                # Adding an edge with weight=similarity
                G.add_edge(i, j, weight=float(sim))

    return G

In [158]:
# Building the graph on our chunked documents 

policy_graph = build_policy_graph(chunked_docs, vectorizer, k_neighbors=5, threshold=0.9, nlist=100)
print(f"[INFO]: Approximate graph created with {policy_graph.number_of_nodes()} nodes and {policy_graph.number_of_edges()} edges.")

[INFO]: Approximate graph created with 18115 nodes and 7435 edges.


In [159]:
def print_policy_graph_info(graph):
    ''' Helper function that return the summary of the Policy Graph. '''
    print("Graph Summary:")
    print(f"  Number of nodes: {graph.number_of_nodes()}")
    print(f"  Number of edges: {graph.number_of_edges()}")
    print("\nSample Nodes (first 5):")
    for node in list(graph.nodes(data=True))[:5]:
        print(node)
    print("\nSample Edges (first 5):")
    for edge in list(graph.edges(data=True))[:5]:
        print(edge)

print_policy_graph_info(policy_graph)

Graph Summary:
  Number of nodes: 18115
  Number of edges: 7435

Sample Nodes (first 5):
(0, {'doc': Document(metadata={'producer': 'Prince 12.5.1 (www.princexml.com)', 'creator': 'PolicyStat', 'creationdate': '', 'subject': 'The California State University', 'author': 'Grommo, April: Asst VC, Enroll Mgmt Srvcs', 'title': '2021 – 2022 Emergency Grant Allocation', 'source': 'Policies/2021 - 2022 Emergency Grant Allocation.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1', 'source_file': '2021 - 2022 Emergency Grant Allocation.pdf', 'policy_title': '2021 - 2022 Emergency Grant Allocation'}, page_content='COPY\nStatus Active PolicyStat ID 10719972 \nOrigination 12/7/2021 \nEffective 12/7/2021 \nReviewed 12/7/2021 \nNext Review 12/7/2023 \nOwner April Grommo: \nAsst VC, Enroll \nMgmt Srvcs \nArea Academic and \nStudent Affairs \n2021 – 2022 Emergency Grant Allocation \nPolicy \nThis policy provides procedural guidance related to the allocation of $30 million of one-time funding \nissued

In [160]:
# question_prompt_template = """
# You are a highly knowledgeable assistant with expertise in CSU policies. Your task is to answer the following question using the context provided.
# IMPORTANT: If the question is not related to CSU policies, respond with: 
# "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
# Question: {question}
# Context: {context}
# Answer:
# """

# question_prompt_template = """
# You are a highly knowledgeable assistant with deep expertise in CSU policies. Your responses must strictly pertain to CSU policies and internal policy matters.
# IMPORTANT: If the user's question is not about CSU policies or policy-related information, immediately respond with:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
# Do not provide any additional content in that case.
# Otherwise, answer the following question using the context provided.
# Question: {query}
# Context: {context}
# Answer:
# """


# question_prompt = PromptTemplate(
#     input_variables=["question", "context"],
#     template="""
# You are a highly knowledgeable assistant with deep expertise in CSU policies.
# Only answer questions related to CSU policies.
# If the question is not related to CSU policies, respond with:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"

# Question: {question}
# Context: {context}
# Answer:
# """,
# )

# question_prompt_template = """
# You are a CSU Policy Assistant. You are only allowed to answer questions directly related to California State University policies using official policy documents as your source.
# If the user's question is not related to CSU policies, respond exactly with:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
# If the user's question is unclear, ask for clarification.
# Otherwise, answer the question using the context provided.

# Question: {question}
# Context: {context}
# Answer:
# """

question_prompt_template = """
You are a CSU Policy Assistant. Your job is to answer ONLY questions that are directly about California State University (CSU) policies, using official policy documents as your source.

Step 1: Before answering, check if the user's question is about CSU policies.
- If the question is NOT about CSU policies, respond ONLY with:
"I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
- Do NOT attempt to answer or provide any information outside CSU policies.

Step 2: If the question IS about CSU policies:
- If unclear, ask the user to clarify.
- Otherwise, answer using the provided context.

Question: {question}
Context: {context}
Answer:
"""



In [161]:
# refine_prompt_template = """
# The initial answer is: {existing_answer}
# Additional context: {context}
# IMPORTANT: Ensure the question is clearly related to CSU policies. 
# If it is not, respond with: 
# "I'm sorry, I can only answer questions related to CSU policies. Could you please ask a question related to CSU policies?"
# Otherwise, refine and elaborate on the answer, providing clear details and citing evidence where applicable.
# Refined answer:
# """
# refine_prompt_template = """
# The initial answer is: {existing_answer}
# Additional context: {context}

# IMPORTANT: Before refining, ensure the question is clearly related to CSU policies.
# If you determine that the query is off-topic, immediately respond with:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please ask a question related to CSU policies?"
# Otherwise, please refine and elaborate on the answer, providing clear details and citing evidence as needed.
# Refined answer:
# """


# refine_prompt_template = """
# [INTERNAL: If the question contains "summarize", "clarify", or "rephrase", do NOT output the off-topic message; simply refine the previous answer.]

# The initial answer is: {existing_answer}
# Additional context: {context}

# IMPORTANT: If the new context does not clearly indicate that the question pertains to CSU policies and no follow-up instruction is present, respond with exactly:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please ask a question related to CSU policies?"
# Otherwise, refine and expand the answer.

# Refined answer:
# """


refine_prompt_template = """
You are a CSU Policy Assistant. Your response must be strictly based on CSU policies.

Step 1: Check if the question or additional context is about CSU policies.
- If NOT, respond ONLY with:
"I'm sorry, I can only answer questions related to CSU policies. Could you please ask a question related to CSU policies?"

Step 2: If it IS about CSU policies, refine and expand the answer using the context provided.

The initial answer is: {existing_answer}
Additional context: {context}
Refined answer:
"""


In [162]:
# # Creating PromptTemplate objects
question_prompt = PromptTemplate(
    template=question_prompt_template,
    input_variables=["question", "context"]
)
refine_prompt = PromptTemplate(
    template=refine_prompt_template,
    input_variables=["existing_answer", "context"]
)

In [163]:
# Subclass RetrievalQA to allow extra chain_type_kwargs.
class CustomRetrievalQA(RetrievalQA):
    class Config:
        extra = Extra.allow

In [164]:
def is_csu_policy_question(query: str) -> bool:
    """Check if the query relates to CSU policies using keywords."""
    csu_keywords = ["CSU", "California State University", "policy", "academic integrity", "code of conduct"]
    return any(keyword.lower() in query.lower() for keyword in csu_keywords)

# Modify your retriever to return empty results for non-policy questions
class PolicyFilteredRetriever(EnsembleRetriever):
    def get_relevant_documents(self, query: str):
        if not is_csu_policy_question(query):
            return []  # Return empty list for non-policy questions
        return super().get_relevant_documents(query)


#### Creating Hybrid Retriever (Combines Semantic, Keyword & Graph Retrieval)

I'm creating a is a custom hybrid retriever class that:
1. Retrieves documents via semantic and keyword search.
2. Uses the graph to add neighboring nodes of the retrieved chunks (for additional context).
3. Deduplicates and returns a final list of relevant documents.

In [165]:
class HybridGraphRetriever(BaseRetriever, BaseModel):
    semantic_retriever: Any
    keyword_retriever: Any
    policy_graph: nx.Graph
    top_k: int = Field(default=5)
    graph_hops: int = Field(default=1)

    class Config:
        extra = "allow"

    def _get_relevant_documents(self, query: str) -> List[Document]:
        sem_docs = self.semantic_retriever.get_relevant_documents(query)
        key_docs = self.keyword_retriever.get_relevant_documents(query)
        combined = sem_docs + key_docs
        expanded_docs = combined.copy()
        for doc in combined:
            for node, data in self.policy_graph.nodes(data=True):
                if data["doc"].page_content.strip() == doc.page_content.strip():
                    neighbors = nx.single_source_shortest_path_length(self.policy_graph, node, cutoff=self.graph_hops)
                    for n in neighbors:
                        neighbor_doc = self.policy_graph.nodes[n]["doc"]
                        expanded_docs.append(neighbor_doc)
                    break
        seen = {}
        for doc in expanded_docs:
            key = (doc.metadata.get("policy_title", ""), doc.metadata.get("page", ""), doc.page_content)
            seen[key] = doc
        unique_docs = list(seen.values())
        return unique_docs[:self.top_k]

    async def _aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError("Async retrieval is not implemented for HybridGraphRetriever.")

In [166]:
hybrid_retriever = HybridGraphRetriever(
    semantic_retriever=semantic_retriever,
    keyword_retriever=keyword_retriever,
    policy_graph=policy_graph,
    top_k=5,
    graph_hops=1
)

# hybrid_retriever = PolicyFilteredRetriever(retrievers=[bm25_retriever, tfidf_retriever], weights=[0.5, 0.5])

print("[INFO]: HybridGraphRetriever with approximate graph is ready.")

[INFO]: HybridGraphRetriever with approximate graph is ready.


In [167]:
# Create a custom ConversationalRetrievalChain subclass that allows extra keys.
class CustomConversationalRetrievalChain(ConversationalRetrievalChain):
    class Config:
        extra = Extra.allow

In [168]:
# 2. Create the LLMs
llm = Ollama(model="mistral", temperature=0.3)

initial_llm_chain = LLMChain(llm=llm, prompt=question_prompt)
refine_llm_chain = LLMChain(llm=llm, prompt=refine_prompt)

combine_docs_chain = RefineDocumentsChain(
    initial_llm_chain=initial_llm_chain,
    refine_llm_chain=refine_llm_chain,
    document_variable_name="context",
    initial_response_name="existing_answer"
)
print("[INFO]: Custom refine documents chain is ready.")

[INFO]: Custom refine documents chain is ready.


In [169]:
dummy_question_prompt = PromptTemplate(
    template="{question}",
    input_variables=["question"]
)
question_generator = LLMChain(llm=llm, prompt=dummy_question_prompt)

In [170]:
# memory = ConversationBufferMemory(
#     memory_key="chat_history", 
#     output_key="answer", 
#     return_messages=True
# )
# rag_chain = CustomRetrievalQA.from_chain_type(
#     llm=llm_for_chain,
#     chain_type="refine",


 #     retriever=hybrid_retriever,
#     return_source_documents=True,
#     chain_type_kwargs={
#         "question_prompt": question_prompt,
#         "refine_prompt": refine_prompt,
#         "document_variable_name": "context"
#     }
# )

# rag_chain_custom = CustomConversationalRetrievalChain.from_llm(
#     llm=llm_for_chain, 
#     retriever=hybrid_retriever,
#     memory=memory,
#     output_key="answer",
#     return_source_documents=True,
#     chain_type_kwargs={
#          "question_prompt": question_prompt,
#          "refine_prompt": refine_prompt,
#          "document_variable_name": "context"
#     }
# )

memory = ConversationBufferMemory(
    memory_key="chat_history", 
    return_messages=True,
    output_key="answer"
)

rag_chain = ConversationalRetrievalChain(
    retriever=hybrid_retriever,               
    combine_docs_chain=combine_docs_chain,   
    question_generator=question_generator,
    memory=memory,
    output_key="answer",
    return_source_documents=True,
    callbacks=[]
)
print("[✅] Custom ConversationalRetrievalChain is ready.")

[✅] Custom ConversationalRetrievalChain is ready.


In [171]:
def is_on_topic(question: str, llm) -> bool:
    """Use a simple prompt to ask the LLM whether the query is related to CSU policies."""
    check_prompt = PromptTemplate(
        template="Is the following question related to CSU policies? Answer with 'yes' or 'no'.\nQuestion: {question}",
        input_variables=["question"]
    )
    check_chain = LLMChain(llm=llm, prompt=check_prompt)
    response = check_chain.predict(question=question)
    return "yes" in response.lower()

### Creating  a Simple Intent Classifier Using the Existing LLM

In [172]:
class SimpleIntentClassifier:
    def __init__(self, llm):
        """Initialize a simple intent classifier using an existing LLM"""
        self.llm = llm
        self.intents = [
            "policy_lookup",
            "policy_comparison",
            "policy_application",
            "summarize_previous",
            "clarification",
            "out_of_scope"
        ]
        
        # Create the classification prompt
        self.prompt_template = PromptTemplate(
            input_variables=["query"],
            template="""
            Classify the following query into exactly one of these intents:
            - policy_lookup: Questions about what a specific CSU policy is or contains
            - policy_comparison: Questions comparing two or more CSU policies
            - policy_application: Questions about how a CSU policy applies to a situation
            - summarize_previous: Requests to summarize previous information
            - clarification: Requests to clarify previous information
            - out_of_scope: Questions not related to CSU policies
            
            Query: {query}
            
            Also extract any policy entities mentioned in the query.
            
            Respond in this exact format:
            Intent: [intent name]
            Confidence: [0.0-1.0]
            Entities: [list of policy entities or "none"]
            """
        )
        
        self.chain = LLMChain(llm=self.llm, prompt=self.prompt_template)
    
    def parse(self, query):
        """Parse the query and return intent classification"""
        result = self.chain.run(query=query)
        
        # Extract intent, confidence, and entities using regex
        intent_match = re.search(r"Intent: (\w+)", result)
        confidence_match = re.search(r"Confidence: (0\.\d+|1\.0)", result)
        entities_match = re.search(r"Entities: (.+)", result)
        
        intent = intent_match.group(1) if intent_match else "out_of_scope"
        confidence = float(confidence_match.group(1)) if confidence_match else 0.5
        
        entities = []
        if entities_match and "none" not in entities_match.group(1).lower():
            entity_names = entities_match.group(1).strip("[]").split(",")
            for entity in entity_names:
                entity = entity.strip()
                if entity:
                    entities.append({"entity": "policy", "value": entity})
        
        return {
            "intent": {"name": intent, "confidence": confidence},
            "entities": entities
        }


### Implementing Conversation Context Management


In [173]:
class PolicyConversationContext:
    def __init__(self, max_turns=5):
        self.context = []
        self.max_turns = max_turns
        self.current_policies = set()
        self.last_answer = ""
        self.last_intent = None
        self.last_source_docs = []

    def add_turn(self, user_message, bot_response, intent, entities=None, source_docs=None):
        # Adding new conversation turn
        self.context.append({
            "user": user_message,
            "bot": bot_response,
            "intent": intent,
            "timestamp": datetime.now()
        })

        # Updating tracking variables
        self.last_answer = bot_response
        self.last_intent = intent
        if source_docs:
            self.last_source_docs = source_docs

        # Tracking mentioned policies
        if entities:
            for entity in entities:
                if entity["entity"] == "policy":
                    self.current_policies.add(entity["value"])
        
        # Maintaining context window size
        if len(self.context) > self.max_turns:
            self.context.pop(0)
    
        def get_last_answer(self):
            return self.last_answer
        
        def get_last_source_docs(self):
            return self.last_source_docs
        
        def get_relevant_policies(self):
            return list(self.current_policies)
        
        def get_context_history(self):
            return self.context
                        

### Creating an Intent-Aware Retriever Wrapper
Wrapping our existing hybrid retriever with intent-aware capabilities.

In [174]:
class IntentAwareRetrieverWrapper:
    def __init__(self, hybrid_retriever, intent_classifier):
        self.hybrid_retriever = hybrid_retriever
        self.intent_classifier = intent_classifier
    
    def get_relevant_documents(self, query):
        # Classify intent
        intent_result = self.intent_classifier.parse(query)
        intent = intent_result["intent"]["name"]
        confidence = intent_result["intent"]["confidence"]
        entities = intent_result.get("entities", [])
        
        # Handle special intents
        if intent in ["summarize_previous", "clarification"]:
            return []
            
        if intent == "out_of_scope" and confidence > 0.6:
            return []
        
        # For policy comparison, enhance retrieval
        if intent == "policy_comparison":
            policy_entities = [e["value"] for e in entities if e["entity"] == "policy"]
            if len(policy_entities) >= 2:
                all_docs = []
                for policy in policy_entities:
                    enhanced_query = f"{policy} policy CSU"
                    docs = self.hybrid_retriever.get_relevant_documents(enhanced_query)
                    all_docs.extend(docs)
                return all_docs
        
        # For regular policy questions, use the hybrid retriever
        return self.hybrid_retriever.get_relevant_documents(query)


In [175]:
# Query Decomposition Using spaCy
def simple_decompose_query(query):
    """Simple query decomposition based on common patterns"""
    # Check for comparison queries
    comparison_keywords = ["compare", "difference", "versus", "vs", "similarities", "differences"]
    is_comparison = any(keyword in query.lower() for keyword in comparison_keywords)
    
    if is_comparison:
        # This is a comparison query, handle it as is
        return [query]
    
    # Split on question marks for multiple questions
    if "?" in query:
        parts = query.split("?")
        # Filter out empty parts and add back the question marks
        return [part.strip() + "?" for part in parts if part.strip()]
    
    # Not a complex query
    return [query]


In [176]:
# def policy_chatbot(query: str):

#     # First, check if the query is on-topic.
#     if not is_on_topic(query, llm):
#         print("\nI'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?\n")
#         return
    
#     # Note: Use "question" as the input key.
#     result = rag_chain({"question": query})
#     answer = result.get("answer", "")
#     source_docs = result.get("source_documents", [])
    
#     print(f"\n💬 Query: {query}\n")
#     print(f"🤖 Answer:\n{answer}\n")
    
#     print("📚 References:")
#     for i, doc in enumerate(source_docs, start=1):
#         metadata = doc.metadata
#         policy_title = metadata.get("policy_title", "Unknown Policy")
#         policy_url = metadata.get("policy_url", None)
#         page = metadata.get("page")
#         page_num = page + 1 if isinstance(page, int) else "?"
#         if policy_url and isinstance(policy_url, str) and policy_url.startswith("http"):
#             link = f"{policy_url}#page={page_num} ({policy_title})"
#         else:
#             link = f"{policy_title} (Page {page_num})"
#         print(f"[{i}] {link}")
#     print("\n" + "-" * 80 + "\n")


# def policy_chatbot(question: str):  
#     result = rag_chain({"question": question})
#     answer = result.get("answer", "")
#     source_docs = result.get("source_documents", [])
    
#     print(f"\n💬 Query: {question}\n")
#     print(f"🤖 Answer:\n{answer}\n")
#     print("📚 References:")
#     for i, doc in enumerate(source_docs, 1):
#         meta = doc.metadata
#         title = meta.get("policy_title", "Unknown Policy")
#         url = meta.get("policy_url", "")
#         page = meta.get("page")
#         page_disp = page + 1 if isinstance(page, int) else "?"
#         if url:
#             print(f"[{i}] {url}#page={page_disp} ({title})")
#         else:
#             print(f"[{i}] {title} (Page {page_disp})")
#     print("\n" + "-" * 80 + "\n")

In [177]:
# def is_on_topic(question: str) -> bool:
#     """
#     Uses a simple LLMChain to check if the question is directly related to CSU policies.
#     Returns True if the answer is 'yes', otherwise False.
#     """
#     on_topic_prompt = PromptTemplate(
#         template="Is the following question related to CSU policies? Answer only 'yes' or 'no'.\nQuestion: {question}",
#         input_variables=["question"]
#     )
#     on_topic_chain = LLMChain(llm=llm, prompt=on_topic_prompt)
#     response = on_topic_chain.predict(question=question)
#     return "yes" in response.lower()


In [178]:
# def policy_chatbot(question: str):
#     # Pre-check: if question is off-topic, immediately return the fixed off-topic message.
#     if not is_on_topic(question):
#         print("\nI'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?\n")
#         return

#     result = rag_chain({"question": question})
#     answer = result.get("answer", "")
#     source_docs = result.get("source_documents", [])
    
#     print(f"\n💬 Query: {question}\n")
#     print(f"🤖 Answer:\n{answer}\n")
#     print("📚 References:")
#     for i, doc in enumerate(source_docs, start=1):
#         meta = doc.metadata
#         title = meta.get("policy_title", "Unknown Policy")
#         url = meta.get("policy_url", "")
#         page = meta.get("page")
#         page_disp = page + 1 if isinstance(page, int) else "?"
#         if url and isinstance(url, str) and url.startswith("http"):
#             print(f"[{i}] {url}#page={page_disp} ({title})")
#         else:
#             print(f"[{i}] {title} (Page {page_disp})")
#     print("\n" + "-" * 80 + "\n")

In [179]:
def policy_chatbot(question: str):
    """Enhanced policy chatbot with NLU capabilities"""
    global intent_classifier, context_manager
    
    # Ensure components are initialized
    if 'intent_classifier' not in globals() or 'context_manager' not in globals():
        initialize_chatbot()
    
    # Classify intent
    intent_result = intent_classifier.parse(question)
    intent = intent_result["intent"]["name"]
    confidence = intent_result["intent"]["confidence"]
    entities = intent_result.get("entities", [])
    
    print(f"\n💬 Query: {question}\n")
    
    # Handle special intents
    if intent == "summarize_previous":
        last_answer = context_manager.get_last_answer()
        if not last_answer:
            answer = "I don't have any previous information to summarize."
        else:
            # Use a simple summarization prompt
            summarization_prompt = f"Summarize the following text in a concise way:\n\n{last_answer}"
            answer = llm_for_chain.predict(text=summarization_prompt)
        
        print(f"🤖 Answer:\n{answer}\n")
        context_manager.add_turn(question, answer, intent)
        return
    
    if intent == "clarification":
        last_answer = context_manager.get_last_answer()
        if not last_answer:
            answer = "I'm sorry, but I don't have any previous information to clarify. Could you ask a specific question about CSU policies?"
        else:
            clarification_prompt = f"The user is asking for clarification on this response: '{last_answer}'. Provide a clearer explanation."
            answer = llm_for_chain.predict(text=clarification_prompt)
        
        print(f"🤖 Answer:\n{answer}\n")
        context_manager.add_turn(question, answer, intent)
        return
    
    # Check if it's a non-policy question
    if intent == "out_of_scope" and confidence > 0.6:
        answer = "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
        print(f"🤖 Answer:\n{answer}\n")
        context_manager.add_turn(question, answer, intent)
        return
    
    # Use your existing RAG chain with the original hybrid_retriever
    # We'll handle the intent-aware filtering here
    
    # Check if query needs decomposition
    sub_queries = simple_decompose_query(question)
    
    if len(sub_queries) > 1:
        # Process each sub-query and combine results
        combined_answer = ""
        all_sources = []
        
        for sub_q in sub_queries:
            # Use your existing rag_chain
            result = rag_chain({"question": sub_q})
            sub_answer = result.get("answer", "")
            if sub_answer:
                combined_answer += sub_answer + "\n\n"
                all_sources.extend(result.get("source_documents", []))
        
        answer = combined_answer.strip()
        source_docs = all_sources
    else:
        # Process normally for simple queries
        result = rag_chain({"question": question})
        answer = result.get("answer", "")
        source_docs = result.get("source_documents", [])
    
    # Format the response based on intent
    if intent == "policy_comparison":
        policy_entities = [e["value"] for e in entities if e["entity"] == "policy"]
        if len(policy_entities) >= 2:
            answer = format_comparison_response(answer, policy_entities)
    
    # Update conversation context
    context_manager.add_turn(question, answer, intent, entities, source_docs)
    
    # Output the answer
    print(f"🤖 Answer:\n{answer}\n")
    
    # Print references
    print("📚 References:")
    for i, doc in enumerate(source_docs, start=1):
        meta = doc.metadata
        title = meta.get("policy_title", "Unknown Policy")
        url = meta.get("policy_url", "")
        page = meta.get("page")
        page_disp = page + 1 if isinstance(page, int) else "?"
        if url and isinstance(url, str) and url.startswith("http"):
            print(f"[{i}] {url}#page={page_disp} ({title})")
        else:
            print(f"[{i}] {title} (Page {page_disp})")
    
    print("\n" + "-" * 80 + "\n")


In [180]:
def format_comparison_response(answer, policy_entities):
    """Format the response as a comparison table for policy comparison intents"""
    # Create a header for the comparison
    comparison = f"## Comparison of {' and '.join(policy_entities)}\n\n"
    
    # Try to extract key aspects for comparison
    aspects = ["Focus", "Scope", "Enforcement", "Penalties", "Application"]
    
    # Create a markdown table
    comparison += "| Aspect | " + " | ".join(policy_entities) + " |\n"
    comparison += "|--------|" + "|".join(["---------" for _ in policy_entities]) + "|\n"
    
    # Add the original answer after the table
    comparison += "\n\n" + answer
    
    return comparison

def format_policy_lookup_response(answer, policy_name):
    """Format the response for policy lookup intents"""
    formatted = f"## {policy_name.title()} Policy\n\n"
    
    # Add key points section
    formatted += "### Key Points:\n"
    
    # Add the original answer
    formatted += "\n" + answer
    
    return formatted

In [181]:
def initialize_chatbot():
    """Initialize all components for the enhanced policy chatbot"""
    global intent_classifier, context_manager
    
    # Initialize NLU components if not already initialized
    if 'intent_classifier' not in globals():
        print("Initializing simple intent classifier...")
        intent_classifier = SimpleIntentClassifier(llm_for_chain)
    
    if 'context_manager' not in globals():
        print("Initializing conversation context manager...")
        context_manager = PolicyConversationContext()
    
    # Create the intent-aware retriever wrapper
    intent_aware_retriever = IntentAwareRetrieverWrapper(hybrid_retriever, intent_classifier)
    
    print("Enhanced policy chatbot initialized successfully!")
    return intent_aware_retriever

In [189]:
# Initialize all components
initialize_chatbot()

Enhanced policy chatbot initialized successfully!


<__main__.IntentAwareRetrieverWrapper at 0x7fd1d7e99040>

In [190]:
policy_chatbot("What is the academic integrity policy?")


💬 Query: What is the academic integrity policy?

🤖 Answer:
 The California State University (CSU) operates under a budget system, where the authorizations contained in the previous budget are used until the new one is approved. Each campus has a Budget Review Board, which is convened annually prior to budget preparation. This board includes the Associated Students' President and the Dean of Students, among others.

The role of this review board is to familiarize themselves with the procedures involved in the budget process. While the CSU values academic integrity and encourages responsible behaviors from students, there may be instances where a student's behavior does not align with the Student Conduct Code. In such cases, an educational process is initiated to promote safety and good citizenship, and appropriate consequences may be imposed according to the CSU policy. These consequences can range from receiving a lower grade on an assignment or in a course, being placed on academic p

#### Trying our chatbot with different queries

In [183]:
policy_chatbot("What are the approval procedures for academic freedom-related policies?")


💬 Query: What are the approval procedures for academic freedom-related policies?

🤖 Answer:
 The process for conferring the title of President Emeritus at California State University (CSU) involves a review of an individual's valuable contributions to their university and to this system of higher education, as outlined in Resolution RBOT 07-03-07 for conferring the Title Trustee Emeritus and Agenda Item 2 of the Committee on Educational Policy at the March 16-17, 2010 Board Meeting.

In cases where there is disagreement with this determination, it should be noted on the outside employment disclosure form and escalated to the next level of review. This second and final level of review should be conducted by an independent review committee appointed by the President or Chancellor or his/her designee. The recommendation provided at this level shall be the final determination.

Regarding your additional context, it is important to note that the process does not involve using questionnaire

In [184]:
query = "How many states are there in USA?"
policy_chatbot(query)


💬 Query: How many states are there in USA?

🤖 Answer:
I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?



In [185]:
policy_chatbot("What is the capital of France?")


💬 Query: What is the capital of France?

🤖 Answer:
I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?



In [186]:
query = "What is the annual fees of MS in CS fees at San Jose State University?"
policy_chatbot(query)


💬 Query: What is the annual fees of MS in CS fees at San Jose State University?

🤖 Answer:
 According to the provided context, the consultation regarding future assessments of approved tuition schedules within the California State University (CSU) system will begin in August, involving CSSA leadership-elect, students, faculty, and staff. The assessment prepared by the Chancellor's Office will include a comparison of CSU systemwide tuition rates to public four-year institutions of higher education in the United States as reported.

Regarding the Physical Science Replacement Building, Wing A at California State University, Los Angeles (CSULA), it has been prepared in accordance with the requirements of the California Environmental Quality Act (CEQA). The proposed project is expected to have no significant adverse impacts on the environment and will benefit the CSU. The schematic plans for this project have been approved at a cost of $42,595,000 at CCCI 4019.

For the most accurate and u

#### Experimenting complex queries

This below query requires the chatbot to retrieve information from the Academic Freedom Policy (which discusses research freedom and potential conflicts) and additional policy documents related to research funding or conflict of interest. The answer must integrate details from more than one policy.

In [187]:
policy_chatbot("How does the university policy address conflicts between faculty research priorities and commercial interests, and what approval procedures are in place to manage these conflicts?")


💬 Query: How does the university policy address conflicts between faculty research priorities and commercial interests, and what approval procedures are in place to manage these conflicts?



KeyboardInterrupt: 

**Complex Query-2:**

This below query demands a comparison between two distinct policies. The policystat chatbot needs to extract guidelines from both the Academic Access Policy and the Student Code of Conduct (or similar documents) and then perform a synthesis to highlight the differences and impacts on enforcement.

**How it works:**
1. The RAG pipeline retrieves chunks from both policies—semantic search picks up nuanced guidelines while keyword search fetches exact phrases like “faculty responsibilities” or “enforcement.”
2. GraphRAG further enhances the process by connecting sections that use similar language across policies.
3. The LLM then collates these details into a comparative answer with contextual references that indicate the policy source and page number for each piece of information.

In [None]:
policy_chatbot("What are the key differences between the Academic Access Policy and the Student Code of Conduct regarding faculty responsibilities, and how do these differences influence policy enforcement at the institution?")

**Complex Query-3:**
The below query is a multi-faceted query requiring integration of information from several policies. It involves not only the Academic Freedom Policy but also the tenure guidelines and research compliance standards. The answer must present a holistic view that outlines both academic independence and regulatory compliance.

**How chatbot works**:
The chatbot uses the hybrid retrieval module to gather relevant documents from all three policy areas. Semantic retrieval captures conceptual links about “independence” and “compliance,” while keyword retrieval hones in on technical terms like “tenure” or “regulations.” The graph-based component connects these overlapping concepts across multiple documents. With conversational memory, the system preserves context across turns, and the final answer generated by the LLM includes inline citations that reference the exact policy and page number where each requirement is stated.

In [None]:
query = "Considering the university’s policies on academic freedom, tenure, and research compliance, what are the combined requirements for faculty to maintain academic independence while ensuring adherence to institutional regulations? Give me the brief summary of the entire answer in the end."
policy_chatbot(query)