# RAG (Retrieval Augmented Generation)

#### What is RAG?
**Retrieval Augmented Generation** is a way to improve how AI models, like chatbots, generate the text. It combines the AI's ability to create text with a system that finds and uses relevant information from a database/knowledge base.


#### How Does RAG Work?
1. **You ask a question/query**: You ask the Chatbot something.
2. **Find relevant information for you**: The Chatbot searches a knowledge-base to find the most relevant information related to your question/query.
3. **Genearate accurate response**: The Chatbot uses this information to create a more accurate answer.

In [10]:
# Installing required packages
!pip install --upgrade pip
!pip install langchain -q
!pip install langchain_community -q
!pip install chromadb -q
!pip install Cmake
!pip install transformers



In [62]:
# Importing all the required libraries and modules
import os
import numpy as np
import pandas as pd
import re
import networkx as nx
import matplotlib.pyplot as plt
import faiss
from pathlib import Path
from typing import Any, List
from pydantic import Field, BaseModel, Extra
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama
from langchain.chains.llm import LLMChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader, PyPDFLoader
from chromadb.utils import embedding_functions
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.embeddings import FakeEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA, LLMChain, ConversationalRetrievalChain
from langchain.chains.summarize import load_summarize_chain
from langchain.memory import ConversationSummaryMemory, ConversationBufferMemory
from langchain.schema import BaseRetriever
from langchain.schema import Document
from langchain_community.vectorstores.utils import filter_complex_metadata
from langchain.chains.combine_documents.refine import RefineDocumentsChain
from langchain.chains.conversational_retrieval.base import ConversationalRetrievalChain
from langchain.retrievers import BM25Retriever, EnsembleRetriever

import warnings
warnings.filterwarnings('ignore')

### LangChain & Ollama

Over here we're working with LangChain, which is a powerful framework for building applications with language models.
LangChain provides utilities for working with various language model providers, integrating embeddings, and even creating chains for more complex applications. 

LangChain is especially useful for creating Retrieval Augmented Generation (RAG) workflows, which improve response accuracy by combining LLMs with real-time data retrieval. It’s open-source, flexible, and widely used across industries for building scalable and efficient AI solutions.

### Setup

We're using Ollama, which a platform for running LLMs on your local machine. 
To get started with Ollama for our RAG tutorial, follow these simple steps:

1. Open up a terminal window in JupyterLab and type: `ollama serve` (This fires up the Ollama service, which acts like a local AI assistant.)

2. Now, open another terminal window in JupyterLab. Here, we'll download the Mistral model by typing: `ollama pull mistral` (This grabs the Mistral model and makes it ready for use on your computer.)

### Part-1: Retrieval

* In this section, we'll more focus on the retrieval part of the RAG by understanding vectorization, followed by storing and retrieving vectors efficiently. 

#### Vectorization: 

In [9]:
!pip install --upgrade pip
!pip install tf-keras --upgrade -q
!pip install --upgrade transformers numpy sentence-transformers langchain_community -q

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gptqmodel 1.9.0 requires numpy>=2.2.2, but you have numpy 2.1.3 which is incompatible.
gptqmodel 1.9.0 requires protobuf>=5.29.3, but you have protobuf 4.25.6 which is incompatible.[0m[31m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 2.2.5 which is incompatible.
gptqmodel 1.9.0 requires protobuf>=5.29.3, but you have protobuf 4.25.6 which is incompatible.
numba 0.61.0 requires numpy<2.2,>=1.24, but you have numpy 2.2.5 which is incompatible.[0m[31m
[0m

In [8]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initializing the vectorizer
# "all-MiniLM-L6-v2": is a good general-purpose embedding model that balances performance and efficiency
vectorizer = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") # This vectorizer converts text into vectors in embedding space

In [None]:
# Let's take an example of converting "CSUCO" into a series of numbers
vectorizer.embed_query("CSUCO")[0:10] 

Here, we'll write one function that takes 2 strings, vectorizes them, and returns their cosine similarity. 

**Cosine Similarity**: 

In [None]:
def get_similarity_score(word1, word2):
    """ Helper function to vectorize two strings and return the cosine similarity """
    word1_vector = vectorizer.embed_query(word1)
    word2_vector = vectorizer.embed_query(word2)
    dot_product = np.dot(word1_vector, word2_vector)
    norm_vec1 = np.linalg.norm(word1_vector)
    norm_vec2 = np.linalg.norm(word2_vector)
    return dot_product / (norm_vec1 * norm_vec2)

In [None]:
# Observe the similarity scores
# From similarity score you can quantify how similar both words are.

print("Similarity of 'colour' and 'color': ",get_similarity_score("colour","color"))
print("Similarity of 'cars' and 'car': ",get_similarity_score("cars","car"))
print("Similarity of 'cars' and 'truck': ",get_similarity_score("cars","truck"))

Which of the following words in the list words are most related to the word **'car'**? The function similarity_list takes a list of words, and outputs the word and similarity score from highest to lowest.

In [None]:
def similarity_list(words):
    """ Helper function that return a list of tuples with (word, cosine similarity) for words in the list relative to 'car', sorted by similarity descending."""
    car = "car"
    results = [(word, get_similarity_score(car, word)) for word in words]
    return sorted(results, key=lambda x: x[1], reverse=True)

In [None]:
words = ["cars", "truck", "bike", "trees", "mountains"]
similarity_list(words)

Here, we'll write a function that matches a query with its most related/relevant text. 

In [None]:
# Each query below has an appropriate text that allows you to answer the question.
# Example list of existing query-text pairs
qa_pairs = [
    {"query": "What is RAG?", "text": "Retrieval Augmented Generation improves response accuracy by combining retrieval with generation."},
    {"query": "How does vectorization help?", "text": "Vectorization converts text into numerical representations that models can work with."},
    {"query": "What is the importance of embeddings?", "text": "Embeddings allow machine learning models to understand text by converting words into vectors."},
    {"query": "How do retrieval systems work?", "text": "Retrieval systems search large databases and find the most relevant documents based on the query."},
    {"query": "Why is local LLM deployment beneficial?", "text": "Deploying LLMs locally can reduce latency and improve data privacy."}
]

In [None]:
def match_queries_with_pairs(queries, qa_pairs):
    """ Helper function that matches each query with the most related text from qa_pairs based on cosine similarity """
    matched_results = []
    for query in queries:
        best_match = None
        best_score = -1
        for pair in qa_pairs:
            score = get_similarity_score(query, pair["text"])
            if score > best_score:
                best_score = score
                best_match = pair
        matched_results.append({"query": query, "matched_text": best_match["text"], "similarity": best_score})
    return matched_results

In [None]:
queries = ["What is RAG?", "How are embeddings used?", "Benefits of local LLMs?"]
matches = match_queries_with_pairs(queries, qa_pairs)
for match in matches:
    print(match)

In [None]:
# Now separate the queries and text

queries = ["What is RAG?",
            "How does vectorization help?",
            "What is the importance of embeddings?",
            "How do retrieval systems work?",
            "Why is local LLM deployment beneficial?"]

texts = ["Retrieval Augmented Generation improves response accuracy by combining retrieval with generation.",
         "Vectorization converts text into numerical representations that models can work with.",
         "Embeddings allow machine learning models to understand text by converting words into vectors.",
         "Retrieval systems search large databases and find the most relevant documents based on the query.",
         "Deploying LLMs locally can reduce latency and improve data privacy."]

In [None]:
def match_queries_with_texts(queries, texts):
    """ Helper function that matches each query with the most related text based on cosine similarity """
    
    # Calculate similarities between each query and text
    similarities = np.zeros((len(queries), len(texts)))
    
    for i, query in enumerate(queries):
        for j, text in enumerate(texts):
            similarities[i, j] = get_similarity_score(query, text)
    
    # Match each query to the text with the highest similarity
    matches = {}
    for i, query in enumerate(queries):
        best_match_idx = np.argmax(similarities[i])
        matches[query] = texts[best_match_idx]
    
    return matches

In [None]:
# Let's shuffle the queries and texts 

import random
random.shuffle(queries)
random.shuffle(texts)

match_queries_with_texts(queries, texts)

### Database: ChromaDB

Let us see how we can store these for efficient vector retrieval. There are many storage options available, but here we will use **ChromaDB**, an open-source vector database.

In LangChain, we can set the database to be a LangChain **retriever object**, which essentially allows us to **perform queries similarly**.

In [None]:
# First just to give it a try, store the already defined queries and texts and load them into ChromaDB

ids = list(range(len(texts)))
db = Chroma.from_texts(texts, vectorizer, metadatas=[{"id":id} for id in ids])

retriever = db.as_retriever(search_kwargs={"k": 1})

texts

In [None]:
# Let's try to get texts and the metadata

retriever.invoke('What is RAG?')

In [None]:
retriever.invoke('What is the importance of embeddings?')

Now, I'll apply the same retrieval logic to a file `policy_report.csv` that contains various information (Title, Area and Owner) of several policies. 

In [None]:
def create_policy_retriever(csv_file, top_k=10, combine_fields=False):
    """ Helper function that reads a CSV file containing policy data (title, areas and owner) and return a ChromaDB retriever """
    try:
        # Try reading with the default engine and UTF-8 encoding
        df = pd.read_csv(csv_file, encoding="utf-8", sep="\t", on_bad_lines='skip')
    except UnicodeDecodeError:
        print("Failed to decode with encoding utf-8. Trying 'utf-16' instead.")
        df = pd.read_csv(csv_file, encoding="utf-16", sep="\t", on_bad_lines='skip')

    required_columns = {'Title', 'Area', 'Owner'}
    if not required_columns.issubset(df.columns):
        raise ValueError(f"CSV file must contain the columns: {required_columns}")

    texts = []
    metadatas = []

    for _, row in df.iterrows():
        if combine_fields:
            # Combine all fields into the text for richer context
            text = f"Title: {row['Title']}\nArea: {row['Area']}\nOwner: {row['Owner']}"
        else:
            # Use only the title as the main text
            text = row['Title']
        metadata = {"Area": row['Area'], "Owner": row['Owner']}
        texts.append(text)
        metadatas.append(metadata)

    # Create the vector store with the provided texts and metadata
    db = Chroma.from_texts(texts, vectorizer, metadatas=metadatas)
    
    # Convert the vector store into a retriever object with the specified top_k results.
    retriever = db.as_retriever(search_kwargs={"k": top_k})
    
    # Extra feature: Log the number of policies loaded.
    print(f"Loaded {len(texts)} policies into the vector store.")

    return retriever

In [None]:
retriever = create_policy_retriever("policy_report.csv", top_k=5, combine_fields=True)

In [None]:
retriever.invoke('Budget Oversight')

In [None]:
retriever.invoke('Accounting for Banking Activity')

In [None]:
def query_policy(retriever, query):
    """ Helper function to query the retriever for a policy and print the matching policy details in a table format. """
    results = retriever.get_relevant_documents(query)
    policies = []
    for doc in results:
        metadata = doc.metadata
        title = doc.page_content
        # Use the correct case keys based on your metadata (e.g., 'Area' and 'Owner')
        area = metadata.get('Area')
        owner = metadata.get('Owner')
        policies.append({"Title": title, "Area": area, "Owner": owner})

    df = pd.DataFrame(policies)
    print(df.to_string(index=False))

In [None]:
query_policy(retriever, "Academic and Student Affairs")

In [None]:
query_policy(retriever, "Business and Finance")

In [None]:
def get_policy_titles_by_area(retriever, area_query, max_results=1000, sort_results=False, return_full_info=False):
    """ Helper function that takes an area query string, return a list of policy titles 
    or full policy details whose metadata "Area" matches the area_query return List of policy titles (str) or list of dictionaries 
    with full info."""
    
    results = retriever.get_relevant_documents("", filter={"Area": area_query}, k=max_results)
    
    policies = []
    for doc in results:
        if doc.page_content.startswith("Title:"):
            first_line = doc.page_content.split("\n")[0]
            title = first_line.replace("Title: ", "").strip()
        else:
            title = doc.page_content.strip()
        policy_info = {"Title": title, "Area": doc.metadata.get("Area"), "Owner": doc.metadata.get("Owner")}
        policies.append(policy_info)
    
    if sort_results:
        policies = sorted(policies, key=lambda x: x["Title"])
    
    print(f"Found {len(policies)} matching policies for area '{area_query}'.")
    
    if return_full_info:
        return policies
    else:
        return [p["Title"] for p in policies]

In [None]:
area = "Academic and Student Affairs"
titles = get_policy_titles_by_area(retriever, area, max_results=100, sort_results=True, return_full_info=False)
print("Policy Titles for area '{}':".format(area))
for t in titles:
    print("-", t)

In [None]:
area = "Business and Finance"
titles = get_policy_titles_by_area(retriever, area, max_results=100, sort_results=True, return_full_info=True)
print("Policy Titles for area '{}':".format(area))
for t in titles:
    print("-", t)

### Chunking

The data we just looked at was conveniently split into rows, with each row representing a distinct and meaningful chunk of information. This straightforward structure makes it easier to process and analyze the text data.

However, when dealing with larger or more complex documents, the text is often not so neatly structured. In such cases, it’s essential to handle the formatting and structure efficiently. We can break down a not-so-simply formatted file into manageable chunks using LangChain's `TextLoader` and `RecursiveCharacterTextSplitter`. This allows us to preprocess and chunk the data effectively for further use in our RAG pipeline.

#### Example Code

```python
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load text data from a file
loader = TextLoader("your_document.txt")  # Replace with your file path
documents = loader.load()

# Initialize the text splitter with a defined chunk size and overlap
# - chunk_size: The maximum number of characters in each chunk.
# - chunk_overlap: The number of overlapping characters between chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# Split the loaded documents into chunks
chunks = text_splitter.split_documents(documents)

# Display the number of chunks created
print(f"Total chunks
created: {len(chunks)}")
ine.

Let me upload a policy document here and try to perform the same retrieval logic before doing 

In [None]:
!pip install pypdf

source_url = "http://calstate.policystat.com/policy/17347260/"
pdf_file = "Policies/Academic Freedom Policy.pdf"
pdf_loader = PyPDFLoader(pdf_file)
documents = pdf_loader.load()
print(f"Loaded {len(documents)} document(s) from '{pdf_file}'.")

In [None]:
# Small chunk size
small_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)
small_chunks = small_text_splitter.split_documents(documents)
print(f"Document has been splitted into {len(small_chunks)} small chunks.")

# Large chunk size
large_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
large_chunks = large_text_splitter.split_documents(documents)
print(f"Document has been splitted into {len(large_chunks)} large chunks.")

In [None]:
# Creating ChromaDB retriever for small chunks
db_small = Chroma.from_documents(small_chunks, vectorizer)
retriever_small = db_small.as_retriever(search_kwargs={"k": 2})

# Creating ChromaDB retriever for large chunks
db_large = Chroma.from_documents(large_chunks, vectorizer)
retriever_large = db_large.as_retriever(search_kwargs={"k": 2})

In [None]:
def chunk_retrieval(query, retriever):
    """ Helper function that retrieves relevant chunks for the given query using the provided retriever. """
    results = retriever.get_relevant_documents(query)
    print(f"Query: {query} \n")
    print(f"Retrieved {len(results)} results from retriever.\n")
    
    for i, doc in enumerate(results, start=1):
        chunk_text = doc.page_content
        chunk_length = len(chunk_text)
        print(f"[Result {i}] - Length: {chunk_length} characters")
        print(chunk_text)
        print("-" * 80)
    
    return results

In [None]:
# NEED TO LOOK AGAIN ON THIS PART

query = "How does the policy address controversial topics in the classroom?"

# Trying to fetch the relevant information using small_retriever
chunk_retrieval(query, retriever_small)

# Trying to fetch the relevant information using large_retriever   
chunk_retrieval(query, retriever_large)

### RAG With LLMs

In [None]:
# Chunking while preserving page metadata
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

In [None]:
# Spliting each page into smaller chunks and preserving metadata
for doc in documents:
    doc.metadata["source"] = source_url

all_chunks = []
for doc in documents:
    # Using metadata from the original page (like 'page' or 'source')
    chunks = text_splitter.split_documents([doc])
    for chunk in chunks:
        chunk.metadata = doc.metadata  # Retaining the original page's metadata
        all_chunks.append(chunk)

print(f"Total chunks after splitting: {len(all_chunks)}")

In [None]:
# Vector store from chunks (with page metadata now preserved!)
db = Chroma.from_documents(all_chunks, vectorizer)
retriever = db.as_retriever(search_kwargs={"k": 3})

In [None]:
llm = Ollama(model="mistral", temperature=0.7)
llm.invoke("Hello there, how are you doing?")

In [None]:
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=retriever,
    return_source_documents=True
)

In [None]:
query = "What responsibilities accompany academic freedom, as outlined in the policy?"
result = rag_chain(query)

In [None]:
result

In [None]:
print(f"Query: {query} ")
print(f"Answer: {result['result']}")

#### Adding custom prompts and refine question prompts

In [None]:
question_prompt_template = """
You're a highly knowledgeable assistant. I want you to answer the following question using the context provided.
Question: {question}
Context: {context}
Answer:
"""

refine_prompt_template = """
The initial answer is: {existing_answer}
Additional context: {context}
Please refine and elaborate on the answer, providing clear details and citing evidence where applicable.
Refined answer: 
"""

In [None]:
question_prompt = PromptTemplate(template=question_prompt_template, input_variables=["question", "context"])
refine_prompt = PromptTemplate(template=refine_prompt_template, input_variables=["existing_answer", "context"])

In [None]:
# Building the advanced RAG chain using RetrievalQA with the "refine" strategy
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="refine",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={
        "question_prompt": question_prompt,
        "refine_prompt": refine_prompt,
        "document_variable_name": "context"
    }
)

In [None]:
def summarize_sources(source_documents):
    """ Helper function that summarizes the source documents. """
    summary_chain = load_summarize_chain(llm, chain_type="map_reduce")
    summary = summary_chain.run(source_documents)
    return summary

In [None]:
# Defining RAG pipeline
def rag_pipeline(query):
    """ Helper function that runs a RAG pipeline with custom refine prompts, evidence summarization, and references of source documents. """
    result = rag_chain(query)
    answer = result['result']
    source_docs = result['source_documents']

    print(f"Query: {query}")
    print(f"Answer: {answer}\n")
    print("-" * 100)
    summary = summarize_sources(source_docs)
    print(f"Summary from Source Documents: {summary}")
    print("-" * 100)

    print("\nReferences (click to view source page):\n")
    for i, doc in enumerate(source_docs, start=1):
        metadata = doc.metadata
        page = metadata.get("page")
        source = metadata.get("source")

        if source and page is not None:
            link = f"{source}#page={page + 1}"
            print(f"[{i}] Page {page + 1}: {link}")
        elif source:
            print(f"[{i}] Source: {source}")
        else:
            print(f"[{i}] Source: Unknown")

    print("-" * 100)

In [None]:
query = "Who is responsible for overseeing the academic freedom policy and what are their roles?"
rag_pipeline(query)

In [None]:
query = "What procedures are specified for the review and revision of the academic freedom policy?"
rag_pipeline(query)

In [None]:
query = "What are the key principles that define academic freedom as outlined in the policy?"
rag_pipeline(query)

# RAG with multiple Policies 

In [12]:
#############################################################
# STEP 1: Load the CSV Report with Policy Metadata
#############################################################

def load_policy_report(csv_file: str) -> pd.DataFrame:
    """
    Loads the report.csv that includes policy titles and reference URLs.
    Expected CSV columns: 'Title', 'URL', and potentially others.
    """
    try:
        # Try reading with the default engine and UTF-8 encoding
        df = pd.read_csv(csv_file, encoding="utf-8", sep="\t", on_bad_lines='skip')
    except UnicodeDecodeError:
        print("Failed to decode with encoding utf-8. Trying 'utf-16' instead.")
        df = pd.read_csv(csv_file, encoding="utf-16", sep="\t", on_bad_lines='skip')

    return df

In [13]:
# Load the CSV file (adjust the path if needed)
report_df = load_policy_report("report.csv")
print(f"Report loaded: {len(report_df)} policies found in report.csv.")

Failed to decode with encoding utf-8. Trying 'utf-16' instead.
Report loaded: 429 policies found in report.csv.


In [14]:
def map_policy_metadata(report_df: pd.DataFrame) -> dict:
    """ Helper function that creates a mapping where keys are lowercase policy titles (from the CSV) and values are the corresponding URL. """
    mapping = {}
    for _, row in report_df.iterrows():
        title = row["Title"].strip().lower()
        url = row["URL"].strip()
        mapping[title] = url
    return mapping

# Load CSV and create mapping.
report_df = load_policy_report("report.csv")
policy_mapping = map_policy_metadata(report_df)
print(f"[INFO]: Report loaded with {len(policy_mapping)} policies.")

Failed to decode with encoding utf-8. Trying 'utf-16' instead.
[INFO]: Report loaded with 428 policies.


In [15]:
# First, I'm trying to all the Policy documents (pdfs), chunking them while preserving the metadata of each policy document.
def load_policies(folder_path: str, policy_mapping: dict) -> List[Document]:
    """ Helper function that loads all PDFs from the folder and update each Document's metadata with. """
    all_docs = []
    for pdf_path in Path(folder_path).glob("*.pdf"):
        loader = PyPDFLoader(str(pdf_path))
        docs = loader.load()  
        file_title = pdf_path.stem.lower()
        matched_url = None
        matched_policy_title = None

        for title_key, url in policy_mapping.items():
            if title_key in file_title:
                matched_url = url
                matched_policy_title = title_key  
                break
        for doc in docs:
            doc.metadata["source_file"] = pdf_path.name
            if matched_url:
                doc.metadata["policy_title"] = matched_policy_title
                doc.metadata["policy_url"] = matched_url
            else:
                # Fallback if no match is found in the CSV.
                doc.metadata["policy_title"] = pdf_path.stem
                doc.metadata["policy_url"] = None
        all_docs.extend(docs)
    return all_docs

raw_documents = load_policies("Policies/", policy_mapping)
print(f"Loaded {len(raw_documents)} document pages from policies.")

Loaded 3139 document pages from policies.


In [16]:
print(f"[INFO]: Loaded {len(raw_documents)} documents (pages) from the 'Policies/' folder.")

[INFO]: Loaded 3139 documents (pages) from the 'Policies/' folder.


Update Document Metadata from CSV Report

In [17]:
# def update_doc_metadata_with_report(docs: List[Document], report_df: pd.DataFrame) -> List[Document]:
#     """ Helper function that uses the 'source_file' and the CSV report to update metadata for each policy document. """
#     for doc in docs:
#         source_file = doc.metadata.get("source_file", "").lower()
#         matches = report_df[report_df["Title"].str.lower().str.contains(source_file, na=False)]
#         if not matches.empty:
#             row = matches.iloc[0]
#             doc.metadata["policy_title"] = row["Title"]
#             doc.metadata["policy_url"] = row["URL"]
#         else:
#             doc.metadata["policy_title"] = source_file
#             doc.metadata["policy_url"] = None
#     return docs

In [18]:
# raw_documents = update_doc_metadata_with_report(raw_documents, report_df)

In [19]:
# Chunking each document (policy) page while preserving the metadata
def chunk_documents(raw_docs: list[Document], chunk_size: int = 500, overlap: int = 50) -> list[Document]:
    """ Helper function that chunks each document and preserve the metadata. """
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    chunks = []
    for doc in raw_docs:
        doc_chunks = splitter.split_documents([doc])
        for chunk in doc_chunks:
            chunk.metadata = doc.metadata.copy()  # Preserve original metadata
            chunks.append(chunk)
    return chunks

In [20]:
chunked_docs = chunk_documents(raw_documents)
print(f"[INFO]: Chunked into {len(chunked_docs)} segments.")

[INFO]: Chunked into 18115 segments.


In [21]:
def simple_filter_metadata(metadata: dict, allowed_types=(str, int, float, bool)) -> dict:
    """ 
    Helper function that filters a metadata dictionary so that each value is of type str, int, float, or bool.
    If a value is not one of these types (and not None), it's converted to a string.
    Keys with None values are dropped.
    """
    filtered = {}
    for key, value in metadata.items():
        if value is None:
            continue  # Skip None values
        if isinstance(value, allowed_types):
            filtered[key] = value
        else:
            # Optionally, convert the value to a string.
            filtered[key] = str(value)
    return filtered

In [22]:
# Building Semantic & Keyword based Retriever

# After chunking, I'm filtering metadata for each document chunk.
# I'm replacing any None values with a default, or removes keys with non-simple types.
for doc in chunked_docs:
    doc.metadata = simple_filter_metadata(doc.metadata)

db_semantic = Chroma.from_documents(chunked_docs, vectorizer)
semantic_retriever = db_semantic.as_retriever(search_kwargs={"k": 5})
print("[INFO]: Semantic retriever set up.")

[INFO]: Semantic retriever set up.


In [23]:
# Using FakeEmbeddings for keyword search (BM25-like retrieval)

db_keyword = FAISS.from_documents(chunked_docs, FakeEmbeddings(size=768))
keyword_retriever = db_keyword.as_retriever(search_kwargs={"k": 5})
print("[INFO]: Keyword retriever set up.")

[INFO]: Keyword retriever set up.


#### Building GraphRAG Component

Overhere, I'm trying to build a graph over document chunk using NetworkX, where nodes represent chunks and edges connect similar chunks.  This graph helps propagate context across policy boundaries. 

This graph leverages approximate nearest neighbor search via FAISS to build a sparse graph over our policy documents chunks. This approach avoids computing all pairwise similarities (which can be prohibitively expensive for 500+ PDFs) by efficiently retrieving only the nearest neighbors for each chunk. 

In [24]:
'''
Approach: I'm using FAISS IndexIVFFlat to perform approximate nearest neighbor search. Over here, an edge is added between two nodes if their inner product (cosine similarity) exceeds the specified threshold.
'''

def build_policy_graph(
    docs: list[Document], 
    vectorizer, 
    k_neighbors: int = 5, 
    threshold: float = 0.9, 
    nlist: int = 100
) -> nx.Graph:
    """ Helper function that build a graph. """

    # Computing embeddings for each document chunk
    embeddings = []
    for doc in docs:
        emb = np.array(vectorizer.embed_query(doc.page_content)).astype("float32")
        embeddings.append(emb)
    embeddings = np.stack(embeddings)
    dim = embeddings.shape[1]

    # Building an approximate FAISS index with inner-product
    quantizer = faiss.IndexFlatIP(dim)
    index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
    index.train(embeddings)
    index.add(embeddings)

    # Retrieving approximate k_neighbors for each chunk (include self in results)
    # returns: D = distances, I = Indices
    D, I = index.search(embeddings, k_neighbors + 1)

    # Building a graph based on neighbors with similarity above threshold
    G = nx.Graph()
    
    # Adding nodes with metadata (source and page)
    for idx, doc in enumerate(docs):
        G.add_node(idx, doc=doc)

    # For each document, adding edges from its approximate neighbors (skipping self)
    for i, (neighbors, distances) in enumerate(zip(I, D)):
        for j, sim in zip(neighbors[1:], distances[1:]):
            if sim >= threshold:
                # Adding an edge with weight=similarity
                G.add_edge(i, j, weight=float(sim))

    return G

In [25]:
# Building the graph on our chunked documents 

policy_graph = build_policy_graph(chunked_docs, vectorizer, k_neighbors=5, threshold=0.9, nlist=100)
print(f"[INFO]: Approximate graph created with {policy_graph.number_of_nodes()} nodes and {policy_graph.number_of_edges()} edges.")

[INFO]: Approximate graph created with 18115 nodes and 7435 edges.


In [26]:
def print_policy_graph_info(graph):
    ''' Helper function that return the summary of the Policy Graph. '''
    print("Graph Summary:")
    print(f"  Number of nodes: {graph.number_of_nodes()}")
    print(f"  Number of edges: {graph.number_of_edges()}")
    print("\nSample Nodes (first 5):")
    for node in list(graph.nodes(data=True))[:5]:
        print(node)
    print("\nSample Edges (first 5):")
    for edge in list(graph.edges(data=True))[:5]:
        print(edge)

print_policy_graph_info(policy_graph)

Graph Summary:
  Number of nodes: 18115
  Number of edges: 7435

Sample Nodes (first 5):
(0, {'doc': Document(metadata={'producer': 'Prince 12.5.1 (www.princexml.com)', 'creator': 'PolicyStat', 'creationdate': '', 'subject': 'The California State University', 'author': 'Grommo, April: Asst VC, Enroll Mgmt Srvcs', 'title': '2021 – 2022 Emergency Grant Allocation', 'source': 'Policies/2021 - 2022 Emergency Grant Allocation.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1', 'source_file': '2021 - 2022 Emergency Grant Allocation.pdf', 'policy_title': '2021 - 2022 Emergency Grant Allocation'}, page_content='COPY\nStatus Active PolicyStat ID 10719972 \nOrigination 12/7/2021 \nEffective 12/7/2021 \nReviewed 12/7/2021 \nNext Review 12/7/2023 \nOwner April Grommo: \nAsst VC, Enroll \nMgmt Srvcs \nArea Academic and \nStudent Affairs \n2021 – 2022 Emergency Grant Allocation \nPolicy \nThis policy provides procedural guidance related to the allocation of $30 million of one-time funding \nissued

In [47]:
# question_prompt_template = """
# You are a highly knowledgeable assistant with expertise in CSU policies. Your task is to answer the following question using the context provided.
# IMPORTANT: If the question is not related to CSU policies, respond with: 
# "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
# Question: {question}
# Context: {context}
# Answer:
# """

# question_prompt_template = """
# You are a highly knowledgeable assistant with deep expertise in CSU policies. Your responses must strictly pertain to CSU policies and internal policy matters.
# IMPORTANT: If the user's question is not about CSU policies or policy-related information, immediately respond with:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
# Do not provide any additional content in that case.
# Otherwise, answer the following question using the context provided.
# Question: {query}
# Context: {context}
# Answer:
# """


# question_prompt = PromptTemplate(
#     input_variables=["question", "context"],
#     template="""
# You are a highly knowledgeable assistant with deep expertise in CSU policies.
# Only answer questions related to CSU policies.
# If the question is not related to CSU policies, respond with:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"

# Question: {question}
# Context: {context}
# Answer:
# """,
# )

# question_prompt_template = """
# You are a CSU Policy Assistant. You are only allowed to answer questions directly related to California State University policies using official policy documents as your source.
# If the user's question is not related to CSU policies, respond exactly with:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
# If the user's question is unclear, ask for clarification.
# Otherwise, answer the question using the context provided.

# Question: {question}
# Context: {context}
# Answer:
# """

question_prompt_template = """
You are a CSU Policy Assistant. Your job is to answer ONLY questions that are directly about California State University (CSU) policies, using official policy documents as your source.

Step 1: Before answering, check if the user's question is about CSU policies.
- If the question is NOT about CSU policies, respond ONLY with:
"I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?"
- Do NOT attempt to answer or provide any information outside CSU policies.

Step 2: If the question IS about CSU policies:
- If unclear, ask the user to clarify.
- Otherwise, answer using the provided context.

Question: {question}
Context: {context}
Answer:
"""



In [48]:
# refine_prompt_template = """
# The initial answer is: {existing_answer}
# Additional context: {context}
# IMPORTANT: Ensure the question is clearly related to CSU policies. 
# If it is not, respond with: 
# "I'm sorry, I can only answer questions related to CSU policies. Could you please ask a question related to CSU policies?"
# Otherwise, refine and elaborate on the answer, providing clear details and citing evidence where applicable.
# Refined answer:
# """
# refine_prompt_template = """
# The initial answer is: {existing_answer}
# Additional context: {context}

# IMPORTANT: Before refining, ensure the question is clearly related to CSU policies.
# If you determine that the query is off-topic, immediately respond with:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please ask a question related to CSU policies?"
# Otherwise, please refine and elaborate on the answer, providing clear details and citing evidence as needed.
# Refined answer:
# """


# refine_prompt_template = """
# [INTERNAL: If the question contains "summarize", "clarify", or "rephrase", do NOT output the off-topic message; simply refine the previous answer.]

# The initial answer is: {existing_answer}
# Additional context: {context}

# IMPORTANT: If the new context does not clearly indicate that the question pertains to CSU policies and no follow-up instruction is present, respond with exactly:
# "I'm sorry, I can only answer questions related to CSU policies. Could you please ask a question related to CSU policies?"
# Otherwise, refine and expand the answer.

# Refined answer:
# """


refine_prompt_template = """
You are a CSU Policy Assistant. Your response must be strictly based on CSU policies.

Step 1: Check if the question or additional context is about CSU policies.
- If NOT, respond ONLY with:
"I'm sorry, I can only answer questions related to CSU policies. Could you please ask a question related to CSU policies?"

Step 2: If it IS about CSU policies, refine and expand the answer using the context provided.

The initial answer is: {existing_answer}
Additional context: {context}
Refined answer:
"""


In [31]:
# # Creating PromptTemplate objects
question_prompt = PromptTemplate(
    template=question_prompt_template,
    input_variables=["question", "context"]
)
refine_prompt = PromptTemplate(
    template=refine_prompt_template,
    input_variables=["existing_answer", "context"]
)

In [32]:
# Subclass RetrievalQA to allow extra chain_type_kwargs.
class CustomRetrievalQA(RetrievalQA):
    class Config:
        extra = Extra.allow

In [65]:
def is_csu_policy_question(query: str) -> bool:
    """Check if the query relates to CSU policies using keywords."""
    csu_keywords = ["CSU", "California State University", "policy", "academic integrity", "code of conduct"]
    return any(keyword.lower() in query.lower() for keyword in csu_keywords)

# Modify your retriever to return empty results for non-policy questions
class PolicyFilteredRetriever(EnsembleRetriever):
    def get_relevant_documents(self, query: str):
        if not is_csu_policy_question(query):
            return []  # Return empty list for non-policy questions
        return super().get_relevant_documents(query)


#### Creating Hybrid Retriever (Combines Semantic, Keyword & Graph Retrieval)

I'm creating a is a custom hybrid retriever class that:
1. Retrieves documents via semantic and keyword search.
2. Uses the graph to add neighboring nodes of the retrieved chunks (for additional context).
3. Deduplicates and returns a final list of relevant documents.

In [59]:
class HybridGraphRetriever(BaseRetriever, BaseModel):
    semantic_retriever: Any
    keyword_retriever: Any
    policy_graph: nx.Graph
    top_k: int = Field(default=5)
    graph_hops: int = Field(default=1)

    class Config:
        extra = "allow"

    def _get_relevant_documents(self, query: str) -> List[Document]:
        sem_docs = self.semantic_retriever.get_relevant_documents(query)
        key_docs = self.keyword_retriever.get_relevant_documents(query)
        combined = sem_docs + key_docs
        expanded_docs = combined.copy()
        for doc in combined:
            for node, data in self.policy_graph.nodes(data=True):
                if data["doc"].page_content.strip() == doc.page_content.strip():
                    neighbors = nx.single_source_shortest_path_length(self.policy_graph, node, cutoff=self.graph_hops)
                    for n in neighbors:
                        neighbor_doc = self.policy_graph.nodes[n]["doc"]
                        expanded_docs.append(neighbor_doc)
                    break
        seen = {}
        for doc in expanded_docs:
            key = (doc.metadata.get("policy_title", ""), doc.metadata.get("page", ""), doc.page_content)
            seen[key] = doc
        unique_docs = list(seen.values())
        return unique_docs[:self.top_k]

    async def _aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError("Async retrieval is not implemented for HybridGraphRetriever.")

In [73]:
hybrid_retriever = HybridGraphRetriever(
    semantic_retriever=semantic_retriever,
    keyword_retriever=keyword_retriever,
    policy_graph=policy_graph,
    top_k=5,
    graph_hops=1
)

# hybrid_retriever = PolicyFilteredRetriever(retrievers=[bm25_retriever, tfidf_retriever], weights=[0.5, 0.5])

print("[INFO]: HybridGraphRetriever with approximate graph is ready.")

[INFO]: HybridGraphRetriever with approximate graph is ready.


In [74]:
# Create a custom ConversationalRetrievalChain subclass that allows extra keys.
class CustomConversationalRetrievalChain(ConversationalRetrievalChain):
    class Config:
        extra = Extra.allow

In [75]:
# 2. Create the LLMs
llm = Ollama(model="mistral", temperature=0.3)

initial_llm_chain = LLMChain(llm=llm, prompt=question_prompt)
refine_llm_chain = LLMChain(llm=llm, prompt=refine_prompt)

combine_docs_chain = RefineDocumentsChain(
    initial_llm_chain=initial_llm_chain,
    refine_llm_chain=refine_llm_chain,
    document_variable_name="context",
    initial_response_name="existing_answer"
)
print("[INFO]: Custom refine documents chain is ready.")

[INFO]: Custom refine documents chain is ready.


In [76]:
# condense_question_prompt = PromptTemplate(
#     input_variables=["chat_history", "question"],
#     template="""
# Given the following conversation history and a follow-up question, rephrase the follow-up question into a standalone query.

# Chat History:
# {chat_history}
# Follow-up question: {question}
# Standalone question:
# """
# )
# question_generator = LLMChain(llm=llm, prompt=condense_question_prompt)

In [77]:
dummy_question_prompt = PromptTemplate(
    template="{question}",
    input_variables=["question"]
)
question_generator = LLMChain(llm=llm, prompt=dummy_question_prompt)

In [78]:
# memory = ConversationBufferMemory(
#     memory_key="chat_history", 
#     output_key="answer", 
#     return_messages=True
# )
# rag_chain = CustomRetrievalQA.from_chain_type(
#     llm=llm_for_chain,
#     chain_type="refine",


 #     retriever=hybrid_retriever,
#     return_source_documents=True,
#     chain_type_kwargs={
#         "question_prompt": question_prompt,
#         "refine_prompt": refine_prompt,
#         "document_variable_name": "context"
#     }
# )

# rag_chain_custom = CustomConversationalRetrievalChain.from_llm(
#     llm=llm_for_chain, 
#     retriever=hybrid_retriever,
#     memory=memory,
#     output_key="answer",
#     return_source_documents=True,
#     chain_type_kwargs={
#          "question_prompt": question_prompt,
#          "refine_prompt": refine_prompt,
#          "document_variable_name": "context"
#     }
# )

memory = ConversationBufferMemory(
    memory_key="chat_history", 
    return_messages=True,
    output_key="answer"
)

rag_chain = ConversationalRetrievalChain(
    retriever=hybrid_retriever,               
    combine_docs_chain=combine_docs_chain,   
    question_generator=question_generator,
    memory=memory,
    output_key="answer",
    return_source_documents=True,
    callbacks=[]
)
print("[✅] Custom ConversationalRetrievalChain is ready.")

[✅] Custom ConversationalRetrievalChain is ready.


In [79]:
def is_on_topic(question: str, llm) -> bool:
    """Use a simple prompt to ask the LLM whether the query is related to CSU policies."""
    check_prompt = PromptTemplate(
        template="Is the following question related to CSU policies? Answer with 'yes' or 'no'.\nQuestion: {question}",
        input_variables=["question"]
    )
    check_chain = LLMChain(llm=llm, prompt=check_prompt)
    response = check_chain.predict(question=question)
    return "yes" in response.lower()

In [80]:
# def policy_chatbot(query: str):

#     # First, check if the query is on-topic.
#     if not is_on_topic(query, llm):
#         print("\nI'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?\n")
#         return
    
#     # Note: Use "question" as the input key.
#     result = rag_chain({"question": query})
#     answer = result.get("answer", "")
#     source_docs = result.get("source_documents", [])
    
#     print(f"\n💬 Query: {query}\n")
#     print(f"🤖 Answer:\n{answer}\n")
    
#     print("📚 References:")
#     for i, doc in enumerate(source_docs, start=1):
#         metadata = doc.metadata
#         policy_title = metadata.get("policy_title", "Unknown Policy")
#         policy_url = metadata.get("policy_url", None)
#         page = metadata.get("page")
#         page_num = page + 1 if isinstance(page, int) else "?"
#         if policy_url and isinstance(policy_url, str) and policy_url.startswith("http"):
#             link = f"{policy_url}#page={page_num} ({policy_title})"
#         else:
#             link = f"{policy_title} (Page {page_num})"
#         print(f"[{i}] {link}")
#     print("\n" + "-" * 80 + "\n")


# def policy_chatbot(question: str):  
#     result = rag_chain({"question": question})
#     answer = result.get("answer", "")
#     source_docs = result.get("source_documents", [])
    
#     print(f"\n💬 Query: {question}\n")
#     print(f"🤖 Answer:\n{answer}\n")
#     print("📚 References:")
#     for i, doc in enumerate(source_docs, 1):
#         meta = doc.metadata
#         title = meta.get("policy_title", "Unknown Policy")
#         url = meta.get("policy_url", "")
#         page = meta.get("page")
#         page_disp = page + 1 if isinstance(page, int) else "?"
#         if url:
#             print(f"[{i}] {url}#page={page_disp} ({title})")
#         else:
#             print(f"[{i}] {title} (Page {page_disp})")
#     print("\n" + "-" * 80 + "\n")

In [81]:
def is_on_topic(question: str) -> bool:
    """
    Uses a simple LLMChain to check if the question is directly related to CSU policies.
    Returns True if the answer is 'yes', otherwise False.
    """
    on_topic_prompt = PromptTemplate(
        template="Is the following question related to CSU policies? Answer only 'yes' or 'no'.\nQuestion: {question}",
        input_variables=["question"]
    )
    on_topic_chain = LLMChain(llm=llm, prompt=on_topic_prompt)
    response = on_topic_chain.predict(question=question)
    return "yes" in response.lower()


In [82]:
def policy_chatbot(question: str):
    # Pre-check: if question is off-topic, immediately return the fixed off-topic message.
    if not is_on_topic(question):
        print("\nI'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?\n")
        return

    result = rag_chain({"question": question})
    answer = result.get("answer", "")
    source_docs = result.get("source_documents", [])
    
    print(f"\n💬 Query: {question}\n")
    print(f"🤖 Answer:\n{answer}\n")
    print("📚 References:")
    for i, doc in enumerate(source_docs, start=1):
        meta = doc.metadata
        title = meta.get("policy_title", "Unknown Policy")
        url = meta.get("policy_url", "")
        page = meta.get("page")
        page_disp = page + 1 if isinstance(page, int) else "?"
        if url and isinstance(url, str) and url.startswith("http"):
            print(f"[{i}] {url}#page={page_disp} ({title})")
        else:
            print(f"[{i}] {title} (Page {page_disp})")
    print("\n" + "-" * 80 + "\n")

#### Trying our chatbot with different queries

In [83]:
policy_chatbot("What are the approval procedures for academic freedom-related policies?")


💬 Query: What are the approval procedures for academic freedom-related policies?

🤖 Answer:
 To provide more specific information regarding your inquiry, here are the relevant CSU policies related to the topics you've mentioned:

1. Policies on Student Fees and Financial Aid can be found in the Tuition, Fees, and Financial Aid section of each campus's catalog. Each campus may have slightly different policies, so it is best to consult your specific campus's catalog. (Referenced policy: Tuition, Fees, and Financial Aid)

2. Policies on the Transfer of Credit Earned at Other Institutions can be found in the Transfer Credit section of each campus's catalog. Again, each campus may have slightly different policies, so it is best to consult your specific campus's catalog. (Referenced policy: Transfer Credit)

3. Catalog Rights policies are outlined in the Catalog Rights and Academic Standards section of each campus's catalog. Each campus may have slightly different procedures, so it is best 

In [84]:
query = "How many states are there in USA?"
policy_chatbot(query)


I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?



In [86]:
policy_chatbot("What is the capital of France?")


I'm sorry, I can only answer questions related to CSU policies. Could you please rephrase your query accordingly?



In [85]:
query = "What is the annual fees of MS in CS fees at San Jose State University?"
policy_chatbot(query)


💬 Query: What is the annual fees of MS in CS fees at San Jose State University?

🤖 Answer:
 The fee for international students in the Master of Science (MS) in Computer Science program at San Jose State University may have been the Student Success, Excellence and Technology Fee during certain academic years, which was set at $630 per term according to the CSU policy document from 2013-14. However, it's essential to note that this fee is not a systemwide CSU policy but rather a fee that may be assessed by individual academic programs within the university.

For the 2023-24 academic year, the nonresident tuition fee per unit is $396 for semester-based programs and $264 for quarter-based programs. The total nonresident tuition paid per term will be determined by the number of units taken.

In addition to this, a supplemental Graduate Business Professional Fee has been set at rates of $231 per term for graduate students in certain programs. For accurate and current information about fees 

#### Experimenting complex queries

This below query requires the chatbot to retrieve information from the Academic Freedom Policy (which discusses research freedom and potential conflicts) and additional policy documents related to research funding or conflict of interest. The answer must integrate details from more than one policy.

In [87]:
policy_chatbot("How does the university policy address conflicts between faculty research priorities and commercial interests, and what approval procedures are in place to manage these conflicts?")


💬 Query: How does the university policy address conflicts between faculty research priorities and commercial interests, and what approval procedures are in place to manage these conflicts?

🤖 Answer:
 In the California State University (CSU), conflicts of interest are managed, resolved, and reported for various activities, including business transactions involving campus alumni associations and personal or business affairs of directors, officers, or staff members. These transactions require advance approval by the governing body.

In cases where a conflict of interest is identified, an independent review committee is appointed to assess and make recommendations for its management. This process aligns with the CSU's commitment to maintaining academic integrity and adhering to both Federal Conflict of Interest regulations and California Conflict of Interest requirements.

For federally funded research, each CSU campus assists investigators, students, and research staff in determining po

**Complex Query-2:**

This below query demands a comparison between two distinct policies. The policystat chatbot needs to extract guidelines from both the Academic Access Policy and the Student Code of Conduct (or similar documents) and then perform a synthesis to highlight the differences and impacts on enforcement.

**How it works:**
1. The RAG pipeline retrieves chunks from both policies—semantic search picks up nuanced guidelines while keyword search fetches exact phrases like “faculty responsibilities” or “enforcement.”
2. GraphRAG further enhances the process by connecting sections that use similar language across policies.
3. The LLM then collates these details into a comparative answer with contextual references that indicate the policy source and page number for each piece of information.

In [89]:
policy_chatbot("What are the key differences between the Academic Access Policy and the Student Code of Conduct regarding faculty responsibilities, and how do these differences influence policy enforcement at the institution?")


💬 Query: What are the key differences between the Academic Access Policy and the Student Code of Conduct regarding faculty responsibilities, and how do these differences influence policy enforcement at the institution?

🤖 Answer:
 In the California State University (CSU), the Student Conduct Code applies to all students, including applicants, enrolled students, students between academic terms, graduates awaiting degrees, and students who withdraw from school while a disciplinary matter is pending. Any behavior that threatens the safety or security of the university community, or substantially disrupts the functions or operation of the university may lead to disciplinary action, regardless of whether a law enforcement investigation has concluded. The Student Conduct Process outlines the procedure for addressing these violations and may result in various sanctions such as restitution, loss of financial aid, educational and remedial sanctions, denial of access to Campus or persons, disci


💬 Query: Could you please summerize the above answer?

🤖 Answer:
 Inquiries about the handling of allegations within the California State University (CSU) system often concern the investigative process and evidence collection in such cases, particularly when it comes to Title IX sexual harassment policy. Here's a more detailed explanation:

1. Allegation: An accusation made against an individual within the CSU system, which can encompass a broad spectrum of issues related to Title IX, such as sexual misconduct or gender-based discrimination.

2. Investigative Process: When an allegation is reported, it triggers an investigation. This process involves several steps: gathering evidence, interviewing relevant parties, and reviewing any available documentation. The goal is to establish whether the allegations are substantiated or not. (California State University, 2021)

3. Evidence Collection: In addition to the standard evidence considered during an investigation, such as witness statem

**Complex Query-3:**
The below query is a multi-faceted query requiring integration of information from several policies. It involves not only the Academic Freedom Policy but also the tenure guidelines and research compliance standards. The answer must present a holistic view that outlines both academic independence and regulatory compliance.

**How chatbot works**:
The chatbot uses the hybrid retrieval module to gather relevant documents from all three policy areas. Semantic retrieval captures conceptual links about “independence” and “compliance,” while keyword retrieval hones in on technical terms like “tenure” or “regulations.” The graph-based component connects these overlapping concepts across multiple documents. With conversational memory, the system preserves context across turns, and the final answer generated by the LLM includes inline citations that reference the exact policy and page number where each requirement is stated.

In [90]:
query = "Considering the university’s policies on academic freedom, tenure, and research compliance, what are the combined requirements for faculty to maintain academic independence while ensuring adherence to institutional regulations? Give me the brief summary of the entire answer in the end."
policy_chatbot(query)


💬 Query: Considering the university’s policies on academic freedom, tenure, and research compliance, what are the combined requirements for faculty to maintain academic independence while ensuring adherence to institutional regulations? Give me the brief summary of the entire answer in the end.

🤖 Answer:
 Faculty members at California State University (CSU) are entitled to full freedom in research, teaching, and publication as outlined in the Academic Freedom Policy (CSU Executive Order 1096). This includes academic freedom to teach their subject matter without external constraints other than those normally denoted by scholarly standards.

In terms of authority, it is important to note that students are also subject to discipline for conduct that threatens the safety or security of the campus community, or substantially disrupts the functions or operation of the University, regardless of whether it occurs on or off campus (5 Cal. Code Regs. § 41301 (d)).

Regarding research, faculty 