# Enhanced Agentic RAG with Policy Selection

This notebook demonstrates an improved Retrieval Augmented Generation (RAG) system that first determines which policies are most relevant to a user query (using an agentic step) and then performs retrieval and response generation using only those documents.

In [None]:
# Install required packages
# !pip install faiss-cpu mistralai beautifulsoup4 requests numpy python-dotenv

## Load API key

In [23]:
import os
import dotenv

# Load from .env file if it exists
dotenv.load_dotenv()

os.environ["MISTRAL_API_KEY"] = ""  # Your key here
print(f"MISTRAL_API_KEY: {os.environ.get('MISTRAL_API_KEY')}")
api_key = os.getenv("MISTRAL_API_KEY")

MISTRAL_API_KEY: 


## Import Required Libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import faiss
from mistralai import Mistral, UserMessage
import time
from urllib.parse import urlparse
import pickle

## Define Utilities for Fetching and Processing Web Content

This function fetches the main text from a given URL.

In [3]:
def get_content_from_url(url):
    """
    Fetch content from a given URL and extract the main text.
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        html_doc = response.text
        soup = BeautifulSoup(html_doc, "html.parser")

        # Remove script, style, and other non-content elements
        for element in soup(["script", "style", "header", "footer", "nav"]):
            element.extract()

        # Extract text from main content areas
        main_content = soup.find("main") or soup.find("article") or soup.find("div", class_="content") or soup.find("body")
        if main_content:
            text = main_content.get_text(separator='\n', strip=True)
        else:
            text = soup.get_text(separator='\n', strip=True)

        # Clean up the text
        text = re.sub(r'\n+', '\n', text)
        text = re.sub(r'\s+', ' ', text)

        domain = urlparse(url).netloc
        return text, domain
    except Exception as e:
        print(f"Error fetching content from {url}: {e}")
        return None, None

## Define Policies and Gather Data from Multiple Sources

Each policy is defined with its URL and a friendly title. The content is fetched and the source information now includes the policy title.

In [5]:
# List of policies with their URLs and titles
policies = [
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/sport-and-wellness-facilities-and", "title": "Sport and Wellness"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/student-attendance-policy", "title": "Attendance"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/final-grade-policy", "title": "Final Grade"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/student-conduct-policy", "title": "Student Conduct"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/udst-policies-and-procedures/academic-schedule-policy", "title": "Academic Schedule"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/student-appeals-policy", "title": "Student Appeals"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/transfer-policy", "title": "Transfer Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/admissions-policy", "title": "Admissions Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/registration-policy", "title": "Registration Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/udst-policies-and-procedures/graduation-policy", "title": "Graduation Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/academic-annual-leave-policy", "title": "Academic Annual Leave Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/academic-credentials-policy", "title": "Academic Credentials Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/academic-freedom-policy", "title": "Academic Freedom Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/academic-professional-development", "title": "Academic Professional Development"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/academic-qualifications-policy", "title": "Academic Qualifications Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/credit-hour-policy", "title": "Credit Hour Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/intellectual-property-policy", "title": "Intellectual Property Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/joint-appointment-policy", "title": "Joint Appointment Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/program-accreditation-policy", "title": "Program Accreditation Policy"},
    {"url": "https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/examination-policy", "title": "Examination Policy"}
]

# Fetch content from each policy URL
all_texts = []
all_sources = []

for policy in policies:
    url = policy["url"]
    title = policy["title"]
    text, domain = get_content_from_url(url)
    if text and len(text) > 100:  # ensure meaningful content
        all_texts.append(text)
        source_info = f"Policy: {title} - Source: {domain} - {url}"
        all_sources.append(source_info)

print(f"Processed {len(all_texts)} policies successfully")

# Save the combined text to a file
combined_text = "\n\n---\n\n".join(all_texts)
file_name = "assets/combined_documents.txt"
with open(file_name, 'w', encoding='utf-8') as file:
    file.write(combined_text)

Processed 20 policies successfully


## Chunk the Text with Overlap for Better Context Preservation

This function splits each document into overlapping chunks.

In [6]:
def chunk_text(texts, sources, chunk_size=512, overlap=100):
    """
    Split each document into overlapping chunks while preserving source info.
    """
    chunks = []
    chunk_sources = []
    
    for i, doc_text in enumerate(texts):
        if len(doc_text) < 50:
            continue
        start = 0
        while start < len(doc_text):
            end = min(start + chunk_size, len(doc_text))
            chunks.append(doc_text[start:end])
            chunk_sources.append(sources[i])
            start += chunk_size - overlap
    return chunks, chunk_sources

chunks, chunk_sources = chunk_text(all_texts, all_sources)

## Get Embeddings Using Mistral API

In [9]:
def get_text_embedding(list_txt_chunks, batch_size=20):
    """
    Get embeddings for text chunks in batches to avoid rate limits.
    """
    client = Mistral(api_key=api_key)
    all_embeddings = []
    for i in range(0, len(list_txt_chunks), batch_size):
        batch = list_txt_chunks[i:i+batch_size]
        try:
            embeddings_batch_response = client.embeddings.create(model="mistral-embed", inputs=batch)
            all_embeddings.extend(embeddings_batch_response.data)
            time.sleep(2)  
        except Exception as e:
            print(f"Error getting embeddings for batch {i}:{i+batch_size}: {e}")
            for _ in range(len(batch)):
                all_embeddings.append(None)
    return all_embeddings

In [10]:
# Get embeddings for all chunks
text_embeddings = get_text_embedding(chunks)

# Filter out any failed embeddings
valid_embeddings = []
valid_chunks = []
valid_sources = []

for i, embedding in enumerate(text_embeddings):
    if embedding is not None:
        valid_embeddings.append(embedding.embedding)
        valid_chunks.append(chunks[i])
        valid_sources.append(chunk_sources[i])

print(valid_sources)
embeddings = np.array(valid_embeddings)

['Policy: Sport and Wellness - Source: www.udst.edu.qa - https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/sport-and-wellness-facilities-and', 'Policy: Sport and Wellness - Source: www.udst.edu.qa - https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/sport-and-wellness-facilities-and', 'Policy: Sport and Wellness - Source: www.udst.edu.qa - https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/sport-and-wellness-facilities-and', 'Policy: Sport and Wellness - Source: www.udst.edu.qa - https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/sport-and-wellness-facilities-and', 'Policy: Sport and Wellness - Source: www.udst.edu.qa - https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedures/sport-and-wellness-facilities-and', 'Policy: Sport and Wellness - Source: www.udst.edu.qa - https://www.udst.edu.qa/about-udst/institutional

## Create and Populate the Vector Database

In [11]:
# Build the FAISS index using the dimension of embeddings
d = len(valid_embeddings[0])
index = faiss.IndexFlatL2(d)
index.add(embeddings)

## Define Agentic Policy Selection Function

This function uses the Mistral API to choose the top two policies that are most relevant to the given question. It is provided with the list of all available policy titles.

In [19]:
def agentic_policy_selection(question, policy_titles):
    prompt = (
        f"Given the following policies: {', '.join(policy_titles)}. "
        f"Which two policies are most relevant to answer the question: '{question}'? "
        "Please provide your answer as a comma-separated list. with no additional text."
    )
    client = Mistral(api_key=api_key)
    messages = [UserMessage(content=prompt)]
    try:
        chat_response = client.chat.complete(
            model="mistral-large-latest",
            messages=messages
        )
        answer = chat_response.choices[0].message.content
        # Assume the answer is a comma-separated list
        selected = [p.strip() for p in answer.split(",")]
        return selected[:2]
    except Exception as e:
        print(f"Error selecting policies: {e}")
        return []

## RAG Query Function with Agentic Policy Selection

This function first calls the agentic step to select the top two policies. It then filters the vector database to only include chunks from those policies before performing the retrieval and generating a response.

In [22]:
def rag_query(question, k=10):
    # Extract all unique policy titles from the source strings
    all_policy_titles = list({
        src.split(' - ')[0].replace('Policy: ', '') for src in valid_sources
    })

    # Agentic step: determine which two policies are most relevant
    selected_policies = agentic_policy_selection(question, all_policy_titles)
    print("Selected policies:", selected_policies)

    # Filter valid chunks and embeddings to only those whose source contains one of the selected policies
    filtered_indices = [i for i, src in enumerate(valid_sources) if any(policy in src for policy in selected_policies)]
    if len(filtered_indices) == 0:
        print("No matching policies found in the sources. Using all chunks.")
        filtered_indices = list(range(len(valid_sources)))

    filtered_embeddings = embeddings[filtered_indices]
    filtered_chunks = [valid_chunks[i] for i in filtered_indices]
    filtered_sources = [valid_sources[i] for i in filtered_indices]

    # Get embedding for the question
    question_embeddings = get_text_embedding([question])
    if not question_embeddings or question_embeddings[0] is None:
        return "Error: Could not generate embeddings for the question."
    query_embedding = np.array([question_embeddings[0].embedding])

    # Build a temporary FAISS index for the filtered embeddings
    d = embeddings.shape[1]
    temp_index = faiss.IndexFlatL2(d)
    temp_index.add(filtered_embeddings)
    D, I = temp_index.search(query_embedding, k=min(k, len(filtered_chunks)))

    # Retrieve the matching chunks and their sources
    retrieved_chunks = [filtered_chunks[i] for i in I.tolist()[0]]
    retrieved_sources = [filtered_sources[i] for i in I.tolist()[0]]

    # Format context with source information
    context = ""
    for i, chunk in enumerate(retrieved_chunks):
        context += f"\nChunk {i+1}:\n{chunk}\n{retrieved_sources[i]}\n---\n"

    prompt = f"""
    You are given the following context information. Use it to answer the user's question accurately.
    If the information needed is not in the context, please say \"I don't have enough information to answer this question.\"
    
    Context information:
    ---------------------
    {context}
    ---------------------
    
    Question: {question}
    
    Please provide a comprehensive answer based solely on the context information provided.
    Include references to the policy used to get that answer formatted like this: Policy: Name - (Policy URL). Don't mention the chunk numbers.

    """
    print("Prompt:", prompt)

    client = Mistral(api_key=api_key)
    messages = [UserMessage(content=prompt)]
    try:
        chat_response = client.chat.complete(
            model="mistral-large-latest",
            messages=messages
        )
        response = chat_response.choices[0].message.content
    except Exception as e:
        response = f"Error generating response: {str(e)}"

    return response, context, retrieved_sources

## Test the Agentic RAG System

In [21]:
# Test with a sample question
question = "Can I skip classes?"
response, context, sources = rag_query(question)
print(f"Question: {question}\n\nAnswer:\n{response}")

Selected policies: ['Attendance', 'Student Conduct']
Prompt: 
    You are given the following context information. Use it to answer the user's question accurately.
    If the information needed is not in the context, please say "I don't have enough information to answer this question."
    
    Context information:
    ---------------------
    
Chunk 1:
bilities 4.4.1 Students are responsible for the regular, Punctual Attendance of all Learning Sessions, and prescribed activities for the Courses in which they are enrolled. 4.5 Admissions and Registration Department Responsibilities 4.5.1 The Admissions and Registration Department is the custodian of all Student Attendance records. 4.6 Course Requirements 4.6.1 Absence from a Learning Session does not relieve Students from completing any missed Course Requirements. 4.6.2 The Academic Member may grant Studen
Policy: Attendance - Source: www.udst.edu.qa - https://www.udst.edu.qa/about-udst/institutional-excellence-ie/policies-and-procedu

## Save the RAG Components for Streamlit App

In [15]:
# Save the FAISS index
faiss.write_index(index, "assets/rag_index.faiss")

# Save the metadata (chunks, sources, and API key)
with open("assets/rag_data.pkl", "wb") as f:
    pickle.dump({
        "chunks": valid_chunks,
        "sources": valid_sources,
        "api_key": api_key
    }, f)