# Simple agent system for PDF documents

This notebook demonstrates how to create a simple agent system for PDF documents.

1. Load a PDF document.
2. Split the document into chunks.
3. Create an embedding of the chunks.
4. Create a language model.
5. Create an agent that uses the retriever and language model to answer questions about the document.
6. Ask questions of the agent and get answers.

## Prerequisites

Before running this notebook, you need:
- An OpenAI API key
- PDF documents to query  (by default there are already 2 PDF documents in the `pdf` folder)

In [None]:
# Install required packages
!pip install llama-index llama-index-llms-openai llama-index-agent-openai python-dotenv openai nest-asyncio nbconvert requests

# Verify installations
import importlib

def check_package(package_name):
    try:
        importlib.import_module(package_name)
        return True
    except ImportError:
        return False

packages = {
    "llama_index": "llama-index core",
    "llama_index.llms.openai": "llama-index-llms-openai",
    "llama_index.agent.openai": "llama-index-agent-openai",
    "dotenv": "python-dotenv",
    "openai": "OpenAI API",
    "nest_asyncio": "nest-asyncio", 
    "nbconvert": "nbconvert",
    "requests": "requests",
}

all_installed = True
for package, display_name in packages.items():
    installed = check_package(package)
    print(f"{display_name}: {'✅ Installed' if installed else '❌ Not installed'}")
    all_installed = all_installed and installed

if all_installed:
    print("\n✅ All required packages are installed!")
else:
    print("\n⚠️ Some packages are missing. Run the installation command again.")

## Environment Setup

Load environment variables from the `.env` file and set up for PDF processing. <br>
N.b. it will look through the entire repo for a valid `.env` file.

In [None]:
import os
from dotenv import load_dotenv
import nest_asyncio

# Apply nest_asyncio to allow nested event loops (needed for some async operations)
nest_asyncio.apply()

# Load environment variables from .env file
load_dotenv()

# Get API keys from environment variables or set them directly
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# If environment variables are not loaded, you can set them here
# OPENAI_API_KEY = "your-openai-api-key"

# Set environment variables
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY or ""

# Verify API key is set
if not OPENAI_API_KEY:
    print("⚠️ Warning: OPENAI_API_KEY is not set")
else:
    print("✅ API key is set")

## Import Required Libraries

Let's import all the libraries we'll need for this notebook.

In [None]:
# Import core LlamaIndex components
from llama_index.core import SimpleDirectoryReader, Settings, VectorStoreIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Import agent components
from llama_index.agent.openai import OpenAIAgent

# Import OpenAI LLM
from llama_index.llms.openai import OpenAI

# Import other utilities
import logging
import sys
from IPython.display import Markdown, display

# Configure basic logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

print("✅ Libraries imported successfully")

## Configure LLM

Set up the language model we'll use for our queries and indexing.

In [None]:
# Initialize the OpenAI LLM
llm = OpenAI(
    model="gpt-4.1-nano",  # You can change this to another model if needed
    temperature=0.2,      # Lower temperature for more consistent results
    streaming=True,       # Enable streaming for better UX
    system_prompt="You are a helpful assistant that provides accurate information about topics found in documents. Be thorough and make sure to search through the entire document, including any lists or tables that might appear on pages."
)

# Set up the global LlamaIndex configuration
Settings.llm = llm

print(f"✅ LLM configured: {llm.model}")

## Load and Index PDF Documents

Let's load some PDF documents and create vector indices for them.

In [None]:
# Define the paths to our PDF files
# Update these paths to match your file locations
pdf_paths = {
    "brochure-info-gestion":"pdf/hesso-brochure-a5-info-gestion-fr-web.pdf",
    "brochure-eco-entreprise":"pdf/hesso-brochure-a5-eco-entreprise-fr-web-cor2.pdf"
} 

# Check if the files exist
pdf_exists = {}
for key, path in pdf_paths.items():
    exists = os.path.exists(path)
    pdf_exists[key] = exists
    if not exists:
        print(f"⚠️ Warning: {key} PDF file not found at {path}")

# Only proceed with files that exist
pdf_documents = {}
pdf_indices = {}

for key, path in pdf_paths.items():
    if pdf_exists[key]:
        try:
            print(f"Loading {key} document...")
            # Set a higher chunk_size to ensure we capture tables and lists properly
            pdf_documents[key] = SimpleDirectoryReader(
                input_files=[path],
                filename_as_id=True
            ).load_data()
            print(f"✅ Successfully loaded {len(pdf_documents[key])} pages from {key}")
            
            print(f"Creating vector index for {key}...")
            pdf_indices[key] = VectorStoreIndex.from_documents(pdf_documents[key])
            print(f"✅ Successfully created index for {key}")
        except Exception as e:
            print(f"❌ Error loading {key}: {e}")

print(f"Total PDF indices created: {len(pdf_indices)}")

## Create Agent with Query Tools

Let's create individual query engines for each PDF source, then create an OpenAI agent that can use these tools to answer questions.

In [None]:
# Create query engines for each PDF index
pdf_query_engines = {}
for key, index in pdf_indices.items():
    # the parameter similarity_top_k is set to 10 to return the top 10 most relevant results if you need more you can increase this number
    # But be careful with the number of tokens you are using the higher the number of top_k the more tokens you will use -> more expensive and slower
    pdf_query_engines[key] = index.as_query_engine(similarity_top_k=10)

# Create a list of query engine tools
query_engine_tools = []

# Add PDF query engines to the tools list
for key, engine in pdf_query_engines.items():
    display_name = key.replace("_", " ").title()
    query_engine_tools.append(
        # Create a QueryEngineTool for each PDF query engine
        # This allows us to query each PDF source
        QueryEngineTool(
            query_engine=engine,
            metadata=ToolMetadata(
                name=key, 
                description=f"Provides comprehensive information about {display_name}. Use this tool to search for information specifically in the {display_name} document."
            )
        )
    )

print(f"✅ Created {len(query_engine_tools)} query engine tools")

# Create the OpenAI Agent with the query tools
agent = OpenAIAgent.from_tools(
    tools=query_engine_tools,
    llm=llm,
    verbose=True,
    system_prompt="You are a helpful AI assistant that can search through PDF documents to answer questions. "
                  "You have access to tools that can search through different PDF documents. "
                  "When a user asks a question, think about which document(s) might contain the relevant information "
                  "and use the appropriate tools to search for the answer. "
                  "If you need information from multiple documents, use multiple tools. "
                  "Always provide comprehensive and accurate answers based on the document content."
)

print("✅ Created OpenAI Agent that can query across all PDF sources using tools")

## Query Your PDF Documents with the Agent

Now we can use our agent to query all our PDF data sources intelligently!

In [None]:
def query_agent(query_text):
    """Query the agent and display the response."""
    print(f"Querying agent: '{query_text}'")
    print("\nAgent is thinking and using tools...\n")
    
    # Execute the query using the agent
    response = agent.chat(query_text)
    
    # Display the response
    print("\n" + "-"*50 + "\n")
    display(Markdown(f"**Agent Response:**\n\n{response}"))
    
    return response

# Example query about HES-SO fields of study
if "brochure-info-gestion" in pdf_query_engines:
    query_text = "What is the Bachelor's degree in Business Information Technology?"
    response = query_agent(query_text)

**Note:** The agent is now able to intelligently choose which tools to use based on the query. It can reason about which documents are most likely to contain the relevant information and use the appropriate tools accordingly. This is much more sophisticated than the previous SubQuestionQueryEngine approach.

In [None]:
# Example of a query that requires the agent to use multiple sources
if len(query_engine_tools) > 1:
    query_text = "List all the bachelor programs mentioned in all the documents. Compare what each document offers."
    response = query_agent(query_text)

## Querying Specific PDF Sources

You can also query individual PDF sources directly.

In [None]:
# You can also ask the agent to focus on specific documents
if "brochure-info-gestion" in pdf_query_engines:
    print("Asking the agent to focus on the Business Information Technology document")
    
    query_text = "Tell me specifically about the Business Information Technology bachelor's degree from the info-gestion document. Please provide a bullet point list of key information."
    response = query_agent(query_text)
    
    print("\n" + "="*50 + "\n")
    
    # Compare with direct query engine access
    print("For comparison, here's the same query using the direct query engine:")
    inf_gestion_engine = pdf_query_engines["brochure-info-gestion"]
    direct_response = inf_gestion_engine.query("What is the Bachelor's degree in Business Information Technology? Please provide a bullet point list")
    
    print("\n" + "-"*50 + "\n")
    display(Markdown(f"**Direct Query Engine Response:**\n\n{direct_response}"))

**Note:** You can see the difference between using an agent vs. direct query engines:

- **Agent approach**: The agent can reason about which tools to use, combine information from multiple sources, and provide more contextual answers. It's more flexible and can handle complex queries that span multiple documents.

- **Direct query engine**: Faster for simple queries on a single document, but lacks the reasoning capabilities of an agent.

The agent approach is particularly powerful when you have multiple data sources and need to answer complex questions that might require information synthesis from multiple documents.

In [None]:
# Demonstration: Agent's reasoning capabilities
print("Demonstrating the agent's ability to reason about tool selection:")
print("="*60)

# Ask a complex question that requires the agent to think about which tools to use
complex_query = "I'm trying to decide between different programs. Can you help me understand the differences between the programs offered in the two documents? Which document focuses more on technology and which on business?"

response = query_agent(complex_query)