# Simple agent system for PDF documents

This notebook demonstrates how to create a simple agent system for PDF documents.

1. Load a PDF document.
2. Split the document into chunks.
3. Create an embedding of the chunks.
4. Create a language model.
5. Create an agent that uses the retriever and language model to answer questions about the document.
6. Ask questions of the agent and get answers.

## Prerequisites

Before running this notebook, you need:
- An OpenAI API key
- PDF documents to query  (by default there are already 2 PDF documents in the `pdf` folder)

In [None]:
# Install required packages
!pip install llama-index llama-index-llms-openai python-dotenv openai nest-asyncio nbconvert requests

# Verify installations
import importlib

def check_package(package_name):
    try:
        importlib.import_module(package_name)
        return True
    except ImportError:
        return False

packages = {
    "llama_index": "llama-index core",
    "llama_index.llms.openai": "llama-index-llms-openai",
    "dotenv": "python-dotenv",
    "openai": "OpenAI API",
    "nest_asyncio": "nest-asyncio", 
    "nbconvert": "nbconvert",
    "requests": "requests",
}

all_installed = True
for package, display_name in packages.items():
    installed = check_package(package)
    print(f"{display_name}: {'✅ Installed' if installed else '❌ Not installed'}")
    all_installed = all_installed and installed

if all_installed:
    print("\n✅ All required packages are installed!")
else:
    print("\n⚠️ Some packages are missing. Run the installation command again.")

## Environment Setup

Load environment variables from the `.env` file and set up for PDF processing. <br>
N.b. it will look through the entire repo for a valid `.env` file.

In [None]:
import os
from dotenv import load_dotenv
import nest_asyncio

# Apply nest_asyncio to allow nested event loops (needed for some async operations)
nest_asyncio.apply()

# Load environment variables from .env file
load_dotenv()

# Get API keys from environment variables or set them directly
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# If environment variables are not loaded, you can set them here
# OPENAI_API_KEY = "your-openai-api-key"

# Set environment variables
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY or ""

# Verify API key is set
if not OPENAI_API_KEY:
    print("⚠️ Warning: OPENAI_API_KEY is not set")
else:
    print("✅ API key is set")

## Import Required Libraries

Let's import all the libraries we'll need for this notebook.

In [None]:
# Import core LlamaIndex components
from llama_index.core import SimpleDirectoryReader, Settings, VectorStoreIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

# Import OpenAI LLM
from llama_index.llms.openai import OpenAI

# Import other utilities
import logging
import sys
from IPython.display import Markdown, display

# Configure basic logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

print("✅ Libraries imported successfully")

## Configure LLM

Set up the language model we'll use for our queries and indexing.

In [None]:
# Initialize the OpenAI LLM

## Exercise 1 : Try to create the LLM with these parameters
model="gpt-4.1-nano", 
temperature=0.2,      
streaming=True,     
system_prompt="You are a helpful assistant that provides accurate information about topics found in documents. Be thorough and make sure to search through the entire document, including any lists or tables that might appear on pages."

## YOUR CODE HERE

# Some documentation to help you:
# - https://docs.llamaindex.ai/en/stable/examples/llm/openai/#configure-model



# Exercise 2 : Set up the global LlamaIndex configuration

## YOUR CODE HERE one line :)

# Some documentation to help you:
# - https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/#configuring-settings

print(f"✅ LLM configured: {llm.model}")

## Load and Index PDF Documents

Let's load some PDF documents and create vector indices for them.

In [None]:
# This code is already provided for you. Please read it to understand it as the main goal of this notebook is to create an 
# LLM that can answer questions about the documents you will provide, not creating the index itself.
# ======================================================================

# Define the paths to our PDF files
# Update these paths to match your file locations
pdf_paths = {
    "brochure-info-gestion":"pdf/hesso-brochure-a5-info-gestion-fr-web.pdf",
    "brochure-eco-entreprise":"pdf/hesso-brochure-a5-eco-entreprise-fr-web-cor2.pdf"
} 

# Check if the files exist
pdf_exists = {}
for key, path in pdf_paths.items():
    exists = os.path.exists(path)
    pdf_exists[key] = exists
    if not exists:
        print(f"⚠️ Warning: {key} PDF file not found at {path}")

# Only proceed with files that exist
pdf_documents = {}
pdf_indices = {}

for key, path in pdf_paths.items():
    if pdf_exists[key]:
        try:
            print(f"Loading {key} document...")
            # Set a higher chunk_size to ensure we capture tables and lists properly
            pdf_documents[key] = SimpleDirectoryReader(
                input_files=[path],
                filename_as_id=True
            ).load_data()
            print(f"✅ Successfully loaded {len(pdf_documents[key])} pages from {key}")
            
            print(f"Creating vector index for {key}...")
            pdf_indices[key] = VectorStoreIndex.from_documents(pdf_documents[key])
            print(f"✅ Successfully created index for {key}")
        except Exception as e:
            print(f"❌ Error loading {key}: {e}")

print(f"Total PDF indices created: {len(pdf_indices)}")

## Create Query Engines

Let's create individual query engines for each PDF source, then combine them into a SubQuestionQueryEngine.

In [None]:
# Create query engines for each PDF index
pdf_query_engines = {}
for key, index in pdf_indices.items():
    # the parameter similarity_top_k is set to 10 to return the top 10 most relevant results if you need more you can increase this number
    # But be careful with the number of tokens you are using the higher the number of top_k the more tokens you will use -> more expensive and slower
    pdf_query_engines[key] = index.as_query_engine(similarity_top_k=10)

# Create a list of query engine tools
query_engine_tools = []

# Add PDF query engines to the tools list
for key, engine in pdf_query_engines.items():
    display_name = key.replace("_", " ").title()
    
    query_engine_tools.append(
        # Exercise 3 : Create a QueryEngineTool for each PDF query engine
        # This allows us to query each PDF source
        
        # YOUR CODE HERE
        
        # Some documentation to help you:
        # - https://docs.llamaindex.ai/en/stable/module_guides/deploying/agents/tools/#queryenginetool
        # - https://docs.llamaindex.ai/en/stable/api_reference/tools/query_engine/#llama_index.core.tools.query_engine.QueryEngineTool
        # - https://docs.llamaindex.ai/en/stable/api_reference/tools/#llama_index.core.tools.types.ToolMetadata
      
      
    )

print(f"✅ Created {len(query_engine_tools)} query engine tools")

# Exercise 4 : Create a SubQuestionQueryEngine called "multi_source_engine" that can query across all PDF sources. It needs the query engine tools defined above

## YOUR CODE HERE

# Some documentation to help you:
# - https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-6/Router_And_SubQuestion_QueryEngine/#pydanticsingleselector

print("✅ Created SubQuestionQueryEngine that can query across all PDF sources")

## Query Your PDF Documents

Now we can query all our PDF data sources at once!

In [None]:
# Exercise 5 : Try to query the multi_source_engine with a question about the documents you provided

## YOUR CODE HERE

# Some documentation to help you:
# - https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/

## Querying Specific PDF Sources

You can also query individual PDF sources directly.

In [None]:
# Query just one of the PDF indices
if "brochure-info-gestion" in pdf_query_engines:
   
    # Exercise 6 : Try to query the pdf_query_engines["brochure-info-gestion"] with a question about the documents you provided
    
    ## YOUR CODE HERE
    
    # Some documentation to help you:
    # - https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/

N.b. Here the results are more accurate because the query engine has only one document to work with.