# Simple PDF Query Tool

This notebook demonstrates how to create a powerful query system for PDF documents using LlamaIndex. It will show:

1. How to load PDF documents
2. How to create vector indices for these documents
3. How to use SubQuestionQueryEngine to query across different PDF sources
4. How to customize the system with different LLMs and prompts

## Prerequisites

Before running this notebook, you need:
- An OpenAI API key
- PDF documents to query <br> (if you put them other than in the examples folder, please change the paths to said file in the corresponding cells)

In [None]:
# Install required packages
!pip install llama-index llama-index-llms-openai python-dotenv openai nest-asyncio

# Verify installations
import importlib

def check_package(package_name):
    try:
        importlib.import_module(package_name)
        return True
    except ImportError:
        return False

packages = {
    "llama_index": "llama-index core",
    "llama_index_llms_openai": "llama-index-llms-openai",
    "python_dotenv": "python-dotenv",
    "openai": "OpenAI API",
    "nest_asyncio": "nest-asyncio", 
}

all_installed = True
for package, display_name in packages.items():
    installed = check_package(package)
    print(f"{display_name}: {'✅ Installed' if installed else '❌ Not installed'}")
    all_installed = all_installed and installed

if all_installed:
    print("\n✅ All required packages are installed!")
else:
    print("\n⚠️ Some packages are missing. Run the installation command again.")

## Environment Setup

Load environment variables from the `.env` file and set up for PDF processing. <br>
N.b. it will look through the entire project for a valid `.env` file.

In [None]:
import os
from dotenv import load_dotenv
import nest_asyncio
import asyncio

# Apply nest_asyncio to allow nested event loops (needed for some async operations)
nest_asyncio.apply()

# Load environment variables from .env file
load_dotenv()

# Get API keys from environment variables or set them directly
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# If environment variables are not loaded, you can set them here
# OPENAI_API_KEY = "your-openai-api-key"

# Set environment variables
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY or ""

# Verify API key is set
if not OPENAI_API_KEY:
    print("⚠️ Warning: OPENAI_API_KEY is not set")
else:
    print("✅ API key is set")

## Import Required Libraries

Let's import all the libraries we'll need for this notebook.

In [None]:
# Import core LlamaIndex components
from llama_index.core import SimpleDirectoryReader, ServiceContext, Settings, VectorStoreIndex, SummaryIndex
from llama_index.core.response.pprint_utils import pprint_response
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

# Import OpenAI LLM
from llama_index.llms.openai import OpenAI

# Import other utilities
import logging
import sys
from IPython.display import Markdown, display

# Configure basic logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

print("✅ Libraries imported successfully")

## Configure LLM

Set up the language model we'll use for our queries and indexing.

In [None]:
# Initialize the OpenAI LLM
llm = OpenAI(
    model="gpt-4o-mini",  # You can change this to another model like "gpt-3.5-turbo"
    temperature=0.2,      # Lower temperature for more consistent results
    streaming=True,       # Enable streaming for better UX
    system_prompt="You are a helpful assistant that provides accurate information about topics found in the documents."
)

# Set up the global LlamaIndex configuration
Settings.llm = llm

print(f"✅ LLM configured: {llm.model}")

## Load and Index PDF Documents

Let's load some PDF documents and create vector indices for them.

In [None]:
# Define the paths to our PDF files
# Update these paths to match your file locations
pdf_paths = {
    "japanese_destroyers_1": "Imperial Japanese Navy Destroyers 1919–45 (1) Minekaze to Shiratsuyu Classes (Mark Stille, Paul Wright (Illustrator)) (Z-Library).pdf",
    "japanese_destroyers_2": "Imperial Japanese Navy Destroyers 1919–45 (2) Asashio to Tachibana Classes (Mark Stille, Paul Wright (Illustrator)) (Z-Library).pdf"
    # Add more PDF files here as needed
}

# Check if the files exist
pdf_exists = {}
for key, path in pdf_paths.items():
    exists = os.path.exists(path)
    pdf_exists[key] = exists
    if not exists:
        print(f"⚠️ Warning: {key} PDF file not found at {path}")

# Only proceed with files that exist
pdf_documents = {}
pdf_indices = {}

for key, path in pdf_paths.items():
    if pdf_exists[key]:
        try:
            print(f"Loading {key} document...")
            pdf_documents[key] = SimpleDirectoryReader(input_files=[path]).load_data()
            print(f"✅ Successfully loaded {len(pdf_documents[key])} pages from {key}")
            
            print(f"Creating vector index for {key}...")
            pdf_indices[key] = VectorStoreIndex.from_documents(pdf_documents[key])
            print(f"✅ Successfully created index for {key}")
        except Exception as e:
            print(f"❌ Error loading {key}: {e}")

print(f"Total PDF indices created: {len(pdf_indices)}")

## Create Query Engines

Let's create individual query engines for each PDF source, then combine them into a SubQuestionQueryEngine.

In [None]:
# Create query engines for each PDF index
pdf_query_engines = {}
for key, index in pdf_indices.items():
    pdf_query_engines[key] = index.as_query_engine(similarity_top_k=3)

# Create a list of query engine tools
query_engine_tools = []

# Add PDF query engines to the tools list
for key, engine in pdf_query_engines.items():
    display_name = key.replace("_", " ").title()
    query_engine_tools.append(
        QueryEngineTool(
            query_engine=engine,
            metadata=ToolMetadata(
                name=key, 
                description=f"Provides information about {display_name}"
            )
        )
    )

print(f"✅ Created {len(query_engine_tools)} query engine tools")

# Create the SubQuestionQueryEngine
multi_source_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools
)

print("✅ Created SubQuestionQueryEngine that can query across all PDF sources")

## Query Your PDF Documents

Now we can query all our PDF data sources at once!

In [None]:
def query_system(query_text):
    """Query the multi-source PDF system and display the response."""
    print(f"Querying: '{query_text}'")
    print("\nThinking...\n")
    
    # Set logging to DEBUG to show the sub-questions
    logging.getLogger().setLevel(logging.DEBUG)
    
    # Execute the query
    response = multi_source_engine.query(query_text)
    
    # Reset logging level
    logging.getLogger().setLevel(logging.INFO)
    
    # Display the response
    print("\n" + "-"*50 + "\n")
    display(Markdown(f"**Answer:**\n\n{response}"))
    
    return response

# Example query about Japanese destroyers
if "japanese_destroyers_1" in pdf_query_engines:
    query_text = "What was the Shiratsuyu class destroyer?"
    response = query_system(query_text)

In [None]:
# Example query that might combine information from multiple PDF sources
if len(query_engine_tools) > 1:
    query_text = "Compare the Japanese naval vessels of the 1930s with their role in military strategy."
    response = query_system(query_text)

## Querying Specific PDF Sources

You can also query individual PDF sources directly.

In [None]:
# Query just one of the PDF indices
if "japanese_destroyers_2" in pdf_query_engines:
    print("Querying just the second destroyer data...")
    destroyers_engine = pdf_query_engines["japanese_destroyers_2"]
    
    query_text = "What were the main Japanese destroyers of World War II?"
    response = destroyers_engine.query(query_text)
    
    print("\n" + "-"*50 + "\n")
    display(Markdown(f"**Destroyer Information:**\n\n{response}"))

## Conclusion

In this notebook, we demonstrated how to:
1. Load and index PDF documents using LlamaIndex
2. Create query engines for each PDF document
3. Combine these query engines into a multi-source research system
4. Query the system to get answers that draw from all available PDF knowledge

This approach is powerful for creating comprehensive research assistants that can leverage multiple PDF sources of knowledge.