# Gather Evidence Using PaperQA2

![Agent with scientific knowledge tool](../../../static/agent-gather-evidence.png)

The lab introduces sophisticated RAG techniques including

- Re-ranking of retrieved passages for relevance
- Contextual summarization of evidence
- Precise citation tracking with token count monitoring
- Cost estimation for different model configurations

This lab uses PaperQA2, a sophisticated retrieval-augmented generation (RAG) agent developed by FutureHouse that specializes in scientific literature analysis. It uses a technique called "reranking and contextual summarization" or RCS to improve the quality of retrieved evidence. This proces begins with a standard top-k dense vector rerieval step using embedding vectors to identify potentially relevant document chunks. High-scoring chunks are then summarized using a LLM and reranked to ensure that only the most relevant text influences the final answer. PaperQA2 demonstrated superior performance, achieving 85.2% precision in question answering tasks compared to 73.8% human performance [1].

In this lab, you will explore the impact of advanced RAG techniques on the quality of the agentic response. You will also experiment with different LLMs and observe the impact on output quality and inference cost. Finally you will build a incorporate PaperQA2 into a `generate_evidence` tool and add it to a Strands agent deployed on Bedrock AgentCore.

## Sources

1. Skarlinski, Michael D., et al. "Language agents achieve superhuman synthesis of scientific knowledge." arXiv preprint arXiv:2409.13740, 26 Sept. 2024, doi.org/10.48550/arXiv.2409.13740.

## 1. Prerequisites

- Python 3.10 or later
- AWS account configured with appropriate permissions
- Access to the Anthropic Claude Sonnet 4 model on Amazon Bedrock
- Basic understanding of Python programming

In [None]:
%pip install -U boto3 strands-agents strands-agents-tools defusedxml httpx bedrock_agentcore_starter_toolkit paper-qa

## 2. Experiment with PaperQA2

### 2.1. Basic usage

In [None]:
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="pydantic")
warnings.filterwarnings("ignore", module="litellm")

Download example paper Rodriguez, Patricia J., et al. "Semaglutide vs Tirzepatide for Weight Loss in Adults With Overweight or Obesity." JAMA Internal Medicine, vol. 184, no. 9, 8 July 2024, pp. 1056-1064, doi:10.1001/jamainternmed.2024.2525.


In [None]:
import boto3
from botocore import UNSIGNED
from botocore.config import Config
import os
from pathlib import Path

s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

pmc_id = "PMC9438179"
bucket_name = "pmc-oa-opendata"
object_key = f"oa_comm/txt/all/{pmc_id}.txt"

# Create papers directory if it doesn't exist
papers_dir = Path(f"my_papers/{pmc_id}/txt")
papers_dir.mkdir(parents=True, exist_ok=True)

local_file_path = f"my_papers/{pmc_id}/txt/{pmc_id}.txt"
s3.download_file(bucket_name, object_key, local_file_path)

In [None]:
from paperqa import Settings, ask
from paperqa.settings import (
    AgentSettings,
    ParsingSettings,
    IndexSettings,
    AnswerSettings,
)

LLM = "global.anthropic.claude-sonnet-4-20250514-v1:0"
SUMMARY_LLM = "bedrock/openai.gpt-oss-120b-1:0"
AGENT_TYPE = "fake"
EVIDENCE_K = 5
EVIDENCE_SUMMARY_LENGTH = "25 to 50 words"

answer_response = await ask(
    "How safe and effective are GLP-1 drugs for long term use?",
    settings=Settings(
        llm=LLM,
        summary_llm=SUMMARY_LLM,
        agent=AgentSettings(
            agent_llm=LLM,
            index=IndexSettings(
                index_directory="my_papers/PMC9438179/index",
                paper_directory="my_papers/PMC9438179/txt",
            ),
            agent_type=AGENT_TYPE,
        ),
        embedding="bedrock/amazon.titan-embed-text-v2:0",
        parsing=ParsingSettings(use_doc_details=False),
        answer=AnswerSettings(
            answer_max_sources=1,
            evidence_k=EVIDENCE_K,
            evidence_summary_length=EVIDENCE_SUMMARY_LENGTH,
        ),
    ),
)

In [None]:
def pretty_print_paperqa_results(answer) -> None:
    """Pretty print the output from the PaperQA2 query."""
    session_output = answer.session

    print("**Question**\n")
    print(session_output.question + "\n")
    print("**Answer**\n")
    print(session_output.answer + "\n")
    print("**Evidence**\n")
    for context in session_output.contexts:
        print(f"{context.text.name}:\t{context.context}")
    print("**Token Counts**\n")
    print(f"{'Model':<45} {'Input':<8} {'Output':<8}")
    print("-" * 65)
    for model, values in session_output.token_counts.items():
        print(f"{model:<45} {values[0]:<8} {values[1]:<8}")
    print()
    print("**Estimated Cost**\n")
    print(round(session_output.cost, 3))

In [None]:
pretty_print_paperqa_results(answer_response)

### 2.2. Cost Optimization with Multiple LLMs

Try different llms. From the docs:

- `llm`: LLM for general use including metadata inference and answer generation. Should be 'best' LLM. Uses include:
    1. Inferring citation information from documents when left unspecified
    1. Extracting title, DOI, and authors from citation information when left unspecified
    1. Optionally injecting pre-answer information
    1. Generating an answer given evidence
    1. Optionally injecting post-answer information 
- `summary_llm`: LLM for creating contextual summaries 
- `agent_llm`: LLM inside the agent making tool selections.

In [None]:
from paperqa import Settings, ask
from paperqa.settings import (
    AgentSettings,
    ParsingSettings,
    IndexSettings,
    AnswerSettings,
)

LLM = "global.anthropic.claude-sonnet-4-20250514-v1:0"
SUMMARY_LLM = "bedrock/openai.gpt-oss-120b-1:0"
AGENT_TYPE = "fake"
EVIDENCE_K = 5
EVIDENCE_SUMMARY_LENGTH = "25 to 50 words"

answer_response = await ask(
    "What is PaperQA?",
    settings=Settings(
        llm=LLM,
        summary_llm=SUMMARY_LLM,
        agent=AgentSettings(
            agent_llm=LLM,
            index=IndexSettings(
                index_directory="my_papers/paperqa/index",
                paper_directory="my_papers/paperqa/txt",
            ),
            agent_type=AGENT_TYPE,
        ),
        embedding="bedrock/amazon.titan-embed-text-v2:0",
        parsing=ParsingSettings(use_doc_details=False),
        answer=AnswerSettings(
            answer_max_sources=1,
            evidence_k=EVIDENCE_K,
            evidence_summary_length="about 50 words",
        ),
    ),
)

pretty_print_paperqa_results(answer_response)

By using a different model for the evidence summarization step, we were able to reduce the estimated cost by nearly half!

## 3. Create Strands Agent and tool

### 3.1. Call Gather Evidence Tool Directly

There's an example of how to add PaperQA to a Strands tool in `gather_evidence.py`. Let's incorporate it into a test agent and call it directly. This may take 1-2 minutes to complete.

In [None]:
from strands import Agent
from gather_evidence import gather_evidence_tool

MODEL_ID = "global.anthropic.claude-sonnet-4-20250514-v1:0"
agent = Agent(tools=[gather_evidence_tool], model=MODEL_ID)

agent.tool.gather_evidence_tool(
    pmcid="PMC9438179",
    question="How safe and effective are GLP-1 drugs for long term use?",
)

### 3.2. Define Agent

Let's define an agent that can use the `search_pmc` tool from notebook 01 and the new `gather_evidence` tool from this notebook to find and understand scientific articles.

In [None]:
from strands import Agent
from search_pmc import search_pmc_tool
from gather_evidence import gather_evidence_tool

MODEL_ID = "global.anthropic.claude-sonnet-4-20250514-v1:0"
QUERY = "How safe and effective are GLP-1 drugs for long term use?"

SYSTEM_PROMPT = """You are a life science research assistant. When given a scientific question, follow this process:

1. Use search_pmc_tool to find highly-cited papers. Search broadly first, then narrow down. Use temporal filters like "last 2 years"[dp] for recent work. 
2. Identify the PMC ID value for the 2 relevant papers, then submit their ID values and the query to the gather_evidence_tool.
3. Generate a concise answer to the question based on the most relevant evidence, along with source citations
"""


# Initialize your agent
agent = Agent(
    system_prompt=SYSTEM_PROMPT,
    tools=[search_pmc_tool, gather_evidence_tool],
    model=MODEL_ID,
)

# Send a message to the agent
response = agent(QUERY)

## 4. Deploy to Amazon Bedrock AgentCore Runtime

In [None]:
import boto3
from bedrock_agentcore_starter_toolkit import Runtime

ssm = boto3.client("ssm")

agentcore_runtime = Runtime()
agentcore_runtime.configure(
    agent_name="pmc_gather_evidence_agent",
    auto_create_ecr=True,
    execution_role=ssm.get_parameter(
        Name="/deep-research-workshop/agentcore-runtime-role-arn"
    )["Parameter"]["Value"],
    entrypoint="agent.py",
    memory_mode="NO_MEMORY",
    requirements_file="requirements.txt",
)

In [None]:
agentcore_runtime.launch(auto_update_on_conflict=True)

In [None]:
%%time

agentcore_runtime.invoke(
    {"prompt": "How safe and effective are GLP-1 drugs for long term use?"}
)

## 5. (Optional) Interact with agent using AgentCore Chat

Follow these steps to open an interactive chat session with your new agent.

1. Open a command line terminal in your notebook environment.
2. Navigate to the project root folder (where `pyproject.toml` is located).
3. Run `pip install .` to install the workshop tools including the chat CLI.
4. Run `agentcore-chat` to launch the CLI.
5. Select the `pmc_evidence_agent` by typing its name or index in the terminal and press Enter.
6. Ask your question at the `You:` prompt and press Enter.


## 6. (Optional) Clean Up

Run the next notebook cell to delete the AgentCore runtime environment.

In [None]:
import boto3

agentcore_client = boto3.client("bedrock-agentcore-control")
agent_status = agentcore_runtime.status()

agentcore_client.delete_agent_runtime(agentRuntimeId=agent_status.config.agent_id)