Update eval framework to use SummarizerAgent by jjasinsk-amd · Pull Request #269 · amd/gaia

jjasinsk-amd · 2026-01-27T15:42:35Z

Summary

This PR introduces and enhances the summarization pipeline by adding support for synthetic PDF creation and groundtruth generation as well as integrating a summarization agent for inference and evaluation across Lemoade and Claude models.

Key Changes

Adds a utility to generate synthetic PDF files for testing and benchmarking purposes.
Modifies groundtruth.py to create accurate groundtruth PDF files used in summarization evaluation.
Updates batch_experiment.py to leverage the SummarizerAgent for summarizing content on both Lemoade Server and Claude.
Adjusts evaluation criteria to align with the output format and expectations of the summarization agent, ensuring consistent and meaningful benchmarking.

jjasinsk-amd · 2026-01-27T15:58:32Z

@claude please review this PR

github-actions · 2026-01-27T15:59:27Z

Claude finished @jjasinsk-amd's task —— View job

PR Review: Update eval framework to use SummarizerAgent

Summary

This PR integrates the SummarizerAgent into the evaluation framework, replacing direct PDF extraction and summarization logic with the centralized agent. The changes touch 4 files with significant refactoring.

🔴 Critical Issues

1. Security Finding - False Positive
The GitHub Advanced Security finding at pdf_document_generator.py:680 about "clear-text logging of sensitive information (password)" appears to be a false positive. The line in question:

print(f"  Total API tokens (input+output): {total_api_tokens:,}")

This logs token counts (usage metrics), not passwords or API keys. The scanner likely flagged "token" incorrectly. No action needed.

🟡 Important Observations

1. Good: SummarizerAgent Integration (batch_experiment.py:14-19, 149-153)
The PR correctly imports and instantiates SummarizerAgent for PDF extraction:

from gaia.agents.summarize.agent import SummarizerAgent
from gaia.agents.summarize.prompts import SUMMARY_STYLES, SYSTEM_PROMPTS
# ...
self.summarizer_agent = SummarizerAgent()

This promotes code reuse and centralizes PDF processing logic.

2. Good: Improved PDF Extraction (batch_experiment.py:155-170)
The new _extract_text_from_pdf method leverages SummarizerAgent's built-in caching:

pdf_text = self.summarizer_agent.get_summary_content_from_file(Path(pdf_path))

3. Good: Local LLM Support in Evaluator (eval.py:24-43)
The Evaluator class now supports both Claude and local LLM for evaluation, providing flexibility for different deployment scenarios.

4. Concern: Removed pypdf Import
The diff shows removal of pypdf import and direct PDF reading code. Ensure SummarizerAgent handles all edge cases that were previously handled directly (large PDFs, malformed PDFs, etc.).

5. New File: pdf_document_generator.py
A new 700+ line file for generating synthetic PDF documents using Claude. While functional, consider:

The file is quite long - could benefit from splitting into smaller modules
Good error handling and cost tracking implementation
Template-based approach is well-structured

🟢 Minor Suggestions

1. groundtruth.py Additions
The new UseCase.PDF enum and associated prompts are well-structured. The PDF-specific ground truth generation prompt looks comprehensive.

2. Evaluator Class Improvements (eval.py)

Good addition of _send_prompt() method for abstracting Claude vs local LLM
Comprehensive timing and progress tracking

Questions for Author

Were there any test failures when switching to SummarizerAgent for PDF extraction?
Is the SummarizerAgent initialization at class level (__init__) intentional, or should it be lazy-loaded?
Should the pdf_document_generator.py be in eval/ or would apps/ be more appropriate?

Checklist

Code compiles/parses correctly
Security finding is false positive
Integration with existing SummarizerAgent is appropriate
Tests should be verified (recommend running eval tests)

Overall Assessment: Approve with minor suggestions. The integration looks clean and follows good practices for code reuse.

kovtcharov-amd · 2026-01-28T20:35:13Z

@claude couple things:

does this change break any pre-existing functionality especially one described in the documentation?
does the documentation need to be updated? what needs to be added/removed/modified?

github-actions · 2026-01-28T20:36:01Z

Claude finished @kovtcharov-amd's task —— View job

Analyzing Breaking Changes and Documentation Impact

Read PR diff to understand all changes
Check existing documentation for eval framework
Identify any breaking changes to pre-existing functionality
Determine documentation updates needed

Does this change break pre-existing functionality?

No breaking changes to the documented CLI interface. The PR is backwards-compatible:

Aspect	Status	Details
CLI commands	No change	`gaia groundtruth`, `gaia batch-experiment`, `gaia eval` work identically
Existing config files	Compatible	`basic_summarization.json`, `basic_qa.json` unchanged
Default workflows	Compatible	Meeting/email/QA workflows work without modification
Evaluation output format	Enhanced	New performance metrics added, existing fields preserved

Internal implementation changes (not user-facing):

BatchExperimentRunner now uses SummarizerAgent for PDF extraction instead of direct pypdf (batch_experiment.py:149)
Evaluator class now supports local LLM via use_local_llm parameter (eval.py:24-43)
Evaluation criteria adjusted for different content types (PDF, email, transcript) - more nuanced evaluation

Does documentation need to be updated?

Yes, updates needed in docs/reference/eval.mdx:

1. Add new `--use-case pdf` option (Priority: High)

The PR adds UseCase.PDF (groundtruth.py:23) but documentation doesn't mention it.

Add to "Ground Truth Creation" section (~line 567):

gaia groundtruth -d ./output/pdfs -p "*.pdf" --use-case pdf -o ./output/groundtruth

Update "Use case options" list (~line 589):

- `pdf` - PDF document summarization (supports embedded groundtruth workflow)

2. Add PDF Summarization Workflow (Priority: High)

Add new "Workflow 5: PDF Document Summarization" section after Workflow 4:

### Workflow 5: PDF Document Summarization

**Complete example: Evaluate PDF document summarization capabilities**

Step 1: Generate synthetic PDF documents
\`\`\`bash
python -m gaia.eval.pdf_document_generator -o ./output/test_data/pdfs --count-per-type 2
\`\`\`

Step 2: Create PDF ground truth
\`\`\`bash
gaia groundtruth -d ./output/test_data/pdfs -p "*.pdf" --use-case pdf -o ./output/groundtruth
\`\`\`

Step 3: Run experiments (use pdf_summarization.json config)
\`\`\`bash
gaia batch-experiment -c ./src/gaia/eval/configs/pdf_summarization.json -i ./output/groundtruth/consolidated_pdf_groundtruth.json -o ./output/experiments
\`\`\`

3. Document new config file (Priority: Medium)

Mention pdf_summarization.json config in the Batch Experimentation section (~line 636):

Available config examples in `./src/gaia/eval/configs/`:
- `basic_summarization.json` - Meeting/email summarization
- `basic_qa.json` - Q&A experiments  
- `pdf_summarization.json` - PDF document summarization

4. Document pdf_document_generator.py utility (Priority: Medium)

Add "Synthetic PDF Generation" subsection under "Synthetic Data Generation" (~line 519):

#### PDF Documents
Generate synthetic PDF documents for testing:
\`\`\`bash
python -m gaia.eval.pdf_document_generator -o ./output/test_data/pdfs --count-per-type 1
\`\`\`

**Available document types:** technical_spec, business_proposal, research_report, project_plan, policy_document, white_paper, user_manual, financial_report

5. Update System Architecture diagram (Priority: Low)

The mermaid diagram already shows "Documents/PDFs" as input, so no change needed there.

Summary

Item	Required Action
Breaking changes	None - PR is backwards-compatible
CLI documentation	Add `--use-case pdf` option
New workflow	Add "PDF Document Summarization" workflow
Config reference	Document `pdf_summarization.json`
Utility tool	Document `pdf_document_generator.py`

Recommendation: The PR should either include these documentation updates, or a follow-up issue should be created to track them.

### Summary This PR introduces and enhances the summarization pipeline by adding support for synthetic PDF creation and groundtruth generation as well as integrating a summarization agent for inference and evaluation across Lemoade and Claude models. ### Key Changes - **Adds a utility** to generate synthetic PDF files for testing and benchmarking purposes. - **Modifies `groundtruth.py`** to create accurate groundtruth PDF files used in summarization evaluation. - **Updates `batch_experiment.py`** to leverage the `SummarizerAgent` for summarizing content on both Lemoade Server and Claude. - **Adjusts evaluation criteria** to align with the output format and expectations of the summarization agent, ensuring consistent and meaningful benchmarking.

Use SummarizerAgent inside eval framework

ae8be22

github-actions Bot added eval Evaluation framework changes performance Performance-critical changes labels Jan 27, 2026

Merge branch 'main' into jjasinsk/eval_summarization

d79ac02

github-advanced-security AI found potential problems Jan 27, 2026

View reviewed changes

Comment thread src/gaia/eval/pdf_document_generator.py Fixed

Adjust evaluation criteria for emails, transcripts and pdf documents

bd6132c

github-advanced-security AI found potential problems Jan 28, 2026

View reviewed changes

Comment thread src/gaia/eval/pdf_document_generator.py Fixed

Merge branch 'main' into jjasinsk/eval_summarization

387e0c6

jjasinsk-amd marked this pull request as ready for review January 28, 2026 15:20

jjasinsk-amd requested a review from kovtcharov-amd as a code owner January 28, 2026 15:20

kovtcharov-amd reviewed Jan 28, 2026

View reviewed changes

Comment thread src/gaia/eval/eval.py

kovtcharov-amd reviewed Jan 28, 2026

View reviewed changes

Comment thread src/gaia/eval/eval.py

kovtcharov-amd reviewed Jan 28, 2026

View reviewed changes

Comment thread src/gaia/eval/pdf_document_generator.py Outdated

Review changes

f0419d8

github-actions Bot added documentation Documentation changes agents cli CLI changes labels Jan 29, 2026

Merge branch 'main' into jjasinsk/eval_summarization

825d25e

jjasinsk-amd requested a review from kovtcharov-amd January 30, 2026 09:08

jjasinsk-amd added 2 commits February 2, 2026 10:53

Merge branch 'main' into jjasinsk/eval_summarization

26dd0e4

Merge branch 'main' into jjasinsk/eval_summarization

a9b8db5

kovtcharov reviewed Feb 3, 2026

View reviewed changes

Comment thread src/gaia/eval/batch_experiment.py

kovtcharov reviewed Feb 3, 2026

View reviewed changes

Comment thread src/gaia/eval/eval.py Outdated

kovtcharov approved these changes Feb 3, 2026

View reviewed changes

jjasinsk-amd added 2 commits February 4, 2026 11:25

Merge remote-tracking branch 'origin' into jjasinsk/eval_summarization

29b7b01

Keep all config knobs, review changes

13ad79a

github-actions Bot added dependencies Dependency updates chat Chat SDK changes labels Feb 5, 2026

jjasinsk-amd and others added 6 commits February 5, 2026 17:02

Merge branch 'main' into jjasinsk/eval_summarization

a93a72d

Code formatting

22b7d0a

Merge branch 'main' into jjasinsk/eval_summarization

05ede91

Merge branch 'main' into jjasinsk/eval_summarization

dce8567

Final cleanup

b57ee80

Style changes

d403cdd

jjasinsk-amd added this pull request to the merge queue Feb 6, 2026

Merged via the queue into main with commit 2cca205 Feb 6, 2026
51 checks passed

jjasinsk-amd deleted the jjasinsk/eval_summarization branch February 6, 2026 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update eval framework to use SummarizerAgent#269

Update eval framework to use SummarizerAgent#269
jjasinsk-amd merged 16 commits intomainfrom
jjasinsk/eval_summarization

jjasinsk-amd commented Jan 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

jjasinsk-amd commented Jan 27, 2026

Uh oh!

github-actions Bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

kovtcharov-amd commented Jan 28, 2026

Uh oh!

github-actions Bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jjasinsk-amd commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Uh oh!

Uh oh!

jjasinsk-amd commented Jan 27, 2026

Uh oh!

github-actions Bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Update eval framework to use SummarizerAgent

Summary

🔴 Critical Issues

🟡 Important Observations

🟢 Minor Suggestions

Questions for Author

Checklist

Uh oh!

Uh oh!

kovtcharov-amd commented Jan 28, 2026

Uh oh!

github-actions Bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analyzing Breaking Changes and Documentation Impact

Does this change break pre-existing functionality?

Does documentation need to be updated?

1. Add new --use-case pdf option (Priority: High)

2. Add PDF Summarization Workflow (Priority: High)

3. Document new config file (Priority: Medium)

4. Document pdf_document_generator.py utility (Priority: Medium)

5. Update System Architecture diagram (Priority: Low)

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jjasinsk-amd commented Jan 27, 2026 •

edited

Loading

github-actions Bot commented Jan 27, 2026 •

edited

Loading

github-actions Bot commented Jan 28, 2026 •

edited

Loading

1. Add new `--use-case pdf` option (Priority: High)