Skip to content

Handle password-protected and corrupted PDFs gracefully #451

@kovtcharov

Description

@kovtcharov

Problem

`src/gaia/rag/sdk.py:413-578` — no detection of encrypted/password-protected PDFs. Encrypted PDFs extract empty text but report success:

reader = PdfReader(pdf_path)       # Succeeds even if encrypted
for i, page in enumerate(reader.pages):
    pypdf_text = page.extract_text()  # Returns empty string for encrypted pages

User sees the document as "indexed" with 0 chunks and gets no results on queries.

Impact

  • Encrypted PDFs silently produce empty indices
  • Corrupted PDFs crash with unhelpful exceptions
  • User doesn't know their PDF wasn't actually readable
  • Error message ("No text content found") is misleading for encrypted files

Proposed Fix

  1. Check `reader.is_encrypted` before processing — raise specific error
  2. Catch `PdfReadError` for corrupted PDFs with user-friendly message
  3. Return specific error codes: `encrypted`, `corrupted`, `empty`, `unsupported`
  4. Provide remediation guidance: "Remove PDF password using qpdf or pdftk"
  5. Handle partially readable PDFs (some pages encrypted, some not)
  6. Add `pdf_status` to metadata: `readable`, `partial`, `encrypted`, `corrupted`

Files

  • src/gaia/rag/sdk.py (lines 413-578, _extract_text_from_pdf)

Acceptance Criteria

  • Encrypted PDFs return clear "password protected" error
  • Corrupted PDFs return clear "corrupted file" error
  • User gets actionable guidance for each error type
  • Unit tests with encrypted and corrupted PDF fixtures

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingp1medium priorityragRAG system changesrobustnessReliability, error handling, and hardening

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions