Skip to content

Implement streamdown plugin ReadPdfFile (text-layer PDF only) for ReadFile #46

@gravity-api

Description

@gravity-api

Create a streamdown plugin ReadPdfFile that extracts text from PDF files with a text layer. This plugin is executed via the ReadFile meta plugin when FileFormat=Pdf.

OCR is explicitly out of scope.

Inputs

Name Type Mandatory Description
Path String Expression
Url String Expression
Base64 String Expression

Outputs (for meta normalization)

Return/emit values that the meta plugin can normalize:

Output Key Description
Text Extracted text
PagesCount Total pages (optional but recommended)

Behavior / Error Handling

  • If the PDF is valid but no text can be extracted, fail with a clear message indicating:

    • the PDF may be scanned / image-only
    • OCR is not supported in current version
  • Fail on invalid base64 / missing file / download failures / non-PDF content.

Implementation Notes

  • Resolve bytes from source:

    • Base64 → bytes
    • Path → File.ReadAllBytes
    • Url → HTTP GET bytes
  • Extract text using a PDF text extraction library (no OCR).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions