Create a streamdown plugin ReadPdfFile that extracts text from PDF files with a text layer. This plugin is executed via the ReadFile meta plugin when FileFormat=Pdf.
OCR is explicitly out of scope.
Inputs
| Name |
Type |
Mandatory |
Description |
Path |
String |
Expression |
❌ |
Url |
String |
Expression |
❌ |
Base64 |
String |
Expression |
❌ |
Outputs (for meta normalization)
Return/emit values that the meta plugin can normalize:
| Output Key |
Description |
Text |
Extracted text |
PagesCount |
Total pages (optional but recommended) |
Behavior / Error Handling
-
If the PDF is valid but no text can be extracted, fail with a clear message indicating:
- the PDF may be scanned / image-only
- OCR is not supported in current version
-
Fail on invalid base64 / missing file / download failures / non-PDF content.
Implementation Notes
Create a streamdown plugin
ReadPdfFilethat extracts text from PDF files with a text layer. This plugin is executed via theReadFilemeta plugin whenFileFormat=Pdf.OCR is explicitly out of scope.
Inputs
PathUrlBase64Outputs (for meta normalization)
Return/emit values that the meta plugin can normalize:
TextPagesCountBehavior / Error Handling
If the PDF is valid but no text can be extracted, fail with a clear message indicating:
Fail on invalid base64 / missing file / download failures / non-PDF content.
Implementation Notes
Resolve bytes from source:
Base64→ bytesPath→ File.ReadAllBytesUrl→ HTTP GET bytesExtract text using a PDF text extraction library (no OCR).