Process (by ignoring) inline images by alanz · Pull Request #82 · Yuras/pdf-toolbox

alanz · 2023-06-13T22:21:51Z

I am interested in extracting text from bank statement PDFs. Some of these have inline images, which the toolkit does not currently support.
Process them, by simply ignoring in the stream, but letting the parse continue.
Also, allow exporting the raw operators, not just page glyphs.

These occur as BI params ID image content EI So when we see the ID, skip forward until we see an "EI" in the text. This is the way it is done in the rust pdf library https://github.com/pdf-rs/pdf/blob/677152fa8e84a2dcfbc3d927535148bb8d0369ba/pdf/src/content.rs#L127

This allows simple processing of things like generated bank statements.

Yuras · 2023-06-14T19:04:10Z

+parseInlineImage :: Parser Expr
+parseInlineImage = do
+  Parser.string "ID"
+  Parser.manyTill Parser.anyChar (Parser.string "EI")


This is very suspicious. What if image data contains EI?

might be interesting: mozilla/pdf.js#16461

I agree, it worried me too. The best approach is to somehow derive the expected length of the image blob from the preamble information.

Yuras · 2023-06-14T19:10:11Z

Hi, thank you for the PR!

There seems to be two independent features:

skip inline images
It's pretty clear what this is about, though I'll need some time to dig into the spec to figure out what exactly is going on here.
collect all operators.
Why do you need them? Can pageExtractOperators be implemented separately outside of the library? Is it general enough to be useful for other people? It would help if you describe your use case.

Yuras · 2023-06-17T10:35:08Z

@alanz I tried to make a test case to reproduce the issue with text extraction in presence of inline images. I.e. I created a PDF file with inline image and tried to extract text. Everything works well so far, see #83. It probably works purely by accident (i.e. we treat any unknown thing as an operator), but it does work.
I assume you are running into some kind of a corner case. Could you please help me identify the underlying issue. E.g. share the PDF file or, if not possible, the problematic part of the content stream.

alanz added 3 commits June 7, 2023 23:05

Return all the operations for a page.

e1b0d35

This allows simple processing of things like generated bank statements.

Export Pdf.Content.Ops.Object(..)

17d2e3e

Yuras reviewed Jun 14, 2023

View reviewed changes

Yuras mentioned this pull request Jun 17, 2023

Add test for inline images #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process (by ignoring) inline images#82

Process (by ignoring) inline images#82
alanz wants to merge 3 commits intoYuras:masterfrom
alanz:extract-operations

alanz commented Jun 13, 2023

Uh oh!

Yuras Jun 14, 2023 •

edited

Loading

Uh oh!

Yuras Jun 17, 2023

Uh oh!

alanz Jun 17, 2023 •

edited

Loading

Uh oh!

Yuras commented Jun 14, 2023

Uh oh!

Yuras commented Jun 17, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alanz commented Jun 13, 2023

Uh oh!

Yuras Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yuras Jun 17, 2023

Choose a reason for hiding this comment

Uh oh!

alanz Jun 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yuras commented Jun 14, 2023

Uh oh!

Yuras commented Jun 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yuras Jun 14, 2023 •

edited

Loading

alanz Jun 17, 2023 •

edited

Loading

Yuras commented Jun 17, 2023 •

edited

Loading