Skip to content

Process (by ignoring) inline images#82

Open
alanz wants to merge 3 commits intoYuras:masterfrom
alanz:extract-operations
Open

Process (by ignoring) inline images#82
alanz wants to merge 3 commits intoYuras:masterfrom
alanz:extract-operations

Conversation

@alanz
Copy link
Copy Markdown

@alanz alanz commented Jun 13, 2023

I am interested in extracting text from bank statement PDFs. Some of these have inline images, which the toolkit does not currently support.
Process them, by simply ignoring in the stream, but letting the parse continue.
Also, allow exporting the raw operators, not just page glyphs.

alanz added 3 commits June 7, 2023 23:05
These occur as

BI
  params
ID
  image content
EI

So when we see the ID, skip forward until we see an "EI" in the text.

This is the way it is done in the rust pdf library
https://github.com/pdf-rs/pdf/blob/677152fa8e84a2dcfbc3d927535148bb8d0369ba/pdf/src/content.rs#L127
This allows simple processing of things like generated bank
statements.
parseInlineImage :: Parser Expr
parseInlineImage = do
Parser.string "ID"
Parser.manyTill Parser.anyChar (Parser.string "EI")
Copy link
Copy Markdown
Owner

@Yuras Yuras Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very suspicious. What if image data contains EI?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be interesting: mozilla/pdf.js#16461

Copy link
Copy Markdown
Author

@alanz alanz Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it worried me too. The best approach is to somehow derive the expected length of the image blob from the preamble information.

@Yuras
Copy link
Copy Markdown
Owner

Yuras commented Jun 14, 2023

Hi, thank you for the PR!

There seems to be two independent features:

  • skip inline images
    It's pretty clear what this is about, though I'll need some time to dig into the spec to figure out what exactly is going on here.
  • collect all operators.
    Why do you need them? Can pageExtractOperators be implemented separately outside of the library? Is it general enough to be useful for other people? It would help if you describe your use case.

@Yuras
Copy link
Copy Markdown
Owner

Yuras commented Jun 17, 2023

@alanz I tried to make a test case to reproduce the issue with text extraction in presence of inline images. I.e. I created a PDF file with inline image and tried to extract text. Everything works well so far, see #83. It probably works purely by accident (i.e. we treat any unknown thing as an operator), but it does work.
I assume you are running into some kind of a corner case. Could you please help me identify the underlying issue. E.g. share the PDF file or, if not possible, the problematic part of the content stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants