Conversation
These occur as BI params ID image content EI So when we see the ID, skip forward until we see an "EI" in the text. This is the way it is done in the rust pdf library https://github.com/pdf-rs/pdf/blob/677152fa8e84a2dcfbc3d927535148bb8d0369ba/pdf/src/content.rs#L127
This allows simple processing of things like generated bank statements.
| parseInlineImage :: Parser Expr | ||
| parseInlineImage = do | ||
| Parser.string "ID" | ||
| Parser.manyTill Parser.anyChar (Parser.string "EI") |
There was a problem hiding this comment.
This is very suspicious. What if image data contains EI?
There was a problem hiding this comment.
I agree, it worried me too. The best approach is to somehow derive the expected length of the image blob from the preamble information.
|
Hi, thank you for the PR! There seems to be two independent features:
|
|
@alanz I tried to make a test case to reproduce the issue with text extraction in presence of inline images. I.e. I created a PDF file with inline image and tried to extract text. Everything works well so far, see #83. It probably works purely by accident (i.e. we treat any unknown thing as an operator), but it does work. |
I am interested in extracting text from bank statement PDFs. Some of these have inline images, which the toolkit does not currently support.
Process them, by simply ignoring in the stream, but letting the parse continue.
Also, allow exporting the raw operators, not just page glyphs.