Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbled Text Extraction #447

Closed
alexneblett opened this issue May 2, 2022 · 2 comments
Closed

Garbled Text Extraction #447

alexneblett opened this issue May 2, 2022 · 2 comments
Labels

Comments

@alexneblett
Copy link

Hi,

Thank you for this amazing component. I have run into an issue extracting text from the attached pdf. The text from page.text is garbled, but if I open the pdf in Adobe Acrobat Reader, select all, then copy paste into notepad, the pasted text is what you see in the pdf (usually both are garbled with font mapping issues, etc.). This gives me hope (hopefully not false hope) that perhaps there is a way to extract the text. To be fair, I tried a few other components and they extracted the same garbled text.

Cheers,

Alex

fc30326e-64a0-4a6e-895a-c3d4aeae2974.pdf

@BobLd
Copy link
Collaborator

BobLd commented May 2, 2022

Hi @alexneblett, just had a quick look at you pdf doc.

This will need to be confirmed but it seems the character data is missing, meaning the pdf doesn't 'know' which letter is which.

When I copied the text from Adobe Acrobat reader into Notepad++ I get nonsense text (see below, not sure if this is what you meant in your post or if you managed to get the actual text)
image

If I'm correct and some data is missing, you will be limited with what you can do with PdfPig alone...

One possible solution to get the text is to use Optical character recognition (ORC). The main C# library is the C# wrapper for tesseract available here https://github.com/charlesw/tesseract

Would be nice if @EliotJones or someone else could check inside the pdf if it is not properly built, or if PdfPig can be improved to get the data. I guess one possible improvement would be to have the correct bounding boxes, for the moment they have height 0 and I guess each character path

@EliotJones
Copy link
Member

Hi @alexneblett, as @BobLd found when I open the file in Edge/Firefox/Adobe Acrobat Reader I only get the 'nonsense' content by copying. Is it possible you're using a version of Acrobat that does some OCR or something?

Inspecting the content of the file in iText RUPS it looks like all the fonts in the file are lacking proper Encoding dictionaries and instead just contain Type3 fonts (which represent letters as Postscript path-painting operations with no semantic meaning). Unless a special version of Adobe has some way to interpret the Postscript operations and work out which characters they correspond to I can't see a way any code could extract text content from this file.

For example here is a Type3 font defined in the file:

9 0 obj
<</CharProcs<</.notdef 10 0 R /0 11 0 R  ... etc>>/Encoding 124 0 R /FirstChar 0/FontBBox[ 0 0 1 -1]/FontMatrix[ 1 0 0 1 0 0]/LastChar 114/Subtype/Type3/Type/Font/Widths[ 1 1 ...etc]>>
endobj

And the corresponding Encoding object:

124 0 obj
<</Differences[ 0/0/1/2/3/4/.notdef/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/.notdef/24/25/26/27/28/29/30/31/32/33/34/35/36/37/38/39/40/41/42/43/44/45/46/47/48/49/50/51/52/53/54/55/56/57/58/59/60/61/62/63/64/65/66/67/68/69/70/71/72/73/74/75/76/77/78/79/80/81/82/83/84/85/86/87/88/89/90/91/92/93/94/95/96/97/98/99/100/101/102/103/104/105/106/107/108/109/110/111/112/113/114]/Type/Encoding>>
endobj

The expected content should be a mapping of numeric values to recognized Adobe glyph names so there doesn't appear to be any way to map this back to text content unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants