pdfminer can't parse characters outside the ASCII encoding #136

marianorodriguez · 2019-10-17T14:34:11Z

Attached is a document in spanish that shows that pdfminer cant process latin characters like:
á, é, í, ó, ú, ñ, etc...

peter-vandenabeele-axa · 2019-10-17T15:25:22Z

Sorry, I fail to understand the title, relative to the problem statement.

I understand this as "pdfminer can't parse characters outside the ASCII encoding"
(because the "accent" characters from French and Spanish etc. that you refer to are part of utf-8).

Probably I am just misunderstanding ?

royjohal · 2019-10-17T23:49:01Z

Attached is a document in spanish that shows that pdfminer cant process latin characters like:
á, é, í, ó, ú, ñ, etc...

It's actually more complicated than that.
Also: this behavior can also manifest on standard latin characters: it all depends on how the PDF document was encoded - a PDF can lack a complete textual mapping by omission since it is not an inherent purpose of the PDF format to provide the textual equivalent.

Lines producing the '?':

Parsr/server/src/input/pdfminer/pdfminer.ts

Lines 196 to 205 in 6586953

    
           /** 
        
            * Fetches the character a particular pdfminer's textual output represents 
        
            * TODO: This placeholder will accomodate the solution at https://github.com/aarohijohal/pdfminer.six/issues/1 ... 
        
            * TODO: ... For now, it returns a '?' when a (cid:) is encountered 
        
            * @param character the character value outputted by pdfminer 
        
            * @param font the font associated with the character  -- TODO to be taken into consideration here 
        
            */ 
        
           function getValidCharacter(character: string): string { 
        
           	return RegExp(/\(cid:/gm).test(character) ? '?' : character; 
        
           }

Related Issues:

royjohal · 2020-05-19T06:02:35Z

An interesting related patent: https://patents.google.com/patent/US20060288281

royjohal · 2020-06-18T00:33:54Z

This is a general problem related to the extractors used, instead of being particularly Parsr's problem.

marianorodriguez added bug Something isn't working input / extraction labels Oct 17, 2019

marianorodriguez changed the title ~~pdfminer can't parse characters outside the utf-8 encoding~~ pdfminer can't parse characters outside the ASCII encoding Oct 18, 2019

royjohal closed this as completed Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfminer can't parse characters outside the ASCII encoding #136

pdfminer can't parse characters outside the ASCII encoding #136

marianorodriguez commented Oct 17, 2019

peter-vandenabeele-axa commented Oct 17, 2019 •

edited

royjohal commented Oct 17, 2019 •

edited

royjohal commented May 19, 2020

royjohal commented Jun 18, 2020

pdfminer can't parse characters outside the ASCII encoding #136

pdfminer can't parse characters outside the ASCII encoding #136

Comments

marianorodriguez commented Oct 17, 2019

peter-vandenabeele-axa commented Oct 17, 2019 • edited

royjohal commented Oct 17, 2019 • edited

royjohal commented May 19, 2020

royjohal commented Jun 18, 2020

peter-vandenabeele-axa commented Oct 17, 2019 •

edited

royjohal commented Oct 17, 2019 •

edited