Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfminer can't parse characters outside the ASCII encoding #136

Closed
marianorodriguez opened this issue Oct 17, 2019 · 4 comments
Closed

pdfminer can't parse characters outside the ASCII encoding #136

marianorodriguez opened this issue Oct 17, 2019 · 4 comments
Labels
bug Something isn't working input / extraction

Comments

@marianorodriguez
Copy link
Contributor

Attached is a document in spanish that shows that pdfminer cant process latin characters like:
á, é, í, ó, ú, ñ, etc...

Screenshot 2019-10-17 at 16 33 40

caixa-one-page-spanish.pdf

@marianorodriguez marianorodriguez added bug Something isn't working input / extraction labels Oct 17, 2019
@peter-vandenabeele-axa
Copy link

peter-vandenabeele-axa commented Oct 17, 2019

Sorry, I fail to understand the title, relative to the problem statement.

I understand this as "pdfminer can't parse characters outside the ASCII encoding"
(because the "accent" characters from French and Spanish etc. that you refer to are part of utf-8).

Probably I am just misunderstanding ?

@royjohal
Copy link
Contributor

royjohal commented Oct 17, 2019

Attached is a document in spanish that shows that pdfminer cant process latin characters like:
á, é, í, ó, ú, ñ, etc...

It's actually more complicated than that.
Also: this behavior can also manifest on standard latin characters: it all depends on how the PDF document was encoded - a PDF can lack a complete textual mapping by omission since it is not an inherent purpose of the PDF format to provide the textual equivalent.

Lines producing the '?':

/**
* Fetches the character a particular pdfminer's textual output represents
* TODO: This placeholder will accomodate the solution at https://github.com/aarohijohal/pdfminer.six/issues/1 ...
* TODO: ... For now, it returns a '?' when a (cid:) is encountered
* @param character the character value outputted by pdfminer
* @param font the font associated with the character -- TODO to be taken into consideration here
*/
function getValidCharacter(character: string): string {
return RegExp(/\(cid:/gm).test(character) ? '?' : character;
}

Related Issues:

  1. pdfminer does not always produce clean textual output #75
  2. Find out equivalent character replacements for (cid:nn) characters royjohal/pdfminer.six#1
  3. Japanese characters shown as (cid:3821) etc. pdfminer/pdfminer.six#130

@marianorodriguez marianorodriguez changed the title pdfminer can't parse characters outside the utf-8 encoding pdfminer can't parse characters outside the ASCII encoding Oct 18, 2019
@royjohal
Copy link
Contributor

An interesting related patent: https://patents.google.com/patent/US20060288281

@royjohal
Copy link
Contributor

This is a general problem related to the extractors used, instead of being particularly Parsr's problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working input / extraction
Projects
None yet
Development

No branches or pull requests

3 participants