Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfminer does not always produce clean textual output #75

Closed
royjohal opened this issue Sep 19, 2019 · 1 comment
Closed

pdfminer does not always produce clean textual output #75

royjohal opened this issue Sep 19, 2019 · 1 comment
Assignees
Labels
bug Something isn't working input / extraction

Comments

@royjohal
Copy link
Contributor

royjohal commented Sep 19, 2019

Summary
pdfminer sometimes omits characters in the textual output.
There are some characters missing.

Environment

  • Reference commit/version: a7b4b0c
@royjohal royjohal added bug Something isn't working input / extraction labels Sep 19, 2019
@royjohal royjohal self-assigned this Sep 19, 2019
@royjohal
Copy link
Contributor Author

The issue has been traced back and found to be inherent to pdfminer's current implementation of the pdf text extraction method, and somewhat on the PDF format itself.
See Issue royjohal/pdfminer.six#1

Pointing to glyphs inside the font spec while the fontspec does not have a glyph->character mapping makes sure that a resultant text is not always found.

This needs to be investigated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working input / extraction
Projects
None yet
Development

No branches or pull requests

1 participant