pdfminer does not always produce clean textual output #75

royjohal · 2019-09-19T12:24:59Z

Summary
pdfminer sometimes omits characters in the textual output.
There are some characters missing.

Environment

Reference commit/version: a7b4b0c

The text was updated successfully, but these errors were encountered:

royjohal · 2019-09-27T07:02:41Z

The issue has been traced back and found to be inherent to pdfminer's current implementation of the pdf text extraction method, and somewhat on the PDF format itself.
See Issue royjohal/pdfminer.six#1

Pointing to glyphs inside the font spec while the fontspec does not have a glyph->character mapping makes sure that a resultant text is not always found.

This needs to be investigated.

royjohal added bug Something isn't working input / extraction labels Sep 19, 2019

royjohal self-assigned this Sep 19, 2019

royjohal mentioned this issue Sep 19, 2019

Cleaning pdfminer's textual import #76

Merged

jvalls-axa closed this as completed in ded9ce7 Oct 10, 2019

royjohal mentioned this issue Oct 17, 2019

pdfminer can't parse characters outside the ASCII encoding #136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfminer does not always produce clean textual output #75

pdfminer does not always produce clean textual output #75

royjohal commented Sep 19, 2019 •

edited

royjohal commented Sep 27, 2019

pdfminer does not always produce clean textual output #75

pdfminer does not always produce clean textual output #75

Comments

royjohal commented Sep 19, 2019 • edited

royjohal commented Sep 27, 2019

royjohal commented Sep 19, 2019 •

edited