Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #805
It seems that pdftotext uses different default encodings depending on the host machine. While on my machine all pdfs in #805 get converted correctly, the converter on a Google Colab notebook just produced gibberish.
It might be related to environment variables on the host like
LC_ALL
orLANG
.With this PR we are enabling users to pass a custom
encoding
toPDFToTextConverter.convert()
."Latin 1" is the default encoding of pdftotext. While this works well on many PDFs, it might be needed to switch to "UTF-8" or
others if your doc contains special characters (e.g. German Umlauts, Cyrillic characters ...).
Note: With "UTF-8" we experienced cases, where a simple "fi" gets wrongly parsed as "xef\xac\x81c" which is a latin small ligature (see test cases). That's why we keep "Latin 1" as the default here. (See the list of available encodings by running
pdftotext -listencodings
in the terminal)Also fixing the pdftotext install instructions from 4.02 to 4.03 (thanks for the pointer @m1kol)