Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813

tholor · 2021-02-08T16:07:35Z

Fixes #805

It seems that pdftotext uses different default encodings depending on the host machine. While on my machine all pdfs in #805 get converted correctly, the converter on a Google Colab notebook just produced gibberish.
It might be related to environment variables on the host like LC_ALL or LANG.

With this PR we are enabling users to pass a custom encoding to PDFToTextConverter.convert().

"Latin 1" is the default encoding of pdftotext. While this works well on many PDFs, it might be needed to switch to "UTF-8" or
others if your doc contains special characters (e.g. German Umlauts, Cyrillic characters ...).

Note: With "UTF-8" we experienced cases, where a simple "fi" gets wrongly parsed as "xef\xac\x81c" which is a latin small ligature (see test cases). That's why we keep "Latin 1" as the default here. (See the list of available encodings by running pdftotext -listencodings in the terminal)

Also fixing the pdftotext install instructions from 4.02 to 4.03 (thanks for the pointer @m1kol)

…o fix_pdf_encoding

fix encoding of pdftotext. fix version in download instructions

0701d51

tholor requested a review from tanaysoni February 8, 2021 16:07

tholor self-assigned this Feb 8, 2021

tholor changed the title ~~fix encoding of pdftotext. fix version in download instructions~~ Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions Feb 8, 2021

tholor and others added 2 commits February 9, 2021 09:50

fix test

693aed8

Add latest docstring and tutorial changes

f13b506

tanaysoni approved these changes Feb 9, 2021

View reviewed changes

tholor and others added 3 commits February 9, 2021 12:25

make latin-1 default encoding again

780dd5f

Merge branch 'fix_pdf_encoding' of github.com:deepset-ai/haystack int…

fec2ecb

…o fix_pdf_encoding

Add latest docstring and tutorial changes

9dbdca0

tholor merged commit ac9f924 into master Feb 9, 2021

tholor deleted the fix_pdf_encoding branch February 9, 2021 12:42

tholor mentioned this pull request Feb 9, 2021

PDF Converter on Russian language #805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813

Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813

tholor commented Feb 8, 2021 •

edited

Loading

Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813

Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813

Conversation

tholor commented Feb 8, 2021 • edited Loading

tholor commented Feb 8, 2021 •

edited

Loading