Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
.odf and .pdf : Dealing with special characters and white space #53
I wrote a file with LibreOffice (4.2.3.x) named 'Àéìøèãò áîëñîí ÿâäàë.odt' containing the characters 'Àéìøèãò áîëñîí ÿâäàë' and export it in PDF.
I then try to extract txt of the two files. Here are the results:
ODT to TXT : Didn't recognize specials characters:
PDF to TXT (via poppler) can't deal with filename containing white characters:
PDF to TXT is able to display special characters contained in a file if we remove white space of the filename :
@tintamarre this is a great test; thanks for raising the issue.
It sounds like there is a two-fold problem here.
The first issue is that the current odt parser isn't properly dealing with the non-ascii characters in the file. This is something that I hope is addressed by #39 (as @ShawnMilo suggests) where we try to more systematically deal with the problem of byte-string encodings and unicode code points (which remains very confusing to me). I'll try to add a test to #39 to deal with the situation you mention.
The second issue is that it looks like textract (and specifically the pdf parser) does not properly handle filenames with non-ascii characters. That's definitely not something that I anticipated so I'm not surprised it is breaking things—good catch! I'll try to take a look at that when I have time, but it wouldn't surprise me if we need to reuse some code that is in #39 for decoding the byte-string that is specified on the command line just like we do with the content of files.
This fixes the spaces in the filename problem (they need to be quoted when running a command from the command line). It can handle unicode filenames just fine.
I believe the issue with unicode in the .odt file will be fixed in #39, so I'm going to close this out for now.
Thanks again for reporting the problem!