New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.odf and .pdf : Dealing with special characters and white space #53

Merged
merged 2 commits into from Aug 14, 2014

Conversation

Projects
None yet
2 participants
@deanmalmgren
Owner

deanmalmgren commented Aug 14, 2014

I wrote a file with LibreOffice (4.2.3.x) named 'Àéìøèãò áîëñîí ÿâäàë.odt' containing the characters 'Àéìøèãò áîëñîí ÿâäàë' and export it in PDF.

I then try to extract txt of the two files. Here are the results:

ODT to TXT : Didn't recognize specials characters:

/tmp # textract Àéìøèãò\ áîëñîí\ ÿâäàë.odt
??????? ?????? ?????

PDF to TXT (via poppler) can't deal with filename containing white characters:

/tmp # textract Àéìøèãò\ áîëñîí\ ÿâäàë.pdf
The command pdftotext Àéìøèãò áîëñîí ÿâäàë.pdf - failed with exit code 99

PDF to TXT is able to display special characters contained in a file if we remove white space of the filename :

/tmp # mv Àéìøèãò\ áîëñîí\ ÿâäàë.pdf Àéìøèãò_áîëñîí_ÿâäàë.pdf
/tmp # textract Àéìøèãò_áîëñîí_ÿâäàë.pdf
Àéìøèãò áîëñîí ÿâäàë

@ShawnMilo

This comment has been minimized.

Contributor

ShawnMilo commented Aug 13, 2014

Is this something that will be addressed with this pull request?
#39

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 13, 2014

@tintamarre this is a great test; thanks for raising the issue.

It sounds like there is a two-fold problem here.

The first issue is that the current odt parser isn't properly dealing with the non-ascii characters in the file. This is something that I hope is addressed by #39 (as @ShawnMilo suggests) where we try to more systematically deal with the problem of byte-string encodings and unicode code points (which remains very confusing to me). I'll try to add a test to #39 to deal with the situation you mention.

The second issue is that it looks like textract (and specifically the pdf parser) does not properly handle filenames with non-ascii characters. That's definitely not something that I anticipated so I'm not surprised it is breaking things—good catch! I'll try to take a look at that when I have time, but it wouldn't surprise me if we need to reuse some code that is in #39 for decoding the byte-string that is specified on the command line just like we do with the content of files.

@tintamarre tintamarre changed the title from .odf and .pdf : Dealing with special chararcters and white space to .odf and .pdf : Dealing with special characters and white space Aug 13, 2014

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 14, 2014

This fixes the spaces in the filename problem (they need to be quoted when running a command from the command line). It can handle unicode filenames just fine.

I believe the issue with unicode in the .odt file will be fixed in #39, so I'm going to close this out for now.

Thanks again for reporting the problem!

deanmalmgren added a commit that referenced this pull request Aug 14, 2014

Merge pull request #53 from deanmalmgren/issue53
.odf and .pdf : Dealing with special characters and white space

@deanmalmgren deanmalmgren merged commit 359f9dd into master Aug 14, 2014

1 check was pending

continuous-integration/travis-ci The Travis CI build is in progress
Details

@deanmalmgren deanmalmgren deleted the issue53 branch Aug 14, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment