Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unoconv doesn't ouput UTF-8 text files on windows. #185

Closed
fasar opened this issue Feb 18, 2014 · 5 comments
Closed

unoconv doesn't ouput UTF-8 text files on windows. #185

fasar opened this issue Feb 18, 2014 · 5 comments

Comments

@fasar
Copy link

fasar commented Feb 18, 2014

I've got the same error.

I run the master lastest commit : a2c7b2f...caad43 version of unoconv.
I run unoconv on windows with the Libre Office Python 3.3.0.
I use Libre Office version 4.1.3.2 from http://portableapps.com/

My document testAccent is :

    Exemple : Pouvez-vous aller « à » la librairie me commander ce livre ?
    (Il s'agit bien d'une préposition puisque la conjugaison est impossible !)

I convert it with :

program\python.exe contribs\unoconv\unoconv -f txt --output="C:\Users\fabien_s\Desktop\ici" "C:\Users\fabien_s\Desktop\testAccent.odt"

It gives me an ici: ISO-8859 text, with CRLF line terminators
with -f text it gives me an ici: ISO-8859 text.

When my document testChinese is :

見勝不過眾人之所知,非善之善者也; 
兵法:一曰度,二曰量,三曰数,四曰稱,五曰勝。地生度,度生量,量生数,数生稱,稱生勝。故勝兵若以鎰稱銖,敗兵若以銖稱鎰。勝者之戰,若決積水於千仞之谿者,形也。
兵勢第五

When I do :

program\python.exe contribs\unoconv\unoconv -f txt --output="C:\Users\fabien_s\Desktop\ici" "C:\Users\fabien_s\Desktop\testChinese.odt"

I gives me a lot of ????
file command gives me : ici: ASCII text, with CRLF line terminators

Hex dump gives me :
hexdump -C ici

00000000  3f 3f 3f 3f 3f 3f 3f 3f  3f 2c 3f 3f 3f 3f 3f 3f  |?????????,??????|
00000010  3b 20 0d 0a 3f 3f 3a 3f  3f 3f 2c 3f 3f 3f 2c 3f  |; ..??:???,???,?|
00000020  3f 3f 2c 3f 3f 3f 2c 3f  3f 3f 3f 3f 3f 3f 2c 3f  |??,???,???????,?|
00000030  3f 3f 2c 3f 3f 3f 2c 3f  3f 3f 2c 3f 3f 3f 3f 3f  |??,???,???,?????|
00000040  3f 3f 3f 3f 3f 3f 3f 2c  3f 3f 3f 3f 3f 3f 3f 3f  |???????,????????|
00000050  3f 3f 3f 3f 2c 3f 3f 3f  3f 3f 3f 3f 3f 3f 3f 2c  |????,??????????,|
00000060  3f 3f 3f 0d 0a 3f 3f 3f  3f 0d 0a 0d 0a           |???..????....|
0000006d

With the same document.
When I do :

program\python.exe contribs\unoconv\unoconv -f text --output="C:\Users\fabien_s\Desktop\ici" "C:\Users\fabien_s\Desktop\testAccent.odt"

Command file gives me : ici: data
Hexdump gives me :
hexdump -C ici

00000000  8b dd 0d 4e 3e ba 4b 40  e5 0c 5e 84 4b 84 05 5f  |...N>.K@..^.K.._|
00000010  1b 20 0a 75 d5 1a 00 f0  a6 0c 8c f0 cf 0c 09 f0  |. .u............|
00000020  70 0c db f0 31 0c 94 f0  dd 02 30 1f a6 0c a6 1f  |p...1.....0.....|
00000030  cf 0c cf 1f 70 0c 70 1f  31 0c 31 1f dd 02 45 dd  |....p.p.1.1...E.|
00000040  75 e5 e5 b0 31 96 0c 57  75 e5 e5 96 31 b0 02 dd  |u...1..Wu...1...|
00000050  05 4b 30 0c e5 7a 4d 34  bc 43 de 4b 3f 05 0c 62  |.K0..zM4.C.K?..b|
00000060  5f 02 0a 75 e2 2c 94 0a  0a                       |_..u.,...|
00000069

I use cygwin to get linux commands.

I think it's because windows file encoding is not utf-8 by default.
Maybe it's python who doesn't handle file in the same fashion.

Tomorow, I'll build a linux station, and I will try to convert these example.

@fasar
Copy link
Author

fasar commented Feb 18, 2014

This is an old issue #148 (#148).
But it's closed.

@fasar
Copy link
Author

fasar commented Feb 18, 2014

I did it on Ubuntu with the deb package of unoconv.

Option "-f text" gives the same result as Windows.

On ubuntu, it works very very well with option "-f txt".
It gives UTF-8 files.

@fasar
Copy link
Author

fasar commented Feb 20, 2014

A quick workarround is to convert in pdf, and use text pdf extraction tools.
For me, I use pdfbox. It's perfect for java.

@fasar
Copy link
Author

fasar commented Aug 10, 2014

Ok, I think it's a libreoffice version pb.
Thanks

@dagwieers
Copy link
Member

Thanks for the very detailed troubleshooting and the feedback.
Feedback like this helps other users to investigate their own issues because you show your exact thought process and commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants