Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Default options for Text (encoded) export filter with LibreOffice #98

Closed
jkhradil opened this Issue · 5 comments

3 participants

@jkhradil

Unoconv's (version 0.6) default filter options for Text (encoded) output filter are "76,LF" (UTF-8, line feeds for paragraph breaks). With LibreOffice (3.4.5 and 3.5.7, don't know about other versions) the output is not encoded in UTF-8.

Setting FilterOptions="UTF8,LF" seems to render the desired result. Seems that LibreOffice guys changed the encoding options mapping.

@jkhradil

Upon further inspection I found out that unoconv v0.5 used "UTF8,LF" as default filter options. This got changed in commit ad3c68d.

I guess this change was based on information on openoffice wiki (http://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options). The Text (encoded) output filter however is document - not spreadsheet filter and the same rules need not apply.

@dagwieers
Owner

Ok, this is likely a regression. However, I would like to understand how it is supposed to work, because we likely document it incorrectly in the manual page as well. So before I am changing it back I would like to have an authoritative source confirming this, modify the manual page accordingly and make sure we are not breaking something else along the way.

@graaff

I'm seeing the same thing with libreoffice 4.0.4.2. Using an explicit FilterOptions=UTF8,LF fixes things for me.

@dagwieers dagwieers added support and removed blocks_release bug labels
@dagwieers
Owner

Sorry for not getting back to this sooner.

I did some tests to understand what is going on and whether this is still relevant. Here are my results:

Current unoconv 0.6 behavior using FilterOptions=76,LF

[dag@moria unoconv]$ /opt/libreoffice5.0/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: UTF-8 Unicode (with BOM) English text, with very long lines, with overstriking
[dag@moria unoconv]$ /opt/libreoffice4.4/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: UTF-8 Unicode (with BOM) English text, with very long lines, with overstriking
[dag@moria unoconv]$ /opt/libreoffice4.3/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: UTF-8 Unicode (with BOM) English text, with very long lines, with overstriking
[dag@moria unoconv]$ /opt/libreoffice4.2/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: UTF-8 Unicode (with BOM) English text, with very long lines, with overstriking
[dag@moria unoconv]$ /opt/libreoffice4.1/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data
[dag@moria unoconv]$ /opt/libreoffice4.0/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data
[dag@moria unoconv]$ /opt/libreoffice3.6/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data
[dag@moria unoconv]$ /opt/libreoffice3.5/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data
[dag@moria unoconv]$ /opt/libreoffice3.4/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data

Patched unoconv behavior using FilterOptions=UTF8,LF

[dag@moria unoconv]$ /opt/libreoffice5.0/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: UTF-8 Unicode (with BOM) English text, with very long lines, with overstriking
[dag@moria unoconv]$ /opt/libreoffice4.4/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: UTF-8 Unicode (with BOM) English text, with very long lines, with overstriking
[dag@moria unoconv]$ /opt/libreoffice4.3/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: UTF-8 Unicode (with BOM) English text, with very long lines, with overstriking
[dag@moria unoconv]$ /opt/libreoffice4.2/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: UTF-8 Unicode (with BOM) English text, with very long lines, with overstriking
[dag@moria unoconv]$ /opt/libreoffice4.1/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data
[dag@moria unoconv]$ /opt/libreoffice4.0/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data
[dag@moria unoconv]$ /opt/libreoffice3.6/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data
[dag@moria unoconv]$ /opt/libreoffice3.5/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data
[dag@moria unoconv]$ /opt/libreoffice3.4/program/python ./unoconv -f txt test.fodt
[dag@moria unoconv]$ file test.txt
test.txt: data

So there seems to be absolutely no difference between FilterOptions=76,LF and FilterOptions=UTF8,LF. For all the versions we tested.

Comparing the two different files generate (which are identical in size BTW):

[dag@moria unoconv]$ diff -u <(hexdump -C test-76.txt) <(hexdump -C test-utf8.txt)
--- /dev/fd/63  2015-07-05 11:53:51.483061227 +0200
 +++ /dev/fd/62  2015-07-05 11:53:51.483061227 +0200
 @@ -113,7 +113,7 @@
  00000700  20 6f 72 61 63 6c 65 69  69 6d 62 2d 72 67 2e 20  | oracleiimb-rg. |
  00000710  49 74 73 20 70 72 65 66  65 72 72 65 64 20 73 65  |Its preferred se|
  00000720  72 76 65 72 20 69 73 20  73 67 75 66 63 20 69 6e  |rver is sgufc in|
 -00000730  20 04 43 53 4d 05 2e 0a  54 68 69 73 20 72 65 73  | .CSM...This res|
 +00000730  20 07 43 53 4d 08 2e 0a  54 68 69 73 20 72 65 73  | .CSM...This res|
  00000740  6f 75 72 63 65 20 67 72  6f 75 70 20 61 6c 73 6f  |ource group also|
  00000750  20 68 6f 73 74 73 20 74  68 65 20 4d 51 20 71 75  | hosts the MQ qu|
  00000760  65 75 65 20 6d 61 6e 61  67 65 72 20 6f 6e 20 74  |eue manager on t|
 @@ -273,10 +273,10 @@
  00001100  74 65 72 73 0a 0a 50 61  72 61 6d 65 74 65 72 0a  |ters..Parameter.|
  00001110  56 61 6c 75 65 0a 44 69  73 61 73 74 65 72 20 73  |Value.Disaster s|
  00001120  65 72 76 65 72 20 61 6e  64 20 6c 6f 63 61 74 69  |erver and locati|
 -00001130  6f 6e 0a 73 67 75 66 63  20 40 20 04 43 53 4d 05  |on.sgufc @ .CSM.|
 +00001130  6f 6e 0a 73 67 75 66 63  20 40 20 07 43 53 4d 08  |on.sgufc @ .CSM.|
  00001140  0a 46 61 69 6c 6f 76 65  72 20 73 65 72 76 65 72  |.Failover server|
 -00001150  20 61 6e 64 20 6c 6f 63  61 74 69 6f 6e 0a 04 73  | and location..s|
 -00001160  67 75 67 63 05 20 40 20  04 4d 61 72 6e 69 78 05  |gugc. @ .Marnix.|
 +00001150  20 61 6e 64 20 6c 6f 63  61 74 69 6f 6e 0a 07 73  | and location..s|
 +00001160  67 75 67 63 08 20 40 20  07 4d 61 72 6e 69 78 08  |gugc. @ .Marnix.|
  00001170  0a 0a 43 6f 6e 66 69 67  75 72 61 74 69 6f 6e 0a  |..Configuration.|
  00001180  0a 43 6c 75 73 74 65 72  20 53 65 72 76 65 72 0a  |.Cluster Server.|
  00001190  4c 6f 63 61 74 69 6f 6e  0a 49 50 0a 73 67 75 66  |Location.IP.sguf|
 @@ -290,43 +290,43 @@
  00001210  43 6c 6f 76 65 72 6c 65  61 66 0a 6f 72 61 63 6c  |Cloverleaf.oracl|
  00001220  65 69 69 6d 62 2d 72 67  0a 63 67 75 69 69 6d 69  |eiimb-rg.cguiimi|
  00001230  69 6d 62 0a 31 30 2e 36  36 2e 31 32 30 2e 31 33  |imb.10.66.120.13|
 -00001240  0a 73 67 75 66 63 0a 04  43 53 4d 05 0a 53 47 49  |.sgufc..CSM..SGI|
 +00001240  0a 73 67 75 66 63 0a 07  43 53 4d 08 0a 53 47 49  |.sgufc..CSM..SGI|
  00001250  49 4d 42 0a 0a 63 67 75  69 69 6d 69 78 66 62 2d  |IMB..cguiimixfb-|
  00001260  72 67 0a 63 67 75 69 69  6d 69 78 66 62 0a 31 30  |rg.cguiimixfb.10|
  00001270  2e 36 36 2e 31 32 30 2e  31 36 35 0a 73 67 75 66  |.66.120.165.sguf|
 -00001280  63 0a 04 43 53 4d 05 0a  2d 0a 54 69 76 6f 6c 69  |c..CSM..-.Tivoli|
 +00001280  63 0a 07 43 53 4d 08 0a  2d 0a 54 69 76 6f 6c 69  |c..CSM..-.Tivoli|
  00001290  20 54 4d 46 0a 73 63 69  69 6d 74 6d 66 61 2d 72  | TMF.sciimtmfa-r|
  000012a0  67 0a 63 67 75 69 69 6d  74 6d 66 61 0a 31 30 2e  |g.cguiimtmfa.10.|
  000012b0  36 36 2e 31 32 30 2e 31  34 0a 73 67 75 67 63 0a  |66.120.14.sgugc.|
 -000012c0  04 4d 61 72 6e 69 78 05  0a 2d 0a 46 69 6e 61 6e  |.Marnix..-.Finan|
 +000012c0  07 4d 61 72 6e 69 78 08  0a 2d 0a 46 69 6e 61 6e  |.Marnix..-.Finan|
  000012d0  63 65 20 4b 69 74 0a 63  67 75 69 69 6d 73 79 62  |ce Kit.cguiimsyb|
  000012e0  62 2d 72 67 0a 63 67 75  69 69 6d 73 79 62 62 0a  |b-rg.cguiimsybb.|
...

There is a difference between byte 0x04 and 0x05 and resp. 0x07 and 0x08, and that's the only difference between LibreOffice 4.1 and older, and LibreOffice 4.2 and newer. I don't minder putting the default FilterOptions back to using UTF8 based on this analysis.

More info in commit 3b25f64

@dagwieers dagwieers closed this
@dagwieers
Owner

I closed the ticket. If anyone find this makes a different, please reopen this ticket and added the needed info in order to reproduce this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.