New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in text output if heading is first element on a page #9

Closed
manjarofan opened this Issue May 2, 2016 · 4 comments

Comments

Projects
None yet
3 participants
@manjarofan

manjarofan commented May 2, 2016

Explanatory introduction:

  • Obviously Libre/Open Office places elements inside .odt files that effect a page-break when the document is displayed in the GUI-program.
  • Obviously the concrete location of such soft page-breaks and therefore the location of the inserted elements can differ for the same document between different office versions.
  • When the .odt file is translated to raw XML by using the --raw option of odt2txt, then you can find tags namend <text:soft-page-break/> inside the XML stream/file.
  • When the output of odt2txt is plain text a heading in the .odt file should be transcoded to
    This is a heading!
    ------------------

The bug:

  1. When a heading is the first element on a new page of an .odt document (because the previous page is full), then in the plain text representation generated from this .odt file by odt2txt the underlining for this heading will be missing.
  2. When the next element that follows this heading in the .odt file is also a heading, then the line break between these two headings will be missing.

Expected behaviour:
Those elements inserted by LibreOffice in order to cause soft page-breaks do not belong to the real content of an .odt document, they are meta content.
->They should not influence the plain text output.
->They should be ignored in XML output, too. Or this should be configurable with an option.

Affected Versions:
I did test it for: version 0.5 (compiled from Source), odt2txt-0.4-12.1.2@openSUSE 13.2, odt2txt-0.4-10.1.2@openSUSE 13.1

Additional Information:
For a git repository I configured odt2txt to work as a diff tool for .odt files like it is suggested here.
Here's an example for an output off 'git diff' that is incorrect:

diff --git a/doc/tuxminds_de/tuxminds_anleitung_3.81.odt b/doc/tuxminds_de/tuxminds_anleitung_3.81.odt
index 09fc4eb..16cef78 100644
--- a/doc/tuxminds_de/tuxminds_anleitung_3.81.odt
+++ b/doc/tuxminds_de/tuxminds_anleitung_3.81.odt
@@ -1,7 +1,7 @@

   TuxMinds

-  Benutzerhandbuch für die Version >= 3.81
+  Benutzerhandbuch für die Version >= 3.96

   Inhaltsverzeichnis

@@ -719,11 +719,13 @@
   nicht)

   Abbruch
-  -------

   siehe Start -nomen est omen-

-  Grundfunktionen der MaustastenLinke Maustaste:
+  Grundfunktionen der Maustasten
+  ------------------------------
+
+  Linke Maustaste:
   ----------------

   Klick auf die Arbeitsfläche+Drag zum Markieren / Umreißen eines

Comments to that output:
@@ -1,7 +1,7 @@: I loaded the .odt file into LibreOffice (v4.1.6.2, tried with 5.0, too) changed '3.81' to '3.96' and saved the document in order to test 'odt2txt' in 'git diff'.
So, this difference is O.K. !
@@ -719,11 +719,13 @@: In that version of LibreOffice/OpenOffice, in which the document was modified last (before I did it for testing), the soft page-break had obviously been at another position, in particular in front of the heading Grundfunktionen der Maustasten. And when I executed File->Save it has moved before Abbruch. Because of the bug ------- is deleted after "Abbruch" and ------------------------------ is added after Grundfunktionen der Maustasten.
The line-break, the underlining and an empty line between the two headings Grundfunktionen der Maustasten and Linke Maustaste: is correct in the new version and incorrect in the old.

Here is the same section from the XML output of the unmodified .odt file:

<text:p text:style-name="P10"/>
<text:h text:style-name="Heading_20_2" text:outline-level="2">Abbruch</text:h>
<text:p text:style-name="Standard">
  <text:span text:style-name="T12">siehe Start -nomen est omen-</text:span>
</text:p>
<text:h text:style-name="Heading_20_2" text:outline-level="2"><text:soft-page-break/>Grundfunktionen der Maustasten</text:h>
<text:h text:style-name="Heading_20_2" text:outline-level="2">Linke Maustaste:</text:h>

@manjarofan manjarofan changed the title from error in text output if headings to error in text output if heading is first element on a page May 2, 2016

@dstosberg

This comment has been minimized.

Show comment
Hide comment
@dstosberg

dstosberg May 2, 2016

Owner

Hi, can you provide (link to) a sample file that triggers this behaviour?

Owner

dstosberg commented May 2, 2016

Hi, can you provide (link to) a sample file that triggers this behaviour?

@manjarofan

This comment has been minimized.

Show comment
Hide comment
@manjarofan

manjarofan May 4, 2016

(I got an error message when trying to upload a file here. Personal mailing seems impossible at github?)
So I suggest you to extract the file out of an archive:
Here's the link to the project.
There you'll find the file 'doc/tuxminds_de/tuxminds_anleitung_3.81.odt' in the archive 'src_tuxminds_3.96_20141203.tbz'.

The bug occurs at the top of
<Page>: <Heading>
5: 4 Installation und Start
6: 5 Zugriff auf die Roboter über Linux Devices
8: Vorbemerkungen
10: Der Entwurfsmodus
12: Abbruch
13: 11 Skins (unchanged file only)

I looked at the whole text fiIe generated from 'src_tuxminds_3.96_20141203.tbz'
and noticed that the bug occurs in at least one other constellations, too:

  • When a text:list ends just in front of a heading (</text:list> is the name of the closing XML tag when the output type is raw XML):
    'Grundlegende Bedienung' and 'Betrieb ohne Oberfläche (Batchbetrieb)' (that follows immediately) are both headings (at page 7 of the .odt file) and they are not at the top of that page but the bug occurs in the generated txt file.

manjarofan commented May 4, 2016

(I got an error message when trying to upload a file here. Personal mailing seems impossible at github?)
So I suggest you to extract the file out of an archive:
Here's the link to the project.
There you'll find the file 'doc/tuxminds_de/tuxminds_anleitung_3.81.odt' in the archive 'src_tuxminds_3.96_20141203.tbz'.

The bug occurs at the top of
<Page>: <Heading>
5: 4 Installation und Start
6: 5 Zugriff auf die Roboter über Linux Devices
8: Vorbemerkungen
10: Der Entwurfsmodus
12: Abbruch
13: 11 Skins (unchanged file only)

I looked at the whole text fiIe generated from 'src_tuxminds_3.96_20141203.tbz'
and noticed that the bug occurs in at least one other constellations, too:

  • When a text:list ends just in front of a heading (</text:list> is the name of the closing XML tag when the output type is raw XML):
    'Grundlegende Bedienung' and 'Betrieb ohne Oberfläche (Batchbetrieb)' (that follows immediately) are both headings (at page 7 of the .odt file) and they are not at the top of that page but the bug occurs in the generated txt file.
@albfan

This comment has been minimized.

Show comment
Hide comment
@albfan

albfan Aug 7, 2017

Contributor

I guess uploading here the file and posting actual output and wanted output is the best to see this issue.

Omit any details not important for this. Do not upload a file of 20 pages to show a problem only in two first lines)

Contributor

albfan commented Aug 7, 2017

I guess uploading here the file and posting actual output and wanted output is the best to see this issue.

Omit any details not important for this. Do not upload a file of 20 pages to show a problem only in two first lines)

@dstosberg dstosberg closed this in 1488cae Sep 28, 2017

dstosberg pushed a commit that referenced this issue Sep 28, 2017

root
Convert xml-protected spaces to real spaces
Because they can disturb later processing. See #9.
@dstosberg

This comment has been minimized.

Show comment
Hide comment
@dstosberg

dstosberg Sep 28, 2017

Owner

Fixed the <text:soft-page-break/> issue. Fixed <text:s> as well, which caused the same symptoms. Thank you.

Owner

dstosberg commented Sep 28, 2017

Fixed the <text:soft-page-break/> issue. Fixed <text:s> as well, which caused the same symptoms. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment