Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unoconv gets stuck converting one HTML into PDF #191

Closed
mauricioblur opened this issue Mar 4, 2014 · 7 comments
Closed

Unoconv gets stuck converting one HTML into PDF #191

mauricioblur opened this issue Mar 4, 2014 · 7 comments

Comments

@mauricioblur
Copy link

I'm trying to convert one HTML file into one PDF file. Usually, the process works fine with an important part of the HTML files. The command looks like:

unoconv -f pdf -vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv search_003.html

The output is:

Verbosity set to level 49
Connection type: socket,host=localhost,port=2002;urp;StarOffice.ComponentContext
Office base location: /usr/lib/libreoffice
Office binary location: /usr/lib/libreoffice/program
Existing listener not found.
Launching our own listener using /usr/lib/libreoffice/program/soffice.bin.
LibreOffice listener successfully started. (pid=30901)
Input file: search_003.html
Selected output format: Portable Document Format [.pdf]
Selected office filter: writer_pdf_Export
Used doctype: document
Output file: search_003.pdf
Terminating LibreOffice instance.
Waiting for LibreOffice instance to exit.

But for a few files, the process gets stuck on "Input file":

Verbosity set to level 49
Connection type: socket,host=localhost,port=2002;urp;StarOffice.ComponentContext
Office base location: /usr/lib/libreoffice
Office binary location: /usr/lib/libreoffice/program
Existing listener not found.
Launching our own listener using /usr/lib/libreoffice/program/soffice.bin.
LibreOffice listener successfully started. (pid=30978)
Input file: tvrnews.tvr.ro.html

If I press CTRL + C, then:

Verbosity set to level 49
Connection type: socket,host=localhost,port=2002;urp;StarOffice.ComponentContext
Office base location: /usr/lib/libreoffice
Office binary location: /usr/lib/libreoffice/program
Existing listener not found.
Launching our own listener using /usr/lib/libreoffice/program/soffice.bin.
LibreOffice listener successfully started. (pid=30978)
Input file: tvrnews.tvr.ro.html
^Cunoconv: SystemError during update-indexes phase: Couldn't instantiate python representation of structered UNO type com.sun.star.lang.DisposedException
Traceback (most recent call last):
  File "/usr/bin/unoconv", line 1053, in <module>
    die(exitcode)
  File "/usr/bin/unoconv", line 919, in die
    if convertor.desktop.getCurrentFrame():

My unoconv version is:

unoconv 0.5
Written by Dag Wieers <dag@wieers.com>
Homepage at http://dag.wieers.com/home-made/unoconv/

platform posix/linux2
python 2.7.4 (default, Sep 26 2013, 03:20:26) 
[GCC 4.7.3]
LibreOffice 4.0

build revision $Rev$

Any idea or suggestion of how to manage this problem?

Thank you very much!

@IzzySoft
Copy link

IzzySoft commented Sep 8, 2014

Exactly the same here when trying to convert an .odt to .doc. Did you find a work-around meanwhile? Trouble for me is, LibreOffice itself also hangs when I want to load that document, so I'm at a loss: if I cannot get it editable again, I've lost 3 days of work, and I need to continue with it.

PS: just found an older related issue, #13 has the same error message. There @dagwieers suggested it might be a size limit of some kind concerning the output format. I'm just trying some other output formats meanwhile (.ott and ooxml), but it doesn't look successful up to now.

Some interesting side effects:

  • CPU hogs at 100% for one core (core alternating – so it switches between the cores)
  • there are two processes of libreoffice doing the conversion. Running strace -p <pid> against process-1, it seems to do nothing: Just sticks at futex(0x25e8d60, FUTEX_WAIT_PRIVATE, 2, NULL – while the second permanently repeats mprotect(0x7fc361935000, 4096, PROT_READ|PROT_WRITE) = 0 (just altering the address with each repeated line)
  • I don't see any "memory hog": both processes show a RES of ~100M and VIRT of ~1G (input .odt is ~500k)

I've started the first conversion try in the morning before leaving for work, it still hung like that when I returned ~10h later.

@mauricioblur
Copy link
Author

Hi!

No luck, a couple of days ago I study the same issue again (but now with the latest version of Unoconv, 0.6), and the result is the same.

I stop using unoconv (at least for this conversion), and I start exploring another alternatives.

Good luck!

@IzzySoft
Copy link

@mauricioblur Same for me. I've ended up with an adventure, but was able to get the document fixed that way. Just in case, as it might be helpful to others stuck in the same path:

Apache OO seems to be a little more forgiving here. It took quite long (a few hours), but it was able to load the document. As it turned out later, simply saving it again doesn't solve the issue (the "corrupt part" stays corrupt). "Luckily" I was anticipating something like that. and also generated a "Master Document" (File → Send → Create Master Document), slitting at the "top-level" of the headlines used. That generates as many .odt files as there are "chapters" at the given level, plus a "controlling .odm" file.

Tried opening the resulting .odt files, which all did fine except for one (the one containing the "corrupt block"). After a few hours of loading it again succeeded, and I repeated above steps for "level 2". Then again the same game, "level 3". No more levels to split at, but the "broken part" was down to 3 pages. First page showed up fine, switching to the second page resulted in a "hang".

Luckily I had an older version of the same document from 2 days earlier. Splitted that up the very same way, copy-pasted the changes from the still readable "page 1", then replaced the resulting piece in the "broken document" by the updated one from the old version.

As I obviously didn't want to continue working with that "split document", final steps where then "merging up" the parts (File → Export lets you do that, converting a "Master Document" with all its "split files" into a single .odt containing sections, which you then might need to clean up). In the end, I decided to stay with the "Master Document" at "level 1", so if anything breaks again, it would (hopefully) affect only a part of my work.

@dagwieers
Copy link
Member

@IzzySoft We have seen various problems with the (fragile) way unoconv converst documents (using an office listener and communicating with it). In most of the cases the cause is in LibreOffice/OpenOffice. If you cannot reproduce it using the graphical interface and/or using the native LibreOffice conversion options, it maybe be specific to unoconv itself.

Using the latest LibreOffice/OpenOffice and/or trying older versions could be useful to find regressions and identify when it stopped working correctly. Or it might as well show the latest version fixes it.

PS If you can send me such a document I can test it on my laptop (which has every LibreOffice installed since v3.3 up to v4.3).

@IzzySoft
Copy link

IzzySoft commented Oct 2, 2014

@dagwieers After having finally resolved the issue on my end (with my document), my conclusion is it's not a bug with reading the document – the document was definitely broken. How I conclude that? Literally all software I used for reading had trouble at the same page: unoconv, OO, LO, even the MS Word Viewer. So there must be a problem in the document itself. My comment above shows how I broke that down, and finally figured how to (hopefully) avoid it in the future:

The document-in-question is using a lot of "frames" (mostly for graphics/pictures with their captions). It's a book I'm author of: I use LO to write it, convert it to .doc, send it to the publisher. The editor does all the formattings, and finally sends it for print. For the next edition, I get the finalized .doc back. And now comes the culprit: I must not touch any of those "frames" the editor inserted. Whenever I do, I risk being unable to open the document again. So it's something introduced when editing/saving the document in LO (I remember having had a similar issue back when I used OO, but cannot tell if that was the same; it was at least related, as I solved it then by removing all "frames").

Long story short: It's in my case obviously not a specific unoconv issue. But if you think it might be helpful, I still have the "borked document", so I could send it to you.

@dagwieers dagwieers added LibreOffice and removed bug labels Jul 5, 2015
@dagwieers
Copy link
Member

It may be useful to test this document again with a recent LibreOffice and in case it still has difficulties, report this to the LibreOffice project. They may be able to improve their support for broken documents.

I will close this issue. Thanks again for the feedback !

@charlescurley
Copy link

charlescurley commented Oct 22, 2017

I seem to have hit this issue. In my case, I did open the offending document and convert to text manually.

The error message I get, after hitting control C, is:

unoconv: SystemError during import phase:
Couldn't instantiate python representation of structured UNO type com.sun.star.lang.DisposedException
Traceback (most recent call last):
File "/usr/bin/unoconv", line 1278, in
die(exitcode)
File "/usr/bin/unoconv", line 1131, in die
if convertor.desktop.getCurrentFrame():
uno.DisposedException: Binary URP bridge already disposed

No output file is created.

This is on Debian 8.9, jessie. Libreoffice version is 1:5.2.7-1~bpo8+1, and unoconv is 0.7-1.1~bpo8+1. The offending document is http://legisweb.state.wy.us/statutes/compress/title35.docx. It is one of several (http://legisweb.state.wy.us/LSOWEB/StatutesDownload.aspx); the rest all converted successfully using unoconv.

A quick examination in LibreOffice's Navigator indicates that there are no frames in the document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants