Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #3019 - Better import from LO / OO / soffice #3783

Merged
merged 9 commits into from
Apr 8, 2020

Conversation

JohnMcLear
Copy link
Member

@JohnMcLear JohnMcLear commented Mar 29, 2020

Discussion here: #3019

I don't like the return in contentcollector.js -- I'm +1 a better method.

This fixes copy/paste and import too, in theory. Not fully tested.

  • Once OO is working test abiword to see if it will take the same Filter for the html value.

This change is meant to ease using LibreOffice as converter. When LibreOffice
converts a file, it adds some classes to the <title> tag.
This is a quick & dirty way of matching the <title> and comment it out
independently on the classes that are set on it.
@muxator
Copy link
Contributor

muxator commented Mar 30, 2020

Crashes the import process. Etherpas itself survives. No logs.

See #3019 (comment).

@JohnMcLear
Copy link
Member Author

Ugh, libreoffice get this wrong.

If libreoffice isn't fully installed IE you only install libreoffice-common it fails.

sudo apt-get install libreoffice

is required.

What should happen is instead of Returning Error: source file could not be loaded Libreoffice should explain why it's not loading and how to remedy. Quite disappointed!

@JohnMcLear
Copy link
Member Author

When using NodeJS spawn

Error in option: --convert-to html:"XHTML Writer File:UTF8"

When using CLI

soffice --headless --convert-to html:"XHTML Writer File:UTF8" --outdir . /home/jose/testOneLiner.odt --invisible --nologo --nolockcheck --writer

Works

@JohnMcLear
Copy link
Member Author

JohnMcLear commented Mar 30, 2020

I dunno tbh, this fix is fine as far how it's designed, the problem is with nodejs not properly passing the arguments to spawn

I tried hardcoding and shell:true to no avail..

      var soffice = spawn(settings.soffice, [
        '--headless',
        '--invisible',
        '--nologo',
        '--nolockcheck',
        '--writer',
        '--convert-to', 'html:"XHTML Writer File:UTF8"',
        task.srcFile,
        '--outdir', tmpDir
      ],
        {
          shell : true
        }
      );


@muxator can you take a look and sanity check my work on this? If you set shell to false you can see that spawn isn't passing the correct command to soffice

@JohnMcLear
Copy link
Member Author

Bump @muxator for help :D

@JohnMcLear JohnMcLear changed the title Fixes #3019 Fixes #3019 - Better import from LO / OO / soffice - don't import 2 bullets in place of one - Need help Apr 1, 2020
@muxator
Copy link
Contributor

muxator commented Apr 4, 2020

@JohnMcLear I confirm your problem.

When invoked directly on the shell this command works flawlessly:

libreoffice6.4 --headless --convert-to html:"XHTML Writer File:UTF8" --outdir . /tmp/oneline.odt --invisible --nologo --nolockcheck --writer
convert /tmp/oneline.odt -> /tmp/oneline.html using filter : XHTML Writer File:UTF8

The same command wrapped in a spawn() call in pure Javascript, without Etherpad being involved at all, fails:

// file: test-libreoffice.js
const { spawn } = require('child_process');

//const ls = spawn('ls', ['-lh', '/usr']);
const ls = spawn('libreoffice6.4', [
  '--headless',
  '--convert-to',
  'html:"XHTML Writer File:UTF8"',
  '--outdir',
  '.',
  '--invisible',
  '--nologo',
  '--nolockcheck',
  '--writer',
  '/tmp/oneline.odt',
]);

ls.stdout.on('data', (data) => {
  console.log(`stdout: ${data}`);
});

ls.stderr.on('data', (data) => {
  console.log(`stderr: ${data}`);
});

ls.on('close', (code) => {
  console.log(`child process exited with code ${code}`);
});

Result:

$ node test-libreoffice.js 
convert /tmp/oneline.odt -> /tmp/oneline.html using filter : "XHTML Writer File:UTF8"
Error: Please verify input parameters... (SfxBaseModel::impl_store <file:///tmp/oneline.html> failed: 0x81a(Error Area:Io Class:Parameter Code:26))
child process exited with code 0

Tested on nodejs 12.13.1.

I'll try to profile the LibreOffice execution.

@JohnMcLear
Copy link
Member Author

I'm glad you sanity checked this, glad it's not me being an idiot! Thanks! Please let me know if you solve and how, this one pissed me off good :D

@muxator
Copy link
Contributor

muxator commented Apr 4, 2020

This is not nodejs specific.

The same program above, converted in python, gives the exact same result:

#!/usr/bin/env python3
#
# file: /tmp/test.py

import subprocess

args = [
  'libreoffice6.4',
  '--headless',
  '--convert-to',
  'html:"XHTML Writer File:UTF8"',
  '--outdir',
  '.',
  '/tmp/oneline.odt',
  '--invisible',
  '--nologo',
  '--nolockcheck',
  '--writer',
]

result = subprocess.run(args, capture_output=True)

print('stdout:', result.stdout)
print('stderr:', result.stderr)

When executing:

$ ./test.py 
stdout: b'convert /tmp/oneline.odt -> /tmp/oneline.html using filter : "XHTML Writer File:UTF8"\n'
stderr: b'Warning: failed to read path from javaldx\nError: Please verify input parameters... (SfxBaseModel::impl_store <file:///tmp/oneline.html> failed: 0x81a(Error Area:Io Class:Parameter Code:26))\n'

@muxator
Copy link
Contributor

muxator commented Apr 4, 2020

Aaaand this is the reason for which everyone always needs to have a good diff tool at hand.

Here's the difference between the output of the succeeding (directly on the shell) and the failing (js via child_process.spawn) execution:

-convert /tmp/oneline.odt -> /tmp/oneline.html using filter : XHTML Writer File:UTF8
+convert /tmp/oneline.odt -> /tmp/oneline.html using filter : "XHTML Writer File:UTF8"

Do you see that in the js (& python) execution there are two double quotes more?

Now, in hindsight, it is easy to understand: the double quotes are needed only when executing via the shell, because the shell needs escaping for spaces.

# bash requires the double quotes BECAUSE of the spaces, to convey
# that "html:XHTML Writer File:UTF8" is really a single parameter
#
--convert-to "html:XHTML Writer File:UTF8"   

When we explicitly pass parameters to a subprocess via execve() (which is what all these library calls are doing in the end), there is no need to escape anything, because the parameters are already cleanly separated among the array elements.

Leave out the double quotes and it should work.

On my toy Javascript & python programs it worked.

@JohnMcLear
Copy link
Member Author

JohnMcLear commented Apr 4, 2020

woop nice will add to my list of things to do 2mrw. Thanks!

@muxator
Copy link
Contributor

muxator commented Apr 4, 2020

After looking at this snippet I think that, we might be having one more problem.

LibreOffice exits with a 0 exit code even when the conversion fails. It did when we were passing a bad parameter.

If Etherpad just checks for the process exit code it may mistake a failed conversion for a successful one.

We should check for something else in addition.

  1. the process stdout/stderr? That's fragile
  2. that the converted file was really created where it was supposed to be <-- I would trust this more.

@JohnMcLear
Copy link
Member Author

  • Fix spawn command to not use "
  • Ensure output file exists with fs.fileExists lookup.

@JohnMcLear JohnMcLear assigned JohnMcLear and unassigned muxator Apr 4, 2020
@JohnMcLear
Copy link
Member Author

JohnMcLear commented Apr 5, 2020

Lol I just looked and I already WAS leaving out the double quotes btw :D

  if(type === "html") type = "html:XHTML Writer File:UTF8" // note how I'm not escaping the "
    queue.push({"srcFile": srcFile, "destFile": destFile, "type": type, "...
 var soffice = spawn(settings.soffice, [
        '--headless',
        '--invisible',
        '--nologo',
        '--nolockcheck',
        '--writer',
        '--convert-to', task.type,
        task.srcFile,

gonna test in isolation now to generate some logic that works/fails.

The below works:

var spawn = require("child_process").spawn;

var soffice = spawn("/usr/bin/soffice", [
  '--headless',
  '--invisible',
  '--nologo',
  '--nolockcheck',
  '--writer',
  '--convert-to', 'html:XHTML Writer File:UTF8',
  '/home/jose/test.doc',
  '--outdir', '/home/jose/'
]);

@JohnMcLear
Copy link
Member Author

Yea, I'm fucking stumped.

Sanity check for me..

  1. Checkout develop && specify soffice in settings.json
  2. Import a word file, it works.
  3. Edit src/node/utils/LibreOffice.js Insert if (type === 'html') type = "html:XHTML Writer File:UTF8"; @ Line 43
  4. Restart Etherpad
  5. Import a word file, it fails.

Yet running this isolated:

var soffice = spawn("/usr/bin/soffice", [
  '--headless',
  '--invisible',
  '--nologo',
  '--nolockcheck',
  '--writer',
  '--convert-to', 'html:XHTML Writer File:UTF8',
  '/home/jose/test.doc',
  '--outdir', '/home/jose/'
]);

Works fine...

Can you confirm? Am I losing my mind?

@muxator
Copy link
Contributor

muxator commented Apr 6, 2020

Sanity check for me..
[...]

I am going to try it now.

@muxator
Copy link
Contributor

muxator commented Apr 6, 2020

Confirmed the issue.

What nodejs version are you running? Is it 12?

Edit: the error happens on both node 10 and 12, but on 12 it is harder to spot

@muxator
Copy link
Contributor

muxator commented Apr 6, 2020

I have a suspect.

To see what happens, you'll have to run Etherpad under nodejs 10 (on nodejs 12 a deprecated feature was removed and you won't see anything useful in the logs). See #3834 and #3841.

The type parameter somehow ends up in the filename of a temporary file we generate, or we think we generated (a brittle choice in the first place 😄)).

  • When type === 'html', the temporary file name is /tmp/upload_XXX.html.

  • When type === 'html:XHTML Writer File:UTF8', this is what is generated in the logs:

    [2020-04-06 22:39:31.270] [WARN] console - Converting Error: { [Error: ENOENT: no such file or directory, rename '/tmp/upload_f3354ec47b4d3989f15cd7bbaedd81de.html:XHTML Writer File:UTF8' -> '/tmp/etherpad_import_1137536363.html']
      errno: -2,
      code: 'ENOENT',
      syscall: 'rename',
      path:
     '/tmp/upload_f3354ec47b4d3989f15cd7bbaedd81de.html:XHTML Writer File:UTF8',
      dest: '/tmp/etherpad_import_1137536363.html' }
    Error: ENOENT: no such file or directory, rename '/tmp/upload_f3354ec47b4d3989f15cd7bbaedd81de.html:XHTML Writer File:UTF8' -> '/tmp/etherpad_import_1137536363.html'
    

    Moreover, in the next line, you'll see why we need to integrate Rewrite the customError module with class syntax #3841 (or upgrade log4js), and why this is not visible under nodejs 12:

    [2020-04-06 22:39:31.274] [ERROR] console - (node:13895) [DEP0079] DeprecationWarning: Custom inspection function on Objects via .inspect() is deprecated
    

@muxator
Copy link
Contributor

muxator commented Apr 6, 2020

I think that the line that needs to be changed is this one:

var sourceFilename = filename.substr(0, filename.lastIndexOf('.')) + '.' + task.type;

But I really would like to see this become more robust (without encoding information in the temporary file name).

Some more monkey patching on that file would be almost immoral. 😭

@JohnMcLear
Copy link
Member Author

How the funk did I not see that?!

I'm running v13.10.1 and that might explain why but wow, what a crazy situation tho..

Solution, pass the padId+somerandomnonce() as the output filename string?

@JohnMcLear
Copy link
Member Author

Yea, agreed. This whole import/export pass to third party binaries is a mess. Gonna fix this, make it stable then will do a good review for future versions.

@muxator
Copy link
Contributor

muxator commented Apr 6, 2020

  1. Solution, pass the padId+somerandomnonce() as the output filename string?

    You mean, ask libreoffice to generate a converted file with a name we control? Something like that, if LibreOffice's command line allows it.

    I would prefer that.

  2. or mess with task.type here:

    var sourceFilename = filename.substr(0, filename.lastIndexOf('.')) + '.' + task.type;

    Removing everything after the first ":", if exists.

    But it stinks.

In the next commit, we are going to change the conversion method to
"html:XHTML Writer File:UTF8". Without this change, that conversion method name
would end up in the extension of the temporary file that is created as an
intermediate step. In this way, the file extensione will always stay ".html".

No functional changes, hopefully. Only the extension of the temporary file
should change.
This yields better conversion results, but requires the previous change,
otherwise there would have been difficulties in locating the temporary file
name.
@JohnMcLear
Copy link
Member Author

Right, let's get this moving. Please merge this PR as it is now and then I can keep moving forward with changes. I'm going to hit problems with contentcollector.js and my list work really soon!

@JohnMcLear JohnMcLear changed the title Fixes #3019 - Better import from LO / OO / soffice - don't import 2 bullets in place of one - Need help [PRIORITY #2] Fixes #3019 - Better import from LO / OO / soffice - READY TO MERGE Apr 7, 2020
@muxator muxator force-pushed the 3019 branch 2 times, most recently from 1351c22 to 2b83227 Compare April 8, 2020 20:49
@muxator muxator merged commit 08b83ae into ether:develop Apr 8, 2020
@muxator
Copy link
Contributor

muxator commented Apr 8, 2020

Merged with some reordering of the commits.

Thanks.

@muxator muxator changed the title [PRIORITY #2] Fixes #3019 - Better import from LO / OO / soffice - READY TO MERGE Fixes #3019 - Better import from LO / OO / soffice Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants