Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safer ImageMagick conversions #663

Closed
dginev opened this issue Sep 23, 2015 · 19 comments
Closed

Safer ImageMagick conversions #663

dginev opened this issue Sep 23, 2015 · 19 comments
Assignees
Milestone

Comments

@dginev
Copy link
Collaborator

dginev commented Sep 23, 2015

I had reported in the last CorTeX run (way back when in 2014) that there are "out of control" gs processes, running for hours and hogging tons of RAM, essentially in an infinite loop.

I have seen this problem at work also, and would like to share the taste of the solution - although we may need to think how exactly to adapt it to LaTeXML.

The best fix I know of is applicable to manually invoking convert and looks like this (the numbers can be of course adjusted):

timeout --signal=9 600 nice -n 15 convert [options] ...

It ensures two aspects:

  • convert (and its children) will never block the CPU by hogging all the cycles in some busy loop, due to the very high niceness
  • convert (and its children) will never end up badly broken / looping forever - via the external monitoring of timeout and the merciless 9 signal.

I am not immediately aware of the best way to transfer this to the LaTeXML calls to convert, but I definitely consider it a wise idea, as it would help the predictability of LaTeXML in all use cases.

@dginev dginev added this to the LaTeXML-0.8.3 milestone Sep 23, 2015
@dginev
Copy link
Collaborator Author

dginev commented Sep 23, 2015

Also, since I am planning on rerunning arXiv in the coming couple of weeks, and I am certain I will encounter this exact problem, having some solution prior the run would be very helpful.

On the other hand I am running latexmlc explicitly there, so I could do the timeout+nice combo as a prefix to the latexmlc call. Yes, that's certainly a stopgap solution.

@brucemiller
Copy link
Owner

On 09/23/2015 12:58 PM, Deyan Ginev wrote:

I am not immediately aware of the best way to transfer this to the
LaTeXML calls to convert, but I definitely consider it a wise idea, as
it would help the predictability of LaTeXML in all use cases.

One big difficult is that we make the conversions through
method calls to Image::Magick, rather via a command line.

Indeed, the very fact that external programs are being invoked
is pretty much hidden from LaTeXML; we simply ask Image::Magick
to read an eps (in this case) image and rasterize it, without
any overt knowledge of how that's being done. Image::Magick
is in control of the effective command line being invoked
and really should be the target of a fix -- or alternatively
ghostscript. They might be responsive (though I've had difficulties).

It is conceivable (though non-trivial, if I remember the coding)
to convert the sequence of Image::Magick calls into a command line.
But we really want to avoid that since it may introduce
more problems through portability than it would solve.

In any case, do you have a test document that reproducibly
causes this problem (presumably just a bad *.eps with an
\includegraphics command?) Then we can explore alternatives.

@dginev
Copy link
Collaborator Author

dginev commented Sep 23, 2015

Yes, I now have a really bad EPS file that I can share, but not yet publicly. I'll get permission to mail it to you in private. It's embargo date expires soon, so we could eventually make it part of the test suite.

@dginev
Copy link
Collaborator Author

dginev commented Sep 24, 2015

So, just checked, and it seems the Perl ImageMagick module uses XS to directly interface with the imagemagick system libraries. Which is nice because it is efficient, but also explains why LaTeXML itself becomes "very dead" in some cases, where only a kill -9 would terminate the process.

I do, however, have a pragmatic suggestion - and that is to do a Perl-level fork() call for each and every image conversion job, and monitor the converting process from the main LaTeXML process. That way we remain insulated from any critical errors induced by the image conversion process.

Does that sound acceptable to you? I can whip up a PR today/tomorrow to illustrate.

@brucemiller
Copy link
Owner

On 09/24/2015 11:52 AM, Deyan Ginev wrote:

So, just checked, and it seems the Perl ImageMagick module uses XS to
directly interface with the imagemagick system libraries. Which is nice
because it is efficient, but also explains why LaTeXML itself becomes
"very dead" in some cases, where only a |kill -9| would terminate the
process.

Yes, but it sounds like the only problem area is when the imagemagick
libraries are "calling out" to ghostscript. However they make the
arrangements, they're not protecting us; that's where the nicety
and timeouts would be useful.

And it turns out there's a file delegates.xml that describes how
to decode such formats, but it isn't clear that we can/should modify
it from within LaTeXML?

Alternatively, if it's really only ps (pdf? AI?) that we need to
worry about, perhaps we should handle the rasterization ourselves?
(by our own calls to gs? ugh...)

I do, however, have a pragmatic suggestion - and that is to do a
Perl-level |fork()| call for each and every image conversion job, and
monitor the converting process from the main LaTeXML process. That way
we remain insulated from any critical errors induced by the image
conversion process.

Does that sound acceptable to you? I can whip up a PR today/tomorrow to
illustrate.

Yeah, that sounds like the necessary approach, I guess.
It's probably the $image->Read that's the main concern,
if it's the gs conversion. (after $image = image_object(...))

Kinda worried about performance: math to image conversion
is an expensive step (when performed), so this will add to it.

@dginev
Copy link
Collaborator Author

dginev commented Sep 24, 2015

I think we can only fork when it is a user supplied image file, not when we are doing latex-based images. I am yet to see that fail so badly.

@brucemiller
Copy link
Owner

On 09/24/2015 01:41 PM, Deyan Ginev wrote:

I think we can only fork when it is a user supplied image file, not when
we are doing latex-based images. I am yet to see that fail so badly.

probably right and I was just thinking the same thing: we can likely
trust the dvi engines to generate "proper" postscript, but author
supplied...
we'll they're really author supplied programs! Probably should sandbox them
as well!!

@dginev
Copy link
Collaborator Author

dginev commented Sep 24, 2015

Btw, I just surveyed a random sample of the arXiv submissions in 2014 and 2015 - there are still a LOT of .eps files in the bundles, the format is here to stay. And I anticipate the more modern the image, the more likely to give convert trouble - still digging for that example from a paper submitted just a couple of months ago.

@brucemiller
Copy link
Owner

On 09/24/2015 02:06 PM, Deyan Ginev wrote:

Btw, I just survery a random sample of the arXiv submissions in 2014 and
2015 - there are still a LOT of |.eps| files in the bundles, the format
is here to stay. And I anticipate the more modern the image, the more
likely to give convert trouble - still digging for that example from a
paper submitted just a couple of months ago.

It might be that only LaTeXML::Util::Image::image_read needs the
special treatment, so it could be built in there?

[although LaTeXImages uses the sequence image_object, ->Read, which
probably is supposed to be equivalent ??? don't remember)

@brucemiller
Copy link
Owner

On 09/24/2015 02:06 PM, Deyan Ginev wrote:

Btw, I just survery a random sample of the arXiv submissions in 2014 and
2015 - there are still a LOT of |.eps| files in the bundles, the format
is here to stay. And I anticipate the more modern the image, the more
likely to give convert trouble - still digging for that example from a
paper submitted just a couple of months ago.

Oh, and I think that ghostscript gets invoked on pdf and ai (adobe
illustrator,
which I think is just a variant of pdf); don't know if pdf is capable
of wedging ghostscript though. At any rate, pdf is probably becoming
more common.

@dginev
Copy link
Collaborator Author

dginev commented Sep 24, 2015

Correct, I have also seen recent use of PDF for storing very high quality vector graphics (which shocked me out of my mind).

@brucemiller
Copy link
Owner

Pull request #666 gave a seemingly sensible patch to manage this problem, via setting Image Magick specific environment variables. However, testing strongly suggested that the timing control MAGICK_TIME_LIMIT actually is measuring the calling program's total execution time (in this case latexmlpost or latexmlc; presumably expecting convert), instead of the time actually processing images. If this is true, that would suggest that a long running server, processing many jobs will eventually get timed out by the ImageMagick time limit!

And this raises the question of what the various memory and disk limits are measuring; are they also just measuring the current process?

I'm inclined to comment out the code in this patch, leaving it (and maybe additional comments here) as documentation of what you might try if you're in a server situation. I would also probably close this bug as "unfixable" for routine use: if you're running from the command line, just kill it; in server situations, you've just got to protect it more.

@dginev
Copy link
Collaborator Author

dginev commented Oct 8, 2015

It is fixable, we just need to fork() the image conversion and track its time from the parent process. Perl can also kill -9. It's sad that the option doesn't do what we expect it to, but we can emulate it nicely.

@dginev
Copy link
Collaborator Author

dginev commented Oct 8, 2015

Also, if we measure the execution time of the current process, we can add the time on top of the max allowed image conversion time to the current runtime and set MAGICK_TIME_LIMIT just before we start the image conversion. There are many ways to bend over and adapt an otherwise reliable timeout mechanism.

@brucemiller
Copy link
Owner

On 10/08/2015 04:00 PM, Deyan Ginev wrote:

Also, if we measure the execution time of the current process, we can
add the time on top of the max allowed image conversion time to the
current runtime and set |MAGICK_TIME_LIMIT| just before we start the
image conversion. There are many ways to bend over and adapt an
otherwise reliable timeout mechanism.

Yeah, perhaps you're right. It still seems like
a very server-specific approach, though.

@brucemiller
Copy link
Owner

Have you had any more exprerience with bad ghostscript on CorTeX? Does it suggest anything sane that can be done from within LaTeXML? Or is this just an issue that any hardened server will have to protect against using timeouts & environment variables?

@dginev
Copy link
Collaborator Author

dginev commented Dec 16, 2015

I have had plenty of bad experiences yes. But now I have the ENV timeouts in place for ghostscript, which work wonderfully: https://github.com/dginev/LaTeXML-Plugin-Cortex/blob/master/bin/latexml_worker#L50

I still consider this an aspect LaTeXML can be better at, yes. The benefit of doing it internally is that we can have informative error messages on why the timeout occurred, as opposed to a dirty kill of latexml (which CorTeX still does after 20 minutes and 20 seconds, for some pathological jobs). And that we can do it once and not force any latexml user do it from scratch, especially since there are situations where the problems are pathological and require a KILL signal.

@dginev
Copy link
Collaborator Author

dginev commented May 24, 2016

Upon my recent investigation of Perl's safe signals in #741 , I was reminded of the ImageMagick breakage from arXiv, not to mention the newly found http://imagetragick.com

I would love to have LaTeXML take some protective measures against its image conversion dependencies, as otherwise using it for images in production is a no-go (luckily no need for that in the Authorea context).

@brucemiller
Copy link
Owner

I'm going to claim that LaTeXML is doing as much as it feasibly can, here; partly just to shorten the list of issues to "possible' ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants