Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert PDF graphics to scalable SVGs #902

Open
bfirsh opened this issue Dec 8, 2017 · 10 comments
Open

Convert PDF graphics to scalable SVGs #902

bfirsh opened this issue Dec 8, 2017 · 10 comments

Comments

@bfirsh
Copy link
Contributor

bfirsh commented Dec 8, 2017

When using --graphicsmap=pdf.svg, it converts the graphic to a SVG with a raster rendering of the PDF. I would expect it to convert it to a vector graphic. The same presumably applies for EPS, AI, and PS.

For Engrafo, we had success using pdf2svg. Presumably the same result can be achieved by piping it through the same LaTeX rendering system that renders math/tikzpicture as SVG.

@bfirsh bfirsh changed the title PDF, EPS, etc graphics should be converted to scalable SVGs Convert PDF graphics to scalable SVGs Dec 8, 2017
@brucemiller
Copy link
Owner

But does it always generate a raster? I'd think it would depend on what kinds of drawing are in the pdf itself; more line oriented would generate vectors, but pdf that has a raster embedded is going to generate a raster in the svg. Do you have any small samples where you'd expect vectors?

[We're using ImageMagick for almost all image conversion. It's finicky, but it's nice to be able to rely on a single tool/dependence]

@bfirsh
Copy link
Contributor Author

bfirsh commented Dec 8, 2017

If it’s vector in the source PDF, it’s vector in the output SVG. If it’s raster in the source PDF, it’s raster in the output SVG. That’s what pdf2svg does.

Presumably the same process which produces math and tikz SVGs would do the same thing? (I’m not sure how that works but it seems to run them through TeX then somehow outputs an SVG.) If that system were used, then it’s not adding any additional dependencies.

@brucemiller
Copy link
Owner

brucemiller commented Dec 8, 2017 via email

@dginev
Copy link
Collaborator

dginev commented Dec 8, 2017

Oh, this is a topic I have some very painful experience in, maybe I say a couple of words.

First, imagemagick has "pathological" behavior on certain (very hard to classify or predict) PDF/eps inputs, in particular PDFs that encode vectorial graphics. What I mean by pathological is that it will do any of - an infinite loop in runtime, out of memory exception, silent failure with no image produced and files leftover on the filesystem...

One workaround that was widely used and approved in places such as StackOverflow was to delegate vectorial PDFs to a different processing engine, and in particular - a headless inkscape process. This is something I have seen work very reliably in the past, and in fact also shows the inverse problem - there are pathological images that don't convert in inkscape in say 10 minutes, that finish in a few seconds in imagemagick. And vice-versa, generally following the vectors vs pixels distinction.

On the upside, and to go back to your discussion here, when inkscape succeeds with the conversion the resulting SVG is truly preserving the vectorial definitions in the PDF, and the final image is high quality (if you get the appropriate web fonts to match the PDF fonts, or you get authors writing to you that the kerning is off by several millimeters...).

Should latexml rely on inkscape as well for these cases? Hard to say, it is a rather large dependency and may feel better as a plugin than as a core component. There are entire companies that deal with image conversion / hosting so it's an admittedly large and non-trivial problem. There may also be some space to think about where an exact line needs to be drawn between latexml and a general-purpose image conversion tool.

Lastly, arxiv conversions have given me endless grief in these "pathological" cases - you can't rely on the latexml process to time out / back out of an underlying infinite loop in C (where imagemagick can get stuck), so you need an external watchdog process to monitor that. This is a big part of what LaTeXML-Plugin-Cortex ended up covering for.

Related issues for some background: #663 and #666

@dginev dginev added this to the LaTeXML-0.8.4 milestone Dec 8, 2017
@bfirsh
Copy link
Contributor Author

bfirsh commented Dec 14, 2017

@brucemiller Here's an example of a vector PDF figure: cifar10_48-48-10_batch_10_plot.pdf

Here's the latexml output: https://www.arxiv-vanity.com/papers/1703.00441v2/#S4.F2.sf3

It is also particularly fuzzy because the DPI is not configurable, but that's another problem!

pdf2svg converted this to a scalable SVG without problems.

@bfirsh
Copy link
Contributor Author

bfirsh commented Dec 14, 2017

@dginev Agreed that a plugin is a good place to start. Perhaps it could be optional core functionality, so it isn't a hard dependency.

I might have a shot at a pdf2svg plugin to fix this for Arxiv Vanity, if I get round to it.

@brucemiller
Copy link
Owner

Ah, yes, of course ImageMagick isn't preserving the vectorness; it's basically a pipeline of raster operations, so the first thing it'll want to do is convert to an internal raster. By the same token, even if we introduce a dependency on pdf2svg to keep the image in raster form, we'd need a vector alternative duplicating the whole transform sequence (all the stuff that graphicx brings in). With svg, this is of course possible, maybe even "easy" in some sense, but a whole bunch of new code & testing. In other words, a bit tricky.

@dginev
Copy link
Collaborator

dginev commented Apr 8, 2019

A bit too open-ended for 0.8.4, pushing back to 0.9 until we have an attack plan in mind.

@dginev
Copy link
Collaborator

dginev commented Jan 25, 2022

To pin down a concrete high-difficulty test for this direction of work, today I encountered arXiv:1804.00311.

That article has multiple graphics using PDF assets which take north of 5 minutes to convert via ghostscript -- and even encounter API errors for metadata operations, such as obtaining the size. If we can become more efficient and correct in such cases, as we also start producing SVG for them, that would be an excellent outcome.

@dginev
Copy link
Collaborator

dginev commented Mar 1, 2024

Today I also stumbled on another gs-intensive example from arXiv:1807.01606. Attaching one PDF asset for future testing - it takes 11 minutes to execute gs on my machine.


fig8.pdf

\documentclass{article}
\usepackage{graphicx}
\begin{document}
\includegraphics[width=10cm]{fig8.pdf}
\end{document}

Resulting PNG:

image

Since the article has 15 of these PDFs, it reliably times out with the current build setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants