-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up entity bounding box detection #116
Comments
Also of note is this alternative technique for colorizing spans of text in LaTeX. I suspect it would run into the very same errors of whatsits causing subtle changes in the spacing of text, though it might be worth looking at further: https://tex.stackexchange.com/a/116907/198728 |
One additional idea for scaling up the coloring is to copy over the output and auxiliary files from the uncolorized LaTeX before compiling the colorized code, with the hopes that only the last LaTeX compilation needs to be re-run. |
Another idea for speeding up entity bounding box detection is:
|
comments from @kyleclo on the above comment, copied from zoom chat: let's suppose we can get this for every token in grobid, and only math symbols in latex (e.g. between $$)
in latex
bboxes from grobid:
bboxes from latex:
An NLP model (e.g. Dongyeop's) takes as input the latex and returns that it wants the LaTeX sentence above. but we only have bounding boxes for the symbols, not the entire sentence. How do we surface the bounding box for the entire sentence? We can identify grobid bbox candidates from in grobid
to the latex sentence.. somehow |
Following up on the above idea...
I spent a bit of time looking into the AutoTeX source code. My conclusion is that we will need to make some light modifications to the AutoTeX source code, or monkey patch it, if we want to reduce the number of compilations needed for per paper The issue seems to be a method called I think this effort is probably worth it. It just stinks that it means that we won't be able to just merely install AutoTeX through cpanm to get the most recent version |
Some of these ideas were implemented as of a recent commit 3e0abb4. Specifically:
|
One important source of speed up is to make it faster to compile TeX projects and raster their pages. For one of the papers we're processing for the S2 reading group---https://arxiv.org/abs/1909.13433---there's a case where it takes 40 seconds to raster the pages from the paper. Turns out, the PDF embeds about a dozen other PDFs as figures, and those PDFs I think have thousands of objects in them. This seems to be the source of causing it to take so long to compile. My fix for processing the paper was to open the directory, change all of the figures from PDFs to PNGs (and update the references to those figures to point to the PNGs), and then package up the directory again, using this as the archive for that TeX project. This decreased the rastering time to no more than 1 or 2 seconds. In the future, this might be something that could be useful to automated in the pipeline. |
This issue is being closed as it is overly broad. See #132 for a concrete discussion of one idea for further improving the speed of the pipeline. |
I'm reopening this issue as I'm hoping to speed up entity processing for some of the papers. In my recent analyses of pipeline performance on s mall number of long-running papers, I have found that the longest-running stage is always compiling the LaTeX. Then, the second-longest stage is either locating hues in the image diffs, or, in just a few cases, rastering the images. It does not appear to take very long to difference the images, or to scan and colorize the TeX. The implications of this is that we can likely get the most payoff if we decrease the time spent compiling LaTeX. To repeat some of the ideas we've thought through, they include:
|
In the current version of the pipeline, the highest accuracy entity detections come from detecting one entity at a time. This does not scale well (i.e., it leads to some papers taking an hour or more to process).
This issue proposes how to speed up the detection of entities. Ideas include:
To batch process despite LaTeX compilation errors: Add a print message to the LaTeX before each entity colorization command. If that print message appears right before a LaTeX parse error, add that colorization command to a skiplist, remove it from the batch, terminate TeX compilation as quickly as possible, and try again.
To batch process despite colorization changing text spacing: A better way is needed to detect which colorization commands change the spacing of the text. In single-column papers, there is a simple approach: find the first symbol directly to the left or above the shifted text; it is likely the cause of the shifted text. Add it to the skiplist, remove it from the batch, and try again. For two-column papers, or trickier cases in single-column papers, a more sophisticated approach is needed.
Perhaps optical flow can be used to detect which symbols in a batch have shifted positions, and the first symbol before the shifted symbols is marked as disruptive and removed from the batch.
Similar heuristics can be used as those proposed for single-column papers, accepting that sometimes batch-processing will still be inefficient, because the wrong symbols are getting removed from the batch
The text after each symbol can be given a color. When text shifts and it has a specific color, it will be known that it follows a specific symbol. That symbol can be removed from the batch. It's my intuition would provide the best trade-off for accuracy in detecting which symbols cause spacing issues, while being somewhat straightforward to implement.
The text for each paragraph can be assigned a different color. That way, it is known in which paragraph the text first started to shift, and the symbol that caused the shift would be the first one in the paragraph to appear in a pixel position before the shifted text (i.e., right to the left of, or right above). This gets trickier if a paragraph is split across columns, though heuristics could be used to detect which part of the paragraph appeared in earlier columns (i.e., by looking at horizontal spacing between chunks of color that belonged to a column). The advantage of this approach is that colorization commands added at the very start and end of a paragraph, I expect (though don't know for sure), would be less likely to introduce changes to the text spacing themselves.
The text was updated successfully, but these errors were encountered: