There are some samples or corpus for test-kit? #835
Comments
Um, hello. What exactly are you requesting in this issue? LaTeXML has an (integration) test suite already that covers most of the core latex conversion features. The suite is lacking in post-processing tests, and we welcome external contributions - you can grab a showcase article of choice and run it through JATS, comparing to an ideal test case. Ideally we would have had this already, but with limited time we tend to prioritize core features over the periphery. As to a large corpus LaTeXML has been run over, there is an active effort (although also low on manpower in recent months) for converting arXiv.org to HTML. A report on which you can find here: And the latest stats here: |
Hi @dginev, thanks! My perseptions are about "lacking in post-processing tests", as you say. The issue is a suggestion to enhance this repo (or use a second git repository) with
Perhaps, for a "large corpus LaTeXML" you can use preprint manuscripts that was published in PubMed Central, SciELO and others... Contacting journals and authors. I can help contacting 1 or 2 journals. |
If you're requesting what I think you're requesting, I'd like to turn the request around! :> Basically, none of us currently on the project are very familiar with JATS, but we have a (hopefully) good proof of concept. As I recall, it passed validation as JATS documents, but that of course doesn't tell you whether it faithfully captures the semantics of the original document. So, what we'd really appreciate is for someone who is familiar with JATS and how it's used to apply LaTeXML to a sample of real-life documents to determine how well it's working, what the faults are --- I'm sure there are some. If faults arise, we can try to fix them. Once it's working convincingly, we could easily derive some small unit tests for regression purposes. You up for that? :> |
I guess not... |
Please reopen this - I will be exploring LaTeXML for arxiv -> JATS conversion, with the goal to produce a ScienceFair datasource. I will document experiments and progress here, and contribute back any bugs (+fixes) or improvements needed. If a test corpus is useful, I can contribute that back too. |
Hi all, sorry to abandon... My manuscripts will be converted only to "Simple JATS" not real JATS... When obtain complete JATS I will back with files. Hi @blahah, let see what you offer there (!), I can help with JATS analysis (some quality control). |
Oh wow, @blahah thanks for the interest! Let's keep the issue open then and see if we can reap some mutual benefits here. I've been involved in arXiv conversion with latexml so can answer some of the tricky aspects you may encounter on the way. I would offer some of my arXiv latex->html build systme for reuse, but it's not generalized enough and hasn't been maintained yet - but may at least be worthy of a glance in the part where the actual latexml workers are called which is here: https://github.com/dginev/latexml-plugin-cortex While the code may not be terribly useful to you, the comments may be an early warning of what could go wrong: |
@dginev do you have any stats on the time and (compute) resources your HTML conversion took? Curious to see what I'm letting myself in for :) |
Hi @blahah, I suggest also some samples at http://jats4r.org/validator/ PS: I feel a little rusty, but I can do "human analysis" as JATS expert. |
Short answer: It's tricky and slow, too slow without at least 20+ CPUs around. I do, somewhere in the email archives... here is the last time I publicly shared some data on the LaTeXML list (10.2016): Apparently our arXMLiv-specific email archives are private? Here is an email snippet with runtime stats from January 2016, when I did the last detailed email report w.r.t runtime:
|
I think this is also a good time to quickly remark that there has been a lot going on "behind the curtains" of the project targeting exactly the performance deficiencies of LaTeXML and there may be (exciting?) developments in that vein later in 2017. But I can't share more for the moment safe for this "light hint". |
Wow! This is a nice development!! Looking forward to some bug reports :> Thanks! |
Thanks @dginev - I do have access to 64+ core machines I can use so that sounds totally achievable. And excited to see what the secret developments are |
Marking this as a documentation enhancement for 0.8.4 (2 releases from now), feel free to send us updates as things progress @blahah ! |
Hey @blahah , could you share if your effort to get arxiv->JATS had some progress / results, and if there are any blockers we can help with? My research group has upped its hardware capacity a week ago and we seem to have found a viable compromise for research-only redistribution of the HTML5 of arXiv, so I'll be sharing some @KWARC news here soon and I will try to address the requests @ppKrauss had about documenting corpus-level conventions and best practices (at least the ones I arrived at). The folks at arxiv-vanity (cc @bfirsh ) are now also doing the latexml dance over arxiv, so that makes 3 separate parties working on converting that corpus, and it would be excellent to share notes and upgrades as we go along. Pretty exciting actually. I may be a lot more active on this front in 2018, so this feels like a good time to drop a note here. |
@dginev I will put all my code and results online. Basic story is that I got it working pretty nicely but lots of edge cases. I'd be very interested to sync up. When you say:
do you mean a licensing compromise or a technical one? The license issue seems to me the biggest one. |
@blahah sounds great, and sounds about right about the edge cases - great to mutually solve those. The compromise solves the licensing problem, but is a "legally technical one" basically mitigating risk by having a dedicated organization do the redistribution with extremely limited purpose (non-commercial + research). We can't really wish away the default arXiv license I am afraid... The only "ultimate" solution remains having Cornell itself hosting the alternative formats, but that still seems to be a long-term perspective only. I am just happy we found some way to make the data available to the wider scientific community, we should be moving from "unavailable" to "slightly inconvenient direct download" soon. |
@dginev I'd like to know when "slightly inconvenient" download is ready! Would love to use this for R-factor stuff :) |
We have just posted live our arXiv.org 08.2017 HTML5 dataset, together with a token model and word embeddings, intended for redistribution for research and tool development. Advertising them here as requested, and we welcome any and all community feedback: |
Hi @ppKrauss , this is where the "slightly inconvenient" part of the download comes. You need to sign an NDA with the SIGMathLing organization to be given access to the downloads, which is the legal workaround for mitigating any weird licensing troubles with arXiv (long topic I won't go into here). Detailed instructions here. For now we are testing that redistribution route so both the dataset and embeddings follow these guidelines. Hopefully in a bright mid-term future we'll have an official path to distribution that won't need NDAs and hassle, sorry for the inconvenience. PS: The large files are indeed hosted via git LFS in gitlab, but they are hidden for licensing reasons. |
I will close this issue for now, feel free to drop a comment or open a new one that is more specific to JATS, given that we covered a lot of ground here. For now, I added a pointer to the arXMLiv corpus I mentioned in the latexml wiki pages: https://github.com/brucemiller/LaTeXML/wiki/Interesting-Applications And will use the main #896 issue for discussing improving the markup documentation. I'm not doing any active JATS work at the moment, so we may want to find a different driver for that. |
The link to arXML seems to be dead https://kwarc.info/systems/arXMLiv/ <https://kwarc.info/systems/arXMLiv/>
I’d like to download the corpus if possible for fair use purposes, can you link here?
… On Apr 20, 2018, at 2:15 AM, Deyan Ginev ***@***.***> wrote:
I will close this issue for now, feel free to drop a comment or open a new one that is more specific to JATS, given that we covered a lot of ground here.
For now, I added a pointer to the arXMLiv corpus I mentioned in the latexml wiki pages:
https://github.com/brucemiller/LaTeXML/wiki/Interesting-Applications <https://github.com/brucemiller/LaTeXML/wiki/Interesting-Applications>
And will use the main #896 <#896> issue for discussing improving the markup documentation. I'm not doing any active JATS work at the moment, so we may want to find a different driver for that.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#835 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANEPK6j19Qj6d8bL-NfluavurrA61mcnks5tqXzmgaJpZM4M1690>.
|
Thanks for spotting that dead link, fixed. A bit too much is happening on that wiki page, if you specifically care about the dataset you can find it here: and my download explanations are in the comment here #835 (comment) |
Real-life samples (not only "sampling for demo") are important for:
Example: I need to "see" (with samples) the potential to use LaTeXML in LaTeX-to-JATS conversions...
About JATS and JATS-samples
Ideal is select some samples (eg. 2, 10 or 100 documents), the Latex-manuscripts or Latex-articles that was the source of articles in PubMed Central... A good sample-set depends on community use: little sample-set is good as "standard examples", and larger sample-set can be used as text corpus.
The text was updated successfully, but these errors were encountered: