New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are some samples or corpus for test-kit? #835

Closed
ppKrauss opened this Issue Apr 6, 2017 · 26 comments

Comments

Projects
None yet
5 participants
@ppKrauss

ppKrauss commented Apr 6, 2017

Real-life samples (not only "sampling for demo") are important for:

  • show features of the software and conventions adopted;
  • tests: avoiding unexpected failures after code modifications (software regression);
  • select specific "standard samples" to discuss features and test enhancements.

Example: I need to "see" (with samples) the potential to use LaTeXML in LaTeX-to-JATS conversions...

About JATS and JATS-samples

Ideal is select some samples (eg. 2, 10 or 100 documents), the Latex-manuscripts or Latex-articles that was the source of articles in PubMed Central... A good sample-set depends on community use: little sample-set is good as "standard examples", and larger sample-set can be used as text corpus.

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 6, 2017

Um, hello. What exactly are you requesting in this issue? LaTeXML has an (integration) test suite already that covers most of the core latex conversion features.

The suite is lacking in post-processing tests, and we welcome external contributions - you can grab a showcase article of choice and run it through JATS, comparing to an ideal test case. Ideally we would have had this already, but with limited time we tend to prioritize core features over the periphery.

As to a large corpus LaTeXML has been run over, there is an active effort (although also low on manpower in recent months) for converting arXiv.org to HTML. A report on which you can find here:
https://lists.kwarc.info/pipermail/project-latexml/2016-October/002196.html

And the latest stats here:
http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html

@ppKrauss

This comment has been minimized.

ppKrauss commented Apr 6, 2017

Hi @dginev, thanks! My perseptions are about "lacking in post-processing tests", as you say. The issue is a suggestion to enhance this repo (or use a second git repository) with

  • post-processing results: each LaTeXML version produces a new set of results, that git will be check if it is stable or changed.

  • ideal results: each sample need source and "target", the ideal produced by other software... Example: the official JATS in PubMed Central.

  • diff methodology and metrics: it is not easy, but for XML JATS is possible to compare C14N differences (ideal vs result) by attributes, by pre-selected XPath's, etc.

Perhaps, for a "large corpus LaTeXML" you can use preprint manuscripts that was published in PubMed Central, SciELO and others... Contacting journals and authors. I can help contacting 1 or 2 journals.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 6, 2017

If you're requesting what I think you're requesting, I'd like to turn the request around! :>

Basically, none of us currently on the project are very familiar with JATS, but we have a (hopefully) good proof of concept. As I recall, it passed validation as JATS documents, but that of course doesn't tell you whether it faithfully captures the semantics of the original document.

So, what we'd really appreciate is for someone who is familiar with JATS and how it's used to apply LaTeXML to a sample of real-life documents to determine how well it's working, what the faults are --- I'm sure there are some. If faults arise, we can try to fix them. Once it's working convincingly, we could easily derive some small unit tests for regression purposes.

You up for that? :>

@brucemiller

This comment has been minimized.

Owner

brucemiller commented May 27, 2017

I guess not...

@blahah

This comment has been minimized.

blahah commented Jul 6, 2017

Please reopen this - I will be exploring LaTeXML for arxiv -> JATS conversion, with the goal to produce a ScienceFair datasource.

I will document experiments and progress here, and contribute back any bugs (+fixes) or improvements needed.

If a test corpus is useful, I can contribute that back too.

@ppKrauss

This comment has been minimized.

ppKrauss commented Jul 6, 2017

Hi all, sorry to abandon... My manuscripts will be converted only to "Simple JATS" not real JATS... When obtain complete JATS I will back with files.

Hi @blahah, let see what you offer there (!), I can help with JATS analysis (some quality control).

@dginev dginev reopened this Jul 6, 2017

@dginev

This comment has been minimized.

Collaborator

dginev commented Jul 6, 2017

Oh wow, @blahah thanks for the interest! Let's keep the issue open then and see if we can reap some mutual benefits here. I've been involved in arXiv conversion with latexml so can answer some of the tricky aspects you may encounter on the way. I would offer some of my arXiv latex->html build systme for reuse, but it's not generalized enough and hasn't been maintained yet - but may at least be worthy of a glance in the part where the actual latexml workers are called which is here:

https://github.com/dginev/latexml-plugin-cortex

While the code may not be terribly useful to you, the comments may be an early warning of what could go wrong:
https://github.com/dginev/LaTeXML-Plugin-Cortex/blob/master/bin/latexml_worker#L12

@blahah

This comment has been minimized.

blahah commented Jul 6, 2017

thanks @dginev, I've been browsing that code and it was very useful. I may well be back with questions :)

@ppKrauss thanks - we will validate against the JATS DTD and (the ultimate test) check whether it works in the Lens viewer. If you have any other tips for validation I'd welcome them :)

@blahah

This comment has been minimized.

blahah commented Jul 6, 2017

@dginev do you have any stats on the time and (compute) resources your HTML conversion took? Curious to see what I'm letting myself in for :)

@ppKrauss

This comment has been minimized.

ppKrauss commented Jul 6, 2017

Hi @blahah, I suggest also some samples at http://jats4r.org/validator/

PS: I feel a little rusty, but I can do "human analysis" as JATS expert.

@blahah

This comment has been minimized.

blahah commented Jul 7, 2017

Managed to get it working pretty well :)

screen shot 2017-07-07 at 04 02 51
screen shot 2017-07-07 at 04 02 22

A few things to fix but very close!

@dginev

This comment has been minimized.

Collaborator

dginev commented Jul 7, 2017

@dginev do you have any stats on the time and (compute) resources your HTML conversion took? Curious to see what I'm letting myself in for :)

Short answer: It's tricky and slow, too slow without at least 20+ CPUs around.

I do, somewhere in the email archives... here is the last time I publicly shared some data on the LaTeXML list (10.2016):
https://lists.kwarc.info/pipermail/project-latexml/2016-October/002196.html

Apparently our arXMLiv-specific email archives are private? Here is an email snippet with runtime stats from January 2016, when I did the last detailed email report w.r.t runtime:

Dear all,
The first "dataset" run is now complete.

  • It took almost exactly 101 hours (4 days and 5 hours), or just about
    2.82 jobs/second.
  • That means the average arXiv job took 2.5 minutes to convert.
  • CorTeX is officially "stable"!
    • the dispatcher, database and workers processed the entirety of
      arXiv without a single unforeseen failure. There was zero admin
      intervention during the run.
    • We had the full array of workers operational from beginning to end,
      and all workers and HULK machines remain online and healthy after the run.
      The final success rate is:
      No Problems 8.13% 83492
      Warning 45.81% 470344
      Error 36.27% 372436
      Fatal 9.78% 100442
      http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html
      (there may be small fluctuations when you got the live site, as ~100
      jobs were yet to return and I marked them as timeout to wrap up)
      I am attaching the numbers from the previous run at [1]. We can record a
      small deterioration percentage-wise, but given that the current run is a
      lot more honest about post-processing errors and fatals, as well as
      imposes a hard 2GB memory limit, this is understandable.
      While I have stated CorTeX is now stable, it is not yet error-free. I
      believe a significant portion of the "file not found" errors could be
      due to the workers cleaning up files too aggressively, but that is yet
      to be established. Some of the reporting menus in the frontend are
      currently broken due to URI escaping issues in nickel.js (the web
      framework), so I'll try to fix them over the weekend, so that we get a
      better overview of the best venues for improvements for the next rerun.
      We can also think of redesigning the categorization of certain messages,
      so that they don't create enormous variety of "what" classes (e.g.
      missing figure filenames).
      I intend to package the results of the current run as an
      "arXMLiv-12-2015" dataset, and follow-up with new dataset releases on a
      quarterly or 6-month basis, depending on HULK's availability and our
      progress with improving the conversion rates.
      Greetings,
      Deyan

[1] 4th stability run, December 2015

No Problems 8.16% 81005
Warning 47.01% 466706
Error 35.5% 352371
Fatal 9.33% 92625
On 01/18/2016 04:43 PM, Deyan Ginev wrote:

Dear all,

After we observed a few niggles w.r.t error-reporting and worker
stability in the last run, Bruce and I added some upgrades that
hopefully give us a stable-enough setup to run a first "dataset" run for
arXMLiv. Thus I have just started a full rerun from scratch.

All workers are using LaTeXML's latest HEAD (git version
29a47e5).

We're rerunning 1,025,914 arXiv sources with:

  • 420 CPUs (HULK, beryl, local laptops)
  • 20 minute job timeouts
  • 2 GB RAM memory limit

You can monitor the run at:
http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html

Keep in mind the sub-report pages are cached, and you can see the
timestamp in the footer to check their freshness. They should be at best
a few minutes, and at worst a few hours, behind the main report page,
which isn't cached.

Greetings,
Deyan

@dginev

This comment has been minimized.

Collaborator

dginev commented Jul 7, 2017

I think this is also a good time to quickly remark that there has been a lot going on "behind the curtains" of the project targeting exactly the performance deficiencies of LaTeXML and there may be (exciting?) developments in that vein later in 2017. But I can't share more for the moment safe for this "light hint".

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Jul 7, 2017

Wow! This is a nice development!! Looking forward to some bug reports :> Thanks!

@blahah

This comment has been minimized.

blahah commented Jul 7, 2017

Thanks @dginev - I do have access to 64+ core machines I can use so that sounds totally achievable.

And excited to see what the secret developments are 😄

@dginev dginev added the question label Jul 19, 2017

@dginev dginev added this to the LaTeXML-0.8.4 milestone Sep 5, 2017

@dginev

This comment has been minimized.

Collaborator

dginev commented Sep 5, 2017

Marking this as a documentation enhancement for 0.8.4 (2 releases from now), feel free to send us updates as things progress @blahah ! 👍

@dginev

This comment has been minimized.

Collaborator

dginev commented Dec 26, 2017

Hey @blahah , could you share if your effort to get arxiv->JATS had some progress / results, and if there are any blockers we can help with?

My research group has upped its hardware capacity a week ago and we seem to have found a viable compromise for research-only redistribution of the HTML5 of arXiv, so I'll be sharing some @KWARC news here soon and I will try to address the requests @ppKrauss had about documenting corpus-level conventions and best practices (at least the ones I arrived at).

The folks at arxiv-vanity (cc @bfirsh ) are now also doing the latexml dance over arxiv, so that makes 3 separate parties working on converting that corpus, and it would be excellent to share notes and upgrades as we go along. Pretty exciting actually. I may be a lot more active on this front in 2018, so this feels like a good time to drop a note here.

@blahah

This comment has been minimized.

blahah commented Dec 30, 2017

@dginev I will put all my code and results online. Basic story is that I got it working pretty nicely but lots of edge cases. I'd be very interested to sync up.

When you say:

we seem to have found a viable compromise for research-only redistribution of the HTML5 of arXiv

do you mean a licensing compromise or a technical one?

The license issue seems to me the biggest one.

@dginev

This comment has been minimized.

Collaborator

dginev commented Jan 2, 2018

@blahah sounds great, and sounds about right about the edge cases - great to mutually solve those.

The compromise solves the licensing problem, but is a "legally technical one" basically mitigating risk by having a dedicated organization do the redistribution with extremely limited purpose (non-commercial + research). We can't really wish away the default arXiv license I am afraid... The only "ultimate" solution remains having Cornell itself hosting the alternative formats, but that still seems to be a long-term perspective only. I am just happy we found some way to make the data available to the wider scientific community, we should be moving from "unavailable" to "slightly inconvenient direct download" soon.

@jmnicholson

This comment has been minimized.

jmnicholson commented Jan 9, 2018

@dginev I'd like to know when "slightly inconvenient" download is ready! Would love to use this for R-factor stuff :)

@ppKrauss ppKrauss changed the title from Samples or corpus for test-kit? to There are some samples or corpus for test-kit? Jan 9, 2018

@dginev

This comment has been minimized.

Collaborator

dginev commented Jan 24, 2018

We have just posted live our arXiv.org 08.2017 HTML5 dataset, together with a token model and word embeddings, intended for redistribution for research and tool development. Advertising them here as requested, and we welcome any and all community feedback:

https://sigmathling.kwarc.info/news/2018/01/24/dataset/

@ppKrauss

This comment has been minimized.

ppKrauss commented Jan 24, 2018

hi @dginev , congratulations on your work!

I am tring to download the arXMLiv_08_2017_no_problem.zip, but "Authorize gl.kwarc.info" fails.

PS: you can use git LFS in a public repo to offer your big ~5Gb file — not need to hide it, there are no cost.

@dginev

This comment has been minimized.

Collaborator

dginev commented Jan 24, 2018

Hi @ppKrauss , this is where the "slightly inconvenient" part of the download comes. You need to sign an NDA with the SIGMathLing organization to be given access to the downloads, which is the legal workaround for mitigating any weird licensing troubles with arXiv (long topic I won't go into here). Detailed instructions here.

For now we are testing that redistribution route so both the dataset and embeddings follow these guidelines. Hopefully in a bright mid-term future we'll have an official path to distribution that won't need NDAs and hassle, sorry for the inconvenience.

PS: The large files are indeed hosted via git LFS in gitlab, but they are hidden for licensing reasons.

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 20, 2018

I will close this issue for now, feel free to drop a comment or open a new one that is more specific to JATS, given that we covered a lot of ground here.

For now, I added a pointer to the arXMLiv corpus I mentioned in the latexml wiki pages:

https://github.com/brucemiller/LaTeXML/wiki/Interesting-Applications

And will use the main #896 issue for discussing improving the markup documentation. I'm not doing any active JATS work at the moment, so we may want to find a different driver for that.

@dginev dginev closed this Apr 20, 2018

@jmnicholson

This comment has been minimized.

jmnicholson commented Apr 20, 2018

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 20, 2018

Thanks for spotting that dead link, fixed. A bit too much is happening on that wiki page, if you specifically care about the dataset you can find it here:
https://sigmathling.kwarc.info/resources/arxmliv-dataset-082017/

and my download explanations are in the comment here #835 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment