New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] First iteration of TEI-implementation for LaTeXML. #871

Merged
merged 10 commits into from Nov 28, 2017

Conversation

Projects
None yet
4 participants
@Thanathan-k
Contributor

Thanathan-k commented Sep 4, 2017

Based on the LaTeXML-jats.xsl Stylesheet and only changed currently necessary parts. Testing and feedback would be appreciated.

First iteration of TEI-implementation for LaTeXML. Based on the LaTeX…
…ML-jats.xsl Stylesheet and only changed currently necessary parts.

@Thanathan-k Thanathan-k changed the title from First iteration of TEI-implementation for LaTeXML. to [feature] First iteration of TEI-implementation for LaTeXML. Sep 4, 2017

@dginev dginev requested a review from brucemiller Sep 4, 2017

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Sep 5, 2017

Is there a recommended tei validation tool, or relaxng schema, or.. ?

@dginev

This comment has been minimized.

Collaborator

dginev commented Sep 5, 2017

+1 to Bruce's comment. If there is a validator/schema, if you ( @Thanathan-k ) can add a test to latexml's post-processing suite, it would be a killer PR. Would make it 10x easier to merge I think.

@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Sep 6, 2017

Currently trying to get a validator in the form of an .rng file. However it seems complicated to use the MathML specification even though it should work with Roma... Do any of you already have some experience with it?

@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Sep 6, 2017

The used validation tool was jing (https://code.google.com/archive/p/jing-trang/)

My document used references within formulas. To my knowledge this isn't allowed in MathML. However I was not able to handle this in the XSL-File, because the formula generation seems to be completly done by LateXML itself. Is my assumption correct? And if so, how should cases with references within formulas be handled?

@dginev

This comment has been minimized.

Collaborator

dginev commented Sep 6, 2017

I am quite sure LaTeXML used to verify its MathML was valid, I think back in the days of outputting XHTML 4, but @brucemiller may have also done this for the HTML5 output. He would certainly know.

Changed more transform functions
for example formula group
@physikerwelt

This comment has been minimized.

Contributor

physikerwelt commented Sep 14, 2017

@dginev I think the problem is that the mtext element is hard to validate. In the stylesheet, @Thanathan-k used explicit rules were set up to validate the mtext element, which makes sense according to https://www.w3.org/TR/MathML3/chapter3.html#presm.mtext , https://www.w3.org/TR/MathML3/appendixa.html#parsing_mtext . However, it is also is somehow rational to treat the mtext element like the outer document formats and to allow other tags inside the mtext element. On the MathML website I was looking at the mtext definition
https://www.w3.org/TR/MathML3/appendixa.html#parsing_token.content
which referes to token.content* which referes to text which is not clickable on the website.
However, if additional tags are allows depends on the definition of text.

@physikerwelt

This comment has been minimized.

Contributor

physikerwelt commented Sep 14, 2017

One option could be to allow for all valid tei text in mtext such as it's described in https://www.w3.org/TR/MathML3/chapter6.html#interf.html for mathml in html5.

Added rnc file
Works better than rng (included Grobid and Mathml3)
@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Sep 14, 2017

I've added a new Validator file. Working with jing -c LatTeXML-tei.rnc myfile.

For you to be able to recreate the error we were talking about use the .tex file of https://arxiv.org/format/1708.08473 (It's too large to create a gist sorry),
and then transform it with latexml and afterwards use latexmlpost with tei format to gernate the output. Here jing will show an error:
result_pmml.tei:10738:365: error: element "ref" not allowed here; expected the element end-tag, text or element "m:malignmark" or "m:mglyph"

However I've found even more problem. If we use latexmlpost with the additional --cmml flag, the created content MathML has even more errors. For example:
result.tei:22498:1809: error: element "text" not allowed here; expected the element end-tag, text or element "m:malignmark" or "m:mglyph"
and
result.tei:22498:4462: error: element "m:mtext" not allowed here; expected the element end-tag or element "m:abs", "m:and", "m:apply", "m:approx", "m:arccos", "m:arccosh", "m:arccot", "m:arccoth", "m:arccsc", "m:arccsch", "m:arcsec", "m:arcsech", "m:arcsin", "m:arcsinh", "m:arctan", "m:arctanh", "m:arg", "m:bind", "m:card", "m:cartesianproduct", "m:cbytes", "m:ceiling", "m:cerror", "m:ci", "m:cn", "m:codomain", "m:complexes", "m:compose", "m:conjugate", "m:cos", "m:cosh", "m:cot", "m:coth", "m:cs", "m:csc", "m:csch", "m:csymbol", "m:curl", "m:declare", "m:determinant", "m:diff", "m:divergence", "m:divide", "m:domain", "m:emptyset", "m:eq", "m:equivalent", "m:eulergamma", "m:exists", "m:exp", "m:exponentiale", "m:factorial", "m:factorof", "m:false", "m:floor", "m:fn", "m:forall", "m:gcd", "m:geq", "m:grad", "m:gt", "m:ident", "m:image", "m:imaginary", "m:imaginaryi", "m:implies", "m:in", "m:infinity", "m:int", "m:integers", "m:intersect", "m:interval", "m:inverse", "m:lambda", "m:laplacian", "m:lcm", "m:leq", "m:limit", "m:list", "m:ln", "m:log", "m:lt", "m:matrix", "m:matrixrow", "m:max", "m:mean", "m:median", "m:min", "m:minus", "m:mode", "m:moment", "m:naturalnumbers", "m:neq", "m:not", "m:notanumber", "m:notin", "m:notprsubset", "m:notsubset", "m:or", "m:outerproduct", "m:partialdiff", "m:pi", "m:piecewise", "m:plus", "m:power", "m:primes", "m:product", "m:prsubset", "m:quotient", "m:rationals", "m:real", "m:reals", "m:reln", "m:rem", "m:root", "m:scalarproduct", "m:sdev", "m:sec", "m:sech", "m:selector", "m:semantics", "m:set", "m:setdiff", "m:share", "m:sin", "m:sinh", "m:subset", "m:sum", "m:tan", "m:tanh", "m:tendsto", "m:times", "m:transpose", "m:true", "m:union", "m:variance", "m:vector", "m:vectorproduct" or "m:xor"

I've checked the corresponding rnc files provided by MathML and the errors are indeed correct and not due to any rnc mashup I've done.

@physikerwelt

This comment has been minimized.

Contributor

physikerwelt commented Sep 14, 2017

@Thanathan-k the second problem could also be an independent bug in LaTeXML. Can you identify the LaTeX source formula that caused that problem? Maybe there is just a wrapper missing to wrap the mtext element in a standard conform way to the content branch
https://w3c.github.io/mathml/mathml.html#chapter5_mixing.pmincm

Thanathan-k added some commits Sep 20, 2017

fixed rnc file and theorem/proof in xslt
RNC file didn't allow rows as part of tables and theorem/proof are now handled as notes
Fixed more TEI specifications
Affiliation and AddrLine corrected
@brucemiller

This comment has been minimized.

Owner

brucemiller commented Oct 22, 2017

I haven't responded much, since I have no familiarity with TEI, and since patches kept coming, I figured to wait for the dust to settle... but maybe it has?

On the issue of validation: *pure" MathML allows very limited elements within m:mtext, but we tried to write the spec to imply that when MathML is embedded within a larger document schema, the m:mtext should allow (at least) that document schema's inline elements as well --- at least, that's my interpretation! And, it's what LaTeXML's schema does, and pretty much has to to model what is typically done in LaTeX. Consequently, any other target format including MathML is going to have to do the same thing in order to validate.

In @Thanathan-k's september 13 comment, the first error appears to be due to putting ref within mtext. That's exactly the kind of inline that is not allowed by "pure" MathML, but should be allowed by the embedded MathML (imho). The second error involving cmml, I can't quite interpret; perhaps it's the same problem?

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Oct 22, 2017

Oh: on that second error: if it is indeed a distinct latexml error, it should be reproducible with simple html w/o the complications (and unkowns, for me) of TEI. I small sample producing the error would be helpful.

@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Oct 23, 2017

Hey @brucemiller ,

Yeah I worked on it and used more documents from ArXiv to check if everything works. Currently everything looks fine, but I wanted to check some more samples and fix everything before commenting here again. I'll also prepare a sample which produces the cmml error I was talking about.

@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Oct 26, 2017

Okay I've checked the script again against more ArXiv documents and everything looks good currently. (Except for the errors within MathML).

Here's a tex document producing both the ref and m:text error after using latexml and latexmlpost with my tei extension:

The exact commands were: latexml -destination=error.xml error.tex and latexmlpost -cmml -destination=error.tei -format=tei error.tei

https://gist.github.com/Thanathan-k/ae812977da44a16cb5f8cf5f507b3445

Added the correct rnc files and deleted defect rng file
Due to the external I was not able yet to convert the rnc to rng. Will work on that. If it's possible to get rid of the mathml3.rnc in favor of LaTeXML-math.rnc, that would be great as well.
@brucemiller

This comment has been minimized.

Owner

brucemiller commented Nov 15, 2017

OK, that's a nice small test case; Just running it with pmml, first. it includes \ref within the math (of a label "pent", which is nowhere defined, btw) and that generates a <ref> element within the m:mtext as you'd expect. But that is exactly where you need to extend the MathML schema that's included into TEI's schema: it's m:mtext should accept all TEI's "inline" elements.

Actually, it's a bit worse than that: you don't really want to just go in and edit the TEI & MathML schemas to change them, if at all possible. That's very difficult to maintain, as it isn't clear which
parts of the original schema have been modified.

However: Often, you can start with a small schema that includes TEI and MathML's schema, and then define a couple of magic rules like

Inline.class |= VRML

which would add the VRML element to a set of element names that defines the inline
elements (an example taken from DLMF's enhancement of LaTeXML's schema).
It's tricky, and sometimes the original schema writers make it hard or even impossible to do cleanly,
but that's the approach to try for.

And, in fact, although it is really nice to have a schema to check things against, I don't really want to embed TEI's or MathML's schema into the LaTeXML distribution.

Coming back to --cmml, hmm, yeah, LaTeXML's putting an m:mtext into the cmml, which I guess isn't right. But it really has no idea what the "semantics" are to convert it into something better.... not sure what the best approach here is, yet.

@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Nov 15, 2017

(of a label "pent", which is nowhere defined, btw)

Sorry I focused on the error and just deleted as much as possible. Must have missed that.

And, in fact, although it is really nice to have a schema to check things against, I don't really want to embed TEI's or MathML's schema into the LaTeXML distribution.

So what do you suggest? Or you generally don't want to include TEI in LaTeXML?

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Nov 15, 2017

@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Nov 16, 2017

Let me start by apologizing (again) for not being on top
of this issue and responding in a timely fashion... I've been
rather distracted lately.

Don't worry about it!

So what do you suggest as next steps? As far as my testing goes the created documents should be in line with the Grobid-TEI specification (which I focused on because it's used by our software). I could create an issue there and ask for them to add MathML int their rng/rnc files.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Nov 16, 2017

Ah, I wasn't aware of Grobid-TEI; I bit of shallow googling suggests they're aware of MathML, but isn't clear to what extent they've already accommodated it. Maybe they have an interest in pursuing, and you (& we) can help?

As for next steps: to the extent that you're happy with the current status of the transformation, and it's close enough to be useful at least for your purposes, we can probably merge a subset of the PR; namely, MANIFEST, latexmlpost and the XSL stylesheet. I hope that if further experience with it leads to corrections or improvements, you'll be willing to submit PR's for that?

Does that sound reasonable?
@dginev: any comments?

@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Nov 16, 2017

I'll open a ticket at their github project. Let's see if they have an interest in joining forces.

Sure sounds good! Should I make a new commit with only the files you mentioned? Or will you just delete the rnc files while merging?

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Nov 16, 2017

Hmm, I'm not actually sure that I know how. In principle, I would have cherry-picked the commits that only affected the desired files, and omit any that dealt with schema. But it appears that a few of the commits affect both xsl & rnc, so that won't work cleanly. Probably the solution is something like pulling the whole set of commits, removing the unwanted files and then doing a "squash" (I haven't gotten fluent with squashing, yet :> ). I'm trying to avoid getting the largish schema in, and polluting the history (which is already cluttered enough).

Maybe I should ping @dginev again for advice. He's good at this stuff!

@dginev

This comment has been minimized.

Collaborator

dginev commented Nov 16, 2017

Sounds like it is easiest to remove the files not needed, add that as a new commit to the branch of this PR, and then "squash and merge". Squash is forgiving of mistakes.

@Thanathan-k

This comment has been minimized.

Contributor

Thanathan-k commented Nov 17, 2017

Okay will do that!

@brucemiller brucemiller merged commit 465cbe4 into brucemiller:master Nov 28, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@brucemiller

This comment has been minimized.

Owner

brucemiller commented Nov 28, 2017

Great new feature; will be interesting to see use cases.
Ultimately, we'll need to add some hooks in latexmlc to recognize the format, but this is a really good start. Thanks so much for your patience!

@physikerwelt

This comment has been minimized.

Contributor

physikerwelt commented Nov 28, 2017

Thank you @brucemiller!
@amstart did already convert all NTCIR-11 arXiv documents. Maybe you can share some statistics on how many documents could be converted without errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment