# [feature] First iteration of TEI-implementation for LaTeXML. #871

Merged
merged 10 commits into from Nov 28, 2017

## Conversation

Projects
None yet
4 participants
Contributor

### Thanathan-k commented Sep 4, 2017

 Based on the LaTeXML-jats.xsl Stylesheet and only changed currently necessary parts. Testing and feedback would be appreciated.
 First iteration of TEI-implementation for LaTeXML. Based on the LaTeX… 
…ML-jats.xsl Stylesheet and only changed currently necessary parts.
 7796a27 

### Thanathan-k changed the title from First iteration of TEI-implementation for LaTeXML. to [feature] First iteration of TEI-implementation for LaTeXML.Sep 4, 2017

 Deleted some comments 
 5b47b25 

Owner

### brucemiller commented Sep 5, 2017

 Is there a recommended tei validation tool, or relaxng schema, or.. ?
Collaborator

### dginev commented Sep 5, 2017

 +1 to Bruce's comment. If there is a validator/schema, if you ( @Thanathan-k ) can add a test to latexml's post-processing suite, it would be a killer PR. Would make it 10x easier to merge I think.
Contributor

### Thanathan-k commented Sep 6, 2017

 Currently trying to get a validator in the form of an .rng file. However it seems complicated to use the MathML specification even though it should work with Roma... Do any of you already have some experience with it?

### Thanathan-k added some commits Sep 6, 2017

 Changes to make the document Tei compliant and added RNG file. 
RNG file consist of TEI_math (http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_math.rng) and Grobid (https://grobid.readthedocs.io/en/latest/TEI-encoding-of-results/)
 e013d54 
 Merge branch 'TEI' of github.com:Thanathan-k/LaTeXML into TEI 
 3e30a82 
Contributor

### Thanathan-k commented Sep 6, 2017 • edited

 The used validation tool was jing (https://code.google.com/archive/p/jing-trang/) My document used references within formulas. To my knowledge this isn't allowed in MathML. However I was not able to handle this in the XSL-File, because the formula generation seems to be completly done by LateXML itself. Is my assumption correct? And if so, how should cases with references within formulas be handled?
Collaborator

### dginev commented Sep 6, 2017

 I am quite sure LaTeXML used to verify its MathML was valid, I think back in the days of outputting XHTML 4, but @brucemiller may have also done this for the HTML5 output. He would certainly know.
 Changed more transform functions 
for example formula group
 c4b06d6 
Contributor

### physikerwelt commented Sep 14, 2017

 @dginev I think the problem is that the mtext element is hard to validate. In the stylesheet, @Thanathan-k used explicit rules were set up to validate the mtext element, which makes sense according to https://www.w3.org/TR/MathML3/chapter3.html#presm.mtext , https://www.w3.org/TR/MathML3/appendixa.html#parsing_mtext . However, it is also is somehow rational to treat the mtext element like the outer document formats and to allow other tags inside the mtext element. On the MathML website I was looking at the mtext definition https://www.w3.org/TR/MathML3/appendixa.html#parsing_token.content which referes to token.content* which referes to text which is not clickable on the website. However, if additional tags are allows depends on the definition of text.
Contributor

### physikerwelt commented Sep 14, 2017

 One option could be to allow for all valid tei text in mtext such as it's described in https://www.w3.org/TR/MathML3/chapter6.html#interf.html for mathml in html5.
 Added rnc file 
Works better than rng (included Grobid and Mathml3)
 66fe02a 
Contributor

### Thanathan-k commented Sep 14, 2017 • edited

 I've added a new Validator file. Working with jing -c LatTeXML-tei.rnc myfile. For you to be able to recreate the error we were talking about use the .tex file of https://arxiv.org/format/1708.08473 (It's too large to create a gist sorry), and then transform it with latexml and afterwards use latexmlpost with tei format to gernate the output. Here jing will show an error: result_pmml.tei:10738:365: error: element "ref" not allowed here; expected the element end-tag, text or element "m:malignmark" or "m:mglyph" However I've found even more problem. If we use latexmlpost with the additional --cmml flag, the created content MathML has even more errors. For example: result.tei:22498:1809: error: element "text" not allowed here; expected the element end-tag, text or element "m:malignmark" or "m:mglyph" and result.tei:22498:4462: error: element "m:mtext" not allowed here; expected the element end-tag or element "m:abs", "m:and", "m:apply", "m:approx", "m:arccos", "m:arccosh", "m:arccot", "m:arccoth", "m:arccsc", "m:arccsch", "m:arcsec", "m:arcsech", "m:arcsin", "m:arcsinh", "m:arctan", "m:arctanh", "m:arg", "m:bind", "m:card", "m:cartesianproduct", "m:cbytes", "m:ceiling", "m:cerror", "m:ci", "m:cn", "m:codomain", "m:complexes", "m:compose", "m:conjugate", "m:cos", "m:cosh", "m:cot", "m:coth", "m:cs", "m:csc", "m:csch", "m:csymbol", "m:curl", "m:declare", "m:determinant", "m:diff", "m:divergence", "m:divide", "m:domain", "m:emptyset", "m:eq", "m:equivalent", "m:eulergamma", "m:exists", "m:exp", "m:exponentiale", "m:factorial", "m:factorof", "m:false", "m:floor", "m:fn", "m:forall", "m:gcd", "m:geq", "m:grad", "m:gt", "m:ident", "m:image", "m:imaginary", "m:imaginaryi", "m:implies", "m:in", "m:infinity", "m:int", "m:integers", "m:intersect", "m:interval", "m:inverse", "m:lambda", "m:laplacian", "m:lcm", "m:leq", "m:limit", "m:list", "m:ln", "m:log", "m:lt", "m:matrix", "m:matrixrow", "m:max", "m:mean", "m:median", "m:min", "m:minus", "m:mode", "m:moment", "m:naturalnumbers", "m:neq", "m:not", "m:notanumber", "m:notin", "m:notprsubset", "m:notsubset", "m:or", "m:outerproduct", "m:partialdiff", "m:pi", "m:piecewise", "m:plus", "m:power", "m:primes", "m:product", "m:prsubset", "m:quotient", "m:rationals", "m:real", "m:reals", "m:reln", "m:rem", "m:root", "m:scalarproduct", "m:sdev", "m:sec", "m:sech", "m:selector", "m:semantics", "m:set", "m:setdiff", "m:share", "m:sin", "m:sinh", "m:subset", "m:sum", "m:tan", "m:tanh", "m:tendsto", "m:times", "m:transpose", "m:true", "m:union", "m:variance", "m:vector", "m:vectorproduct" or "m:xor" I've checked the corresponding rnc files provided by MathML and the errors are indeed correct and not due to any rnc mashup I've done.
Contributor

### physikerwelt commented Sep 14, 2017

 @Thanathan-k the second problem could also be an independent bug in LaTeXML. Can you identify the LaTeX source formula that caused that problem? Maybe there is just a wrapper missing to wrap the mtext element in a standard conform way to the content branch https://w3c.github.io/mathml/mathml.html#chapter5_mixing.pmincm

### Thanathan-k added some commits Sep 20, 2017

 fixed rnc file and theorem/proof in xslt 
RNC file didn't allow rows as part of tables and theorem/proof are now handled as notes
 6e9aa19 
 Fixed more TEI specifications 
Affiliation and AddrLine corrected
 b20992e 
Owner

### brucemiller commented Oct 22, 2017

 I haven't responded much, since I have no familiarity with TEI, and since patches kept coming, I figured to wait for the dust to settle... but maybe it has? On the issue of validation: *pure" MathML allows very limited elements within m:mtext, but we tried to write the spec to imply that when MathML is embedded within a larger document schema, the m:mtext should allow (at least) that document schema's inline elements as well --- at least, that's my interpretation! And, it's what LaTeXML's schema does, and pretty much has to to model what is typically done in LaTeX. Consequently, any other target format including MathML is going to have to do the same thing in order to validate. In @Thanathan-k's september 13 comment, the first error appears to be due to putting ref within mtext. That's exactly the kind of inline that is not allowed by "pure" MathML, but should be allowed by the embedded MathML (imho). The second error involving cmml, I can't quite interpret; perhaps it's the same problem?
Owner

### brucemiller commented Oct 22, 2017

 Oh: on that second error: if it is indeed a distinct latexml error, it should be reproducible with simple html w/o the complications (and unkowns, for me) of TEI. I small sample producing the error would be helpful.
Contributor

### Thanathan-k commented Oct 23, 2017

 Hey @brucemiller , Yeah I worked on it and used more documents from ArXiv to check if everything works. Currently everything looks fine, but I wanted to check some more samples and fix everything before commenting here again. I'll also prepare a sample which produces the cmml error I was talking about.
Contributor

### Thanathan-k commented Oct 26, 2017 • edited

 Okay I've checked the script again against more ArXiv documents and everything looks good currently. (Except for the errors within MathML). Here's a tex document producing both the ref and m:text error after using latexml and latexmlpost with my tei extension: The exact commands were: latexml -destination=error.xml error.tex and latexmlpost -cmml -destination=error.tei -format=tei error.tei https://gist.github.com/Thanathan-k/ae812977da44a16cb5f8cf5f507b3445
 Added the correct rnc files and deleted defect rng file 
Due to the external I was not able yet to convert the rnc to rng. Will work on that. If it's possible to get rid of the mathml3.rnc in favor of LaTeXML-math.rnc, that would be great as well.
 4498764 
Owner

### brucemiller commented Nov 15, 2017

 OK, that's a nice small test case; Just running it with pmml, first. it includes \ref within the math (of a label "pent", which is nowhere defined, btw) and that generates a  element within the m:mtext as you'd expect. But that is exactly where you need to extend the MathML schema that's included into TEI's schema: it's m:mtext should accept all TEI's "inline" elements. Actually, it's a bit worse than that: you don't really want to just go in and edit the TEI & MathML schemas to change them, if at all possible. That's very difficult to maintain, as it isn't clear which parts of the original schema have been modified. However: Often, you can start with a small schema that includes TEI and MathML's schema, and then define a couple of magic rules like Inline.class |= VRML  which would add the VRML element to a set of element names that defines the inline elements (an example taken from DLMF's enhancement of LaTeXML's schema). It's tricky, and sometimes the original schema writers make it hard or even impossible to do cleanly, but that's the approach to try for. And, in fact, although it is really nice to have a schema to check things against, I don't really want to embed TEI's or MathML's schema into the LaTeXML distribution. Coming back to --cmml, hmm, yeah, LaTeXML's putting an m:mtext into the cmml, which I guess isn't right. But it really has no idea what the "semantics" are to convert it into something better.... not sure what the best approach here is, yet.
Contributor

### Thanathan-k commented Nov 15, 2017 • edited

 (of a label "pent", which is nowhere defined, btw) Sorry I focused on the error and just deleted as much as possible. Must have missed that. And, in fact, although it is really nice to have a schema to check things against, I don't really want to embed TEI's or MathML's schema into the LaTeXML distribution. So what do you suggest? Or you generally don't want to include TEI in LaTeXML?
Owner

### brucemiller commented Nov 15, 2017

 On 11/15/2017 06:24 PM, Michael Kramer wrote: And, in fact, although it is really nice to have a schema to check things against, I don't really want to embed TEI's or MathML's schema into the LaTeXML distribution. So what do you suggest? Or you generally don't want to include TEI in LaTeXML? Let me start by apologizing (again) for not being on top of this issue and responding in a timely fashion... I've been rather distracted lately. Actually, I am very interested in including TEI, and really appreciate y'alls efforts to put it together! The changes to latexmlpost (and eventually to latexmlc), as well as the xsl stylesheet definitely should go into the distribution. But I don't think the schema should go in for several reasons. Firstly, there's no direct way to use it anyway (and we don't include schema for other target formats like jats and various flavors of html). Secondly, it will inevitably get out-of-sync with the "master" copy of the tei schema, particularly if we have a modified copy of it. Yes, if there are changes to TEI, we'll have to modify the stylesheet, but hopefully not have to re-invent the modified stylesheet. And in fact, rather than us carry around that modified stylesheet, it would be better to lobby the TEI folks to include MathML in their master copy (hopefully dealing with the mtext issue at the same time). Hopefully that explains my position? Seriously: thanks for your effort!
Contributor

### Thanathan-k commented Nov 16, 2017

 Let me start by apologizing (again) for not being on top of this issue and responding in a timely fashion... I've been rather distracted lately. Don't worry about it! So what do you suggest as next steps? As far as my testing goes the created documents should be in line with the Grobid-TEI specification (which I focused on because it's used by our software). I could create an issue there and ask for them to add MathML int their rng/rnc files.
Owner

### brucemiller commented Nov 16, 2017

 Ah, I wasn't aware of Grobid-TEI; I bit of shallow googling suggests they're aware of MathML, but isn't clear to what extent they've already accommodated it. Maybe they have an interest in pursuing, and you (& we) can help? As for next steps: to the extent that you're happy with the current status of the transformation, and it's close enough to be useful at least for your purposes, we can probably merge a subset of the PR; namely, MANIFEST, latexmlpost and the XSL stylesheet. I hope that if further experience with it leads to corrections or improvements, you'll be willing to submit PR's for that? Does that sound reasonable? @dginev: any comments?
Contributor

### Thanathan-k commented Nov 16, 2017

 I'll open a ticket at their github project. Let's see if they have an interest in joining forces. Sure sounds good! Should I make a new commit with only the files you mentioned? Or will you just delete the rnc files while merging?
Owner

### brucemiller commented Nov 16, 2017

 Hmm, I'm not actually sure that I know how. In principle, I would have cherry-picked the commits that only affected the desired files, and omit any that dealt with schema. But it appears that a few of the commits affect both xsl & rnc, so that won't work cleanly. Probably the solution is something like pulling the whole set of commits, removing the unwanted files and then doing a "squash" (I haven't gotten fluent with squashing, yet :> ). I'm trying to avoid getting the largish schema in, and polluting the history (which is already cluttered enough). Maybe I should ping @dginev again for advice. He's good at this stuff!
Collaborator

### dginev commented Nov 16, 2017 • edited

 Sounds like it is easiest to remove the files not needed, add that as a new commit to the branch of this PR, and then "squash and merge". Squash is forgiving of mistakes.
Contributor

### Thanathan-k commented Nov 17, 2017

 Okay will do that!
 Deleted RNC files for Pull Request 
 8dbe361 

### brucemiller merged commit 465cbe4 into brucemiller:master Nov 28, 2017 1 check passed

#### 1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Owner

### brucemiller commented Nov 28, 2017

 Great new feature; will be interesting to see use cases. Ultimately, we'll need to add some hooks in latexmlc to recognize the format, but this is a really good start. Thanks so much for your patience!
Contributor

### physikerwelt commented Nov 28, 2017

 Thank you @brucemiller! @amstart did already convert all NTCIR-11 arXiv documents. Maybe you can share some statistics on how many documents could be converted without errors.