New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] First iteration of TEI-implementation for LaTeXML. #871
Conversation
…ML-jats.xsl Stylesheet and only changed currently necessary parts.
Is there a recommended tei validation tool, or relaxng schema, or.. ? |
+1 to Bruce's comment. If there is a validator/schema, if you ( @Thanathan-k ) can add a test to latexml's post-processing suite, it would be a killer PR. Would make it 10x easier to merge I think. |
Currently trying to get a validator in the form of an .rng file. However it seems complicated to use the MathML specification even though it should work with Roma... Do any of you already have some experience with it? |
The used validation tool was jing (https://code.google.com/archive/p/jing-trang/) My document used references within formulas. To my knowledge this isn't allowed in MathML. However I was not able to handle this in the XSL-File, because the formula generation seems to be completly done by LateXML itself. Is my assumption correct? And if so, how should cases with references within formulas be handled? |
I am quite sure LaTeXML used to verify its MathML was valid, I think back in the days of outputting XHTML 4, but @brucemiller may have also done this for the HTML5 output. He would certainly know. |
for example formula group
@dginev I think the problem is that the mtext element is hard to validate. In the stylesheet, @Thanathan-k used explicit rules were set up to validate the mtext element, which makes sense according to https://www.w3.org/TR/MathML3/chapter3.html#presm.mtext , https://www.w3.org/TR/MathML3/appendixa.html#parsing_mtext . However, it is also is somehow rational to treat the mtext element like the outer document formats and to allow other tags inside the mtext element. On the MathML website I was looking at the mtext definition |
One option could be to allow for all valid tei text in mtext such as it's described in https://www.w3.org/TR/MathML3/chapter6.html#interf.html for mathml in html5. |
Works better than rng (included Grobid and Mathml3)
I've added a new Validator file. Working with jing -c LatTeXML-tei.rnc myfile. For you to be able to recreate the error we were talking about use the .tex file of https://arxiv.org/format/1708.08473 (It's too large to create a gist sorry), However I've found even more problem. If we use latexmlpost with the additional --cmml flag, the created content MathML has even more errors. For example: I've checked the corresponding rnc files provided by MathML and the errors are indeed correct and not due to any rnc mashup I've done. |
@Thanathan-k the second problem could also be an independent bug in LaTeXML. Can you identify the LaTeX source formula that caused that problem? Maybe there is just a wrapper missing to wrap the mtext element in a standard conform way to the content branch |
RNC file didn't allow rows as part of tables and theorem/proof are now handled as notes
Affiliation and AddrLine corrected
I haven't responded much, since I have no familiarity with TEI, and since patches kept coming, I figured to wait for the dust to settle... but maybe it has? On the issue of validation: *pure" MathML allows very limited elements within In @Thanathan-k's september 13 comment, the first error appears to be due to putting ref within mtext. That's exactly the kind of inline that is not allowed by "pure" MathML, but should be allowed by the embedded MathML (imho). The second error involving cmml, I can't quite interpret; perhaps it's the same problem? |
Oh: on that second error: if it is indeed a distinct latexml error, it should be reproducible with simple html w/o the complications (and unkowns, for me) of TEI. I small sample producing the error would be helpful. |
Hey @brucemiller , Yeah I worked on it and used more documents from ArXiv to check if everything works. Currently everything looks fine, but I wanted to check some more samples and fix everything before commenting here again. I'll also prepare a sample which produces the cmml error I was talking about. |
Okay I've checked the script again against more ArXiv documents and everything looks good currently. (Except for the errors within MathML). Here's a tex document producing both the ref and m:text error after using latexml and latexmlpost with my tei extension: The exact commands were: https://gist.github.com/Thanathan-k/ae812977da44a16cb5f8cf5f507b3445 |
Due to the external I was not able yet to convert the rnc to rng. Will work on that. If it's possible to get rid of the mathml3.rnc in favor of LaTeXML-math.rnc, that would be great as well.
OK, that's a nice small test case; Just running it with pmml, first. it includes Actually, it's a bit worse than that: you don't really want to just go in and edit the TEI & MathML schemas to change them, if at all possible. That's very difficult to maintain, as it isn't clear which However: Often, you can start with a small schema that
which would add the VRML element to a set of element names that defines the inline And, in fact, although it is really nice to have a schema to check things against, I don't really want to embed TEI's or MathML's schema into the LaTeXML distribution. Coming back to |
Sorry I focused on the error and just deleted as much as possible. Must have missed that.
So what do you suggest? Or you generally don't want to include TEI in LaTeXML? |
On 11/15/2017 06:24 PM, Michael Kramer wrote:
And, in fact, although it is really nice to have a schema to check
things against, I don't really want to embed TEI's or MathML's
schema into the LaTeXML distribution.
So what do you suggest? Or you generally don't want to include TEI in
LaTeXML?
Let me start by apologizing (again) for not being on top
of this issue and responding in a timely fashion... I've been
rather distracted lately.
Actually, I am very interested in including TEI, and really
appreciate y'alls efforts to put it together!
The changes to latexmlpost (and eventually to latexmlc), as
well as the xsl stylesheet definitely should go into the
distribution. But I don't think the schema should go in
for several reasons. Firstly, there's no direct way to use it
anyway (and we don't include schema for other target formats
like jats and various flavors of html). Secondly, it will
inevitably get out-of-sync with the "master" copy of the tei
schema, particularly if we have a modified copy of it. Yes,
if there are changes to TEI, we'll have to modify the stylesheet,
but hopefully not have to re-invent the modified stylesheet.
And in fact, rather than us carry around that modified stylesheet,
it would be better to lobby the TEI folks to include MathML in
their master copy (hopefully dealing with the mtext issue at the
same time).
Hopefully that explains my position?
Seriously: thanks for your effort!
|
Don't worry about it! So what do you suggest as next steps? As far as my testing goes the created documents should be in line with the Grobid-TEI specification (which I focused on because it's used by our software). I could create an issue there and ask for them to add MathML int their rng/rnc files. |
Ah, I wasn't aware of Grobid-TEI; I bit of shallow googling suggests they're aware of MathML, but isn't clear to what extent they've already accommodated it. Maybe they have an interest in pursuing, and you (& we) can help? As for next steps: to the extent that you're happy with the current status of the transformation, and it's close enough to be useful at least for your purposes, we can probably merge a subset of the PR; namely, MANIFEST, latexmlpost and the XSL stylesheet. I hope that if further experience with it leads to corrections or improvements, you'll be willing to submit PR's for that? Does that sound reasonable? |
I'll open a ticket at their github project. Let's see if they have an interest in joining forces. Sure sounds good! Should I make a new commit with only the files you mentioned? Or will you just delete the rnc files while merging? |
Hmm, I'm not actually sure that I know how. In principle, I would have cherry-picked the commits that only affected the desired files, and omit any that dealt with schema. But it appears that a few of the commits affect both xsl & rnc, so that won't work cleanly. Probably the solution is something like pulling the whole set of commits, removing the unwanted files and then doing a "squash" (I haven't gotten fluent with squashing, yet :> ). I'm trying to avoid getting the largish schema in, and polluting the history (which is already cluttered enough). Maybe I should ping @dginev again for advice. He's good at this stuff! |
Sounds like it is easiest to remove the files not needed, add that as a new commit to the branch of this PR, and then "squash and merge". Squash is forgiving of mistakes. |
Okay will do that! |
Great new feature; will be interesting to see use cases. |
Thank you @brucemiller! |
Based on the LaTeXML-jats.xsl Stylesheet and only changed currently necessary parts. Testing and feedback would be appreciated.