-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generated XHTML fails (a few) validity checks #1440
Comments
For reference, I looked at the
I guess that the proper fix is to use [1] https://html.spec.whatwg.org/multipage/semantics.html#attr-meta-http-equiv-content-type |
I have a couple of XSLT scripts which fix up the output produced by v0.8.5, and which generate EPUB OPF and NAV files from that XHTML, which are such that I can get a no-warnings score from the DAISY validator (thus warranting a WCAG 2.0 AA conformance claim). This is using latexml+latexmlpost to generate the XHTML, rather than the EPUB mode of latexmlc (I've nothing against latexmlc, but the former is the toolchain I've been working with; see also issue #1441). One or two of the things I fix up have, I see from above, been already addressed by other commits. Would it be useful to post the current versions of these scripts here? I doubt they would be directly useful in LaTeXML code, but might be useful for reference. Or would it be useful to create a feature request 'produce WCAG-conforming EPUB output'? |
I think both would be at least partially helpful. If we can reuse/get inspired by your upgrades to have latexml start producing WCAG-conforming epub, that's great. And then if we had a way to test that the epub generation continues to be conformant, that's also great, so that we don't lose the newly added upgrade. And yes, thank you for all the reports and issues, much appreciated - it is often the case that we end up polishing most the bits we use ourselves (HTML5), so it's good to get someone with a keen eye on epub stress test what is generated. |
Seconding what @dginev said: It would be nice to see the scripts you've used. We'll probably want to adapt and incorporate them: We certainly would prefer to produce valid ePub out of the box (while still allowing customization). And of course, still be able to produce valid, natural xhtml --- When I pondered this Issue before, I was slightly stuck on whether a separate ePub flag was needed in the XSLT, or whether it could be finessed from other flags, as @xworld21 suggested. |
Done! I've attached a zip file containing the kit which I've circulated to one or two colleagues: latex-epub-kit-2021-08-12.tar.gz; it has had very minimal usage by anyone other than me. See the README for details. This works with LaTeXML 0.8.5, and is probably very version-specific. The XHTML here is purely the output of Here, the metadata that appears in the OPF file is obtained from an XMP file which is generated as step one. I found this a more reliable way of getting such information from the source to the OPF, than trying to smuggle it through the The NAV information in the EPUB is scavenged from the structure of the generated XHTML files, so is probably quite sensitive to the details of that serialisation. That obviously wouldn't be an issue if you were generating both at the same time. I hope this is useful. |
I should add – as a postscript, and in case it's not obvious from what I've written above – that I think the current LaTeXML XHTML output is very close to what's required for EPUB. The sanitisation steps are merely addressing buglets (or nearly so), and the NAV script is generating something it's very simple for LaTeXML to do (and may do already, in the The OPF step is probably the step that would require (and that required for me) a certain quantum of thought. You of course will have thought through some of these points when producing OPF files already. As specific recommendations:
Both of these points, as well as the need for the author to provide some document unique ID, such as a UUID, would seem to me to point towards some sort of simple I'll also point out that the DAISY conformance checker includes a command-line tool which can be invoked at the end of a toolchain to verify that all the checks have passed. And parenthetically (and I'm sure gratuitously), I'll note that although I'm here slightly obsessing over the accessibility features of EPUB, that's far from the only function of the EPUB format. As a final observation, I'll note that although my document ticks all of the DAISY boxes, and has had a ‘looks OK to me’ from my institution's accessibility coordinator, I haven't yet managed to get any feedback from an actual student user dependent on the assistive elements of the materials. I hope to nag the relevant office to help me out there before long, but to repurpose Knuth's famous remark: ‘Beware of bugs in the above code; I have only proved it correct, not tried it.’ |
As I'm reading the detailed comments of @nxg above, let me go on the record that you've triggered some red alerts to my own developer preferences. In particular, the mention of XMP strikes me as a completely unnecessary complication of our current software stack. I tend to be the dev who keeps harping on that we need more simplicity in latexml, both user-facing and in the implementation internals. The risk is really high to make the project unmaintainable due to compounding effects on complexity, unless we keep that in check. So, while I will consult the code you've written, I will do my best to avoid any novelty introductions to the epub or xhtml generation for as long as I can. Ideally we make some of the current code more reusable across latexml (e.g. manifest-related bookkeeping), and factor out enough until someone can jump into and read + contribute a PR in the matter of minutes. Just to prepare the expectations of PRs to come for this issue from my side. I've been quite fond of the lightweight approach taken by @xworld21 's PRs, will try to keep things in that spirit. |
Much gratitude goes to @xworld21 who has basically solved this issue entirely on his own effort. Very impressive! I would appreciate an example from @nxg on the issue of both Hence, most of the issue fixes will land already for 0.8.6 thanks to Vincenzo, and the tabular pieces will have to wait for 0.8.7. Thanks again to everyone for the contributions here! |
After a bit of digging, and going back to my source revision where I added my fixup, I don't think I can now generate an example of You can be sure that I'll be able to detect and report any re-emergence, though! And yes, it seems likely that this was the result of a complicated equation array, both because there were plenty of these in my source, and because I dimly remember that sort of context when I was looking at this. |
I think mention of XMP should have anyone reaching for their red flags, including me. I'm not sure what's written on your red flags, @dginev, but I'll suggest that one or two of them might potentially be lowered. The following is a rather ruminative comment – apologies for its length. First (and just in case it's not clear) I don't anticipate that anything in my attachment above would make it into LaTeXML – it's there just to illustrate what I did to get to my intended destination. Second, I know you're familiar with the details here – I'm setting them out this way partly to organise my own thoughts. I think there are three strands here.
It's really only strand 1 that matches the nominal title of this issue; it occurred to me to create feature requests touching on the others, and I could happily do that if that would be useful. Strands 1 and 2 are, I think, very loosly coupled, and keeping them decoupled seems useful, in an architectural sense. There is some mild coupling in that the process of producing XHTML may or may not leave behind, or carry through from the source document, the extra information required to make strand 2 easy. Because the best practices of Strand 3 touches on both strands 1 and 2, it provides some mild coupling. Remark: I'm now certain that I'm going to continue using LaTeXML as my preferred route for generating XHTML+MathML from LaTeX. In practice, however, I'm probably going to continue using my own code to bundle that XHTML into EPUB (for various reasons). Thus for me, the clear blue water between strands 1 and 2 is both natural and valuable. And thus any side-products that make that bundling easy (containing eg TOCs or aggregations of metadata) are of particular interest to me. So where does XMP come in? Having spent a fair amount of time with XMP and with RDF, I know that XMP is hard to love. It looks ugly, it's fiddly to write, and the XMP spec is in my opinion very poorly written. it does have some advantages, though, which bear on strands 2 and 3.
I don't think that XMP is necessary anywhere in this process. However, when I was doing the XHTML-to-WCAG-EPUB step, it seemed to solve a lot of problems at once.
There are other ways to do each of these things, of course, including for example That is, in my eyes, factoring out all of the project metadata into a single XMP-shaped blob, whether it's put there by the author or by another part of the Makefile, is itself a simplification (admittedly, I have spent a fair amount of time with RDF, so have a fairly high pain threshold where that's concerned). I'll also mention that JSON-LD is part of the family of related formats which are potentially convertible to and from XMP, so that (and I haven't thought about this in any detail) it wouldn't be unreasonable to gather metadata as JSON – potentially a more convenient format – and manipulate it that way. Having arrived there, dumping some XMP to stuff into the OPF becomes merely a final party-trick. All that said (at length), I'm not here to be dogmatic about your design of your code! |
At the very least I genuinely thank you for the detailed examination of the question in the comments here @nxg , truly appreciated. I think both me and Bruce can take some time to carefully consider what kinds of upgrades are worth investing time and maintenance into, and which directions reap the most benefits for effort invested. There are indeed some existing solutions in latexml that can be made to evolve in various directions, and the ePub support has plenty of room to grow in sophistication... Ideally we can get a lot on the generation side with as little as possible technical investment however. In my experience metadata-related bits can be kept quite compact most of the time, but as usual the devil is in the details. |
So, I'm thinking that all the subissues here have been addressed, along with giving us some thoughts for future directions. If you find examples that fail, please open a new issue with a minimal test case and we'll look into it. Thanks for the report, and ideas! |
The XHTML generated by LaTeXML passes a large fraction of the checks provided by the (very thorough) W3C EPUB validator, but not all of them. The attached stylesheet (see internal comments for notes and rationale) normalises the XHTML so that it passes these checks.
The stylesheet also reworks footnotes into a form which is EPUB-friendly, but more generally is the format recommended by the DAISY accessibility consortium, and I think also implicitly recommended by the XHTML specification (though it's hard to pin down a specific location within the (X)HTML(5) sprawl).
The failures are:
<object>
elements with an@alt
attribute (the fallback content should be element-content).<tbody>
and<tr>
as children of<table>
.<meta>
content-type indicating XHTML rather than HTML.The third one is a specifically EPUB issue; the first two are XHTML validity issues. Though this is using an EPUB validator – because in my particular case I'm generating XHTML en route to EPUB – I'm reporting XHTML validity issues which will very probably be relevant for the xhtml output as well. The validity errors are also reported by Emacs nxml-mode, so, again, this isn't an EPUB-specific issue.
Stylesheet: sanitise-xhtml.xslt.gz
The text was updated successfully, but these errors were encountered: