-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow verbatim copies of XML sub-trees #9
Comments
Doesn't feel like an average use case and I don't see what the problem is with just parsing the mathml and reoutput as is. I don't know what you mean by verbatim, but from an xml information set point of view you won't loose any data in doing so. |
Ah ok you'd loose the entities as specified originally. I'm reluctant to add that kind of ad-hoc options in the current xmlm which is already a horrendous and tricky state machine for handling whitespace and the bastartized notion of lexing that xml is. It is a fact that xmlm is ill suited for writing xml filters (e.g PI are dropped), that was supposed to be fixed in a hypothetic xmlm 2.0.0 that never materialized. That being said most tricky bits are on input, but no change is needed on input for your problem: you can simply have the entity function return an entity reference on mathml entities. Now the problem is that the & of these entities will be escaped on output. Having a flag on output to temporarily disable this on output shouldn't be too tricky. What do you think ? |
Yes, as a minimally intrusive workaround, it would suffice if Xmlm.output had an option for not automatically escaping ampersands et al. That way I could declare the entity transformation function to simply re-output the entities: (This was in fact my first approach; only then did I notice that the output was always escaped...) |
Just to clarify: the option controlling automatic escaping could be a parameter of Xmlm.make_output, not Xmlm.output. The former is less flexible, but fits better with the existing API (either is fine for my purposes). |
Mmmh there would be a problem with |
Yeah. I reckon the issue may be solvable, but not in an entirely backwards compatible way. Suppose a new variant This approach has the advantage that existing code would still run fine (though it would require tweaking to remove the compiler warning about the unhandled variant). What do you think? |
Not very fond of that solution. If we were to break compatibility I think it would be better to provide full support for writing xml filters and solve that problem along the way trough an uncut codec like in Jsonm, this would include reporting comments and processing instructions. But this is a non-trivial amount of work. Did you maybe investigate pxp ? |
Yes, the present implementation uses PXP in fact, and I was investigating Xmlm as a possible lighter-weight alternative. But anyway, I understand not wanting to break backwards compatibility (even if it's only because of an easily fixed warning). Feel free to close this ticket or stash it away as a "features to have for 2.0"... |
This won't happen anytime soon so I'm closing. |
I'm parsing an XML document which contains embedded MathML. However, I don't want to touch these embedded sub-trees; instead I would just like to retrieve them verbatim (or as an equivalent minus whitespace).
In theory, this should be simple to achieve: whenever I encounter a
<math>
tag, I could just create a new buffer output, and syphon the input directly into it until the closing element was found. In practice, this is presently not possible with Xmlm. The problem lies in Xmlm's handling of character entity references, which makes a verbatim copy impossible. Therefore, one possible solution would be to temporarily turn off Xmlm.input's expansion of entities and Xmlm.output automatic escaping of ampersands et al.Could this feature be added to Xmlm, or am I missing a more straightforward way of achieving the same goal?
The text was updated successfully, but these errors were encountered: