Allow verbatim copies of XML sub-trees #9

darioteixeira · 2014-12-27T16:14:02Z

I'm parsing an XML document which contains embedded MathML. However, I don't want to touch these embedded sub-trees; instead I would just like to retrieve them verbatim (or as an equivalent minus whitespace).

In theory, this should be simple to achieve: whenever I encounter a <math> tag, I could just create a new buffer output, and syphon the input directly into it until the closing element was found. In practice, this is presently not possible with Xmlm. The problem lies in Xmlm's handling of character entity references, which makes a verbatim copy impossible. Therefore, one possible solution would be to temporarily turn off Xmlm.input's expansion of entities and Xmlm.output automatic escaping of ampersands et al.

Could this feature be added to Xmlm, or am I missing a more straightforward way of achieving the same goal?

The text was updated successfully, but these errors were encountered:

dbuenzli · 2014-12-27T16:26:14Z

Doesn't feel like an average use case and I don't see what the problem is with just parsing the mathml and reoutput as is. I don't know what you mean by verbatim, but from an xml information set point of view you won't loose any data in doing so.

dbuenzli · 2014-12-27T16:50:19Z

Ah ok you'd loose the entities as specified originally. I'm reluctant to add that kind of ad-hoc options in the current xmlm which is already a horrendous and tricky state machine for handling whitespace and the bastartized notion of lexing that xml is. It is a fact that xmlm is ill suited for writing xml filters (e.g PI are dropped), that was supposed to be fixed in a hypothetic xmlm 2.0.0 that never materialized.

That being said most tricky bits are on input, but no change is needed on input for your problem: you can simply have the entity function return an entity reference on mathml entities. Now the problem is that the & of these entities will be escaped on output. Having a flag on output to temporarily disable this on output shouldn't be too tricky. What do you think ?

darioteixeira · 2014-12-27T17:00:54Z

That being said most tricky bits are on input, but no change is needed on input for your problem: you can simply have the entity function return an entity reference on mathml entities. Now the problem is that the & of these entities will be escaped on output. Having a flag on output to temporarily disable this on output shouldn't be too tricky. What do you think ?

Yes, as a minimally intrusive workaround, it would suffice if Xmlm.output had an option for not automatically escaping ampersands et al. That way I could declare the entity transformation function to simply re-output the entities: let input = Xmlm.make_input ~entity:(fun x -> Some ("&" ^ x ^ ";")) ...

(This was in fact my first approach; only then did I notice that the output was always escaped...)

darioteixeira · 2014-12-27T17:06:39Z

Just to clarify: the option controlling automatic escaping could be a parameter of Xmlm.make_output, not Xmlm.output. The former is less flexible, but fits better with the existing API (either is fine for my purposes).

dbuenzli · 2014-12-27T17:17:51Z

Mmmh there would be a problem with & in the original input.

darioteixeira · 2015-01-05T19:50:33Z

Mmmh there would be a problem with & in the original input.

Yeah. I reckon the issue may be solvable, but not in an entirely backwards compatible way.

Suppose a new variant Entity were added to the type signal. By default, Xmlm.input would perform entity expansion, and therefore this particular signal would never occur. For that to happen, a special option must be provided to Xmlm.make_input. As for Xmlm.output, it would always accept Entity, of course.

This approach has the advantage that existing code would still run fine (though it would require tweaking to remove the compiler warning about the unhandled variant). What do you think?

dbuenzli · 2015-01-08T10:31:24Z

Not very fond of that solution. If we were to break compatibility I think it would be better to provide full support for writing xml filters and solve that problem along the way trough an uncut codec like in Jsonm, this would include reporting comments and processing instructions. But this is a non-trivial amount of work. Did you maybe investigate pxp ?

darioteixeira · 2015-01-08T15:45:22Z

Yes, the present implementation uses PXP in fact, and I was investigating Xmlm as a possible lighter-weight alternative. But anyway, I understand not wanting to break backwards compatibility (even if it's only because of an easily fixed warning). Feel free to close this ticket or stash it away as a "features to have for 2.0"...

dbuenzli · 2017-03-15T22:03:27Z

This won't happen anytime soon so I'm closing.

darioteixeira mentioned this issue Jan 13, 2015

Consider switching to a lighter-weight XML parser for Lambxml darioteixeira/lambdoc#27

Closed

dbuenzli closed this as completed Mar 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow verbatim copies of XML sub-trees #9

Allow verbatim copies of XML sub-trees #9

darioteixeira commented Dec 27, 2014

dbuenzli commented Dec 27, 2014

dbuenzli commented Dec 27, 2014

darioteixeira commented Dec 27, 2014

darioteixeira commented Dec 27, 2014

dbuenzli commented Dec 27, 2014

darioteixeira commented Jan 5, 2015

dbuenzli commented Jan 8, 2015

darioteixeira commented Jan 8, 2015

dbuenzli commented Mar 15, 2017

Allow verbatim copies of XML sub-trees #9

Allow verbatim copies of XML sub-trees #9

Comments

darioteixeira commented Dec 27, 2014

dbuenzli commented Dec 27, 2014

dbuenzli commented Dec 27, 2014

darioteixeira commented Dec 27, 2014

darioteixeira commented Dec 27, 2014

dbuenzli commented Dec 27, 2014

darioteixeira commented Jan 5, 2015

dbuenzli commented Jan 8, 2015

darioteixeira commented Jan 8, 2015

dbuenzli commented Mar 15, 2017