Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow verbatim copies of XML sub-trees #9

Closed
darioteixeira opened this issue Dec 27, 2014 · 9 comments
Closed

Allow verbatim copies of XML sub-trees #9

darioteixeira opened this issue Dec 27, 2014 · 9 comments

Comments

@darioteixeira
Copy link

I'm parsing an XML document which contains embedded MathML. However, I don't want to touch these embedded sub-trees; instead I would just like to retrieve them verbatim (or as an equivalent minus whitespace).

In theory, this should be simple to achieve: whenever I encounter a <math> tag, I could just create a new buffer output, and syphon the input directly into it until the closing element was found. In practice, this is presently not possible with Xmlm. The problem lies in Xmlm's handling of character entity references, which makes a verbatim copy impossible. Therefore, one possible solution would be to temporarily turn off Xmlm.input's expansion of entities and Xmlm.output automatic escaping of ampersands et al.

Could this feature be added to Xmlm, or am I missing a more straightforward way of achieving the same goal?

@dbuenzli
Copy link
Owner

Doesn't feel like an average use case and I don't see what the problem is with just parsing the mathml and reoutput as is. I don't know what you mean by verbatim, but from an xml information set point of view you won't loose any data in doing so.

@dbuenzli
Copy link
Owner

Ah ok you'd loose the entities as specified originally. I'm reluctant to add that kind of ad-hoc options in the current xmlm which is already a horrendous and tricky state machine for handling whitespace and the bastartized notion of lexing that xml is. It is a fact that xmlm is ill suited for writing xml filters (e.g PI are dropped), that was supposed to be fixed in a hypothetic xmlm 2.0.0 that never materialized.

That being said most tricky bits are on input, but no change is needed on input for your problem: you can simply have the entity function return an entity reference on mathml entities. Now the problem is that the & of these entities will be escaped on output. Having a flag on output to temporarily disable this on output shouldn't be too tricky. What do you think ?

@darioteixeira
Copy link
Author

That being said most tricky bits are on input, but no change is needed on input for your problem: you can simply have the entity function return an entity reference on mathml entities. Now the problem is that the & of these entities will be escaped on output. Having a flag on output to temporarily disable this on output shouldn't be too tricky. What do you think ?

Yes, as a minimally intrusive workaround, it would suffice if Xmlm.output had an option for not automatically escaping ampersands et al. That way I could declare the entity transformation function to simply re-output the entities: let input = Xmlm.make_input ~entity:(fun x -> Some ("&" ^ x ^ ";")) ...

(This was in fact my first approach; only then did I notice that the output was always escaped...)

@darioteixeira
Copy link
Author

Just to clarify: the option controlling automatic escaping could be a parameter of Xmlm.make_output, not Xmlm.output. The former is less flexible, but fits better with the existing API (either is fine for my purposes).

@dbuenzli
Copy link
Owner

Mmmh there would be a problem with & in the original input.

@darioteixeira
Copy link
Author

Mmmh there would be a problem with & in the original input.

Yeah. I reckon the issue may be solvable, but not in an entirely backwards compatible way.

Suppose a new variant Entity were added to the type signal. By default, Xmlm.input would perform entity expansion, and therefore this particular signal would never occur. For that to happen, a special option must be provided to Xmlm.make_input. As for Xmlm.output, it would always accept Entity, of course.

This approach has the advantage that existing code would still run fine (though it would require tweaking to remove the compiler warning about the unhandled variant). What do you think?

@dbuenzli
Copy link
Owner

dbuenzli commented Jan 8, 2015

Not very fond of that solution. If we were to break compatibility I think it would be better to provide full support for writing xml filters and solve that problem along the way trough an uncut codec like in Jsonm, this would include reporting comments and processing instructions. But this is a non-trivial amount of work. Did you maybe investigate pxp ?

@darioteixeira
Copy link
Author

Yes, the present implementation uses PXP in fact, and I was investigating Xmlm as a possible lighter-weight alternative. But anyway, I understand not wanting to break backwards compatibility (even if it's only because of an easily fixed warning). Feel free to close this ticket or stash it away as a "features to have for 2.0"...

@dbuenzli
Copy link
Owner

This won't happen anytime soon so I'm closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants