Skip to content

Check characters in Text, CDATA, Attribute names and Entity References. #57

Merged
merged 1 commit into from Oct 1, 2011

2 participants

@hvesalai

All characters in XML must conform to the Regex specified in http://www.w3.org/TR/xml/#NT-Char

Added checking to throw IllegalArgumentException if an illegal character was used.

Changed Text and Attribute escaping to conform to
http://www.w3.org/TR/xml/#NT-CharData and http://www.w3.org/TR/xml/#NT-AttValue respectively.

More precisely:

Changed Attribute escaping so that " and ' are only used when necessary, > is not escaped at all. Instead of foo="bar"foo" it now writes foo='bar"foo' and instead of foo="bar'foo" it now writes foo="bar'foo".

Also changed text escaping not to escape " and ' since they are not special in text. Left escaping of > intact because (while not strictly necessary per specification) that is the simplest way to avoid the illegal ]]> sequence in text.

Heikki Vesalainen Check characters in Text, CDATA, Attribute names and Entity References.
All characters in XML must conform to the Regex specified in http://www.w3.org/TR/xml/#NT-Char

Also changed Text and Attribute escaping to conform to
http://www.w3.org/TR/xml/#NT-CharData and http://www.w3.org/TR/xml/#NT-AttValue respectively.

Changed Attribute escaping so that " and ' are only used when necessary.
bece398
@djspiewak
Owner

This isn't really the intention of Text (or Attribute, for that matter). They're supposed to handle all of the escaping. Any characters that are not valid should be escaped automatically, not result in an exception. Similarly, CDATA should allow anything that does not contain the sequence ]]>.

The only reason not to do this is if there is text that is not valid XML character data and cannot be escaped in some way (or, analogously, text that is not valid inside of a CDATA node). I didn't think this existed; am I misunderstanding the spec?

@hvesalai

Whether or not you have misunderstood the spec depends on which spec you mean. In XML 1.0 you really cannot include the control characters that this patch checks against in any part of the document (not even CDATA). And what's more, there is even no way to escape them. See http://www.w3.org/TR/xml/#wf-Legalchar. Thus, there is no way to represent the vertical tab (i.e. is not legal).

Note by the way that <, >, ", ' and & are all legal characters in XML 1.0. They just have to be encoded as character references in some positions.

In XML 1.1 this has changed so that the control characters are allowed as character references (i.e. &#x0B is legal). Even in XML 1.1, the null character is not allowed (\u0000).

Based on the XML declaration written in XMLSerializer I thought anti-xml is trying to conform to XML 1.0. If I understood correctly, then my patch is doing the right thing throwing the exception, since there is no way to escape these characters acording to the specification. Should XMLSerializer be upgraded to XML 1.1? If so, then we need to add the escaping of the restricted characters. We could ofcourse have both XML 1.0 and XML 1.1 support in XMLSerializer and have the XMLSerializer throw the exception, not the constructors of the nodes.

@hvesalai

(for some reason github ate my &#x0B from the above comment. The last word on line 3 should be (i.e. &#x0B is not legal).

@hvesalai

Instead of exceptions we could also silently remove the control characters or change them to whitespace when serializing the output.

@djspiewak djspiewak merged commit bece398 into djspiewak:master Oct 1, 2011
@djspiewak
Owner

Exceptions are definitely the right call. Eeesh, I really hate XML. Thanks for the patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.