Remapper from XMLIllegal to PUA ignores pre-existing PUA characters#1247
Conversation
…n the input data. Existing PUA chars were producing SDE in the InfosetOutputter to XML. This is too late for a parser to backtrack because of these characters. Turns out if you do fuzz testing on data formats that use unicode charsets (like zip file format), then it's very easy to end up with some PUA characters in the data. This fix is needed for Daffodil 3.8.0 because layer algorithms for things like zip files were hitting this with their fuzz testing. DAFFODIL-2883
|
No Daffodil tests fail due to this change. |
| replaceCRWithLF = true | ||
| ) | ||
|
|
||
| def remapXMLIllegalCharactersToPUA(s: String): String = remapXMLToPUA.remap(s) |
There was a problem hiding this comment.
The ticket mentions possibly having a switch to make this configurable. I think that remapping in infoset outputters should probably never care about existing PUA so this change makes alot of sense to me.
But maybe a switch is wanted in some internal function (e.g. getSomeString) that checks for PUA characters and creates a PE/SDE? Should we create a ticket for that, or maybe it's not necessary at all and we should just always allow PUA characters with the caveat that they probably won't round trip. In which case maybe a new ticket should be added to remove the checkForExistingPUA feature? I'm not sure if there's overhead in that.
There was a problem hiding this comment.
Yeah, the DFDL Infoset definitely allows all PUA characters. So DFDL Infoset to XML Infoset projection should not be causing errors on any unicode chars.
Whether we allow existing PUA chars at parse time... it's an XML-specific parse-time option. We have no feature to declare them illegal, and if we did it ought to depend on dfdl:encodingErrorPolicy whether they get replaced or a parse error is signaled.
Right now, suppose you had data in UTF-8 that has existing PUA characters in it that happen to overlap with those remapped by Daffodil. E.g, U+E000 which we use for NUL. Well your data cannot be faithfully converted to/from XML by Daffodil, simple as that. We will turn your E000 into a NUL in the infoset on unparsing.
This is not a Daffodil issue. It is simply not possible to convert data into XML if it uses every unicode code point, since we need some to represent the XML-illegal characters that are allowed in the DFDL Infoset.
The workaround is just say no to XML: convert to JSON first, then remap all your PUA chars that collide with the Daffodil remappings, converting those to some part of the PUA that neither your data, nor Daffodil, use. Then unparse. Now you have data that can be faithfully converted to XML and back by Daffodil. However, to unparse back to exactly what you started from, you have to invert this JSON operation as well.
To me, this is sufficient as a work-around for this very obscure problem.
Existing PUA chars were producing SDE in the InfosetOutputter to XML. This is too late for a parser to backtrack because of these characters.
Turns out if you do fuzz testing on data formats that use unicode charsets (like zip file format), then it's very easy to end up with some PUA characters in the data.
This fix is needed for Daffodil 3.8.0 because layer algorithms for things like zip files were hitting this with their fuzz testing.
DAFFODIL-2883