Skip to content

Remapper from XMLIllegal to PUA ignores pre-existing PUA characters#1247

Merged
mbeckerle merged 1 commit into
apache:mainfrom
mbeckerle:daf-2883-pua
May 29, 2024
Merged

Remapper from XMLIllegal to PUA ignores pre-existing PUA characters#1247
mbeckerle merged 1 commit into
apache:mainfrom
mbeckerle:daf-2883-pua

Conversation

@mbeckerle
Copy link
Copy Markdown
Contributor

Existing PUA chars were producing SDE in the InfosetOutputter to XML. This is too late for a parser to backtrack because of these characters.

Turns out if you do fuzz testing on data formats that use unicode charsets (like zip file format), then it's very easy to end up with some PUA characters in the data.

This fix is needed for Daffodil 3.8.0 because layer algorithms for things like zip files were hitting this with their fuzz testing.

DAFFODIL-2883

…n the input data.

Existing PUA chars were producing SDE in the InfosetOutputter to XML. This is too late for a
parser to backtrack because of these characters.

Turns out if you do fuzz testing on data formats that use unicode charsets (like zip
file format), then it's very easy to end up with some PUA characters in the data.

This fix is needed for Daffodil 3.8.0 because layer algorithms for things like zip files were
hitting this with their fuzz testing.

DAFFODIL-2883
@mbeckerle
Copy link
Copy Markdown
Contributor Author

No Daffodil tests fail due to this change.

Copy link
Copy Markdown
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

replaceCRWithLF = true
)

def remapXMLIllegalCharactersToPUA(s: String): String = remapXMLToPUA.remap(s)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket mentions possibly having a switch to make this configurable. I think that remapping in infoset outputters should probably never care about existing PUA so this change makes alot of sense to me.

But maybe a switch is wanted in some internal function (e.g. getSomeString) that checks for PUA characters and creates a PE/SDE? Should we create a ticket for that, or maybe it's not necessary at all and we should just always allow PUA characters with the caveat that they probably won't round trip. In which case maybe a new ticket should be added to remove the checkForExistingPUA feature? I'm not sure if there's overhead in that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the DFDL Infoset definitely allows all PUA characters. So DFDL Infoset to XML Infoset projection should not be causing errors on any unicode chars.

Whether we allow existing PUA chars at parse time... it's an XML-specific parse-time option. We have no feature to declare them illegal, and if we did it ought to depend on dfdl:encodingErrorPolicy whether they get replaced or a parse error is signaled.

Right now, suppose you had data in UTF-8 that has existing PUA characters in it that happen to overlap with those remapped by Daffodil. E.g, U+E000 which we use for NUL. Well your data cannot be faithfully converted to/from XML by Daffodil, simple as that. We will turn your E000 into a NUL in the infoset on unparsing.

This is not a Daffodil issue. It is simply not possible to convert data into XML if it uses every unicode code point, since we need some to represent the XML-illegal characters that are allowed in the DFDL Infoset.

The workaround is just say no to XML: convert to JSON first, then remap all your PUA chars that collide with the Daffodil remappings, converting those to some part of the PUA that neither your data, nor Daffodil, use. Then unparse. Now you have data that can be faithfully converted to XML and back by Daffodil. However, to unparse back to exactly what you started from, you have to invert this JSON operation as well.

To me, this is sufficient as a work-around for this very obscure problem.

Copy link
Copy Markdown
Contributor

@olabusayoT olabusayoT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@mbeckerle mbeckerle merged commit 8433f4d into apache:main May 29, 2024
@mbeckerle mbeckerle deleted the daf-2883-pua branch May 29, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants