Remapper from XMLIllegal to PUA ignores pre-existing PUA characters by mbeckerle · Pull Request #1247 · apache/daffodil

mbeckerle · 2024-05-28T21:53:47Z

Existing PUA chars were producing SDE in the InfosetOutputter to XML. This is too late for a parser to backtrack because of these characters.

Turns out if you do fuzz testing on data formats that use unicode charsets (like zip file format), then it's very easy to end up with some PUA characters in the data.

This fix is needed for Daffodil 3.8.0 because layer algorithms for things like zip files were hitting this with their fuzz testing.

DAFFODIL-2883

…n the input data. Existing PUA chars were producing SDE in the InfosetOutputter to XML. This is too late for a parser to backtrack because of these characters. Turns out if you do fuzz testing on data formats that use unicode charsets (like zip file format), then it's very easy to end up with some PUA characters in the data. This fix is needed for Daffodil 3.8.0 because layer algorithms for things like zip files were hitting this with their fuzz testing. DAFFODIL-2883

mbeckerle · 2024-05-28T21:55:36Z

No Daffodil tests fail due to this change.

stevedlawrence

+1

stevedlawrence · 2024-05-29T12:58:23Z

+      replaceCRWithLF = true
+    )

  def remapXMLIllegalCharactersToPUA(s: String): String = remapXMLToPUA.remap(s)


The ticket mentions possibly having a switch to make this configurable. I think that remapping in infoset outputters should probably never care about existing PUA so this change makes alot of sense to me.

But maybe a switch is wanted in some internal function (e.g. getSomeString) that checks for PUA characters and creates a PE/SDE? Should we create a ticket for that, or maybe it's not necessary at all and we should just always allow PUA characters with the caveat that they probably won't round trip. In which case maybe a new ticket should be added to remove the checkForExistingPUA feature? I'm not sure if there's overhead in that.

Yeah, the DFDL Infoset definitely allows all PUA characters. So DFDL Infoset to XML Infoset projection should not be causing errors on any unicode chars.

Whether we allow existing PUA chars at parse time... it's an XML-specific parse-time option. We have no feature to declare them illegal, and if we did it ought to depend on dfdl:encodingErrorPolicy whether they get replaced or a parse error is signaled.

Right now, suppose you had data in UTF-8 that has existing PUA characters in it that happen to overlap with those remapped by Daffodil. E.g, U+E000 which we use for NUL. Well your data cannot be faithfully converted to/from XML by Daffodil, simple as that. We will turn your E000 into a NUL in the infoset on unparsing.

This is not a Daffodil issue. It is simply not possible to convert data into XML if it uses every unicode code point, since we need some to represent the XML-illegal characters that are allowed in the DFDL Infoset.

The workaround is just say no to XML: convert to JSON first, then remap all your PUA chars that collide with the Daffodil remappings, converting those to some part of the PUA that neither your data, nor Daffodil, use. Then unparse. Now you have data that can be faithfully converted to XML and back by Daffodil. However, to unparse back to exactly what you started from, you have to invert this JSON operation as well.

To me, this is sufficient as a work-around for this very obscure problem.

olabusayoT

+1

mbeckerle requested review from jadams-tresys, olabusayoT and stevedlawrence May 28, 2024 21:53

stevedlawrence approved these changes May 29, 2024

View reviewed changes

olabusayoT approved these changes May 29, 2024

View reviewed changes

mbeckerle merged commit 8433f4d into apache:main May 29, 2024

mbeckerle deleted the daf-2883-pua branch May 29, 2024 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remapper from XMLIllegal to PUA ignores pre-existing PUA characters#1247

Remapper from XMLIllegal to PUA ignores pre-existing PUA characters#1247
mbeckerle merged 1 commit into
apache:mainfrom
mbeckerle:daf-2883-pua

mbeckerle commented May 28, 2024

Uh oh!

mbeckerle commented May 28, 2024

Uh oh!

stevedlawrence left a comment

Uh oh!

stevedlawrence May 29, 2024

Uh oh!

mbeckerle May 29, 2024

Uh oh!

olabusayoT left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mbeckerle commented May 28, 2024

Uh oh!

mbeckerle commented May 28, 2024

Uh oh!

stevedlawrence left a comment

Choose a reason for hiding this comment

Uh oh!

stevedlawrence May 29, 2024

Choose a reason for hiding this comment

Uh oh!

mbeckerle May 29, 2024

Choose a reason for hiding this comment

Uh oh!

olabusayoT left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants