Add DaffodilUnparseContentHandler.finish() function to SAX unparse API by stevedlawrence · Pull Request #1572 · apache/daffodil

stevedlawrence · 2025-10-13T12:48:30Z

When unparsing using the SAX API, if an XMLReader ends the parse early (usually due to invalid XML), then it stops sending events to the DaffodilUnparseContentHandler. This can lead to dangling threads since the DaffodilUnparseCotentHandler and SAXInfosetInputter are coroutines, which uses threads behind the scenes.

To fix this, we add a new finish() method to the
DaffodilUnparseContentHandler which should be called by SAX unparse API users after the XMLReader parse completes, regardless of success. For example,

val contentHandler = dp.newContentHandlerInstance(...)
xmlReader.setContentHandler(contentHandler)
try {
  xmlReader.parse(...)
} catch (...) {
  ...
} finally {
  contentHandler.finish()
}

The new finish function cleans up the coroutines setting the next event to null, which indicates to the SAXInfosetInputter coroutine that no more events are coming, ands leads to ending the unparse and the coroutine. This now also means getUnparseResult will always return a result, even if the XMLReader sees invalid XML, which makes our API a bit more consistent.

Note that this discovered a case where the InfosetInputter.initialize() method could throw an UnparseError if there was no StartDocument event, which was not caught correctly. Fixing this requires moving the initialize call inside the try/catch block, which is after the UState is created. This means the UState can no longer assert that the InfosetInputter is initialized, but this isn't necessary UState creation to succeed, as long as the infoset inputter is initialized shortly thereafter.

Deprecation/Compatibility:

A new DaffodilUnparseContentHandler.finish() method is added for users of the SAX unparse API. This should be called at the end of the XMLReader.parse() method to to ensure all internal state is cleaned up and an unparse result is available. For example:

DaffodilUnparseContentHandler contentHandler = dp.newContentHandlerInstance();
xmlReader.setContentHandler(contentHandler);
try {
    xmlReader.parse(...):
} catch (...)
    ...
} finally {
    contentHandler.finish();
}
UnparseResult ur = contentHandler.getUnparseResult();

DAFFODIL-2768

mbeckerle

+1 I have just one issue which is you should add an XML Comment and a processing instruction and some whitespace after the end-document element in one of the test cases that terminates normally. I just want to be sure we're not erroring on that.


<!-- an XML comment is allowed -->

<?processingInstruction isAllowed="true"?>

mbeckerle · 2025-10-14T14:18:42Z

daffodil-core/src/main/java/org/apache/daffodil/api/package-info.java

+ *  } catch (DaffodilUnparseErrorSAXException e)
+ *    // generally can be ignored, use getUnparseResult() instead
+ *  } catch (DaffodilUnhandledSAXException e) {
+ *    // should never happen, indicates a bug in daffodil


Explain why you are bothering to illustrate a catch for it then. Shouldn't it just be allowed to propagate if it is a daffodil bug? If so, then why catch it at all?

Good question. The issue is that DaffodilUnhandledSAXException extends SAXException, which is a checked exception so needs to either be caught or declared as thrown. For an example, it's probably best to catch it--users can mark it as thrown if they want, which is probably what they should do.

That said, because this exception indicates a bug in Daffodil, maybe the correct thing to do is to change DaffodilUnhandledSAXException to extend RuntimeException instead of SAXException. That way it isn't a checked checked exception and no one needs to worry about it, and if it does occur it can just bubble up and be handled like normal RuntimeExceptions.

And I think we could safely do that change without any real backwards incompatibility issues? The only real issue is that anyone just catching SAXException will no longer catch this exception, but that's probably the right thing since it indicates a bug that should be handled differently. It's also pretty unlikely to be thrown, since these kinds of bugs are relatively rare.

Does that make sense? Should we switch it to a RuntimeException as part of this PR?

+1 for RuntimeException for this.

mbeckerle · 2025-10-14T14:26:52Z

daffodil-core/src/main/scala/org/apache/daffodil/runtime1/infoset/SAXInfosetInputter.scala

-    // If we haven't reached an EndDocument event yet, there must be more
-    // events on their way, even if we don't know for sure yet.
-    currentEvent.eventType.get ne EndDocument
+    // If we haven't reached an EndDocument event yet, there must be more events on their way,


Is there an issue with an XML Comment that appears after the end document element? The XML Spec says after the end of the root element you can have whitespace, processing-instructions, or comments. Or are we disregarding those elsewhere? I can imagine if an XML infoset was crafted by hand as a test case that the author might want to put comments at the end.

The EndDocument SAX event isn't so much about that last element, but about reaching the end of the XML source and telling the ContentHandler that there won't be any more events. So even if there are comments, PI, etc. after the last element, we won't get the EndDocument event until after those are all processed. And the Daffodil unparser also won't finish uparsing until it gets the EndDocument event from the InfosetInputter as well. So even if the XMLReader does parse the last element, but then sees an invalid comment or something, then it will error, we'll never see EndDocument, and we have the same issue.

The DaffodilUnparseContentHandler does disregard a number of things, like PI and ignoble whitespace:

daffodil/daffodil-core/src/main/scala/org/apache/daffodil/runtime1/processors/DaffodilUnparseContentHandlerImpl.scala

Lines 459 to 473 in 1a51c31

override def ignorableWhitespace(ch: Array[Char], start: Int, length: Int): Unit = {

// do nothing

}

override def processingInstruction(target: String, data: String): Unit = {

// do nothing

}

override def skippedEntity(name: String): Unit = {

// do nothing

}

override def setDocumentLocator(locator: Locator): Unit = {

// do nothing

}

Also, I just learned that the SAX ContentHandler API does not even receive comments--you need a different LexicalHandler if you care about those. So that's why comments aren't in our ContentHandler anywhere.

stevedlawrence · 2025-10-14T15:50:09Z

I have just one issue which is you should add an XML Comment and a processing instruction and some whitespace after the end-document element in one of the test cases that terminates normally.

Will, do. This is an important edge case that I'm not sure we test.

jadams-tresys · 2025-10-27T15:10:32Z

...c/main/scala/org/apache/daffodil/runtime1/processors/DaffodilUnparseContentHandlerImpl.scala

+      null

-  private var unparseResult: Maybe[DFDL.UnparseResult] = Nope
+  private var maybeUnparseResultOrException: Maybe[Either[Exception, DFDL.UnparseResult]] = Nope


So this Either[Exception, DFDL.UnparseResult] is essentially catching any exceptions that get thrown as part of this.resume(...) calls? Pretty neat way to handle things. Feels a little weird to wrap it in a maybe, but I think I understand the logic here where this may not ever be getting set by the XMLReader and we need to be able to detect that.

Yep, if the unparse side of the coroutine throws an unexpected exception (usually indicating bug) we capture that and send it back to the main coroutine where the main thread can handle it.

Yeah, the Maybe is a bit weird, but its needed because the unparse side of the coroutine could still be processing and just needs more events from the XMLReader. A value of None indicates that it's not done so we can gather more events and resume it until it indicates some kind of result (either an UnparseResult or an exception/bug).

jadams-tresys

+1

When unparsing using the SAX API, if an XMLReader ends the parse early (usually due to invalid XML), then it stops sending events to the DaffodilUnparseContentHandler. This can lead to dangling threads since the DaffodilUnparseCotentHandler and SAXInfosetInputter are coroutines, which uses threads behind the scenes. To fix this, we add a new finish() method to the DaffodilUnparseContentHandler which should be called by SAX unparse API users after the XMLReader parse completes, regardless of success. For example, val contentHandler = dp.newContentHandlerInstance(...) xmlReader.setContentHandler(contentHandler) try { xmlReader.parse(...) } catch (...) { ... } finally { contentHandler.finish() } The new finish function cleans up the coroutines setting the next event to null, which indicates to the SAXInfosetInputter coroutine that no more events are coming, ands leads to ending the unparse and the coroutine. This now also means getUnparseResult will always return a result, even if the XMLReader sees invalid XML, which makes our API a bit more consistent. Note that this discovered a case where the InfosetInputter.initialize() method could throw an UnparseError if there was no StartDocument event, which was not caught correctly. Fixing this requires moving the initialize call inside the try/catch block, which is after the UState is created. This means the UState can no longer assert that the InfosetInputter is initialized, but this isn't necessary UState creation to succeed, as long as the infoset inputter is initialized shortly thereafter. - Change DaffodilUnhandledSAXException from a SAXException to a RuntimeException. This exception generally indicatesa bug and so we should not require that it be caught. Instead, it should be handled just like any other RuntimeException. Deprecation/Compatibility: A new DaffodilUnparseContentHandler.finish() method is added for users of the SAX unparse API. This should be called at the end of the XMLReader.parse() method to to ensure all internal state is cleaned up and an unparse result is available. For example: DaffodilUnparseContentHandler contentHandler = dp.newContentHandlerInstance(); xmlReader.setContentHandler(contentHandler); try { xmlReader.parse(...): } catch (...) ... } finally { contentHandler.finish(); } UnparseResult ur = contentHandler.getUnparseResult(); Deprecation/Compatibility: The DaffodilUnhandledSAXException, which can be thrown when unparsing using the SAX API and usually indicates a Daffodil bug, is changed from a SAXException to a RuntimeException. It is no longer a checked exception and does not need to be caught. It should usually be handled just like any other unchecked RuntimeException. DAFFODIL-2768

mbeckerle approved these changes Oct 14, 2025

View reviewed changes

jadams-tresys reviewed Oct 27, 2025

View reviewed changes

jadams-tresys approved these changes Oct 27, 2025

View reviewed changes

stevedlawrence force-pushed the daffodil-2768-sax-unparse-cleanup branch from 5910e81 to 12a8986 Compare October 27, 2025 16:14

stevedlawrence merged commit 3f31018 into apache:main Oct 27, 2025
11 checks passed

stevedlawrence deleted the daffodil-2768-sax-unparse-cleanup branch October 27, 2025 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DaffodilUnparseContentHandler.finish() function to SAX unparse API#1572

Add DaffodilUnparseContentHandler.finish() function to SAX unparse API#1572
stevedlawrence merged 1 commit intoapache:mainfrom
stevedlawrence:daffodil-2768-sax-unparse-cleanup

stevedlawrence commented Oct 13, 2025

Uh oh!

mbeckerle left a comment

Uh oh!

mbeckerle Oct 14, 2025

Uh oh!

stevedlawrence Oct 14, 2025

Uh oh!

mbeckerle Oct 15, 2025

Uh oh!

mbeckerle Oct 14, 2025

Uh oh!

stevedlawrence Oct 14, 2025

Uh oh!

stevedlawrence commented Oct 14, 2025

Uh oh!

jadams-tresys Oct 27, 2025

Uh oh!

stevedlawrence Oct 27, 2025

Uh oh!

jadams-tresys left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	override def ignorableWhitespace(ch: Array[Char], start: Int, length: Int): Unit = {
	// do nothing
	}

	override def processingInstruction(target: String, data: String): Unit = {
	// do nothing
	}

	override def skippedEntity(name: String): Unit = {
	// do nothing
	}

	override def setDocumentLocator(locator: Locator): Unit = {
	// do nothing
	}

Conversation

stevedlawrence commented Oct 13, 2025

Uh oh!

mbeckerle left a comment

Choose a reason for hiding this comment

Uh oh!

mbeckerle Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

stevedlawrence Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

mbeckerle Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

mbeckerle Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

stevedlawrence Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

stevedlawrence commented Oct 14, 2025

Uh oh!

jadams-tresys Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

stevedlawrence Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

jadams-tresys left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants