Add DaffodilUnparseContentHandler.finish() function to SAX unparse API#1572
Conversation
mbeckerle
left a comment
There was a problem hiding this comment.
+1 I have just one issue which is you should add an XML Comment and a processing instruction and some whitespace after the end-document element in one of the test cases that terminates normally. I just want to be sure we're not erroring on that.
<!-- an XML comment is allowed -->
<?processingInstruction isAllowed="true"?>
| * } catch (DaffodilUnparseErrorSAXException e) | ||
| * // generally can be ignored, use getUnparseResult() instead | ||
| * } catch (DaffodilUnhandledSAXException e) { | ||
| * // should never happen, indicates a bug in daffodil |
There was a problem hiding this comment.
Explain why you are bothering to illustrate a catch for it then. Shouldn't it just be allowed to propagate if it is a daffodil bug? If so, then why catch it at all?
There was a problem hiding this comment.
Good question. The issue is that DaffodilUnhandledSAXException extends SAXException, which is a checked exception so needs to either be caught or declared as thrown. For an example, it's probably best to catch it--users can mark it as thrown if they want, which is probably what they should do.
That said, because this exception indicates a bug in Daffodil, maybe the correct thing to do is to change DaffodilUnhandledSAXException to extend RuntimeException instead of SAXException. That way it isn't a checked checked exception and no one needs to worry about it, and if it does occur it can just bubble up and be handled like normal RuntimeExceptions.
And I think we could safely do that change without any real backwards incompatibility issues? The only real issue is that anyone just catching SAXException will no longer catch this exception, but that's probably the right thing since it indicates a bug that should be handled differently. It's also pretty unlikely to be thrown, since these kinds of bugs are relatively rare.
Does that make sense? Should we switch it to a RuntimeException as part of this PR?
There was a problem hiding this comment.
+1 for RuntimeException for this.
| // If we haven't reached an EndDocument event yet, there must be more | ||
| // events on their way, even if we don't know for sure yet. | ||
| currentEvent.eventType.get ne EndDocument | ||
| // If we haven't reached an EndDocument event yet, there must be more events on their way, |
There was a problem hiding this comment.
Is there an issue with an XML Comment that appears after the end document element? The XML Spec says after the end of the root element you can have whitespace, processing-instructions, or comments. Or are we disregarding those elsewhere? I can imagine if an XML infoset was crafted by hand as a test case that the author might want to put comments at the end.
There was a problem hiding this comment.
The EndDocument SAX event isn't so much about that last element, but about reaching the end of the XML source and telling the ContentHandler that there won't be any more events. So even if there are comments, PI, etc. after the last element, we won't get the EndDocument event until after those are all processed. And the Daffodil unparser also won't finish uparsing until it gets the EndDocument event from the InfosetInputter as well. So even if the XMLReader does parse the last element, but then sees an invalid comment or something, then it will error, we'll never see EndDocument, and we have the same issue.
The DaffodilUnparseContentHandler does disregard a number of things, like PI and ignoble whitespace:
Also, I just learned that the SAX ContentHandler API does not even receive comments--you need a different LexicalHandler if you care about those. So that's why comments aren't in our ContentHandler anywhere.
Will, do. This is an important edge case that I'm not sure we test. |
| null | ||
|
|
||
| private var unparseResult: Maybe[DFDL.UnparseResult] = Nope | ||
| private var maybeUnparseResultOrException: Maybe[Either[Exception, DFDL.UnparseResult]] = Nope |
There was a problem hiding this comment.
So this Either[Exception, DFDL.UnparseResult] is essentially catching any exceptions that get thrown as part of this.resume(...) calls? Pretty neat way to handle things. Feels a little weird to wrap it in a maybe, but I think I understand the logic here where this may not ever be getting set by the XMLReader and we need to be able to detect that.
There was a problem hiding this comment.
Yep, if the unparse side of the coroutine throws an unexpected exception (usually indicating bug) we capture that and send it back to the main coroutine where the main thread can handle it.
Yeah, the Maybe is a bit weird, but its needed because the unparse side of the coroutine could still be processing and just needs more events from the XMLReader. A value of None indicates that it's not done so we can gather more events and resume it until it indicates some kind of result (either an UnparseResult or an exception/bug).
When unparsing using the SAX API, if an XMLReader ends the parse early
(usually due to invalid XML), then it stops sending events to the
DaffodilUnparseContentHandler. This can lead to dangling threads since the
DaffodilUnparseCotentHandler and SAXInfosetInputter are coroutines,
which uses threads behind the scenes.
To fix this, we add a new finish() method to the
DaffodilUnparseContentHandler which should be called by SAX unparse API
users after the XMLReader parse completes, regardless of success. For
example,
val contentHandler = dp.newContentHandlerInstance(...)
xmlReader.setContentHandler(contentHandler)
try {
xmlReader.parse(...)
} catch (...) {
...
} finally {
contentHandler.finish()
}
The new finish function cleans up the coroutines setting the next event
to null, which indicates to the SAXInfosetInputter coroutine that no
more events are coming, ands leads to ending the unparse and the
coroutine. This now also means getUnparseResult will always return a
result, even if the XMLReader sees invalid XML, which makes our API a
bit more consistent.
Note that this discovered a case where the InfosetInputter.initialize()
method could throw an UnparseError if there was no StartDocument event,
which was not caught correctly. Fixing this requires moving the initialize call
inside the try/catch block, which is after the UState is created.
This means the UState can no longer assert that the InfosetInputter is
initialized, but this isn't necessary UState creation to succeed, as
long as the infoset inputter is initialized shortly thereafter.
- Change DaffodilUnhandledSAXException from a SAXException to a
RuntimeException. This exception generally indicatesa bug and so we
should not require that it be caught. Instead, it should be handled
just like any other RuntimeException.
Deprecation/Compatibility:
A new DaffodilUnparseContentHandler.finish() method is added for users
of the SAX unparse API. This should be called at the end of the
XMLReader.parse() method to to ensure all internal state is cleaned up
and an unparse result is available. For example:
DaffodilUnparseContentHandler contentHandler = dp.newContentHandlerInstance();
xmlReader.setContentHandler(contentHandler);
try {
xmlReader.parse(...):
} catch (...)
...
} finally {
contentHandler.finish();
}
UnparseResult ur = contentHandler.getUnparseResult();
Deprecation/Compatibility:
The DaffodilUnhandledSAXException, which can be thrown when unparsing
using the SAX API and usually indicates a Daffodil bug, is changed from
a SAXException to a RuntimeException. It is no longer a checked
exception and does not need to be caught. It should usually be handled
just like any other unchecked RuntimeException.
DAFFODIL-2768
5910e81 to
12a8986
Compare
When unparsing using the SAX API, if an XMLReader ends the parse early (usually due to invalid XML), then it stops sending events to the DaffodilUnparseContentHandler. This can lead to dangling threads since the DaffodilUnparseCotentHandler and SAXInfosetInputter are coroutines, which uses threads behind the scenes.
To fix this, we add a new finish() method to the
DaffodilUnparseContentHandler which should be called by SAX unparse API users after the XMLReader parse completes, regardless of success. For example,
The new finish function cleans up the coroutines setting the next event to null, which indicates to the SAXInfosetInputter coroutine that no more events are coming, ands leads to ending the unparse and the coroutine. This now also means getUnparseResult will always return a result, even if the XMLReader sees invalid XML, which makes our API a bit more consistent.
Note that this discovered a case where the InfosetInputter.initialize() method could throw an UnparseError if there was no StartDocument event, which was not caught correctly. Fixing this requires moving the initialize call inside the try/catch block, which is after the UState is created. This means the UState can no longer assert that the InfosetInputter is initialized, but this isn't necessary UState creation to succeed, as long as the infoset inputter is initialized shortly thereafter.
Deprecation/Compatibility:
A new DaffodilUnparseContentHandler.finish() method is added for users of the SAX unparse API. This should be called at the end of the XMLReader.parse() method to to ensure all internal state is cleaned up and an unparse result is available. For example:
DAFFODIL-2768