Fix hang caused by a zero-length first separated infix element by stevedlawrence · Pull Request #517 · apache/daffodil

stevedlawrence · 2021-03-24T13:49:46Z

With infix separators, we only consume a separator after we have parsed
the first field, i.e. when groupPos > 1. The sequence logic has a
condition that prevents groupPos from being incremented when a
zero-length field is parsed when the field was positional. But this
means that if the first field is zero-length, and this condition was
hit, we do not increment groupPos, we never consume the first infix
separator, and thus we parse the same zero-length data infinitely,
causing a hang.

To fix this, we need to remove the isPositional condition that prevented
the groupPos from being incremented when zero length separated field was
parsed. Whether or not the field is positional should not matter--if a
separated field was successfully parsed, even if it was zero length, we
must still increment groupPos for the following separators to be parsed
correctly.

DAFFODIL-2487

stevedlawrence · 2021-03-24T13:51:38Z

...c/test/resources/org/apache/daffodil/section14/sequence_groups/SequenceGroupNestedArray.tdml

          <ex:record>
            <ex:item>notice lines of all commas are skipped</ex:item>
          </ex:record>
+          <ex:record />


The line of all commas resulted in an empty record element. Is that that expected, or does this mean there is some other bug that is preventing this empty record element from being removed?

Interesting question. I looked at the DFDL 1.0 spec and then searched for dfdl:emptyValueDelimiterPolicy in the codebase. Lines 90-94 in SequenceChildBases.scala say:

An element of complex type can have EmptyRep, but dfdl:emptyValueDelimiterPolicy does not apply. TBD: CONFIRM THIS. When a complex type element is parsed, and zero data is consumed, but the parse is successful, then any infoset created is the "empty value" for this complex type element. When the element is required, this infoset is retained. When the element is optional, this infoset is discarded, any side-effects that occurred in its creation are backtracked.

The DFDL 1.0 spec says:

9.4.2.3 Complex element
Required occurrence: An item is added to the Infoset.

Optional occurrence: if dfdl:emptyValueDelimiterPolicy is applicable and is not 'none' [29], then an item is added to the Infoset, otherwise nothing is added to the Infoset.

This TDML file inherits DFDLGeneralFormat.dfdl.xsd without changing dfdl:emptyValueDelimiterPolicy from its original value which is "both". This implies the empty record item should be added to the infoset. However, the comment in SequenceChildBases.scala is either out of date or it implies Daffodil does not check dfdl:emptyValueDelimiterPolicy for complex elements, which would seem to be contrary to the DFDL 1.0 specification.

9.4.2.4 has an interesting example and seems related to what's going on here. The last line of that section says:

Upon this successful parse of E1, it is therefore known-to-exist. However, because the position in the data has not changed, E1 therefore has the empty representation. Because E1 is empty and optional (it has XSD minOccurs='0') and dfdl:emptyValueDelimiterPolicy does not apply, it is not added to the Infoset, and the temporary Infoset item for E1 containing E2 is discarded.

I think this is very similar in this case, where E1 is record, and E2 is item. However, there's one differnce related to the wording

However, because the position in the data has not changed, ...

In our case, the empty <ex:record /> element comes from the line that is all commas. So when parsing record, the position in the data does change while parsing the item elements, those elements just happened to not contribute anything to the infoset.

I also noticed that if I change the line of all commas to be an empty line, then the empty record element doesn't show up in the infoset. So it seems this position changing or not changing while parsing the complex record element might be the difference. In which case, this empty record element is maybe correct?

I guess the question is if you have a optional complex element and all it's children are empty rep and contribute nothign to the infoset, but they consume separator data, should that complex element end up in the infoset?

If I were you, I also would run an experiment. Edit SequenceGroupNestedArray.tdml, define emptyValueDelimiterPolicy="none" after inheriting DFDLGeneralFormat.dfdl.xsd, and rerun the test. Does Daffodil still add the empty record item to the infoset? The spec says:

Optional occurrence: if dfdl:emptyValueDelimiterPolicy is applicable and is not 'none' [29], then an item is added to the Infoset, otherwise nothing is added to the Infoset.

I know that there is a caveat "if dfdl:emptyValueDelimiterPolicy is applicable" and that policy applies only when an element defines an initiator and/or terminator, otherwise it does not apply. Example 9.4.2.4 also says emptyValueDelimiterPolicy does not apply to it. However, this experiment might reveal a bug in Daffodil if Daffodil suddenly reverses its position and stops adding the empty record item to the infoset.

Good idea. I've tried setting emptyValueDelimiterPolicy to all possible values to be sure, and it has no affect on the result. Daffodil always adds the empty record element.

Looking at the emptyValueDelimiterPolicy section it says the property is

Ignored if both dfdl:initiator and dfdl:terminator are "" (empty string).

So I think this property is just going to be ignored, and isn't applicable in this case, as you suggest.

So it's probably a good thing that the behavior is the same regardless of the value of that property (assuming that behavior is correct).

I managed to get the ibm DFDL crosstester working, and it has the same behavior as Daffodil--there is an empty <ex:record /> element when a line is all commas. Likewise, empty lines that contain no commas do not create a record element at all. So I think we're are consistent with IBM DFDL and are interpreting the spec the same. The other test_csv_hang_* tests pass as well.

Unfortunately, test_csv_nohang_1 does not pass in IBM DFDL. It fails with a parse error with message

Unexpected data found at offset '94' after parsing completed. Data: '0x2c...'

The 0x2c byte at offset 94 is the first comma on the line

,preceded by an empty field,

So it seems IBM does not like a line starting with a zero-length field in the nohang test that has minOccurs="1". It feels like Daffodil has the correct behavior, but this area of the spec is complicated, so I can't say for sure.

IBM DFDL differs from Daffodil in this new emptyElementParsePolicy property. They only implement the treatAsAbsent semantics.

I'm thinking the minOccurs of this vector is 1. That means the first element is actually required, not optional. IBM DFDL treats this as absent, and causes the failure since an absent required element causes an error.

I think.

We can try specifying dfdlx:emptyElementParsePolicy='treatAsMissing' (which is Daffodil's current realization of this property, and that behavior should then be the same.

Or add a minOccurs="0" to the array, then both IBM and Daffodil should behave the same.

That's my theory anyway.

emptyElementParsePolicy="treatAsMissing" does change Daffodil's behavior, but it is still slightly different from IBM's default behavior. With that property set, Daffodil stops parsing when it hits that comma, considers the parse a success, and outputs an infoset, just with left over data starting at that comma. So it does seem like we both think that comma is an error, but Daffodil interprets it as the end of the record array, whereas IBM DFDL interprets it as a fatal error.

We both have the same behavior when minOccurs="0". So it seems there's only one discrepency about how this parse error should be handled.

I want to capture this for DFDL workgroup discussion on what the "right" behavior is. We get a bug ticket or IBM does.

Sounds good. I'll merge this as is then since it fixes a pretty serious issue. If we determine daffodil's behavior is wrong, we can create a new ticket.

tuxji

+1

If there is really an issue with Daffodil not paying attention to dfdl:emptyValueDelimiterPolicy for complex elements, we can file another issue and fix it in another PR.

tuxji · 2021-03-24T15:24:11Z

...c/test/resources/org/apache/daffodil/section14/sequence_groups/SequenceGroupNestedArray.tdml

          <ex:record>
            <ex:item>notice lines of all commas are skipped</ex:item>
          </ex:record>
+          <ex:record />


Interesting question. I looked at the DFDL 1.0 spec and then searched for dfdl:emptyValueDelimiterPolicy in the codebase. Lines 90-94 in SequenceChildBases.scala say:

An element of complex type can have EmptyRep, but dfdl:emptyValueDelimiterPolicy does not apply. TBD: CONFIRM THIS. When a complex type element is parsed, and zero data is consumed, but the parse is successful, then any infoset created is the "empty value" for this complex type element. When the element is required, this infoset is retained. When the element is optional, this infoset is discarded, any side-effects that occurred in its creation are backtracked.

The DFDL 1.0 spec says:

9.4.2.3 Complex element
Required occurrence: An item is added to the Infoset.

Optional occurrence: if dfdl:emptyValueDelimiterPolicy is applicable and is not 'none' [29], then an item is added to the Infoset, otherwise nothing is added to the Infoset.

This TDML file inherits DFDLGeneralFormat.dfdl.xsd without changing dfdl:emptyValueDelimiterPolicy from its original value which is "both". This implies the empty record item should be added to the infoset. However, the comment in SequenceChildBases.scala is either out of date or it implies Daffodil does not check dfdl:emptyValueDelimiterPolicy for complex elements, which would seem to be contrary to the DFDL 1.0 specification.

mbeckerle

+1

With infix separators, we only consume a separator after we have parsed the first field, i.e. when groupPos > 1. The sequence logic has a condition that prevents groupPos from being incremented when a zero-length field is parsed when the field was positional. But this means that if the first field is zero-length, and this condition was hit, we do not increment groupPos, we never consume the first infix separator, and thus we parse the same zero-length data infinitely, causing a hang. To fix this, we need to remove the isPositional condition that prevented the groupPos from being incremented when zero length separated field was parsed. Whether or not the field is positional should not matter--if a separated field was successfully parsed, even if it was zero length, we must still increment groupPos for the following separators to be parsed correctly. DAFFODIL-2487

stevedlawrence commented Mar 24, 2021

View reviewed changes

tuxji approved these changes Mar 24, 2021

View reviewed changes

mbeckerle approved these changes Mar 25, 2021

View reviewed changes

stevedlawrence force-pushed the daffodil-2487-infix-zl-hang branch from a387f7a to 507f503 Compare March 25, 2021 18:49

stevedlawrence merged commit 5289f59 into apache:master Mar 25, 2021

stevedlawrence deleted the daffodil-2487-infix-zl-hang branch March 25, 2021 19:46

Conversation

stevedlawrence commented Mar 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tuxji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbeckerle left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants