Skip to content

Add support for validation of length facet#1192

Merged
olabusayoT merged 1 commit intoapache:mainfrom
olabusayoT:daf-2842-add-length-limitedValidation
Mar 27, 2024
Merged

Add support for validation of length facet#1192
olabusayoT merged 1 commit intoapache:mainfrom
olabusayoT:daf-2842-add-length-limitedValidation

Conversation

@olabusayoT
Copy link
Copy Markdown
Contributor

  • add support for length, also set minLength and maxLength with the value of length, when hasLength is true, and those variables are queried
  • ensure length and minLength and maxLength facets are not declared together (Xerces Error, but we have asserts incase Xerces validation is turned off)
  • add tests for validation=limited and validation=on

DAFFODIL-2842

Copy link
Copy Markdown
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with a question

Assert.invariant(hasLength)
schemaDefinitionUnless(
isSimpleType,
"Facet length is allowed only on types string and hexBinary.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is calling isSimpleType supposed to tell you if the type is string, hexBinary, or something else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I updated the message to reflect the check in both length and computeMinMaxLength (where this was copied from)

@olabusayoT olabusayoT force-pushed the daf-2842-add-length-limitedValidation branch from 7fd68d5 to 225b876 Compare March 25, 2024 23:20
Copy link
Copy Markdown
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 minor suggestions and things to confirm with the XSD spec

repElement.hasMaxLength,
"String with dfdl:lengthKind='implicit' must have an XSD maxLength facet value.",
repElement.hasMaxLength || repElement.hasLength,
"String with dfdl:lengthKind='implicit' must have an XSD maxLength or XSD Length facet value.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make XSD Length a lowercase l

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better, could dispense with XSD (some people won't know XSD means XML Schema Language) and shorten the rest to just must have a length or maxLength facet value.

typeDef.optRestriction.flatMap { _.enumerationValues }
}

final lazy val length: java.math.BigDecimal = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like mostly a copy from computeMinMaxLength, I wonder if we can move this logic into computeMinMaxLenghth I'd like to avoid duplicating code--if we ever need to change one place we're probably going to forget to change it in the other.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, something like:

final lazy val (length: java.math.BigDecimal, minLength: java.math.BigDecimal, maxLength: java.math.BigDecimal) =
    computeMinMaxLength

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not even need length in the tuple? Maybe deep down in this function we can do something like

if (hasLength)
  (length, length)
else
  (minLength, maxLength)

So the fact that length was used kindof disappears, and length just means minLength == maxLength

} else {
computeMinMaxLength
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition goes away if we can merge length into computeMinMaxLength, so it simplifies the code a bit.

schemaDefinitionUnless(
isSimpleType,
"Facets minLength and maxLength are allowed only on types string and hexBinary.",
"Facets minLength and maxLength are allowed only on simple types or types derived from simple types.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct, do we allow min/maxLength on numeric types? I wonder if the message was correct but the check was wrong?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the check for the correct type (string or hexBinary) and corresponding error message is performed on lines 994 and 1065 of this ElementBase.scala's class already (note: Steve already suggested merging all checks together into one function, computeMinMaxLength, so these two checks will be merged together later):

// length at line 994
        val typeOK = pt == PrimType.String || pt == PrimType.HexBinary
        schemaDefinitionWhen(
          !typeOK && hasLength,
          "Facet length is not allowed on types derived from type %s.\nIt is allowed only on types derived from string and hexBinary.",
          pt.name,
        )
// minLength, maxLength at line 1065
        val typeOK = pt == PrimType.String || pt == PrimType.HexBinary
        schemaDefinitionWhen(
          !typeOK && (hasMinLength || hasMaxLength),
          "Facets minLength and maxLength are not allowed on types derived from type %s.\nThey are allowed only on typed derived from string and hexBinary.",
          pt.name,
        )

Therefore, what we have here is a preliminary check saying you can't use the min/max/length facets on a complex type. Calling isSimpleType is what's being checked here. I do agree with Steve, the error message should not say the facets are allowed on all simple types - a more correct message would say Facets length, minLength, maxLength are not allowed on complex types.

// so we won't get to this code in those cases unless Xerces validation is turned off
Assert.usage(
ml.isEmpty || localLengthValue.isEmpty,
"Length and minLength cannot be specified together",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be wroth specify something like "Length and minLength facets ...", just to make it clear this is about facets and now about dfdl:length?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and let's put the word "facets" first so we don't have to capitalize length (because the length facet always begins with the lowercase letter l): "Facets length and maxLength ..."

}
case Facet.maxLength | Facet.fractionDigits => {
case Facet.length | Facet.maxLength | Facet.fractionDigits => {
errorOnLocalGreaterThanBaseFacet(theLocalFacet, theRemoteFacet, facetType)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that a local length facet must be less than the base length facet, is that correct? Have we confirmed with the XSD spec that that is the right behavior? Feels like make a local length should always be the same as a base length, since there is no way to narrow a single value instead of a range like min/maxValue have.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this in the XSD spec:

Schema Component Constraint: length valid restriction
It is an ·error· if length is among the members of {facets} of {base type definition} and {value} is not equal to the {value} of the parent length.

So I think this is wrong. This needs to error if local is not equal to base.

else if (hasMinLength || hasMaxLength) {
if (maxLengthLong != -1 && len > maxLengthLong) warn("maxL", maxLengthLong)
Assert.invariant(minLengthLong >= 0)
if (minLengthLong > 0 && len < minLengthLong) warn("minL", minLengthLong)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest we just pass in the string "minLength", "maxLength" and "length", the above %sength is kindof hard to parse and only saves us a few characters.

e.length.foreach { length =>
val lenAsLong = length.longValue()
val isLengthGreaterThanEqToZero = lenAsLong.compareTo(0L) >= 0
if (isLengthGreaterThanEqToZero) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why we do a length >= 0 check. I see it's just copied from min/maxLength, but seems like that should never happen and should be caught during complication? length, minLength, and maxLength can never be negative.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov also warns that line 353 is not covered by tests, so that may be another clue that this code is redundant.

val data = diNode.dataValue.getByteArray

val dataLen = data.length.toLong
val isDataLengthEqual = dataLen.compareTo(lenAsLong) == 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should length compare against the actual number of bytes, or the number of chars in the string representation of the infoset?

Note that min/maxLength have the exact same logic, but shouldn't they have less-than or greather-than? I wonder if those are unrelated bugs?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this in the XSD spec:

1.2 if {primitive type definition} is hexBinary or base64Binary, then the length of the value, as measured in octets of the binary data, must be equal to {value};

So what we have is correct since we are checking the number of bytes. I'm still not sure why min/maxLength use equals for minLength/maxLength, feels like a bug.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hexBinary, it's definitely number of bytes, not number of characters.

For strings,...it's supposed to be characters, but a little DFDL tricky thing is that if dfdl:lengthKind is 'implicit' then the length is in characters, but the charset must be SBCS (single-byte character set) when dfdl:lengthUnits='bytes' so that the maxLength facet, which is used to obtain the length, will be correct for either bytes or characters. Otherwise DFDL's interpretation of the maxLength value could be inconsistent with XSDs interpretation.

Copy link
Copy Markdown
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, but could be improved with some changes.

repElement.hasMaxLength,
"String with dfdl:lengthKind='implicit' must have an XSD maxLength facet value.",
repElement.hasMaxLength || repElement.hasLength,
"String with dfdl:lengthKind='implicit' must have an XSD maxLength or XSD Length facet value.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better, could dispense with XSD (some people won't know XSD means XML Schema Language) and shorten the rest to just must have a length or maxLength facet value.

typeDef.optRestriction.flatMap { _.enumerationValues }
}

final lazy val length: java.math.BigDecimal = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, something like:

final lazy val (length: java.math.BigDecimal, minLength: java.math.BigDecimal, maxLength: java.math.BigDecimal) =
    computeMinMaxLength

}
}
// TODO: why are we using java.math.BigDecimal, when scala has a much
// nicer decimal class?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're all reviewing this PR, can anyone explain why we're using java.math.BigDecimal, or should we change this comment to // TODO: Consider replacing all uses of java.math.BigDecimal with Scala's much nicer decimal class?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we moved away from Scala types in most cases a while ago. I don't remember why, but for example all of our Infoset primitives just java types.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just avoiding boxes on the boxes on the boxes. Scala's decimal is built on java's decimal so one more layer of object between the actual data and operations on it. We did not strictly speaking need the additional functionality of the scala decimal class, so we decided to go with Java's boxed number types as our representation types in the infoset. I rather wish we had gone with ULong for xs:unsignedLong because it is far more efficient, but this decision is 10 years ago and probably before we learned of the ULong library.

schemaDefinitionUnless(
isSimpleType,
"Facets minLength and maxLength are allowed only on types string and hexBinary.",
"Facets minLength and maxLength are allowed only on simple types or types derived from simple types.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the check for the correct type (string or hexBinary) and corresponding error message is performed on lines 994 and 1065 of this ElementBase.scala's class already (note: Steve already suggested merging all checks together into one function, computeMinMaxLength, so these two checks will be merged together later):

// length at line 994
        val typeOK = pt == PrimType.String || pt == PrimType.HexBinary
        schemaDefinitionWhen(
          !typeOK && hasLength,
          "Facet length is not allowed on types derived from type %s.\nIt is allowed only on types derived from string and hexBinary.",
          pt.name,
        )
// minLength, maxLength at line 1065
        val typeOK = pt == PrimType.String || pt == PrimType.HexBinary
        schemaDefinitionWhen(
          !typeOK && (hasMinLength || hasMaxLength),
          "Facets minLength and maxLength are not allowed on types derived from type %s.\nThey are allowed only on typed derived from string and hexBinary.",
          pt.name,
        )

Therefore, what we have here is a preliminary check saying you can't use the min/max/length facets on a complex type. Calling isSimpleType is what's being checked here. I do agree with Steve, the error message should not say the facets are allowed on all simple types - a more correct message would say Facets length, minLength, maxLength are not allowed on complex types.

// so we won't get to this code in those cases unless Xerces validation is turned off
Assert.usage(
ml.isEmpty || localLengthValue.isEmpty,
"Length and minLength cannot be specified together",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and let's put the word "facets" first so we don't have to capitalize length (because the length facet always begins with the lowercase letter l): "Facets length and maxLength ..."

e.length.foreach { length =>
val lenAsLong = length.longValue()
val isLengthGreaterThanEqToZero = lenAsLong.compareTo(0L) >= 0
if (isLengthGreaterThanEqToZero) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov also warns that line 353 is not covered by tests, so that may be another clue that this code is redundant.

if (isDataLengthEqual) java.lang.Boolean.TRUE
else java.lang.Boolean.FALSE
}
case _ => e.SDE("Length facet is only valid for string and hexBinary.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not covered by tests, codecov says, so an earlier check prevents execution from coming here. I'd still leave the check in for the fallback "match everything else" case, but put the word "Facet" first and spell length with a lowercase l.

Comment on lines +445 to +451
<xs:element name="e2_8" dfdl:lengthKind="delimited" dfdl:encoding="iso-8859-1">
<xs:simpleType>
<xs:restriction base="xs:hexBinary">
<xs:length value="0"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that e2_8 is a type derived from hexBinary like the others, but the comment on the test case that uses e2_8 says the type "int" which indicates either e2_8 should be a type derived from int or that comment is incorrect.

Also please note that element names like e2_2, etc., are not well chosen or easily maintainable names. You can't insert new elements between them easily without having to renumber or tack on new numbers. In fact, the numbers tacked on the original element names seem like they were intended to provide some information about the facets, not just give some kind of ordering to the elements, see this example below:

    <xs:element name="e2_2" dfdl:lengthKind="delimited">
      <xs:simpleType>
        <xs:restriction base="ex:stMinLength_1">
          <xs:minLength value="2" />
        </xs:restriction>
      </xs:simpleType>
    </xs:element>

A name like ex:stMinLength_1 is much more clear that it's defining a string with a minLength of 1. So, these tests would become easier to maintain if they used similarly descriptive names such as hex_length_0 instead of e2_8.


<tdml:parserTestCase name="validation_testFacets_03"
root="e2_8" model="TestFacets"
description="checks that int doesn't work with length facet"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this description or the entire test is incorrect since e2_8 is really a hex_length_0.

<tdml:documentPart type="byte">deadbeef</tdml:documentPart>
</tdml:document>

<!-- this is a Xerces Error -->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Xerces used in both validation modes "limited" and "on"?? I thought the point of "limited" validation was to validate faster by using only Daffodil itself, skipping calling Xerces since the user doesn't want to wait for Xerces to validate the infoset.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comes from validating the DFDL schema, not the infoset. So limited/on/full doesn't affect this.

The Java and Scala API's don't allow turning off validating the DFDL schema. It can be disabled using the internal ProcessorFactory:

https://github.com/apache/daffodil/blob/main/daffodil-core/src/main/scala/org/apache/daffodil/core/compiler/Compiler.scala#L64

I think the TDML runner can also disable validating dfdl schemas, but that's only done in special cases I think.

// if hasLength is true, then optRestriction is defined
val r = st.optRestriction.get
(r.lengthValue, r.lengthValue)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If repType is defined then we need to get length information from the repTypeElementDecl, not this element (i.e r). Since repTypeElementDec.minLength and repTypeElementDecl.maxlength come from computeMinMaxLength, and since that takes into account the length facet of the rep type element, I think we can remove this new case and just rely on the case below. That should should use the correct minLength/maxLength/length facet.

schemaDefinitionWhen(
(pt == PrimType.String || pt == PrimType.HexBinary) && lengthKind == LengthKind.Implicit,
"Facets minLength and maxLength must be defined for type %s with lengthKind='implicit'",
"The length, minLength and maxLength facets must be defined for type %s with lengthKind='implicit'",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like we need all three of length, minLength, and maxLength. We should reword this to make it more clear you need minLength/maxLength or length.

case (false, false, false, _) => (zeroBD, unbBD)
case (false, true, false, _) => (r.minLengthValue, unbBD)
case (true, _, _, _) =>
SDE("Facet length cannot be defined with minLength and maxLength facets")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor organizational thing, but it might be helpful to move this case right after the (true, false,false, _)case. That way the first two cases are when hasLength is true--we either use it or error if either minLength/maxLength is defined. That handles all the cases where hasLength is true, so it's maybe a little more mentally easier to think about the more complex minLength/maxLength cases that follow, since we don't have to consider hasLength anymore.

Also, should this be an assert.invariantFailed, since xerces should fail before we get here if length is defined with a minLength/maxLength?

Copy link
Copy Markdown
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to latest changes

@olabusayoT olabusayoT force-pushed the daf-2842-add-length-limitedValidation branch from 165e14b to b417b85 Compare March 27, 2024 16:17
- add support for length facet, setting minLength and maxLength with the value of length, when hasLength is true, and those variables are queried
- ensure length and minLength and maxLength facets are not declared together (Xerces Error, but we have asserts in case Xerces validation is turned off)
- fix error message for isSimpleType check in computeMinMaxLength definition
- add tests for validation=limited and validation=on
- added hexBinary tests including test with length=0
- remove greater than or equal to zero checks for check[Length,MinLength,MaxLength] functions
- remove some commented out code
- add test to verify local facet len must be equal to base restriction length
- add test that ensures dfdl:length is not less than base length facet
- add reptype tests
- convert SDE to Assert.invariantFailed for the case when hasLength is true and minLen/maxLen is anything but false

DAFFODIL-2842
@olabusayoT olabusayoT force-pushed the daf-2842-add-length-limitedValidation branch from b417b85 to 250eaef Compare March 27, 2024 16:38
@olabusayoT olabusayoT merged commit 3e936cf into apache:main Mar 27, 2024
@olabusayoT olabusayoT deleted the daf-2842-add-length-limitedValidation branch March 27, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants