Skip to content

Add support for dfdl:checkRangeInclusive/checkRangeExclusive functions#1199

Merged
olabusayoT merged 1 commit intoapache:mainfrom
olabusayoT:daf-1515-dfdlCheckRangeIncExc
Apr 2, 2024
Merged

Add support for dfdl:checkRangeInclusive/checkRangeExclusive functions#1199
olabusayoT merged 1 commit intoapache:mainfrom
olabusayoT:daf-1515-dfdlCheckRangeIncExc

Conversation

@olabusayoT
Copy link
Copy Markdown
Contributor

  • add implementation for checkRangeInclusive/Exclusive in Expressions and DFDLFunctions
  • add tests for checkRangeInclusive/Exclusive for numeric types and non-numeric types

DAFFODIL-1515

Copy link
Copy Markdown
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions first...

checkArgCount(3)
val argDPath = args(0).compiledDPath
val rangeFrom = args(1).compiledDPath
val rangeTo = args(2).compiledDPath
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use arg1, arg2, and arg3 which are evaluated only once (their evaluation calls checkArgCount(3) too) instead of repeating work. That is, let's remove the checkArgCount(3) call and change the next 3 lines to:

    val argDPath = arg1.compiledDPath
    val rangeFrom = arg2.compiledDPath
    val rangeTo = arg3.compiledDPath

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if you reference arg1/arg2/arg3 it will force evaluation of line 3056 and call checkArgCount(3), so you can also remove the duplicate check on 3061.

checkArgCount(3)
val argDPath = args(0).compiledDPath
val rangeFrom = args(1).compiledDPath
val rangeTo = args(2).compiledDPath
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, can use arg1, arg2, arg3 here.

val res = executeCheck(
dataVal: DataValue.DataValuePrimitiveNullable,
minVal: DataValue.DataValuePrimitiveNullable,
maxVal: DataValue.DataValuePrimitiveNullable,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the syntax here. If this is a function call, which it appears to be, why do the arguments look like parameter definitions? That is, why not just call executeCheck(dataVal, minVal, maxVal)?

Comment on lines +95 to +103
val saved = dstate.currentNode
dataRecipe.run(dstate)
val dataVal = dstate.currentValue
dstate.setCurrentNode(saved)
minRecipe.run(dstate)
val minVal = dstate.currentValue
dstate.setCurrentNode(saved)
maxRecipe.run(dstate)
val maxVal = dstate.currentValue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with evalution of expressions. Why does this code call dstate.setCurrentNode(saved) after both dataRecipe.run(state) and minRecipe.run(dstate), but not after maxRecipe.run(dstate)? Why doesn't dstate need to have its saved state restored after the last maxRecipe,run(dstate)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The currentNode DState can be used by a run implementation to know where to evaluate a relative path expression from (e.g. ../foo/bar/baz).

But calling run can also potentially mutate currentNode, changing it to something else, which can interfere with other calls to run, so it needs to be reset.

For example, say we had something like this:

dfdl:checkRangeExclusive(../foo/bar/baz, ../uno/dos/tres, ../one/two/three)`

Before we call run(), let's say currentNode points to an element called root.

After we call dataRecipe.run() which evaluates ../foo/bar/baz relative to root, the currentNode will point the baz element, because DState was mutated. If we do not reset currentNode back to root, then when we call minRecipe.run() to evaluate ../uno/dos/tres that relative path will be evaluated relative to the baz element instead of the root element. So that's why we need to save and restore curentNode after each call to run.

Note that we don't have to restore currentNode after calling maxReipce.run() because we aren't personally calling anymore runs, so whatever currentNode ended up as doesn't matter to us. If some other expression was called that led to this run() being called, then they were responsible for saving and restoring currentNode if they needed it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we don't have to restore currentNode after calling maxReipce.run() because we aren't personally calling anymore runs, so whatever currentNode ended up as doesn't matter to us. If some other expression was called that led to this run() being called, then they were responsible for saving and restoring currentNode if they needed it.

Actually, why do run implementations even mutate expression state (current node, current value) inside dstate at all? A more functional way would be to pass dstate or a smaller expression state contained by dstate to run(exprState) as a parameter and receive a new expression state (with the new current node and new current value) as a result value from the run(exprState) call. Then these dstate.setCurrentNode(saved) calls would go away because dstate's expression state (current node, current value) never gets mutated even after it's passed to the run implementation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember, but maybe we do it to mimic PState, which is mutable? Back in the early days PState wasn't mutable and we just copied it each time something changed, but they overhead of copies started becoming very noticeable so we switched to a mutable PState.

DState might be small enough that the overhead of creating multiple DStates might not be noticeable? Hard to say with out making the change, and it would probably be a pretty significant change--all the run() functions would need to be changed to allocate a new DState. I would be much cleaner and less prone to error though.

(dataVal.value, minVal.value, maxVal.value) match {
case (data: Integer, min: Integer, max: Integer) => data >= min && data < max
case (data: java.lang.Double, min: java.lang.Double, max: java.lang.Double) =>
data >= min && data < max
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This range check looks like it's still doing an inclusive comparison on the min side and doing an exclusive comparison only on the max side. Is that how the DFDL function "checkRangeExclusive" is supposed to work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't any information on the DFDL spec about how to treat exclusive, and the redmine link on the tickets leads to a 404 error and attempts to find the issue referenced were not successful so I used the mathematical definition of exclusive, which deals with just the upper value (as seen in this SO post: https://stackoverflow.com/questions/39010041/what-is-the-meaning-of-exclusive-and-inclusive-when-describing-number-ranges)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the latest DFDL spec: https://github.com/OpenGridForum/DFDL/blob/master/docs/current/gwdrp-dfdl-v1.0.8-GFD-R-P.240.pdf

Page 184 talks about these functions, but doesn't clarify what it means by inclusive/exclusive.

An alternative interpretation is that the min/max values are treated similar to minInclusive, maxInclusive, minExclusive, and maxExclusive facet restrictions, which is more like what tuxji (and me) were expecting. But your reasoning makes sense too.

@mbeckerle, do you know the intention of these functions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inclusive means both ends are part of the range, Exclusive means both ends are outside the range.

This is only seems ambiguous in English as in "1 to 10 inclusive" where it becomes a question as to whether you mean inclusive to apply to "1 to 10", or just "10". But as "inclusive" is an adjective that describes a characteristic of an interval/range, not a single number. So I claim this is not ambiguous. It applies to both ends of the range, always. It requires more words to talk about the mixed combinations of including and excluding.

In the DFDL functions, both ends of the ranges are definitely what was intended.

<!-- <isInRangeExc>false</isInRangeExc>-->
<!-- </expMinMax_mixedIntAndFloat>-->
<!-- </tdml:dfdlInfoset>-->
<!-- </tdml:infoset>-->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out code shouldn't be left around for a long time.

Copy link
Copy Markdown
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestions to reduce duplication.

checkArgCount(3)
val argDPath = args(0).compiledDPath
val rangeFrom = args(1).compiledDPath
val rangeTo = args(2).compiledDPath
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if you reference arg1/arg2/arg3 it will force evaluation of line 3056 and call checkArgCount(3), so you can also remove the duplicate check on 3061.

val rangeFrom = args(1).compiledDPath
val rangeTo = args(2).compiledDPath
val c = conversions
val res = new CompiledDPath(DFDLCheckRangeExclusive(argDPath, rangeFrom, rangeTo) +: c)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DFDLCheckRangeExclusiveExpr is exactly the same as the Inclusive version except for this one line. What are your thoughts on only having a single DFDLCheckRangeExpr that takes a boolean parameter (e.g. isExcsluve) that signifies inclusive vs exclusive? Then, using a suggestion below you could have a single DFDLCheckRange class that accepts min/max comparision operations, something like:

val (_, _, lt, le, _) = ComparisonOpts.forType(targetType)
val compare = if (isExclusive) lt else le
val res = new CompiledDPath(DFDLCheckRange(argDPath, rangeFrom, rangeTo, compare) +: c)

Then we can combine DFDLCheckRangeInclusive/Exclusive to a single DFDLCheckRange class that does something like:

def run(...) {
  ...
  val res = compare.operate(minValue, dataValue) && compare.operate(dataValue, maxValue)
  res
}

The benefit to this is approach is we avoid lots of duplication and we also figure out the right comparison logic/function at schema compile time instead of run-time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that the above code is run treats the minVal as inclusive, or exclusive depending on the function called, answering the question within this comment: #1199 (comment)

case (data: java.math.BigDecimal, min: java.math.BigDecimal, max: java.math.BigDecimal) =>
data.compareTo(min) >= 0 && data.compareTo(max) <= 0
case (data: BigInteger, min: BigInteger, max: BigInteger) =>
data.compareTo(min) >= 0 && data.compareTo(max) <= 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This match case isn't exhaustive. It's missing Short, Byte, and Long.

Though, instead of doing the type check at runtime and duplicating a bunch of checks, what if we figure out the right ComparisonOps to use at compile time?

I suggest we create a new static function, something like

object ComparisonOps {
  def forType(type: NodeInfo.Kind): (CompareOpBase, CompareOpBase, CompareOpBase, CompareOpBase, CompareOpBase, CompareOpBase) = {
    type match {
      case PrimType.Boolean => (EQ_Compare, NE_Compare, LT_Boolean, LE_Boolean, GT_Boolean, GE_Boolean)
      case PrimType.Integer => (EQ_Compare, NE_Compare, LT_Integer, LE_Integer, GT_Integer, GE_Integer)
      ...
    }
  }
  ...
}

So it returns a 6-tuple of all the comparison functions that might be needed for a specific type. We can use this anywhere we need comparison functions, like in the DFDLCheckRangeExpr suggested above. We could also use it in ComparisonExpression. That probably wants to be something like this:

lazy val compareOp = {
  val (eq, ne, lt, le, gt, ge) = CopmarisonOps.forType(convertedArgType)
  op match {
    ...
    case "eq"=> eq
    case "ne" => ne
    case "lt" => lt
    ...
  }
}

Note that all the comparison functions used in ComparisionExpression should be used in the new forType function.

We could also could use it in repTypeComparers, e.g.

lazy val (
  repTypeCompareLT: NumberCompareOp,
  repTypeCompareLE: NumberCompareOp,
) = LV('repTypeComparers) {
  val (_, _, lt, le, _, _) = ComparisonOps.forType(repTypeElementDec.primType)
  (lt, le)
}

There might be other places where we have a match/case to figure out comparison ops that could be simplified in this way an remove some duplication.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An improvement, maybe forType can be implemented as a Map for constant lookup and reduced allocations? e.g.

object ComparisonOps {
  val forType: Map[NodeInfo.Kind, (CompareOpBase, CompareOpBase, CompareOpBase, CompareOpBase, CompareOpBase, CompareOpBase)] =
    Map(
       PrimType.Boolean -> (EQ_Compare, NE_Compare, LT_Boolean, LE_Boolean, GT_Boolean, GE_Boolean),
       PrimType.Integer -> (EQ_Compare, NE_Compare, LT_Integer, LE_Integer, GT_Integer, GE_Integer),
       ...
    )
}

The use is still the same, e.g. ComparisonOps.forType(somePrimType)

Comment on lines +95 to +103
val saved = dstate.currentNode
dataRecipe.run(dstate)
val dataVal = dstate.currentValue
dstate.setCurrentNode(saved)
minRecipe.run(dstate)
val minVal = dstate.currentValue
dstate.setCurrentNode(saved)
maxRecipe.run(dstate)
val maxVal = dstate.currentValue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The currentNode DState can be used by a run implementation to know where to evaluate a relative path expression from (e.g. ../foo/bar/baz).

But calling run can also potentially mutate currentNode, changing it to something else, which can interfere with other calls to run, so it needs to be reset.

For example, say we had something like this:

dfdl:checkRangeExclusive(../foo/bar/baz, ../uno/dos/tres, ../one/two/three)`

Before we call run(), let's say currentNode points to an element called root.

After we call dataRecipe.run() which evaluates ../foo/bar/baz relative to root, the currentNode will point the baz element, because DState was mutated. If we do not reset currentNode back to root, then when we call minRecipe.run() to evaluate ../uno/dos/tres that relative path will be evaluated relative to the baz element instead of the root element. So that's why we need to save and restore curentNode after each call to run.

Note that we don't have to restore currentNode after calling maxReipce.run() because we aren't personally calling anymore runs, so whatever currentNode ended up as doesn't matter to us. If some other expression was called that led to this run() being called, then they were responsible for saving and restoring currentNode if they needed it.

Copy link
Copy Markdown
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 👍 just a couple small suggestion

case "gt" => gt
case "le" => le
case "ge" => ge
case _ => subsetError("Unsupported operation '%s' on type %s.", op, convergedArgType)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we currently hit this subsetError at compile time for lt, le, gt, ge operations with xs:hexBinary, but I don't think we will anymore? Instead we will create the new LT_HexBinary, etc. classes at compile time and fail at runtime with an assertion. We probably want to keep the same subset behavior so it SDEs at compile time?

So, can you add a check somewhere in this function that creates this subset error for hex binary and those operations so we get the compile time subset error instead of a runtime assertion?

Also, I think this case probably wants to be an Assert.invariantFailed now, since this ComparisonExpression should only be created with one of these operators (see def Comp in DFDLExpressionParser.scala). If we ever hit this case, it's probably a bug.

Also, should we create a ticket to implement the rest of the operators for hexBinary and can remove this subset error and support those operators for hexBinary? Maybe we wait until we drop support for Java 8 and can use Arrays.compare() so it should be a trivial change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created https://issues.apache.org/jira/browse/DAFFODIL-2886

Though I think for ease of creating TDML tests, we should leave the error as a subsetError that results in a SDE rather than an Assert.invariantFailed. I can easily see someone creating a schema that tries to compare hexBinary elems and an SDE would be a better error for them to get until this is implemented.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should leave the error as a subsetError that results in a SDE rather than an Assert.invariantFailed

Yeah, the way you made the change this should stay a subset error. In my head the change would be something like this:

if (convergedArgType == PrimType.HexBinary && (op == "lt || op == "gt" || op == "le" || op == ge")) {
  subsetError(...)
}
op match {
  ...
}

So the hexBinary test would be outside the match case. If it were done that way the default case would want to be an Assert.invariant. But they way did in the match case you're right that it should be a subset error. And your way is probably more scala-y so is better than what I was thinking. And your new hexBinary tests confirm its doing the right thing 👍

import org.apache.daffodil.runtime1.infoset.DataValue.DataValuePrimitive

object ComparisonOps {
lazy val forType: Map[
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add scaladoc for this, especially making sure to mention the order of the returned 6-tuple so we don't have to read the code to figure out which operation is which? It's not too hard to figure it out, but it would make things just a little easier. Maybe also include an example of its use with the val (...) = ... to easily extract the comparison ops?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option might be to create a new case class and make that the value of the map, e.g.:

case class ComparisonOps(eq: ComparisonOp, ne: ComparisonOp, lt: ComparisonOp, ...) 

...

Map(
  PrimType.Boolean -> ComparisonOps(EQ_Compare, NE_Compare, LT_Boolean, ...),
  ...
)

Then uses become something like:

val compOps = ComparisonOps.forType(targetTypeForSubExpr)
val compare = if (isExclusive) compOps.lt else compOps.le

It's a little more verbose, but using it is a bit more clear and less prone to errors since it's impossible for callers of forType to get the order of the 6-tuple wrong.

// call to targetTypeForSubexpression since we don't care about the specific
// value of subexpr in targetTypeForSubexpression, and we need all 3 sub
// expressions to come to the right target target for the args
private lazy val targetTypeForSubExpr: NodeInfo.Kind = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in places where we do something like this (i.e. figure out a common type for all arguments to convert to) we call this convergedArgType. We should do the same here for consistentcy.

<data>1</data>
<min>1.9</min>
<max>2.5</max>
<isInRangeInc>false</isInRangeInc>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test 👍, if we converged the types wrong there's a good chance 1.9 would be cast to int, become 1, and isInRangeInc would be true and the test would fail.

Copy link
Copy Markdown
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Looks very nice, especially the descriptive names used in the tests!

* @param eq represents the equality comparison
* @param ne represents the unequality comparison
* @param lt represents the less than comparison
* @param le represents the less than and equal to comparison
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less than or equal to

* @param lt represents the less than comparison
* @param le represents the less than and equal to comparison
* @param gt represents the greater than comparison
* @param ge represents the greater than and equal to comparison
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greater than or equal to

- add implementation for checkRangeInclusive/Exclusive in Expressions and DFDLFunctions
- create case class ComparisonOp instead of the 6-tuple for the forType Map value
- add ComparisonOps.forType lazy val map and use in ComparisonExpression, RepTypeMixin and DFDLCheckRangeExpr
- add tests for checkRangeInclusive/Exclusive for numeric types and non-numeric types
- add unittest to test out usage exceptions for hexBinary comparisons

DAFFODIL-1515
@olabusayoT olabusayoT force-pushed the daf-1515-dfdlCheckRangeIncExc branch from 22ac0dd to 8bf9547 Compare April 1, 2024 22:29
@olabusayoT olabusayoT merged commit fe6646f into apache:main Apr 2, 2024
@olabusayoT olabusayoT deleted the daf-1515-dfdlCheckRangeIncExc branch April 2, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants