Skip to content

[RFC] Add special case for textNumberPattern with a '0' pad character#1138

Closed
stevedlawrence wants to merge 1 commit intoapache:mainfrom
stevedlawrence:daffodil-DRAFT-icu-zero-padded-numbers
Closed

[RFC] Add special case for textNumberPattern with a '0' pad character#1138
stevedlawrence wants to merge 1 commit intoapache:mainfrom
stevedlawrence:daffodil-DRAFT-icu-zero-padded-numbers

Conversation

@stevedlawrence
Copy link
Member

Note

I've marked this as a RFC/Draft--although I think this changes the behavior to what a user would expect with this specific type of textNumberPattern, it is behavior that is not specified by the DFDL specification

If textNumberPattern specifies a pad character before the number pattern and without a positive prefix, then ICU defaults to a pad position of PAD_BEFORE_PREFIX with no way to change it with just the pattern. This is reasonable for most cases, like when the pad character is a space. However, if the pad character in textNumberPattern is '0', then negative numbers are padded with a '0' before the negative sign. For example, a pattern of "*0####0" unparses -123 to "0-123". This is very unlikely to be what the user wants with this pattern.

So in this very specific case, we change the pad position to PAD_AFTER_PREFIX so the zero pad character appears after the negative sign, e.g. "-0123".

If a user really wants '0' characters to the left of the negative sign, they can use textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils padding logic.

If textNumberPattern specifies a pad character before the number pattern
and without a positive prefix, then ICU defaults to a pad position of
PAD_BEFORE_PREFIX with no way to change it with just the pattern. This
is reasonable for most cases, like when the pad character is a space.
However, if the pad character in textNumberPattern is '0', then negative
numbers are padded with a '0' before the negative sign. For example, a
pattern of "*0####0" unparses -123 to "0-123". This is very unlikely to
be what the user wants with this pattern.

So in this very specific case, we change the pad position to
PAD_AFTER_PREFIX so the zero pad character appears after the negative
sign, e.g. "-0123".

If a user really wants '0' characters to the left of the negative sign,
they can use textPadKind/textTrimKind and textNumberPadCharacter to uses
Daffodils padding logic
@stevedlawrence
Copy link
Member Author

This is one idea for a way to fix the issue mentioned in DFDLSchemas/NITF#19. Without this, I could not get ICU to unparse the way a user would expect. The alternative that Mike suggests of using a choice and different elements for positive and negative numbers works but feels unsatisfying and makes the schema much more complex.

// If a user really wants '0' characters to the left of the negative sign, they can use
// textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils padding logic
// instead of ICUs.
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to explore a different fix idea.

Seems to me that ICU just has the behavior wrong, as in a design flaw/bug.

The ICU notion that only the positive pattern is used to determine everything except the prefix and suffix (something I read in ICU doc) is simply incorrect.

The actual use case we have is one where the pattern wants to be "00000;-0000" but where those 4 zeros in the negative pattern are respected and used to create only 4 digits for the negative case. But ICU doesn't do that. The documented ICU behavior is that the prefix minus sign is taken from the negative pattern, but the digits there are treated as if the pattern was "00000;-#" so that the number of negative digits is ignored.

Let's call that behavior ICU1.

To me that design point in ICU is simply incorrect. Sometimes the number of digits should be different for positive and negative values because of the space taken up by prefix/suffix in fixed length contexts.

Now let's consider how to fix.

I believe the desired behavior should be that if the value is negative, the negative pattern is used, ignoring the positive pattern entirely.

Let's call that behavior ICU2.

We can implement ICU2 behavior for unparsing by just creating two separate patterns. One is the existing positive pattern, the other the existing negative pattern, but as a positive pattern. Ex: if the pattern was "00000;-0000" we would create "00000" and "-0000" with the latter being a positive pattern.
Then we would examine the value being unparsed and if negative, negate it so it is positive, then use the second pattern to unparse it, which will lay down the prefix minus sign as desired, along with 4 digits.

This eliminates any need to fuss with their padding stuff for this specific use case, but it also fixes the broken behavior of padding because for the case where you have "00000;-*00000" the padding would be after the prefix for the negative case as the positive pattern is completely ignored for the negative case.

This variation, what I'm calling ICU2 behavior, is fairly easy for us to explain in Daffodil documentation, and can be motivated with this specific use case as an example of why we don't just use regular ICU behavior.

This behavior is not backward compatible, so we probably want a property to enable this, and I'd suggest we eventually would want this to become the default behavior.

I think this has negligible performance impact. Just a test for whether a value is negative.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this run into issues related to parsing, since ICU only uses the positive pattern for parsing?

For example, with the pattern *000000;-*00000 (note that we would need the pad char defined in both pos and neg patterns, since ICU only gets pad from the positive pattern) ICU will require 5 digits even for negative numbers when parsing since the length comes from the positive pattern. Only negative prefix and suffix come from the negative pattern. So this approach would unparse negative numbers with 4 digits but parse would expect 5 digits.

I guess this could be resolved by using this as your pattern: *0####0;-*0####0? So on parsing, it requires 5 characters total, including the minus sign for negative numbers. And unparsing negative numbers would have the padding on he correct side of the minus sign since there's a positive prefix for a negated negative number. Though, parsing this would still expect padding on the left side of the negative sign.

This also could get confusing if someone uses very different positive and negative patterns. Parse would only accept the positive pattern (plus prefix/suffix from negative pattern), but unparse would unparse with the negative pattern. Feels like that could be confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we'd have to play a similar game on parsing, split the pattern into two parts as previously described, parse the data with the negative pattern (using it as a positive pattern) and if that succeeds use that negative pattern result (but negate it), otherwise use the positive pattern parse result.

That would address the concern of what if the positive and negative patterns are very different, and be symmetric in behavior between parse and unparse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does mean double the effort for negative numbers. The flexibility is nice, but is it worth it?

// instead of ICUs.
if (
df.getFormatWidth > 0 &&
df.getPadCharacter == '0' &&
Copy link
Contributor

@mbeckerle mbeckerle Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen data that does this with spaces, not just leading zeros E.g., where a field looks like "- 12.3" with the minus sign way over to the left. I have also seen "( 12.34)"

So I would remove the check that the ICU pad character is a '0'.

I think the right behavior for this fix (not the ICU1 vs. ICU2 fix mentioned in the other comment) is just for prefix is the positive prefix empty and the negative prefix non-empty, then the padding behavior should come from the negative case. Ditto for suffix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. Though I'm not sure we can remove the '0' check, if we did that I think we couldn't format numbers with padding to the left of the number like (12.34).

I wonder if maybe the simplest solution is a new dfdl extension property? Something like dfdlx:textNumberPatternPadPosition, with values beforePrefix, afterPrefix, beforeSuffix, afterSuffix, and fromPattern as the default for current behavior? Then we can easily support those other padding styles without having to guess what the user wanted. And it keeps all the existing ICU behavior (as weird as it is) the same.

There might be confusion with dfdlx:textNumberPadCharacter, which works different and is applied after ICU formats a number to a string, but that feels like less of a big deal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explore the property idea.

I think the property name should contain "ICU" e.g., dfdlx:icuNumberPatternPadPosition to distinguish it from the DFDL properties for text numbers like dfdl:textNumberJustification, etc.

It does bring up the issue of having that property defined, but having the pattern contradicting it, as in
the property has value beforePrefix, but the pattern is `"+* 00000;-* 0000" which is after prefix. One would still need the "*" in the pattern to define the ICU pad character to be used, but this brings up the possibility of it being ambiguous and or contradictory.

We would still need the "*" and padding character in the pattern as a way of specifying the padding character ICU would use, but... we would ignore where that was expressed and use the property unless the value is 'fromPattern" which defaults to beforePrefix behavior when ambiguous. (Not sure what it defaults to if the ambiguity is about the position relative to a suffix, but it's clearly one or other of beforeSuffix or afterSuffix)

This has the drawback that our patterns would not behave like ICU patterns really in that in many cases where the "* " notation appears in the positive pattern would be ignored.

I guess we could document as the property only being used if the prefix or suffix of the positive pattern is empty so that it is ambiguous. But one could still have contradictory information in the negative pattern like:

dfdlx:icuNumberPatternPadPosition="beforePrefix"
dfdlx:textNumberPattern="* 00000;-* 00000"

In this the positive pattern is ambiguous about the pad position w.r.t prefix because the prefix is empty, but the negative pattern is not. It clearly shows the padding as after prefix. But I guess ICU lets you put all sorts of noise in the negative pattern that ends up being ignored.

I really do think that if the pad position is ambiguous in the positive pattern, then the negative pattern should be used to determine it, as I think this would be the typical case. If the negative pattern is absent, doesn't express pad position, or is also ambiguous then I guess it should default to beforePrefix which is the current behavior.

Ok, I've decided all this seems rather squishy and hard to explain compared with saying that:

  • Daffodil parses and unparses text numbers using the textNumberPattern where the positive pattern is used for positive numbers, the negative pattern (if defined) is used for negative numbers.
  • This differs from the ICU specification which says that only the prefix and suffix definitions are taken from the negative pattern, and other parts of the negative pattern are ignored.
  • This enables the negative pattern to have different (usually fewer) digit specifiers which is a capability needed for fixed length data situations when the sign indication (prefix and suffix) occupies character positions that can be digits for a positive value.
  • An example is a fixed length 5 character field that can contain 5 digits for a positive value, but a minus sign and only 4 digits for a negative value. Normal ICU behavior using pattern "*000000;-0000" will output 6 characters for -12 being "0-0012"as normal ICU ignores the fact that the negative pattern has only 4 zeros. DFDL will respect the 4 zeros in the negative pattern and will output "-0012" occupying only the allowed 5 characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other question: I didn't see a description saying there could be only one pad character placement.
Do we know that this is illegal: * +*x####;* -*x####" , which has padding both before (spaces) and after ("x" chars) the prefix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another wild idea. Why not add a whole second textNumberPattern property to be used for negative numbers, when those need special treatment.

So you can specify a regular ICU pattern with pos + neg parts if you want the regular ICU behavior.
You specify textNumberPositivePattern and textNumberNegativePattern as separate properties if you need negative numbers to be parsed with the same full behavior that is used for positive numbers (i.e.,without ignoring things in the negative pattern like ICU normally would).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've decided all this seems rather squishy and hard to explain compared with saying that:

I agree that your explanation and justification for handling textNumberPattern differently than ICU is logical.

You specify textNumberPositivePattern and textNumberNegativePattern as separate properties:

Altenatively, textNumberNegativePattern by itself could be sufficient as the signal to handle textNumberPattern differently than ICU. Daffodil could extract the positive pattern from textNumberPattern and the negative pattern from textNumberNegativePattern. I would accept the reasoning that both textNumberPositivePattern and textNumberNegativePattern are necessary for symmetry, but I would argue that if only one of the two is defined while the other is missing, Daffodil needs to either issue a schema definition error or extract the missing pattern from textNumberPattern.

@mbeckerle
Copy link
Contributor

Subsumed by #1139 which fixes the same issue.

@stevedlawrence stevedlawrence deleted the daffodil-DRAFT-icu-zero-padded-numbers branch January 16, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants