Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode above \uFFFF - how to express? #208

Closed
StrayAlien opened this issue Dec 17, 2018 · 16 comments
Closed

Unicode above \uFFFF - how to express? #208

StrayAlien opened this issue Dec 17, 2018 · 16 comments

Comments

@StrayAlien
Copy link
Contributor

Hi all, apols for the spam.

I want to get in some sanity test coverage in for unicode support - especially for those codepoints above \uffff.

The spec says:

Note that the character range that includes all Unicode characters is [\u0-\u10FFF].

But then describes a code point as:

code point = "\u", hexadecimal digit, 4 * [hexadecimal digit] ;

4 digits. I do not see anywhere in the spec where it defines how to address those extra chars.

One could assume they are described as surrogate pairs but then assuming stuff is often a bad idea.

Is this one of those 'oh we'll fix it in the next spec rev' things?

All advice appreciated.

Greg.

@agilepro
Copy link
Contributor

agilepro commented Dec 17, 2018

you are mixing two things. First is "code points" which should be good for the entire unicode range which is 32 bits large (most of those points undefined). The other is this \u notation.

If the spec says up to 10FFF then that would correspond to the limits of UTF-16 encoding. That involves what are called surrogate characters which are sequences of two 16-bit values. That is, you would have to put two 16-bit values together to express a single code point, but you can not use the \u notation this way to express those values.

All this is avoided if you use UTF-8 encoding which uses sequences of bytes to represent values in the entire 32-bit code point range. One option, of course, is to simply put the value into the string without using the \u representation.

Your question though is about "string literals" an how to use \u representation. As you point out, \u only allows four hex values, so you can't use this for all code points. This appears to be a limitation of the \u notation.

Since JSON encoded in UTF-8 is allowed, why not just put the character values directly into the string literal without using the \u notation? In Java the UTF-8 characters are properly read into the UTF-16 internal representation. I don't know about other languages. I personally would like to see tests with UTF-8 encoded characters up to 10FFF just to make sure that it works.

@StrayAlien
Copy link
Contributor Author

StrayAlien commented Dec 17, 2018

Thanks Keith, yes - I am talking about string literals. I neglected to say that. Apols.

The notation is language dependant. Java handles it as surrogates, ES6 (for example) can handle it in another way. The \uhhhh notion is a DMN (and others) limitation. But, in my experience most languages can handle to-10FFF using \uhhhh surrogate pairs. I'm just 'disambiguating' here as I'm writing tests.

Re:

why not just put the character values directly into the string literal without using the \u notation?

Because the spec says we're to support a "\u" representation. :-)

Though the spec doesn't say whether that representation is to support to-10FFF chars in a string literal.

At a guess (danger) I would say string literals with to-10FFF should be surrogate pairs, so, am writing tests that use that. Happy for input. Tell me if I shouldn't.

Note - this usage of to-10FFF chars has implications on other stuff as well. The "string length" function has to take it into account. Other string functions that use offsets like substring() and so on have to take two unicode code units into account.

Those need tests ... (which I will work on tomorrow).

re:

In Java the UTF-8 characters are properly read into the UTF-16 internal representation

For literals - that is parser dependant. ANTLR and others may handle it automagically, but my lexer/parser is hand-crafted. No .g4 file.

Re:

I personally would like to see tests with UTF-8 encoded characters up to 10FFF just to make sure that it works.

If you can note a few scenarios you're particularly interested in, maybe I get them coded into tests. I've another few days before I reckon I hit a brick wall in test writing. :-)

Greg.

@StrayAlien
Copy link
Contributor Author

StrayAlien commented Dec 17, 2018

example test (this asserts '2')

    <decision name="decision_004" id="_decision_004">
        <variable name="decision_004"/>
        <literalExpression>
            <!-- horse + poop emoji -->
            <text>string length("\ud83d\udc0e\uD83D\uDCA9")</text>
        </literalExpression>
    </decision>

Obviously .. the 'poop' emoji had to make it into extended unicode tests .... :-)

edit:

As does this:

    <decision name="decision_004_a" id="_decision_004_a">
        <variable name="decision_004_a"/>
        <literalExpression>
            <!-- horse + poop emoji -->
            <text>string length("🐎💩")</text>
        </literalExpression>
    </decision>

@StrayAlien
Copy link
Contributor Author

StrayAlien commented Dec 18, 2018

@agilepro I have pushed some sanity checking unicode tests into the PR. #204 (comment).

To keep it family friendly, I have used nicer emojis. :-)

The tests assume surrogate pairs for supplementary chars. But also test literal, as well as hex encoded forms as well. A few string func assertions in there too. Let me know what you think.

Happy to close this if you/folks think that is good for now - including the surrogate pairs assumption.

Greg.

@agilepro
Copy link
Contributor

agilepro commented Dec 28, 2018

Java Spec section 3.3. Unicode Escapes states that \uXXXX allows you to specifies a "UTF-16 code unit" which is the 16-bit quantity that takes up one space in a Java String. This is NOT a Unicode code point value.

https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html

The DMN spec says that \u is followed by five hex digits (one mandatory, 4 optional) and it says that those five digits form the unicode code point.

It also incorrectly says that \u10FFF would be the highest unicode code point, but UTF-16 actually supports 16 times as many characters, up to \u10FFFF (note that takes 6 hex digits). I believe this is a bug / typo that nobody caught and it should allow for 5 optional digits if they wish to support all the characters that Java supports.

The value should be a code point since that is the real value. Java's use of "UTF-16 code unit" is a a by product of the language starting using UCS-2 encoding, and then switching to UTF-16 later. The \u implementation to produce one place in a string requires the user to know too much about the internal representation. Specifying a code point is a far "friendlier" value.

Also, FEEL allows more than 4 hex digits and so has a way to implement code points larger than FFFF and so, once again, the fact that characters larger than that have to be encoded using surrogate characters should not be something that the user is bothered by.

The first test you wrote above should assert 4, not 2.

@StrayAlien
Copy link
Contributor Author

StrayAlien commented Dec 28, 2018

Hi Keith, thanks. "Happy holidays" to you (as I believe you say that side of the pond). I do value your feedback, and certainly will defer to your greater knowledge in this area. I am no expert, sounds like you know a lot more than me here for sure.

Said nicely with a smile - if you can comment sooner than 10 days that'd be great. :-) I've waaaaaaaay moved on since writing those tests. :-)

I totally missed the \u 1+4 (optional) in the spec. I must have looked at that thing 20 darn times and read it as \u + 4! My failing. If I'd seen that, then I guess this thread would not have been created. It seems we should not be doing surrogate pairs, but supporting (at this moment) up to 5 hex digits (thought your comment above "As you point out, \u only allows four hex values, so you can't use this for all code points" kind of meant I charged ahead with surrogates - still I should have read the darn spec correctly!) I understand the '6' thing, but for now, spec says 5. I will amend the tests in the new year to not use surrogate pairs, but \u 1 + 4.

Re the test asserting 4, not 2. DMN is not Java - so not sure why the Java-focus here. If I were implementing DMN in JS or Fortran or Smalltalk or VB I'd not be referencing that spec at all.

Having said that, in Java terms - as I understand it "\ud83d\udc0e\uD83D\uDCA9" is 2 code points and 4 code units. As it happens, Java counts this as two. Using java.lang.String.codePointCount() the above is 2, not 4.

I have reread you post above a number of times and I believe you are saying we should count points, not units - which is what I have done. But then you say "\ud83d\udc0e\uD83D\uDCA9" should be 4 .... which is actually just 2 code points.

My apologies, but I have missed something here - regardless of how it is represented, surrogate pairs or otherwise, we should points yes?

Many thanks,

Greg.

PS. Please take any comments and questions as constructive. If it leads to a better and more definitive TCK then all good. This is a complex area - always happen to learn from those that know more.

@agilepro
Copy link
Contributor

agilepro commented Dec 28, 2018

Greg, sorry for the long delay. I have been enjoying the holidays.

The FEEL syntax says that: "\ud83d\udc0e\uD83D\uDCA9" is four characters. Each \u defines a character.

In Java syntax, that would be four UTF 16 code values, but only 2 characters, because it takes two 16-bit values to make a single character for code values in that range. If you were to swap the position of the first to codes, it would be considered an invalid String in Java because surrogates in that order are not allowed. A surrogate by itself is not allowed either. However my guess is that Java just allows improperly formed UTF-16 strings to exist for most operations, and will complain about improper format only when using the advanced Character oriented operations. What Java does is immaterial to us, since the DMN spec does not say that interpretation according to Java is required, but instead FEEL is supposed to be language independent.

However, those character values are strange ones. Unicode does not define any glyphs for them, and they can not be displayed. That range has been reserved in the Unicode standard so that UTF-16 will not be missing any characters. According to the Unicode spec "Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for
this range." (https://unicode.org/charts/PDF/UDC00.pdf) Thus your example has four undisplayable characters.

CORRECTION:

I the first version of this comment I said that The DMN spec says that characters from D800 to F8FF are not considered valid characters (FEEL grammar rule #30, S-FEEL grammar rule #25).

On reexamination that grammar rule is about "name start characters" only, not characters in general. Later, in section 10.3.1.1 it says "the character range that includes all Unicode characters is [\u0-\u10FFF]."

This statement contradicts a reference in grammar rule #30 to character \uEFFFF which his larger than the largest allowable character. I assume that they mean the range is [\u0-\u10FFFF] which is what UTF-16 supports.

So, those are valid, but unprintable characters.

@agilepro
Copy link
Contributor

SORRY. The variable length \u was only in reference to the EBNF expressions, and NOT FEEL expressions. FEEL and S-FEEL only allow exactly four digits.

The spec is completely silent on the meaning of that four digit value. See my write-up at: https://s06.circleweaver.com/weaver/t/wfmc/dmn-reference-implementation/noteZoom2034.htm

It seems that we are stuck with exactly four digits, and so the only reasonable conclusion is that the values represent UTF-16 code values NOT unicode code points. And thus your tests are correct: the first example is two characters long.

@StrayAlien
Copy link
Contributor Author

Thanks Keith - good write up in the link. Nice.

Yes. Those chars are not printable, but are valid halves of a surrogate pairs. And re 2, not 4. Phew! I'm glad I wasn't missing something. Thank you for the clarification.

Btw, with reference to your writeup. Another possibility re code point notation is to adopt the way ES6 does it. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#Escape_notation

It keeps \uxxxx to represent code points (and surrogate pairs), but it also has \u{X} ... \u{XXXXXX} - which is variable length and delimited.

Regardless, I guess the big question for us is, "what should we do in the TCK"?

I have a feeling that perhaps nobody has fully implemented "DMN unicode" support as yet and anything we lock down in the TCK feels 'speculative'.

In times like this is it possible to just contact the spec author(s) for clarification of their intent? If we know that we can create tests now and their 'intent' can be further clarified in a subsequent spec version. We should be working together with them on this stuff shouldn't we? I would have thought they'd welcome this effort and be happy to help. Maybe?

Pending clarification from the captains of the DMN spec shall I remove the unicode tests from my PR?

Greg.

@agilepro
Copy link
Contributor

agilepro commented Jan 11, 2019

We discussed on Jan 4. Edson and Matteo said they would review the way UTF-16 works. This is important. There are two possibilities: either the hex value is a Unicode Code Point (and actual character) or the hex value is a UTF-16 code value (which sometimes take two values to make a single character.) The following string:

"\ud83d\udc0e\uD83D\uDCA9"

is interpreted as four characters in the first case, and two characters in the second case.

When interpreted as two characters, those characters are "🐎💩". When interpreted as four Unicode characters, this particular example can not be displayed because those characters are defined in Unicode not to have any visual representation, but they still exist as four separate code points. Also, these particular Unicode characters (characters in the surrogate range) can not be legally represented in UTF-16 strings.

I will bet anything that the Camunda and DROOLS implementation interpret these as UTF-16 code values, because that is what Java does, and because each hex value takes one spot in a Java String. However, the "string length" function is defined in FEEL to count Unicode characters, and the correct answer is 2. The Java String.length() function will return 4. But I bet that both Camunda and DROOLS answer 4 at this time. (This problem is obscure and seems not to have been discussed in RTF.)

My recommendation to the RTF is to adopt a new syntax \u{XXXXXX} which can take 1 to 6 hex digits inside braces, and in this case the value represents a Unicode code point. But we still to determine what the four hex digit version represents.

@StrayAlien
Copy link
Contributor Author

@agilepro @etirelli @tarilabs. There are some test in current PR. As discussed they assert string length 2 and support surrogate pairs. I am happy to remove them if this is all in flux, but for now, I'll hang tight and leave the tests in. I do think the \u{xxxxxx} notation is the right way to go though.

@agilepro
Copy link
Contributor

agilepro commented Jan 18, 2019

I believe the correct string length is 4 for the example given above. That is, count the number of UTF-16 code units. That is what JavaScript does, and that is what Java does (in the normal string length function).

There are many obscure corners of Unicode that will get in the way of counting the number of "character like" units. You might find this page useful: https://s06.circleweaver.com/wu/utf8genie.htm

(1) You can create ñ two different ways: "\u{F1}" and "n\u{303}" While this appears to the user to be a single character, the second form requires two Unicode code points.
(2) Combined emoji: This character "👩‍❤️‍💋‍👩" is counted by JavaScript as string length = 11. It looks like one character, but it actually is 8 Unicode code points, and that is 11 UTF-16 code units. Counting code points is no better than counting UTF-16 units.
(3) Some of these combined characters can be manipulated using backspace, and it takes more than one backspace to delete the combined characters. So the user "knows" is it not a single character in any sense.
(4) There will be a strong dependency on Unicode implementation of how they appear. For example Internet Explorer displays the case 2 above as three separate graphemes. So even if we work hard to exactly represent what people see in terms of counting characters, it sill might appear wrong (different) on different platforms.

All of this convinces me that the "correct" answer for string length is the number of UTF-16 code units. That is what JavaScript does, that is what Java does, and that is what I believe people will expect when it comes to these exotic characters. The answer is well defined and will be exactly the same on all implementations.

Here are the escaped examples above:

(1) mañana, mañana = "mañana, man\u0303ana"
(2) 👩‍❤️‍💋‍👩 = "\uD83D\uDC69\u200D\u2764\uFE0F\u200D\uD83D\uDC8B\u200D\uD83D\uDC69"

@StrayAlien
Copy link
Contributor Author

Thanks @agilepro, sounds like it is wise I remove the tests until the spec has clarity here.

@tarilabs @etirelli

@agilepro
Copy link
Contributor

agilepro commented Feb 8, 2019

I filed ticket at RTF:

Key: INBOX-822
Status: pending
Source: fujitsu america ( keith swenson)
Summary:

Rule #66 on page 111 says that a character in a string can be expressed as:

"\u", hex digit, hex digit, hex digit, hex digit

For example "\uD83D"

That is, exactly four hex digits. I believe the intent is that FEEL only allows exactly four digits, and does not allow the kinds of expressions that we see in the EBNF.

What is never specified is the exact meaning of that hex value. There are two possibilities:

(a) Is that value a Unicode code point? In this case it is easy, the hex value is the code point value, however because you are limited to 64K characters, and not the 1.1M character range normally considered, and not even the values that are mentioned in the spec as having significance.

(b) Or is it a UTF-16 code value? UTF-16 has encoding rules about values in the surrogate character range. In UTF-16 a high-surrogate-code value must be followed by a low-surrogate-code value or else the sequence of values is invalid and undefined. Using surrogate characters you can address the entire 1.1million characters but the user is required to understand about surrogate pairs.

The spec never mentions that UTF-16 encoding is required! It always uses "Unicode" and talks about "characters" and "code points". It does not mention anything about surrogate pairs. It never says that these values a "just like Java" or any other UTF-16 implementation.

Page 124 says that the FEEL string value is the same as java.lang.String. Should we infer from that that internal representations must be in UTF-16? however it also says that it is equivalent to an XML string (which is NOT constrained to UTF-16) and PMML string which I looked up and seems to be based on XML. XML allows characters to be expressed as &#nnnn ; That is an ampersand, a hash, a decimal number, terminated by a semicolon. In this case, the decimal value is the actual code point, and not the UTF-16 value. So page 124 does not say unambiguously that Java defines the string values that can be used.

Unicode is mentioned only in three places: on page 108 (about EBNF character ranges), page 111 that tokens are a sequence of unicode characters, page 114 in an example.

While it might be nice to be a "code point", the syntax clearly limits you to four digits leaving you no way to express larger code point values. If it was a code point you would be limited to only specifying 64,000 character (minus several thousand code points that not allowed for various reasons).

The easiest repair is to state clearly that the \u notation assumes that UTF-16 is being used to encode the strings, and that UTF-16 rules must be used when specifying hex values for characters.

I believe most implementations to date have assumed that these are UTF-16 code unit values. That is what Java does. That is what JavaScript does. I don't know of any environments that do anything different for this kind of expression.

Reported: DMN 1.2 Beta 1 — Fri, 8 Feb 2019 18:33 GMT
Updated: Fri, 8 Feb 2019 18:33 GMT

@agilepro
Copy link
Contributor

agilepro commented May 3, 2019

We are now three months after the submission of the issue to RTF and a full 5 months after the issue was raised by Greg on Dec 16.

Apparently somebody thinks that it is a good idea to use UTF-32 as an encoding, breaking literally every implementation out there.

Nobody is able to answer the simple question "What does the following string mean?"

"\uD83D\uDC07"

Some people think this is one character, and others think this is two characters. Of course, those same people disagree on what character(s) this represents. Which of these is correct?

string length( "\uD83D\uDC07") == 1
string length( "\uD83D\uDC07") == 2

We still don't know and we are not getting there very quickly.

Currently all the known implementations have a clear answer for this. So ... how long are we going to wait for the RTF to change the spec to make all existing implementation invalid? What is the utility of doing that?

@agilepro
Copy link
Contributor

agilepro commented May 3, 2019

Discussed in meeting. These should be interpreted the same as Java.

@agilepro agilepro closed this as completed May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants