Range of the Char #317

yous · 2014-12-14T18:48:27Z

Seeing this line, a Char can have its ord at most 0x1FFFFF. But for UTF-8, it ends at 0x10FFFF by RFC 3629. At first it ends at 0x1FFFFF, but restricted on November 2003.

Invalid byte sequences indicates:

A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF

Also the write_utf8 method of its sample code is as same as our each_byte, but 0x10FFFF not 0x1FFFFF.

Is Char support UTF-16 for 0x10FFFF < ord <= 0x1FFFFF, or is this a mistake?

The text was updated successfully, but these errors were encountered:

asterite · 2014-12-16T03:26:49Z

Hi @yous,

We are kind of busy these days, but we'll give you an answer soon. My first guess is that it's a mistake, but I have to check with @waj and others.

I'm curious, how did you find this little error in the code?

yous · 2014-12-16T04:09:49Z

I was expecting that "\u{FF}".bytes returns [255] but it returns [195, 191]. So I started find where it comes.

waj · 2014-12-16T09:54:59Z

Good catch!

The strings in Crystal are always represented in memory as UTF-8. That's why the character FF is represented by two bytes.

yous · 2014-12-16T20:57:04Z

I think this line of CharReader also have been affected. But we may have to be a bit more specific, by checking the first two bytes:

if first < 0xf4 || first == 0xf4 && second < 0x90

simnalamburt · 2014-12-17T23:00:02Z

👍

asterite · 2014-12-18T13:36:59Z

@yous You are right. We should actually copy the algorithm of the read_code_point_from_utf8 from Wikipedia, right?

unsigned read_code_point_from_utf8()
{
  int code_unit1, code_unit2, code_unit3, code_unit4;

  code_unit1 = getchar();
  if (code_unit1 < 0x80) {
    return code_unit1;
  } else if (code_unit1 < 0xC2) {
    /* continuation or overlong 2-byte sequence */
    goto ERROR1;
  } else if (code_unit1 < 0xE0) {
    /* 2-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    return (code_unit1 << 6) + code_unit2 - 0x3080;
  } else if (code_unit1 < 0xF0) {
    /* 3-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; /* overlong */
    code_unit3 = getchar();
    if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
    return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
  } else if (code_unit1 < 0xF5) {
    /* 4-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    if (code_unit1 == 0xF0 && code_unit2 < 0x90) goto ERROR2; /* overlong */
    if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; /* > U+10FFFF */
    code_unit3 = getchar();
    if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
    code_unit4 = getchar();
    if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
    return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
  } else {
    /* > U+10FFFF */
    goto ERROR1;
  }

  ERROR4:
    ungetc(code_unit4, stdin);
  ERROR3:
    ungetc(code_unit3, stdin);
  ERROR2:
    ungetc(code_unit2, stdin);
  ERROR1:
    return code_unit1 + 0xDC00;
}

I think it catches more invalid sequences than what we have right now (and I'd like to have a spec for each invalid case).

asterite · 2014-12-18T17:51:31Z

@yous Could you take a look at the last commit to see if it's ok? It's more or less Wikipedia's code with some common code refactored.

I see Wikipeadia's code returns an invalid code point if it encounters an invalid byte sequence, instead of throwing an exception. I wonder if that's what we really need to do.

Another thing that we don't do very well is this:

puts 0x1FFFF.chr # assume utf-8 encoding, works but maybe should raise

But in Ruby:

puts 0x110000.chr(Encoding::UTF_8)

Gives:

foo.cr:1:in `chr': invalid codepoint 0x110000 in UTF-8 (RangeError)

Do you think we should raise as well?

yous · 2014-12-19T07:57:58Z

@asterite I think the commit is okay.

For the first thing, seeing Codepage layout:

Red cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for overlong encoding of basic ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using 2 bytes instead of 1).

I think we don't need to think about encoding 7-bit ASCII value using 2 bytes. So raising makes sense. See Overlong encodings for further details:

The standard specifies that the correct encoding of a code point use only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point, this makes string comparisons and searches well-defined.

Are we already raising an error for 0x110000.chr? Also Ruby works for:

0x1FFFF.chr(Encoding::UTF_8)

If you are thinking about pass the encoding to chr, I think this depends to our specification. If we presume every character as UTF-8 encoding, than the current way would be okay. We already encodes 0x80-0xFF using 2 bytes. See the difference between two in Ruby:

>> 0x80.chr.encoding
=> #<Encoding:ASCII-8BIT>
>> 0x80.chr.bytes
=> [128]
>> 0x80.chr(Encoding::UTF_8).encoding
=> #<Encoding:UTF-8>
>> 0x80.chr(Encoding::UTF_8).bytes
=> [194, 128]

asterite · 2014-12-19T11:38:35Z

Thanks for the detailed answer!

0x110000.chr is not raising an exception right now. I was thinking maybe we could raise. But maybe not... I always saw chr as a way to convert a number to a Char, almost like a cast. I didn't know it had a check in Ruby. That means that it'll be a bit slower with that check. And it worries me because chr is used in some low-level code, or at least in some parsers we have. So for now I think we can leave it without the check. Maybe later we can add chr! that does the check.

About your last comment, yes, Char always represents a codepoint in the UTF-8 encoding, because when you ask bytes it gives them assuming that encoding. We could add overloads for bytes and each_byte to take an encoding, or maybe that would be the job of an Encoding type. But we can think about it later :-)

asterite closed this as completed in de3a660 Dec 16, 2014

asterite reopened this Dec 18, 2014

asterite added the kind:bug label Dec 18, 2014

asterite added a commit that referenced this issue Dec 18, 2014

UTF-8 encoding/decoding fixes related to #317

64d2c85

asterite closed this as completed Dec 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Range of the Char #317

Range of the Char #317

yous commented Dec 14, 2014

asterite commented Dec 16, 2014

yous commented Dec 16, 2014

waj commented Dec 16, 2014

yous commented Dec 16, 2014

simnalamburt commented Dec 17, 2014

asterite commented Dec 18, 2014

asterite commented Dec 18, 2014

yous commented Dec 19, 2014

asterite commented Dec 19, 2014

Range of the Char #317

Range of the Char #317

Comments

yous commented Dec 14, 2014

asterite commented Dec 16, 2014

yous commented Dec 16, 2014

waj commented Dec 16, 2014

yous commented Dec 16, 2014

simnalamburt commented Dec 17, 2014

asterite commented Dec 18, 2014

asterite commented Dec 18, 2014

yous commented Dec 19, 2014

asterite commented Dec 19, 2014