Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range of the Char #317

Closed
yous opened this issue Dec 14, 2014 · 9 comments
Closed

Range of the Char #317

yous opened this issue Dec 14, 2014 · 9 comments
Labels

Comments

@yous
Copy link
Contributor

yous commented Dec 14, 2014

Seeing this line, a Char can have its ord at most 0x1FFFFF. But for UTF-8, it ends at 0x10FFFF by RFC 3629. At first it ends at 0x1FFFFF, but restricted on November 2003.

Invalid byte sequences indicates:

A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF

Also the write_utf8 method of its sample code is as same as our each_byte, but 0x10FFFF not 0x1FFFFF.

Is Char support UTF-16 for 0x10FFFF < ord <= 0x1FFFFF, or is this a mistake?

@asterite
Copy link
Member

Hi @yous,

We are kind of busy these days, but we'll give you an answer soon. My first guess is that it's a mistake, but I have to check with @waj and others.

I'm curious, how did you find this little error in the code?

@yous
Copy link
Contributor Author

yous commented Dec 16, 2014

I was expecting that "\u{FF}".bytes returns [255] but it returns [195, 191]. So I started find where it comes.

@waj
Copy link
Member

waj commented Dec 16, 2014

Good catch!

The strings in Crystal are always represented in memory as UTF-8. That's why the character FF is represented by two bytes.

@yous
Copy link
Contributor Author

yous commented Dec 16, 2014

I think this line of CharReader also have been affected. But we may have to be a bit more specific, by checking the first two bytes:

if first < 0xf4 || first == 0xf4 && second < 0x90

@simnalamburt
Copy link

👍

@asterite
Copy link
Member

@yous You are right. We should actually copy the algorithm of the read_code_point_from_utf8 from Wikipedia, right?

unsigned read_code_point_from_utf8()
{
  int code_unit1, code_unit2, code_unit3, code_unit4;

  code_unit1 = getchar();
  if (code_unit1 < 0x80) {
    return code_unit1;
  } else if (code_unit1 < 0xC2) {
    /* continuation or overlong 2-byte sequence */
    goto ERROR1;
  } else if (code_unit1 < 0xE0) {
    /* 2-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    return (code_unit1 << 6) + code_unit2 - 0x3080;
  } else if (code_unit1 < 0xF0) {
    /* 3-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; /* overlong */
    code_unit3 = getchar();
    if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
    return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
  } else if (code_unit1 < 0xF5) {
    /* 4-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    if (code_unit1 == 0xF0 && code_unit2 < 0x90) goto ERROR2; /* overlong */
    if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; /* > U+10FFFF */
    code_unit3 = getchar();
    if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
    code_unit4 = getchar();
    if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
    return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
  } else {
    /* > U+10FFFF */
    goto ERROR1;
  }

  ERROR4:
    ungetc(code_unit4, stdin);
  ERROR3:
    ungetc(code_unit3, stdin);
  ERROR2:
    ungetc(code_unit2, stdin);
  ERROR1:
    return code_unit1 + 0xDC00;
}

I think it catches more invalid sequences than what we have right now (and I'd like to have a spec for each invalid case).

@asterite
Copy link
Member

@yous Could you take a look at the last commit to see if it's ok? It's more or less Wikipedia's code with some common code refactored.

I see Wikipeadia's code returns an invalid code point if it encounters an invalid byte sequence, instead of throwing an exception. I wonder if that's what we really need to do.

Another thing that we don't do very well is this:

puts 0x1FFFF.chr # assume utf-8 encoding, works but maybe should raise

But in Ruby:

puts 0x110000.chr(Encoding::UTF_8)

Gives:

foo.cr:1:in `chr': invalid codepoint 0x110000 in UTF-8 (RangeError)

Do you think we should raise as well?

@yous
Copy link
Contributor Author

yous commented Dec 19, 2014

@asterite I think the commit is okay.

For the first thing, seeing Codepage layout:

Red cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for overlong encoding of basic ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using 2 bytes instead of 1).

I think we don't need to think about encoding 7-bit ASCII value using 2 bytes. So raising makes sense. See Overlong encodings for further details:

The standard specifies that the correct encoding of a code point use only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point, this makes string comparisons and searches well-defined.

Are we already raising an error for 0x110000.chr? Also Ruby works for:

0x1FFFF.chr(Encoding::UTF_8)

If you are thinking about pass the encoding to chr, I think this depends to our specification. If we presume every character as UTF-8 encoding, than the current way would be okay. We already encodes 0x80-0xFF using 2 bytes. See the difference between two in Ruby:

>> 0x80.chr.encoding
=> #<Encoding:ASCII-8BIT>
>> 0x80.chr.bytes
=> [128]
>> 0x80.chr(Encoding::UTF_8).encoding
=> #<Encoding:UTF-8>
>> 0x80.chr(Encoding::UTF_8).bytes
=> [194, 128]

@asterite
Copy link
Member

Thanks for the detailed answer!

0x110000.chr is not raising an exception right now. I was thinking maybe we could raise. But maybe not... I always saw chr as a way to convert a number to a Char, almost like a cast. I didn't know it had a check in Ruby. That means that it'll be a bit slower with that check. And it worries me because chr is used in some low-level code, or at least in some parsers we have. So for now I think we can leave it without the check. Maybe later we can add chr! that does the check.

About your last comment, yes, Char always represents a codepoint in the UTF-8 encoding, because when you ask bytes it gives them assuming that encoding. We could add overloads for bytes and each_byte to take an encoding, or maybe that would be the job of an Encoding type. But we can think about it later :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants