Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emoji support #53

Open
arpit opened this issue Apr 27, 2019 · 10 comments
Open

Emoji support #53

arpit opened this issue Apr 27, 2019 · 10 comments
Labels
type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)

Comments

@arpit
Copy link

arpit commented Apr 27, 2019

Current parser crashes when trying to load a yaml document with emojis in it.

    var emoji = "馃槂";
    String s = "YAML:  ${emoji} YAML Ain't Markup Language";
    var doc = loadYaml(s);
    print(doc['YAML']);

Error:

Error on line 1, column 8: Unexpected character.
  鈺
1 鈹 YAML:  馃槂 YAML Ain't Markup Language
  鈹        ^
  鈺

package:yaml/src/parser.dart 50:7    Parser.parse
package:yaml/src/loader.dart 165:37  Loader._loadMapping
package:yaml/src/loader.dart 86:16   Loader._loadNode
package:yaml/src/loader.dart 62:20   Loader._loadDocument
@fawazhussain
Copy link

flutter_emoji is a package which can parse emojis and can be stored as strings for later reference.
https://pub.dev/packages/flutter_emoji

@kevmoo kevmoo added the type-bug Incorrect behavior (everything from a crash to more subtle misbehavior) label Mar 1, 2021
@kevmoo
Copy link
Member

kevmoo commented Mar 1, 2021

I just hit this, too!

@kevmoo
Copy link
Member

kevmoo commented Mar 1, 2021

It seems that the yaml package restricts the INPUT parsing to the same restrictions as the YAML spec defines for OUTPUT, which may be the problem.

See https://yaml.org/spec/1.2/spec.html#id2770814

@littoy
Copy link

littoy commented Apr 14, 2022

Any plan to fix this bug?

@kevmoo
Copy link
Member

kevmoo commented Apr 14, 2022

PR welcome!

myConsciousness added a commit to myConsciousness/yaml that referenced this issue Apr 18, 2022
@AdrKacz
Copy link

AdrKacz commented May 2, 2022

Hello 馃槉 Thank you for this package!

Any updates on that? It would great to load emojis!

**Edit: ** I just figured out that everything works fine if I double quote my emojis (YAML: Hello 馃榿 vs YAML: "Hello 馃榿"), so that's fine 馃憤

@tamcy
Copy link

tamcy commented Mar 24, 2024

Just want to add that this is not just an emoji issue, but a more general one.

The yaml package uses string_scanner for string parsing, and it mostly utilizes the package's peekChar method, which returns the code unit of the string in a specific offset. We all know Dart uses UTF-16 in its internal representation, but the implementation of _isStandardCharacter() in scanner.dart obviously assumes a UTF-32 codepoint be supplied:

yaml/lib/src/scanner.dart

Lines 1594 to 1598 in e598443

bool _isStandardCharacter(int char) =>
(char >= 0x00020 && char <= 0x00007E) ||
(char >= 0x000A0 && char <= 0x00D7FF) ||
(char >= 0x0E000 && char <= 0x00FFFD) ||
(char >= 0x10000 && char <= 0x10FFFF);

The last line checks for a range between 0x10000 and 0x10FFFF. If my observation is correct, this will never be true because char will never be larger than 0xFFFF. So, the actual issue is that the YAML parser doesn't handle surrogate pairs correctly for plain (unquoted) style. This does not only affects emojis. For instance, the following line is fine:

char: 

But the following would yield an "Unexpected character" exception:

char: 馉敟

Both are CJK characters, parsing would fail for the second case because 馉敟 (U+20525) requires two code units. This is what I have encountered. This requires YAML file author to remember to quote certain characters, or to quote all characters to play safe.

@kevmoo
Copy link
Member

kevmoo commented Mar 24, 2024 via email

@tamcy
Copy link

tamcy commented Mar 25, 2024

(Edit: please see my comment below)

string_scanner seems fine, and the latest version (1.2.0) also added codepoint scanning which yaml can take advantage of to fix the problem I described. Perhaps after dart-lang/string_scanner#69 is accepted, which make .peekCodepoint() API compatible with .peekChar().

On the other hand, while there seems no harm to migrate the yaml package to use codepoint as the scanning unit (but it may cause performance issue), it also seems to me that changes like #125 is already good enough or even inevitable. Here're my points:

  1. It is unlikely to produce a malformed utf-16 string inadvertently, and in my opinion it should not be yaml package's responsibility to check against it.
  2. I've run some simple test and found that the yaml package doesn't check against an invalid string if it is quoted. So I think the behavior should be the same for unquoted/plain strings. (I could be wrong here, I don't know the YAML spec good enough.)
  3. It is still possible for the new .peekCodepoint() method to return a surrogate code point if the string is not formed correctly. And because of (2), yaml should still accept it as a part of the string, instead of throwing an invalid character exception. This means that it is still needed to allow the surrogate codepoint range in Scanner._isStandardCharacter(). The difference is that we should not remove the last check (char >= 0x10000 && char <= 0x10FFFF) if codepoint scanning is to be used.

@tamcy
Copy link

tamcy commented Mar 25, 2024

oh, I saw @kevmoo points to https://yaml.org/spec/1.2-old/spec.html#id2770814 in #125. I've glanced through the document. It looks like the pull request wasn't accepted because the spec clearly excludes surrogate block from the allowed character range, so accepting \xD800-\xDFFF somehow violates the spec. But as said, currently the value passed into Scanner._isStandardCharacter() is actually a code unit (due to the fact that the peekChar() and readChar() methods in string_scanner actually return a code unit). In this case (or in order to adhere to the spec and fix the issue) bumping string_scanner to 1.2.0 (actually I mean something like 1.3.0, i.e. after my pull request is accepted) and make use of its new codepoint APIs seems the way to go. And maybe leave quoted string checking to the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)
Projects
None yet
Development

No branches or pull requests

6 participants