Skip to content
This repository

Unicode escapes in string/character literals #179

Closed
gavinking opened this Issue · 7 comments

2 participants

Gavin King Tako Schotanus
Gavin King
Owner

We've decided not to support unicode escapes in source code, but only inside literals. I have a couple of questions about the syntax for this.

The traditional syntax is \uXXXX, but there are a couple of problems with that:

  1. you need to have a different escape \UXXXXXXXX to cover the full Unicode character set, and
  2. the character code tends to run into the rest of the text in string literals - "\u00E5ngstr\u00F6ms" is not very readable.

We could improve this by changing the syntax slightly, the most natural options seem to be \u{XXXX} or \uXXXX;, for example:

"\u{00E5}ngstr\u{00F6}ms"

or:

"\u00E5;ngstr\u00F6;ms"

OTOH, it does make me question whether we even need special syntax at all. Is the above really much better than something like this:

"" uc('00E5') "ngstr" uc('00F6') "ms"

Finally, Python supports named escapes using the syntax \N{LATIN SMALL LETTER A}. I think we should support this.

Gavin King
Owner

What about:

"\u00E5\ngstr\u00F6\ms"

I guess not, since that winds up looking like there is a \n in the literal.

Gavin King
Owner

This seems to work pretty nicely:

"\{00E5}ngstr\{00F6}ms"

i.e. Just drop the u completely.

Gavin King
Owner

I have gone ahead and implemented this last suggestion, for both String and Character literals.

Tako Schotanus
Collaborator

Ok, I had not seen this issue before (why does GitHub not mail me for new issues on ceylon-spec??) but it seems pretty readable.
Do you have an idea yet how you want to fit the Python named character symbols in there?

Gavin King
Owner

@quintesse named character literals seem to be a bit of a prob because java.lang.Character does not offer any mechanism to look up a character by name, so we would have to maintain our own set of code tables somewhere. I guess we could build a Map of name->codepoint at compiler startup by iterating over all unicode characters. Not sure how long it would take to do that.

Gavin King
Owner

I guess we could build a Map of name->codepoint at compiler startup by iterating over all unicode characters.

Sorry, that's just nonsense. I forgot that Java doesn't even know the name of a character once you have it. So yeah, we would need to have a table of character names somewhere in the compiler.

Tako Schotanus
Collaborator

Ok, then I don't know if it's worth the trouble, at least for now.

Gavin King gavinking referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Gavin King gavinking referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.