Unicode escapes in string/character literals #179

Closed
gavinking opened this Issue Feb 18, 2012 · 7 comments

Projects

None yet

2 participants

@gavinking
Member

We've decided not to support unicode escapes in source code, but only inside literals. I have a couple of questions about the syntax for this.

The traditional syntax is \uXXXX, but there are a couple of problems with that:

  1. you need to have a different escape \UXXXXXXXX to cover the full Unicode character set, and
  2. the character code tends to run into the rest of the text in string literals - "\u00E5ngstr\u00F6ms" is not very readable.

We could improve this by changing the syntax slightly, the most natural options seem to be \u{XXXX} or \uXXXX;, for example:

"\u{00E5}ngstr\u{00F6}ms"

or:

"\u00E5;ngstr\u00F6;ms"

OTOH, it does make me question whether we even need special syntax at all. Is the above really much better than something like this:

"" uc('00E5') "ngstr" uc('00F6') "ms"

Finally, Python supports named escapes using the syntax \N{LATIN SMALL LETTER A}. I think we should support this.

@gavinking gavinking was assigned Feb 18, 2012
@gavinking
Member

What about:

"\u00E5\ngstr\u00F6\ms"

I guess not, since that winds up looking like there is a \n in the literal.

@gavinking
Member

This seems to work pretty nicely:

"\{00E5}ngstr\{00F6}ms"

i.e. Just drop the u completely.

@gavinking gavinking closed this in 2c77726 Apr 24, 2012
@gavinking
Member

I have gone ahead and implemented this last suggestion, for both String and Character literals.

@quintesse
Member

Ok, I had not seen this issue before (why does GitHub not mail me for new issues on ceylon-spec??) but it seems pretty readable.
Do you have an idea yet how you want to fit the Python named character symbols in there?

@gavinking
Member

@quintesse named character literals seem to be a bit of a prob because java.lang.Character does not offer any mechanism to look up a character by name, so we would have to maintain our own set of code tables somewhere. I guess we could build a Map of name->codepoint at compiler startup by iterating over all unicode characters. Not sure how long it would take to do that.

@gavinking
Member

I guess we could build a Map of name->codepoint at compiler startup by iterating over all unicode characters.

Sorry, that's just nonsense. I forgot that Java doesn't even know the name of a character once you have it. So yeah, we would need to have a table of character names somewhere in the compiler.

@quintesse
Member

Ok, then I don't know if it's worth the trouble, at least for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment