-
Notifications
You must be signed in to change notification settings - Fork 0
Description
How many letters do you see: ๐ฬ? Go/Python say it's 2. Java/JavaScript say it's 3. Am I the only one seeing 1 letter?! In fact if I use a slightly different รก, now everyone agrees on 1! So what's going on?
Diacritic marks as separate symbols
Suppose you're defining characters. Latin letters are done, now what about modifications like รก (they are called diacritics)? You decide to define it as a separate letter and assign it some unique identifier (Code Point). One downside is that we need to double the number of characters: with and without the accent. This is unfortunate, but we'll survive.
Now you want this: aฬ . What, another set of letters? ๐ค What if we want to combine them: aฬ ฬ? Okay, this becomes ridiculous.. At some point you have to stop and rethink the strategy.
Unicode folks ran into the same problem, and after some iterations they decided to separate letters from their modifications (Diacritic Marks) like accents, hats, underscores, etc. The way this works is: we put a letter followed by all the modifications as separate symbols. And it's the job of the Font Engine to recognize that these are modifications and not letters. And instead of drawing them all separately, combine them with the letter.
But! When calculating the length of a string, programming platforms don't check if it's a Diacritic Mark or an actual letter. And that's why the length of ๐ฬ isn't 1.
Preconstructed vs Deconstructed symbols
Okay, ๐ฬ is clear, but why รก is 1 symbol? Because Unicode has combined symbols too! If it's a single unit, it's called a preconstructed character. If the diacritic marks go separately - it's a deconstructed character. Thus, in some cases there are 2 ways of representing the same letter in Unicode.
Preconstructed letters are considered legacy, but I'm not sure it's right.. If you have a popular accented letter in a language, you type it frequently on the keyboard. If it was a deconstructed character, we'll have to insert 2 symbols every time, and all the software in the world must flawlessly handle this. Which isn't realistic. So preconstructed characters, while called "legacy", seem to make sense even today.
Representing a String in Java/JS vs Python/Go
But why languages like Java and Python disagree on the length? Java/JS/C# keep chars in UTF-16 internally, so some symbols take 1 char (2 bytes), and the less popular ones take 2 chars. Letter ๐ is one of those 2-char values. Throw in an accent mark, and we get the length of 3.
Python doesn't have a notion of a character. Expression str[0] returns a 1-symbol string. And len(str) counts the Code Points. So ๐ has the length of 1, plus the accent. Internally, Python can keep strings in different representations depending on whether it's ASCII or Unicode, and it keeps both the byte lengths and the code point lengths, see PEP-0393.
Go keeps UTF-8 bytes, and its len(str) returns number of BYTES (not chars) which for ๐ฬ is 6. It's useless for anything but English. But it's possible to convert them into Code Points with rune[](str) which now can be used to get the length. Though it's a more expensive operation. Also, who in Helheim decided to call it Runes instead of the usual Code Points?!
Summary
Unicode is a complicated beast, and we certainly didn't discuss all the intricacies. For instance, guess how many symbols are in ๐ซ๐ฎ?
To get notifications about new posts, Watch this repository.