Why programming languages disagree on the string length?

How many letters do you see: 𝚊̀? Go/Python say it's 2. Java/JavaScript say it's 3. Am I the only one seeing 1 letter?! In fact if I use a slightly different á, now everyone agrees on 1! So what's going on?

# Diacritic marks as separate symbols

Suppose you're defining characters. Latin letters are done, now what about modifications like á (they are called diacritics)? You decide to define it as a separate letter and assign it some unique identifier (Code Point). One downside is that we need to double the number of characters: with and without the accent. This is unfortunate, but we'll survive.

Now you want this: a̅. What, another set of letters? 🤔 What if we want to combine them: a̅̀? Okay, this becomes ridiculous.. At some point you have to stop and rethink the strategy.

Unicode folks ran into the same problem, and after some iterations they decided to separate letters from their modifications (Diacritic Marks) like accents, hats, underscores, etc. The way this works is: we put a letter followed by all the modifications as separate symbols. And it's the job of the Font Engine to recognize that these are modifications and not letters. And instead of drawing them all separately, combine them with the letter.

But! When calculating the length of a string, programming platforms don't check if it's a Diacritic Mark or an actual letter. And that's why the length of 𝚊̀ isn't 1.

# Preconstructed vs Deconstructed symbols

Okay, 𝚊̀ is clear, but why á is 1 symbol? Because Unicode has combined symbols too! If it's a single unit, it's called a preconstructed character. If the diacritic marks go separately - it's a deconstructed character. Thus, in some cases there are 2 ways of representing the same letter in Unicode.

Preconstructed letters are considered legacy, but I'm not sure it's right.. If you have a popular accented letter in a language, you type it frequently on the keyboard. If it was a deconstructed character, we'll have to insert 2 symbols every time, and all the software in the world must flawlessly handle this. Which isn't realistic. So preconstructed characters, while called "legacy", seem to make sense even today.

# Representing a String in Java/JS vs Python/Go

But why languages like Java and Python disagree on the length? Java/JS/C# keep chars in UTF-16 internally, so some symbols take 1 char (2 bytes), and the less popular ones take 2 chars. Letter 𝚊 is one of those 2-char values. Throw in an accent mark, and we get the length of 3.

Python doesn't have a notion of a character. Expression `str[0]` returns a 1-symbol string. And `len(str)` counts the Code Points. So 𝚊 has the length of 1, plus the accent. Internally, Python can keep strings in different representations depending on whether it's ASCII or Unicode, and it keeps both the byte lengths and the code point lengths, see [PEP-0393](https://peps.python.org/pep-0393/).

Go keeps UTF-8 bytes, and its `len(str)` returns number of BYTES (not chars) which for 𝚊̀ is 6. It's useless for anything but English. But it's possible to convert them into Code Points with `rune[](str)` which now can be used to get the length. Though it's a more expensive operation. Also, who in Helheim decided to call it Runes instead of the usual Code Points?!

# Summary

Unicode is a complicated beast, and we certainly didn't discuss all the intricacies. For instance, guess how many symbols are in 🇫🇮?

---
To get notifications about new posts, Watch this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why programming languages disagree on the string length? #30

Diacritic marks as separate symbols

Preconstructed vs Deconstructed symbols

Representing a String in Java/JS vs Python/Go

Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why programming languages disagree on the string length? #30

Description

Diacritic marks as separate symbols

Preconstructed vs Deconstructed symbols

Representing a String in Java/JS vs Python/Go

Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions