Char.toUpper('ß') returns two characters as Char #1001

malaire · 2018-11-12T12:54:26Z

Char is defined to be a single Unicode character and Char.toUpper is defined as returning a single Char.

But Char.toUpper('ß') returns two Unicode characters as single Char. While returned value is correct according to Unicode specification ('ß' is uppercased to two characters), it is not correct according to Elm specification of Char and Char.toUpper.

$ elm repl
> Char.toUpper('ß')
'SS' : Char

UPDATE: There are also other such characters where case conversion results in different number of characters, for example Char.toUpper('\u{FB02}') returns 'FL' : Char. Full list seems to be available at ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

The text was updated successfully, but these errors were encountered:

malaire · 2018-11-12T15:39:50Z

Comparing to other implementations, in Haskell Data.Char.toUpper :: Char -> Char returns ß unchanged, while Data.Text.toUpper :: Text -> Text converts ß to SS.

edgerunner · 2020-08-01T21:51:54Z

It seems that German speakers had and solved the same problem.

In 2017, the Council for German Orthography ultimately adopted capital ß (ẞ) into German orthography, ending a long orthographic debate. — Wikipedia referencing Quartz

I think Elm can ~~do away with~~ alleviate this problem by casing Unicode U+223 ß into U+7838 ẞ and vice versa.

malaire · 2020-08-02T09:34:53Z

I think Elm can do away with this problem by casing Unicode U+223 ß into U+7838 ẞ and vice versa.

That would be incompatible with Unicode standard, so if Elm wants to follow Unicode standard it can't do this.

Also ß isn't the only Unicode character for which case-conversion results in more characters - there are well over 100 such characters. So "fixing" this one case in a way that is not compatible with Unicode would be wrong in my opinion, as that won't fix the whole problem and also makes Elm Unicode incompatible.

(I updated OP with a comment that ß isn't the only such character.)

edgerunner · 2020-08-03T20:14:22Z

Then it seems that we need to accept that casing cannot be Char -> Char. There are already String -> String casing functions that work as expected, (probably following the JS engine implementation), and we can have others that can't technically go wrong. The first ones that come to my mind:

Char -> String: The simplest… Forces you to unnecessarily consider empty strings in the output
Char -> List Char: Technically the same, but somewhat more explicit.
Char -> Maybe Char: Just sweep it under the rug. Get Nothing if the Unicode casing is not a proper Char
Char -> ( Char, String ): Output all probabilities. You can map it as needed, but if you don't check for it, you still can get 'SS' : Char and friends, or get a single 'S' : Char for a 'ß' : Char. (Thanks @dullbananas)

Char -> Cased: A new type to handle specific requirements. Needs to be designed to handle all possibilities properly, and can be built to handle other casing oddities like my native Turkish dotless ı to I and i to İ
An incomplete implementation could be:

type Cased
  -- This is the most usual case. A Char for a Char
  = Single Char
  -- This means there are more Char's after the first.
  | Multi Char Cased
  -- Lang is the alias for however we define the language; String, union etc.
  -- The first Cased is for when the language matches, the second is for when it doesn't.
  | LanguageDependent Lang Cased Cased

The ß problem could be expressed as

Char.toUpper 'ß' => Multi 'S' (Single 'S')

Similarly the Turkish i can be output as

Char.toUpper 'i' => LanguageDependent Tr 'İ' (Single 'I')

This is just a preliminary idea that needs further exploration to cover all the casing oddities properly. There are so many languages out there.

dullbananas · 2020-08-03T20:24:48Z

If it shouldn't be Char -> Char then it should be `Char -> ( Char, String )

dullbananas · 2020-08-03T20:39:32Z

It might be best to keep the existing Char -> Char function and make it output the same character if the result can't be a single character, and create a separate Char -> String function.

evancz added the breaking would require a MAJOR release label Feb 9, 2021

lue-bird mentioned this issue Aug 8, 2022

Char.toUpper on special-cased unicode characters returns 2-character Char gren-lang/core#30

Open

ahankinson mentioned this issue Sep 5, 2023

A large utility library elmcraft/core-extra#1

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Char.toUpper('ß') returns two characters as Char #1001

Char.toUpper('ß') returns two characters as Char #1001

malaire commented Nov 12, 2018 •

edited

Loading

malaire commented Nov 12, 2018

edgerunner commented Aug 1, 2020 •

edited

Loading

malaire commented Aug 2, 2020 •

edited

Loading

edgerunner commented Aug 3, 2020 •

edited

Loading

dullbananas commented Aug 3, 2020

dullbananas commented Aug 3, 2020

Char.toUpper('ß') returns two characters as Char #1001

Char.toUpper('ß') returns two characters as Char #1001

Comments

malaire commented Nov 12, 2018 • edited Loading

malaire commented Nov 12, 2018

edgerunner commented Aug 1, 2020 • edited Loading

malaire commented Aug 2, 2020 • edited Loading

edgerunner commented Aug 3, 2020 • edited Loading

dullbananas commented Aug 3, 2020

dullbananas commented Aug 3, 2020

malaire commented Nov 12, 2018 •

edited

Loading

edgerunner commented Aug 1, 2020 •

edited

Loading

malaire commented Aug 2, 2020 •

edited

Loading

edgerunner commented Aug 3, 2020 •

edited

Loading