Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Char.toUpper('ß') returns two characters as Char #1001

Open
malaire opened this issue Nov 12, 2018 · 6 comments
Open

Char.toUpper('ß') returns two characters as Char #1001

malaire opened this issue Nov 12, 2018 · 6 comments
Labels
breaking would require a MAJOR release

Comments

@malaire
Copy link

malaire commented Nov 12, 2018

Char is defined to be a single Unicode character and Char.toUpper is defined as returning a single Char.

But Char.toUpper('ß') returns two Unicode characters as single Char. While returned value is correct according to Unicode specification ('ß' is uppercased to two characters), it is not correct according to Elm specification of Char and Char.toUpper.

$ elm repl
> Char.toUpper('ß')
'SS' : Char

UPDATE: There are also other such characters where case conversion results in different number of characters, for example Char.toUpper('\u{FB02}') returns 'FL' : Char. Full list seems to be available at ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

@malaire
Copy link
Author

malaire commented Nov 12, 2018

Comparing to other implementations, in Haskell Data.Char.toUpper :: Char -> Char returns ß unchanged, while Data.Text.toUpper :: Text -> Text converts ß to SS.

@edgerunner
Copy link

edgerunner commented Aug 1, 2020

It seems that German speakers had and solved the same problem.

In 2017, the Council for German Orthography ultimately adopted capital ß (ẞ) into German orthography, ending a long orthographic debate. — Wikipedia referencing Quartz

I think Elm can do away with alleviate this problem by casing Unicode U+223 ß into U+7838 and vice versa.

@malaire
Copy link
Author

malaire commented Aug 2, 2020

I think Elm can do away with this problem by casing Unicode U+223 ß into U+7838 ẞ and vice versa.

That would be incompatible with Unicode standard, so if Elm wants to follow Unicode standard it can't do this.

Also ß isn't the only Unicode character for which case-conversion results in more characters - there are well over 100 such characters. So "fixing" this one case in a way that is not compatible with Unicode would be wrong in my opinion, as that won't fix the whole problem and also makes Elm Unicode incompatible.

(I updated OP with a comment that ß isn't the only such character.)

@edgerunner
Copy link

edgerunner commented Aug 3, 2020

Then it seems that we need to accept that casing cannot be Char -> Char. There are already String -> String casing functions that work as expected, (probably following the JS engine implementation), and we can have others that can't technically go wrong. The first ones that come to my mind:

  • Char -> String: The simplest… Forces you to unnecessarily consider empty strings in the output
  • Char -> List Char: Technically the same, but somewhat more explicit.
  • Char -> Maybe Char: Just sweep it under the rug. Get Nothing if the Unicode casing is not a proper Char
  • Char -> ( Char, String ): Output all probabilities. You can map it as needed, but if you don't check for it, you still can get 'SS' : Char and friends, or get a single 'S' : Char for a 'ß' : Char. (Thanks @dullbananas)
  • Char -> Cased: A new type to handle specific requirements. Needs to be designed to handle all possibilities properly, and can be built to handle other casing oddities like my native Turkish dotless ı to I and i to İ
    An incomplete implementation could be:
    type Cased
      -- This is the most usual case. A Char for a Char
      = Single Char
      -- This means there are more Char's after the first.
      | Multi Char Cased
      -- Lang is the alias for however we define the language; String, union etc.
      -- The first Cased is for when the language matches, the second is for when it doesn't.
      | LanguageDependent Lang Cased Cased
    The ß problem could be expressed as
    Char.toUpper 'ß' => Multi 'S' (Single 'S')
    Similarly the Turkish i can be output as
    Char.toUpper 'i' => LanguageDependent Tr 'İ' (Single 'I')
    This is just a preliminary idea that needs further exploration to cover all the casing oddities properly. There are so many languages out there.

@dullbananas
Copy link

If it shouldn't be Char -> Char then it should be `Char -> ( Char, String )

@dullbananas
Copy link

It might be best to keep the existing Char -> Char function and make it output the same character if the result can't be a single character, and create a separate Char -> String function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking would require a MAJOR release
Projects
None yet
Development

No branches or pull requests

4 participants