New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Char.isLower and Char.isUpper Unicode-aware #970

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@Janiczek
Contributor

Janiczek commented Jul 24, 2018

This allows people with non-ASCII alphabets work with Char.isLower and Char.isUpper. Uses toUpper and toLower underneath, which use Javascript's String.prototype.toLower/UpperCase().

The second condition in the functions is there to distinguish between characters that have an upper/lower-case pairing, and those that don't ('0' == Char.toLower '0' but we don't want isLower '0' to be true).

Make Char.isLower and Char.isUpper Unicode-aware
This allows people with non-ASCII alphabets work with `Char.isLower` and `Char.isUpper`. Uses `toUpper` and `toLower` underneath, which use Javascript's `String.prototype.toLower/UpperCase()`.

The second condition in the functions is there to distinguish between characters that have an upper/lower-case pairing, and those that don't (`'0' == Char.toLower '0'` but we don't want `isLower '0'` to be true).
@Janiczek

This comment has been minimized.

Show comment
Hide comment
@Janiczek

Janiczek Jul 24, 2018

Contributor

Is related to #385.

Contributor

Janiczek commented Jul 24, 2018

Is related to #385.

@drathier

This comment has been minimized.

Show comment
Hide comment
@drathier

drathier Jul 26, 2018

What's considered an uppercase character depends on your locale. This PR is still a major improvement.

Related to #942.

drathier commented Jul 26, 2018

What's considered an uppercase character depends on your locale. This PR is still a major improvement.

Related to #942.

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jul 27, 2018

Member

For future reference, the toLocaleUpperCase function talks about cases where this will break:

The toLocaleUpperCase() method returns the value of the string converted to upper case according to any locale-specific case mappings. toLocaleUpperCase() does not affect the value of the string itself. In most cases, this will produce the same result as toUpperCase(), but for some locales, such as Turkish, whose case mappings do not follow the default case mappings in Unicode, there may be a different result.

So it seems that toUpperCase() is a pure function, but toLocaleUpperCase() is not. My instinct is that the "correct" version of this function takes Language as an argument. (Not sure if the idea of "locale" is better. Maybe it is geographical? Maybe there is some standards body that defines locales?)

I do not want us to theorize about these things here. The next step is to find nice links that describe:

  1. How "upper case" is defined by unicode. Is there a big table somewhere?
  2. How a "locale" is defined and who manages that. Are there "new locales" if human culture changes? Who captures that, and how do browsers know about it?

I would prefer to understand the problem more completely before changing things.

Member

evancz commented Jul 27, 2018

For future reference, the toLocaleUpperCase function talks about cases where this will break:

The toLocaleUpperCase() method returns the value of the string converted to upper case according to any locale-specific case mappings. toLocaleUpperCase() does not affect the value of the string itself. In most cases, this will produce the same result as toUpperCase(), but for some locales, such as Turkish, whose case mappings do not follow the default case mappings in Unicode, there may be a different result.

So it seems that toUpperCase() is a pure function, but toLocaleUpperCase() is not. My instinct is that the "correct" version of this function takes Language as an argument. (Not sure if the idea of "locale" is better. Maybe it is geographical? Maybe there is some standards body that defines locales?)

I do not want us to theorize about these things here. The next step is to find nice links that describe:

  1. How "upper case" is defined by unicode. Is there a big table somewhere?
  2. How a "locale" is defined and who manages that. Are there "new locales" if human culture changes? Who captures that, and how do browsers know about it?

I would prefer to understand the problem more completely before changing things.

@Janiczek

This comment has been minimized.

Show comment
Hide comment
@Janiczek

Janiczek Jul 27, 2018

Contributor

From my cursory googling and research:

1. How is "upper case" defined?

I think this FAQ is the link you want.

In short, yes, there is a big table. Three, in fact.

  1. ftp://ftp.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
  2. ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
  3. ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt (GitHub or Markdown doesn't support FTP links 🤷‍♂️ )

Here is the relevant section of standard.

It has some sense of inter-version stability between the Unicode versions.

2. How is "locale" defined?

Again, an Unicode FAQ; and this time there's a whole homepage.

You can download the current version, there are a lot of XML files inside with various data (casing of dates / languages / ..., etc.), to be interpreted according to LDML.

They are also transformed from the XML into JSON, which might be a better fit for Elm?

Contributor

Janiczek commented Jul 27, 2018

From my cursory googling and research:

1. How is "upper case" defined?

I think this FAQ is the link you want.

In short, yes, there is a big table. Three, in fact.

  1. ftp://ftp.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
  2. ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
  3. ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt (GitHub or Markdown doesn't support FTP links 🤷‍♂️ )

Here is the relevant section of standard.

It has some sense of inter-version stability between the Unicode versions.

2. How is "locale" defined?

Again, an Unicode FAQ; and this time there's a whole homepage.

You can download the current version, there are a lot of XML files inside with various data (casing of dates / languages / ..., etc.), to be interpreted according to LDML.

They are also transformed from the XML into JSON, which might be a better fit for Elm?

@Janiczek

This comment has been minimized.

Show comment
Hide comment
@Janiczek

Janiczek Jul 28, 2018

Contributor

We might try to be extra-pure and host the big table, in Elm format, ourselves, but that would make elm/core very big, I imagine. The browser already has that cached in the form of .toLowerCase().

The .toLocaleLowerCase() functions would benefit from the Language argument (to become pure), but I wonder if it's important. In what situation would the function start misbehaving? (From the top of my head, computer location changing? System settings changing?) And is it important, would it affect the user somehow?

I mean, even Date, when toStringed, will show different things on different machines, based on your timezone.

main : Html msg
main =
    "2018-05-10"
        |> Date.fromString
        |> toString
        |> Html.text

shows Ok <Thu May 10 2018 02:00:00 GMT+0200 (Central European Summer Time)> on my machine. (Ellie) It will presumably show something different on yours. It's not pure. Is that problematic?

Contributor

Janiczek commented Jul 28, 2018

We might try to be extra-pure and host the big table, in Elm format, ourselves, but that would make elm/core very big, I imagine. The browser already has that cached in the form of .toLowerCase().

The .toLocaleLowerCase() functions would benefit from the Language argument (to become pure), but I wonder if it's important. In what situation would the function start misbehaving? (From the top of my head, computer location changing? System settings changing?) And is it important, would it affect the user somehow?

I mean, even Date, when toStringed, will show different things on different machines, based on your timezone.

main : Html msg
main =
    "2018-05-10"
        |> Date.fromString
        |> toString
        |> Html.text

shows Ok <Thu May 10 2018 02:00:00 GMT+0200 (Central European Summer Time)> on my machine. (Ellie) It will presumably show something different on yours. It's not pure. Is that problematic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment