-
-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated std.string.capitalize to use std.uni.asCapitalized internally #3848
Conversation
| @@ -2267,7 +2239,7 @@ unittest | |||
| assert(s2 !is s1); | |||
| s1 = to!S("\u0131 \u0130"); | |||
| s2 = capitalize(s1); | |||
| assert(cmp(s2, "\u0049 \u0069") == 0); | |||
| assert(cmp(s2, "\u0049 i\u0307") == 0); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reference I can read to learn why this is correct (and why the old behavior was not)? This merge looks good but I'm not sure how to review this behavior change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly it was an assumption, coupled with the fact that it's what asCapitalized outputs. It really doesn't make sense to me to think that İ's lower case version is just a regular lower case i.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful here. There are some places in Unicode where the uppercase of the lowercase of something is not equal to the original thing. I vaguely recall Ali Cehreli mentioning before certain issues involving the upper/lower-casing of the Turkish dotted/undotted i's. Are you absolutely certain this isn't an instance of that?
|
@quickfur according to http://www.fileformat.info/info/unicode/char/0130/index.htm İ is a Turkish character and the lowercase version is i. This means that |
|
@quickfur @Hackerpilot Based upon the evidence presented in https://issues.dlang.org/show_bug.cgi?id=15440#c1, it could go either way. The "simple lowercase" is |
|
Python3 outputs |
|
Because Python 3 is more Unicode-correct? |
|
I read some of the information linked to in issue 15440, and I think that the new behavior of outputting |
|
Auto-merge toggled on |
Updated std.string.capitalize to use std.uni.asCapitalized internally
|
Perhaps somebody should bring this to Ali's attention so that we can have a more informed opinion on how exactly upper-/lowercasing should work with these letters. |
|
This new treatment of iİ is new to me and it makes sense especially hearing that Java 7 went the same way. Unicode special casing: http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt And a blog post: http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i/ I can't look at this in detail at the moment but we should make sure that dotless-i is handled correctly as well. (I will get back to this in a couple of days.) |
|
@Hackerpilot thanks! |
iwas treated as the lower case version ofİ, when it should bei\u0307