Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated std.string.capitalize to use std.uni.asCapitalized internally #3848

Merged
merged 2 commits into from
Jan 8, 2016

Conversation

JackStouffer
Copy link
Member

  • on average, this tends to be 2 nanoseconds slower, so completely negligible
  • in a separate commit, added missing shortcut to the top menu
  • this change revealed a bug in the old implementation that was treated as a feature. i was treated as the lower case version of İ, when it should be i\u0307

@@ -2267,7 +2239,7 @@ unittest
assert(s2 !is s1);
s1 = to!S("\u0131 \u0130");
s2 = capitalize(s1);
assert(cmp(s2, "\u0049 \u0069") == 0);
assert(cmp(s2, "\u0049 i\u0307") == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference I can read to learn why this is correct (and why the old behavior was not)? This merge looks good but I'm not sure how to review this behavior change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly it was an assumption, coupled with the fact that it's what asCapitalized outputs. It really doesn't make sense to me to think that İ's lower case version is just a regular lower case i.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful here. There are some places in Unicode where the uppercase of the lowercase of something is not equal to the original thing. I vaguely recall Ali Cehreli mentioning before certain issues involving the upper/lower-casing of the Turkish dotted/undotted i's. Are you absolutely certain this isn't an instance of that?

@JackStouffer
Copy link
Member Author

@quickfur according to http://www.fileformat.info/info/unicode/char/0130/index.htm İ is a Turkish character and the lowercase version is i. This means that std.uni.asCapitalized is outputting the wrong thing, so that needs to be fixed first before this is pulled.

@JackStouffer
Copy link
Member Author

@quickfur @Hackerpilot Based upon the evidence presented in https://issues.dlang.org/show_bug.cgi?id=15440#c1, it could go either way. The "simple lowercase" is \u0069 but the "[proper] lowercase" is \u0069\u0307. So it could be argued that std.uni is doing the right thing by not assuming that the letter is Turkish, or it could be argued that std.string is doing the right thing by assuming the letter is Turkish.

@JackStouffer
Copy link
Member Author

Python3 outputs \u0069\u0307 while Python2 outputs \u0069. That should say everything.

@Hackerpilot
Copy link
Member

Because Python 3 is more Unicode-correct?

@Hackerpilot
Copy link
Member

I read some of the information linked to in issue 15440, and I think that the new behavior of outputting \u0069\u0307 makes sense. Unless @quickfur objects, I'm going to merge this.

@Hackerpilot
Copy link
Member

Auto-merge toggled on

Hackerpilot added a commit that referenced this pull request Jan 8, 2016
Updated std.string.capitalize to use std.uni.asCapitalized internally
@Hackerpilot Hackerpilot merged commit 9df3aed into dlang:master Jan 8, 2016
@quickfur
Copy link
Member

quickfur commented Jan 8, 2016

Perhaps somebody should bring this to Ali's attention so that we can have a more informed opinion on how exactly upper-/lowercasing should work with these letters.

@acehreli
Copy link
Contributor

acehreli commented Jan 9, 2016

This new treatment of iİ is new to me and it makes sense especially hearing that Java 7 went the same way.

Unicode special casing: http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

And a blog post: http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i/

I can't look at this in detail at the moment but we should make sure that dotless-i is handled correctly as well. (I will get back to this in a couple of days.)

@JackStouffer
Copy link
Member Author

@Hackerpilot thanks!

@JackStouffer JackStouffer deleted the string branch January 9, 2016 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants