Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordinal Ignore Case Optimization #40910

Merged
merged 2 commits into from Aug 17, 2020
Merged

Ordinal Ignore Case Optimization #40910

merged 2 commits into from Aug 17, 2020

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Aug 16, 2020

The changes here is to optimize the ordinal ignore case for all scenarios (e.g. String/Span compare, StartsWith, EndsWith, IndexOf, LastIndexOf...etc.) when using ICU. We consider NLS is the baseline comparing with ICU.

I am pasting here some perf numbers before and after the change. The perf numbers collected on Windows machine as this is main regression when switch from using NLS to ICU. The changes also is mainly for ordinal operations so there is no optimization done yet for linguistic operations.
Please note in some ASCII scenarios will find the numbers very close before and after the optimization, that is because we have some code to handle ASCII cases without calling the underlying NLS/ICU. But still you'll notice some minor improvements there too.

Also, this change include the initial refactoring for the ordinal operations. Introduced Ordinal classes that contains the ordinal operations but I didn't do full refactoring to avoid bigger code churn. Notice some ordinal scattered code moved to the new Ordinal classes.

NLS (Baseline) 5.0.100-preview.8.20362.3

Method Mean Error StdDev Median
IndexOf_OrdinalIgnoreCase_ShortAscii 82.681 ns 1.6594 ns 2.1577 ns 82.887 ns
IndexOf_OrdinalIgnoreCase_LongAscii 833.006 ns 16.1442 ns 15.8557 ns 835.015 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii 39.629 ns 0.8296 ns 1.9060 ns 39.328 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii 170.314 ns 3.4508 ns 7.2789 ns 168.718 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii 48.099 ns 1.0119 ns 2.8870 ns 47.298 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii 47.085 ns 0.9703 ns 1.5390 ns 46.931 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii 39.789 ns 0.8308 ns 2.1887 ns 39.719 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii 171.161 ns 3.4428 ns 6.1195 ns 171.122 ns
Compare_OrdinalIgnoreCase_ShortAscii 11.133 ns 0.2412 ns 0.4704 ns 11.056 ns
Compare_OrdinalIgnoreCase_LongAscii 213.874 ns 4.2155 ns 6.9262 ns 212.655 ns
StartsWith_OrdinalIgnoreCase_ShortAscii 7.497 ns 0.1839 ns 0.4112 ns 7.465 ns
StartsWith_OrdinalIgnoreCase_LongAscii 63.026 ns 1.2622 ns 3.2127 ns 62.343 ns
EndsWith_OrdinalIgnoreCase_ShortAscii 12.651 ns 0.2852 ns 0.6496 ns 12.476 ns
EndsWith_OrdinalIgnoreCase_LongAscii 8.543 ns 0.2021 ns 0.3846 ns 8.561 ns
Compare_OrdinalIgnoreCase_ShortNonAscii 12.190 ns 0.2766 ns 0.6889 ns 12.060 ns
Compare_OrdinalIgnoreCase_LongNonAscii 11.879 ns 0.2707 ns 0.4214 ns 11.896 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii 32.850 ns 0.6560 ns 1.5843 ns 32.798 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii 32.282 ns 0.6862 ns 1.5063 ns 32.089 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii 12.612 ns 0.2854 ns 0.4923 ns 12.556 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii 12.788 ns 0.2788 ns 0.6516 ns 12.677 ns

ICU (Baseline) 5.0.100-preview.8.20362.3

Method Mean Error StdDev Median
IndexOf_OrdinalIgnoreCase_ShortAscii 281.915 ns 5.5935 ns 9.0324 ns 282.806 ns
IndexOf_OrdinalIgnoreCase_LongAscii 4,654.658 ns 91.2758 ns 157.4462 ns 4,656.749 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii 81.924 ns 1.5851 ns 3.3779 ns 82.129 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii 553.243 ns 10.9822 ns 23.4040 ns 549.160 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii 102.842 ns 2.0976 ns 3.7823 ns 102.124 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii 103.048 ns 2.0764 ns 4.1468 ns 102.458 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii 81.418 ns 1.5798 ns 2.3157 ns 81.939 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii 536.305 ns 10.4910 ns 11.2252 ns 537.333 ns
Compare_OrdinalIgnoreCase_ShortAscii 12.388 ns 0.4576 ns 1.3131 ns 11.950 ns
Compare_OrdinalIgnoreCase_LongAscii 224.877 ns 4.4548 ns 9.2000 ns 224.065 ns
StartsWith_OrdinalIgnoreCase_ShortAscii 6.940 ns 0.1724 ns 0.3785 ns 6.923 ns
StartsWith_OrdinalIgnoreCase_LongAscii 61.033 ns 1.2384 ns 2.5015 ns 60.804 ns
EndsWith_OrdinalIgnoreCase_ShortAscii 13.519 ns 0.3038 ns 0.6205 ns 13.434 ns
EndsWith_OrdinalIgnoreCase_LongAscii 9.008 ns 0.2101 ns 0.5074 ns 8.961 ns
Compare_OrdinalIgnoreCase_ShortNonAscii 12.963 ns 0.2912 ns 0.6392 ns 12.884 ns
Compare_OrdinalIgnoreCase_LongNonAscii 12.220 ns 0.2748 ns 0.6637 ns 12.181 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii 39.288 ns 0.8146 ns 2.0586 ns 39.227 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii 39.845 ns 0.8206 ns 1.7489 ns 40.118 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii 13.222 ns 0.2878 ns 0.6376 ns 13.305 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii 13.152 ns 0.2982 ns 0.6158 ns 12.996 ns

(Baseline) 3.1

Method Mean Error StdDev Median
IndexOf_OrdinalIgnoreCase_ShortAscii 77.048 ns 1.5508 ns 2.5910 ns 76.470 ns
IndexOf_OrdinalIgnoreCase_LongAscii 826.192 ns 16.0535 ns 15.0164 ns 824.904 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii 34.301 ns 0.7206 ns 1.7265 ns 34.156 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii 175.133 ns 3.4291 ns 9.4447 ns 174.532 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii 41.660 ns 0.8008 ns 2.0384 ns 41.199 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii 40.870 ns 0.8510 ns 1.7191 ns 40.812 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii 36.558 ns 1.0939 ns 3.1563 ns 35.709 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii 164.163 ns 3.2957 ns 4.1680 ns 164.756 ns
Compare_OrdinalIgnoreCase_ShortAscii 13.685 ns 0.3012 ns 0.4864 ns 13.692 ns
Compare_OrdinalIgnoreCase_LongAscii 229.758 ns 4.5842 ns 6.7195 ns 228.661 ns
StartsWith_OrdinalIgnoreCase_ShortAscii 7.783 ns 0.1894 ns 0.5088 ns 7.689 ns
StartsWith_OrdinalIgnoreCase_LongAscii 64.775 ns 1.3240 ns 2.9613 ns 64.815 ns
EndsWith_OrdinalIgnoreCase_ShortAscii 14.431 ns 0.3251 ns 0.7271 ns 14.368 ns
EndsWith_OrdinalIgnoreCase_LongAscii 9.639 ns 0.2264 ns 0.3247 ns 9.618 ns
Compare_OrdinalIgnoreCase_ShortNonAscii 13.184 ns 0.2971 ns 0.6266 ns 13.174 ns
Compare_OrdinalIgnoreCase_LongNonAscii 13.214 ns 0.2667 ns 0.5265 ns 13.197 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii 25.277 ns 0.5374 ns 1.0607 ns 25.334 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii 24.705 ns 0.5292 ns 1.2780 ns 24.511 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii 13.586 ns 0.3067 ns 0.7465 ns 13.592 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii 14.096 ns 0.3219 ns 0.9441 ns 14.062 ns

ICU (After optimization)

Method Mean Error StdDev
IndexOf_OrdinalIgnoreCase_ShortAscii 56.725 ns 1.1710 ns 3.3597 ns
IndexOf_OrdinalIgnoreCase_LongAscii 696.613 ns 13.6316 ns 19.5501 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii 25.295 ns 0.5328 ns 1.0392 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii 130.638 ns 2.5330 ns 5.1742 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii 28.075 ns 0.5972 ns 1.6147 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii 28.237 ns 0.5948 ns 1.2675 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii 35.266 ns 0.7289 ns 1.1349 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii 132.163 ns 2.6766 ns 6.6657 ns
Compare_OrdinalIgnoreCase_ShortAscii 11.097 ns 0.2520 ns 0.4413 ns
Compare_OrdinalIgnoreCase_LongAscii 225.093 ns 3.6706 ns 3.2539 ns
StartsWith_OrdinalIgnoreCase_ShortAscii 7.167 ns 0.1742 ns 0.3096 ns
StartsWith_OrdinalIgnoreCase_LongAscii 62.781 ns 1.2765 ns 2.5494 ns
EndsWith_OrdinalIgnoreCase_ShortAscii 11.466 ns 0.2631 ns 0.5374 ns
EndsWith_OrdinalIgnoreCase_LongAscii 7.460 ns 0.1795 ns 0.2999 ns
Compare_OrdinalIgnoreCase_ShortNonAscii 11.297 ns 0.2583 ns 0.4387 ns
Compare_OrdinalIgnoreCase_LongNonAscii 11.295 ns 0.2573 ns 0.4369 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii 12.219 ns 0.2735 ns 0.4790 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii 11.957 ns 0.2694 ns 0.3688 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii 11.770 ns 0.2638 ns 0.4260 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii 11.856 ns 0.2644 ns 0.4116 ns

NLS (After optimization)

Method Mean Error StdDev Median
IndexOf_OrdinalIgnoreCase_ShortAscii 76.700 ns 1.5640 ns 2.6979 ns 75.863 ns
IndexOf_OrdinalIgnoreCase_LongAscii 822.702 ns 16.0500 ns 15.7633 ns 819.457 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii 33.952 ns 0.7083 ns 1.6557 ns 33.791 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii 163.543 ns 3.3181 ns 5.1659 ns 163.107 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii 40.847 ns 0.8527 ns 1.7225 ns 40.646 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii 40.291 ns 0.8005 ns 0.7862 ns 40.375 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii 33.811 ns 0.7107 ns 1.8722 ns 33.311 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii 162.103 ns 3.2151 ns 5.3718 ns 162.527 ns
Compare_OrdinalIgnoreCase_ShortAscii 10.816 ns 0.2495 ns 0.5783 ns 10.613 ns
Compare_OrdinalIgnoreCase_LongAscii 223.861 ns 4.4112 ns 5.8888 ns 223.382 ns
StartsWith_OrdinalIgnoreCase_ShortAscii 7.169 ns 0.1800 ns 0.4064 ns 7.045 ns
StartsWith_OrdinalIgnoreCase_LongAscii 63.064 ns 1.3015 ns 3.2889 ns 62.302 ns
EndsWith_OrdinalIgnoreCase_ShortAscii 11.395 ns 0.2573 ns 0.5079 ns 11.389 ns
EndsWith_OrdinalIgnoreCase_LongAscii 7.660 ns 0.1859 ns 0.4698 ns 7.545 ns
Compare_OrdinalIgnoreCase_ShortNonAscii 11.388 ns 0.2591 ns 0.4470 ns 11.411 ns
Compare_OrdinalIgnoreCase_LongNonAscii 11.352 ns 0.2586 ns 0.5398 ns 11.254 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii 11.650 ns 0.2673 ns 0.3917 ns 11.549 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii 11.688 ns 0.2675 ns 0.5816 ns 11.670 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii 11.763 ns 0.2623 ns 0.4084 ns 11.814 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii 11.716 ns 0.2637 ns 0.2338 ns 11.678 ns

@ghost
Copy link

ghost commented Aug 16, 2020

Tagging subscribers to this area: @tarekgh, @safern, @krwq
See info in area-owners.md if you want to be subscribed.

@tarekgh
Copy link
Member Author

tarekgh commented Aug 16, 2020

@safern @GrabYourPitchforks could you please help reviewing the change here. I hope I can merge it before the deadline tomorrow. So, if you have any comment, please tell if it is blocking this change or something can be done later in other PR.

@GrabYourPitchforks I have changed some of the coding style in some methods which used goto. so I hope this is ok with you.

@safern
Copy link
Member

safern commented Aug 17, 2020

@tarekgh FYI, the runtime test failure is: #40885

@tarekgh
Copy link
Member Author

tarekgh commented Aug 17, 2020

For completeness, here is the perf numbers on my WSL Ubuntu 18.04:

Linux Baseline (3.1)

Method Mean Error StdDev Median
IndexOf_OrdinalIgnoreCase_ShortAscii 153.989 ns 3.1440 ns 9.2701 ns 154.114 ns
IndexOf_OrdinalIgnoreCase_LongAscii 2,575.618 ns 64.7196 ns 185.6927 ns 2,525.609 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii 54.585 ns 1.1196 ns 2.2870 ns 54.411 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii 321.643 ns 6.4673 ns 13.3560 ns 321.997 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii 69.103 ns 1.6689 ns 4.8682 ns 69.025 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii 67.707 ns 1.7037 ns 4.9155 ns 67.039 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii 56.050 ns 1.4809 ns 4.2965 ns 56.146 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii 335.489 ns 8.5171 ns 24.0227 ns 331.953 ns
Compare_OrdinalIgnoreCase_ShortAscii 32.471 ns 0.7306 ns 2.1542 ns 32.559 ns
Compare_OrdinalIgnoreCase_LongAscii 270.407 ns 6.5089 ns 19.0896 ns 269.830 ns
StartsWith_OrdinalIgnoreCase_ShortAscii 8.564 ns 0.2140 ns 0.4874 ns 8.525 ns
StartsWith_OrdinalIgnoreCase_LongAscii 73.859 ns 1.6127 ns 4.6788 ns 72.517 ns
EndsWith_OrdinalIgnoreCase_ShortAscii 34.015 ns 1.0407 ns 3.0192 ns 33.279 ns
EndsWith_OrdinalIgnoreCase_LongAscii 28.954 ns 0.6095 ns 1.2450 ns 28.817 ns
Compare_OrdinalIgnoreCase_ShortNonAscii 34.650 ns 0.8253 ns 2.3547 ns 34.223 ns
Compare_OrdinalIgnoreCase_LongNonAscii 33.184 ns 0.7023 ns 1.7875 ns 33.034 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii 42.865 ns 1.0150 ns 2.8793 ns 42.635 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii 42.814 ns 0.9564 ns 2.7286 ns 42.802 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii 34.069 ns 0.7571 ns 2.1965 ns 34.021 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii 33.374 ns 0.6929 ns 1.5210 ns 33.148 ns

Linux Baseline (5.0.0-preview.8.20361.2)

Method Mean Error StdDev
IndexOf_OrdinalIgnoreCase_ShortAscii 171.314 ns 3.4618 ns 9.2401 ns
IndexOf_OrdinalIgnoreCase_LongAscii 2,530.398 ns 50.1847 ns 117.3051 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii 65.324 ns 1.7259 ns 4.9796 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii 321.646 ns 6.4492 ns 17.3253 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii 79.391 ns 2.1797 ns 6.1120 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii 79.923 ns 1.7343 ns 5.0316 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii 65.837 ns 1.5754 ns 4.6204 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii 331.291 ns 6.7892 ns 19.8045 ns
Compare_OrdinalIgnoreCase_ShortAscii 28.217 ns 0.6025 ns 1.3475 ns
Compare_OrdinalIgnoreCase_LongAscii 250.868 ns 5.0631 ns 13.2494 ns
StartsWith_OrdinalIgnoreCase_ShortAscii 8.355 ns 0.2060 ns 0.4608 ns
StartsWith_OrdinalIgnoreCase_LongAscii 70.660 ns 1.4495 ns 3.3305 ns
EndsWith_OrdinalIgnoreCase_ShortAscii 27.416 ns 0.5876 ns 1.1037 ns
EndsWith_OrdinalIgnoreCase_LongAscii 25.726 ns 0.6210 ns 1.8212 ns
Compare_OrdinalIgnoreCase_ShortNonAscii 29.755 ns 0.6429 ns 1.8549 ns
Compare_OrdinalIgnoreCase_LongNonAscii 29.212 ns 0.7063 ns 2.0038 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii 35.549 ns 2.2755 ns 6.6738 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii 29.127 ns 0.5807 ns 0.5431 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii 22.088 ns 0.4197 ns 0.3926 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii 21.528 ns 0.3856 ns 0.3220 ns

Linux with the Optimization

Method Mean Error StdDev
IndexOf_OrdinalIgnoreCase_ShortAscii 55.132 ns 1.1119 ns 2.0609 ns
IndexOf_OrdinalIgnoreCase_LongAscii 673.226 ns 10.4223 ns 9.7490 ns
IndexOf_OrdinalIgnoreCase_ShortNonAscii 19.971 ns 0.4136 ns 0.3869 ns
IndexOf_OrdinalIgnoreCase_LongNonAscii 126.101 ns 2.2821 ns 2.0230 ns
LastIndexOf_OrdinalIgnoreCase_ShortAscii 23.490 ns 0.3425 ns 0.4077 ns
LastIndexOf_OrdinalIgnoreCase_LongAscii 22.956 ns 0.2715 ns 0.2407 ns
LastIndexOf_OrdinalIgnoreCase_ShortNonAscii 19.403 ns 0.1864 ns 0.1556 ns
LastIndexOf_OrdinalIgnoreCase_LongNonAscii 127.550 ns 2.5765 ns 3.3502 ns
Compare_OrdinalIgnoreCase_ShortAscii 8.947 ns 0.1400 ns 0.1241 ns
Compare_OrdinalIgnoreCase_LongAscii 178.202 ns 2.3762 ns 2.1065 ns
StartsWith_OrdinalIgnoreCase_ShortAscii 6.476 ns 0.1605 ns 0.1501 ns
StartsWith_OrdinalIgnoreCase_LongAscii 56.053 ns 0.4483 ns 0.3743 ns
EndsWith_OrdinalIgnoreCase_ShortAscii 10.298 ns 0.0981 ns 0.0819 ns
EndsWith_OrdinalIgnoreCase_LongAscii 6.888 ns 0.1083 ns 0.0960 ns
Compare_OrdinalIgnoreCase_ShortNonAscii 11.199 ns 0.2417 ns 0.2482 ns
Compare_OrdinalIgnoreCase_LongNonAscii 11.133 ns 0.1970 ns 0.1645 ns
StartsWith_OrdinalIgnoreCase_ShortNonAscii 10.521 ns 0.1371 ns 0.1282 ns
StartsWith_OrdinalIgnoreCase_LongNonAscii 10.754 ns 0.1534 ns 0.1360 ns
EndsWith_OrdinalIgnoreCase_ShortNonAscii 12.965 ns 0.2162 ns 0.1917 ns
EndsWith_OrdinalIgnoreCase_LongNonAscii 13.210 ns 0.2505 ns 0.3429 ns

for (int i = 0; i < 256; i++)
{
// Unfortunately, to ensure one-to-one simple mapping we have to call u_toupper on every character.
// Using string casing ICU APIs cannot give such results even when using NULL locale to force root behavior.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is because Unicode itself doesn't have 1:1 case mapping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if limit the functionality to UnicodeData.txt, it will be 1:1. Yes, I understand in general Unicode casing is not 1:1.

@jefgen
Copy link

jefgen commented Aug 17, 2020

This is really impressive @tarekgh! 👍

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! However, we're now carrying some of our own ICU data (especially with regard to surrogate handling), and with this comes the possibility that our own ICU data will become out-of-sync with the ICU data of the underlying operating system.

The first immediate consequence is that the two lines below might now return different values, depending on runtime version and underlying OS:

// assume 'foo' and 'bar' are strings or ROS<char>
bool areEqual1 = foo.ToUpperInvariant() == bar.ToUpperInvariant();
bool areEqual2 = string.Equals(foo, bar, StringComparison.OrdinalIgnoreCase);

Per MSDN, these two lines are guaranteed to produce the same result.

There was some discussion on this over at #30960, where we proposed making the string.Equals method use simple case folding semantics rather than "convert to uppercase" semantics. One of the reasons given for pushback was that it could break this contract.

We might choose to say that it's ok to break this contract and that the two lines above shouldn't be considered equal. But if we make this claim we should do so consciously and deliberately.

Second (and this can come later), we should introduce a unit test that validates the data carried by OrdinalHelper is always up-to-date with the other data like CharUnicodeData that we carry within the runtime. A unit test for this might look like the following (see here for more info).

using System.Text.Unicode;

[Fact]
public void OrdinalIgnoreCaseTestForAllChars()
{
    for (int i = 0; i < 0xD800; i++)
    {
        RunTest(i);
    }
    // skip unpaired surrogates
    for (int i = 0xE000; i <= 0x10FFFF; i++)
    {
        RunTest(i);
    }

    static void RunTest(int codePoint)
    {
        int upperCodePoint = UnicodeData.GetData(codePoint).SimpleUppercaseMapping;
        if (codePoint != upperCodePoint)
        {
            // 'codePoint' and 'upperCodePoint' should compare as case-insensitive equal

            string s1 = new Rune(codePoint).ToString();
            string s2 = new Rune(upperCodePoint).ToString();

            Assert.True(string.Equals(s1, s2, StringComparison.OrdinalIgnoreCase));
            Assert.Equal(0, string.Compare(s1, s2, StringComparison.OrdinalIgnoreCase));
        }
    }
}

continue;
}

// we come here only if we have valid full surrogates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This comment isn't 100% correct. It's possible for the contents of the ROS<char> input to have changed between the first read (line 235) and the second read (line 261), which makes this no longer a valid surrogate pair. The ToUpperSurrogate appears is resilient to malformed surrogate pairs, but I wouldn't make such a solid "this is valid" statement in a comment.

(This applies to a few other places in this file, such as the pointer-based routine below.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WOW. I like the security thinking here :-) if someone change the buffer contents during the operation they already missed up anyway. I wouldn't care much about that. We may clarify the comment more later but I would avoid that as it may suggest allow changing the underlying buffer.


// s_casingTable is covering the Unicode BMP plane only. Surrogate casing is handled separately.
// Every cell in the table is covering the casing of 256 characters in the BMP.
// Every cell is array of 512 character for uppercasing mapping.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case: this results in 26.5 KB of static cached data that never gets cleaned up during the lifetime of the application. Is this acceptable? Is it worth pointing out as a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing this very reasonable for the worst case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarekgh I do not think so. Because I have some IOT applications, these applications can use very little memory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you expect this application will perform the casing on all Unicode ranges? We carry data more than that and what I am allocating is in the memory which can get swapped to the paging files. ICU data file only is much bigger than that anyway.

@tarekgh
Copy link
Member Author

tarekgh commented Aug 17, 2020

@GrabYourPitchforks

Per MSDN, these two lines are guaranteed to produce the same result.

Even before the change we were not consistent either. when upper/lower casing invariant we were using u_upper/u_lower. but when comparing the strings with invariant we were using ICU collations APIs which can have different results.

We might choose to say that it's ok to break this contract and that the two lines above shouldn't be considered equal. But if we make this claim we should do so consciously and deliberately.

As I pointed before we were not really 100% conforming to the contract anyway. But it is a good call to update the docs to clarify the behavior. Thanks for pointing at that.

Second (and this can come later), we should introduce a unit test that validates the data carried by OrdinalHelper is always up-to-date with the other data like CharUnicodeData that we carry within the runtime. A unit test for this might look like the following (see here for more info).

Fully agree. I was already thinking to do that but didn't have chance to fully do it. I'll track that for doing it later.

Last, thanks for your review and thoughts.

@tarekgh tarekgh merged commit 43bc8e8 into dotnet:master Aug 17, 2020
@tarekgh tarekgh deleted the OrdinalCasing branch August 17, 2020 23:53
tarekgh added a commit to tarekgh/runtime that referenced this pull request Aug 18, 2020
MichalStrehovsky added a commit that referenced this pull request Sep 10, 2020
This is unused after #40910 and breaks building standalone System.Globalization.Native (`error G94D986AD: unused function 'AreEqualOrdinalIgnoreCase' [-Werror,-Wunused-function]`).
jkotas pushed a commit that referenced this pull request Sep 10, 2020
This is unused after #40910 and breaks building standalone System.Globalization.Native (`error G94D986AD: unused function 'AreEqualOrdinalIgnoreCase' [-Werror,-Wunused-function]`).
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants