Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string.IndexOf bug when using Thai culture #75616

Closed
dnickless opened this issue Sep 13, 2022 · 5 comments
Closed

string.IndexOf bug when using Thai culture #75616

dnickless opened this issue Sep 13, 2022 · 5 comments

Comments

@dnickless
Copy link
Contributor

How to reproduce:

Create a net472 Console project and paste the following code:

using System;
using System.Globalization;
using System.Threading;
public class Program
{
    public static void Main()
    {
        Thread.CurrentThread.CurrentCulture = new CultureInfo("th-th");
        Console.WriteLine("#".IndexOf("["));
    }
}

Output (as expected): -1

Switch the .csproj to net6 TFM and run again.

Output (not expected): 0

@gfoidl
Copy link
Member

gfoidl commented Sep 14, 2022

Most likely: Globalization APIs use ICU libraries on Windows 10 (starting with .NET 5).

@mairaw mairaw transferred this issue from dotnet/core Sep 14, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 14, 2022
@ghost
Copy link

ghost commented Sep 14, 2022

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

How to reproduce:

Create a net472 Console project and paste the following code:

using System;
using System.Globalization;
using System.Threading;
public class Program
{
    public static void Main()
    {
        Thread.CurrentThread.CurrentCulture = new CultureInfo("th-th");
        Console.WriteLine("#".IndexOf("["));
    }
}

Output (as expected): -1

Switch the .csproj to net6 TFM and run again.

Output (not expected): 0

Author: dnickless
Assignees: -
Labels:

area-System.Globalization

Milestone: -

@tarekgh
Copy link
Member

tarekgh commented Sep 14, 2022

@dnickless

Thai language has specific collation behavior which you are seeing here. It treats some characters like #, [, -,...etc. as zero sort weight. Which means these characters will be as if they do not exist at all in the string. Therefore, you are getting 0 because of that.
Starting from .NET 5, .NET switched to use the ICU library for globalization support to be more conformant to the Unicode Standard. If you don't really need the linguistic behavior of your string search, you may call IndexOf with the StringComparison.Ordinal option. If you want to revert to the old behavior of the search (which we don't recommend), you may use the way described in the doc.

I am closing the issue but feel free to send any question and we'll be happy to help answering it.

@tarekgh tarekgh closed this as completed Sep 14, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2022
@dnickless
Copy link
Contributor Author

@tarekgh, thanks for the explanation which makes a lot of sense. Thanks even more for the workaround which we will need to apply since we do not own the problematic source code (https://github.com/ClosedXML/ClosedXML/blob/78150efbbd4a36d65e95ef3c793f12feb12c1a9c/ClosedXML/Excel/XLWorkbook_Load.cs#L1249).

I realize that you've answered very similar questions already here:
#43772
and here
https://developercommunity.visualstudio.com/t/stringstartswith-and-stringendwith-returns-wrong-v/1218489.

...and I suspect that tons of applications will stop functioning in Thailand (and probably elsewhere, too) as we speak due to this change...

In case anyone reading this cares, here's the Thai alphabet in UTF-8 (nota bene it does indeed lack the square bracket or any other special character...): https://www.utf8-chartable.de/unicode-utf8-table.pl?start=3584&number=128&utf8=0x

@tarekgh
Copy link
Member

tarekgh commented Sep 14, 2022

Thanks @dnickless for the feedback.

we do not own the problematic source code (ClosedXML/ClosedXML@78150ef/ClosedXML/Excel/XLWorkbook_Load.cs#L1249).

I have opened issue for such library to get this fixed in their side. ClosedXML/ClosedXML#1862. If you see similar issues in some other places, I suggest you open issues for such cases or contact us and we can follow up.

In case anyone reading this cares, here's the Thai alphabet in UTF-8 (nota bene it does indeed lack the square bracket or any other special character...): utf8-chartable.de/unicode-utf8-table.pl?start=3584&number=128&utf8=0x

Unicode lists different languages, and it is not necessary to add all ascii characters to the language character list. But the collation for this language decides what would be the behavior when using such characters from the ascii range.

@dotnet dotnet locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants