Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support using ICU when running on Windows #826

Closed
tarekgh opened this issue Oct 8, 2019 · 20 comments · Fixed by #35383
Closed

Support using ICU when running on Windows #826

tarekgh opened this issue Oct 8, 2019 · 20 comments · Fixed by #35383

Comments

@tarekgh
Copy link
Member

tarekgh commented Oct 8, 2019

Windows 10 now carrying a version of ICU. It will be good to allow .NET Core take advantage of that.

Goals

There are 2 goals which if we do will get us in the much better position regarding the globalization support.

  • Allow using the installed Windows version of ICU when running on Windows 10 19H1 and up.
  • Allow using the app ICU version which the app carry with.

Doing these 2 goals will give the flexibility to the apps to decide if want to use whatever ICU version shipping with Windows or decide to carry its own version of ICU.

How we can support ICU on Windows

  • We'll merge System.Globalization.Native code to coreclr and include this code for Windows build too. System.Globalization.Native has the ICU interop.
  • Adding a shim to load the ICU libraries like what we are doing on Linux, but we’ll try to load it from different places (From System32 folder or app folder for Windows).
  • Having the inteop cs code which is compiled for Linux to be included for Windows too. Of course, we must refactor the current code more and instead of having Windows and Linux cs files, we’ll have NLS and ICU code paths instead.

Loading Default installed ICU on Windows

Windows RS2 started carrying ICU. The ICU DLLs are installed on %WINDIR%\System32 folder.
On RS2, there are 2 DLLs installed there named icuuc.dll and icuin.dll. On 19H1, Windows combined these 2 DLLs into a single DLL named icu.dll.
We'll support the loading ICU on Windows starting from 19H1 version and up. The reason is the previous versions require COM initialization which can cause some issues.

NLS or ICU as a default?

We'll use ICU by default as this is the future direction and we want to get users switch to that. Of course this will be kind of breaking change especially for sorting behavior and locale data. We are going to provide a config switch to allow using NLS if any app wanted to switch back to the old behavior.

Loading non-ICU default libraries

As the second goal is to allow using a specific version of ICU which not necessary the same version shipped with Windows. We are working with Windows team to have them produce ICU NuGet package which apps can get and install with their app (under the app folder). The apps will install it under the app folder for the reasons:

  • Not try to access any files outside the app folder. That will be more secure.
  • The app can be self-contained and possibly can get the ICU package work seamlessly with the .NET Core csproj which will place the ICU DLLs under the app published folder.

The app can opt-in to the app version using the config switch. We’ll try to find the libraries under the app folder according to the layout decided by NuGet package.
The ICU NuGet package will support running on down level Windows versions (e.g. Windows 7). That can give advantage of using ICU there too.

Config Switch

We need the config switch for 2 reasons:

To opt-in/out to or from using ICU/NLS.
To opt-in use the local app ICU version instead of the global version.

Note, we can have this switch work on Linux/macOS too. so, we’ll allow apps to carry their own ICU version if they need to. The issue here is, so far no one producing NuGet packages for ICU on Linux and macOS. The apps on Linux/macOS will need manually to carry the ICU libs to allow local ICU support.

The suggesting is to have a config switch named System.Globalization.ICU and will have the possible values:

  • None means use the ICU (default value).
  • NLS means use NLS APIs and not ICU. this value will be supported on Windows only.
  • Local use app local version (maybe we need to specify the version in the value but need some more investigation about that).

We’ll support the switch through environment variable too called DOTNET_SYSTEM_GLOBALIZATION_ICU.

Falling back

  • If the app didn't opt-in to use NLS and we fail to load the ICU libraries, we’ll fallback to NLS APIs
  • If the app opt-in to use local app version of ICU and failed to load it, we'll fail the app and fail fast.

Invariant Mode

If the globalization invariant mode is enabled, we honor it and will not try to load ICU at all.

@danmoseley
Copy link
Member

I believe this also tracks potentially carrying a private version of ICU, whether on Linux or Windows. (We can split that out of course)

@tarekgh
Copy link
Member Author

tarekgh commented Oct 15, 2019

Right. This issue intended to track all related ICU work we need to do. we can add all information here.

@danmoseley
Copy link
Member

danmoseley commented Oct 21, 2019

You mention supporting app-local ICU in Windows and Linux. Will we support it in macOS?

Are there any special caveats for Alpine here (or containers)? I know Alpine originally motivated the "no ICU" mode -- I assume ICU can be deployed on Alpine manually -- will it be possible to use an app-local ICU there as well?

@tarekgh
Copy link
Member Author

tarekgh commented Oct 21, 2019

You mention supporting app-local ICU in Windows and Linux. Will we support it in macOS?

Yes, macOS should be covered too. I'll edit the text to clarify one thing for supporting ICU as local app on Linux/macOS which is, the ICU NuGet packages which going to be produced by Windows team is going to support Windows only. So, on Linux/macOS, the app will need manually include the ICU libs if want to have local support. That till we'll have some source producing ICU NuGet packages for Linux/macOS.

Are there any special caveats for Alpine here (or containers)? I know Alpine originally motivated the "no ICU" mode -- I assume ICU can be deployed on Alpine manually -- will it be possible to use an app-local ICU there as well?

If the globalization invariant mode is enabled on Alpine, then we'll not load ICU at all (as we do today).
I believe there is a way to install ICU on Alpine as mentioned in https://pkgs.alpinelinux.org/package/edge/main/x86/icu
Alpine will not be different than other Linux distrors.

@jkotas
Copy link
Member

jkotas commented Oct 29, 2019

ICU has different performance characteristics, so we are going to see performance changes as side-effect of this change (related: https://github.com/dotnet/coreclr/issues/27525).

@tarekgh
Copy link
Member Author

tarekgh commented Oct 29, 2019

@jkotas that is why I was inclining to make ICU as opt-in in the beginning. I am not sure if we can address all perf concerns in 5.0.

@jkotas
Copy link
Member

jkotas commented Oct 29, 2019

It is ok to have performance regressions if they are understood and done intentionally for a good reason. We have done this trade-off many times in the history of .NET.

@GrabYourPitchforks
Copy link
Member

We also have a few options available to address any performance losses if they show up in hot paths. For example, we already have a few places where we special-case OrdinalIgnoreCase comparison and go down a fast path if the input is all-ASCII. We could investigate expanding the number of places which receive such an optimization. We could cache commonly-used cultures at the ICU interop layer. We could even contribute changes back to ICU if we identify ways to make the algorithms themselves better.

@iSazonov
Copy link
Contributor

From my experiments with Simple Case Folding a two-level table is as fast as ASCII workarounds. I believe MSFT experts can make it even faster.

@adamsitnik
Copy link
Member

We'll use ICU by default as this is the future direction and we want to get users switch to that. Of course this will be kind of breaking change especially for sorting behavior and locale data.

I think that we should revisit all known platform differences and group them into 3 categories:

  • ICU bugs (and report|fix them)
  • invalid ICU API usage (requires fix on our side)
  • by design (and document these differences)

An example of ICU bug that we've already fixed: unicode-org/icu#840

Examples of things that can be bugs on our side or ICU side:

Console.WriteLine(CultureInfo.InvariantCulture.CompareInfo.IsPrefix("''b", "b", CompareOptions.IgnoreSymbols)); // false
Console.WriteLine(CultureInfo.InvariantCulture.CompareInfo.IsPrefix("a''b", "ab", CompareOptions.IgnoreSymbols)); // true
CompareInfo frenchCompare = new CultureInfo("fr-FR").CompareInfo;
Console.WriteLine(frenchCompare.Compare("\u0153", "oe")); // 1, should be 0

@adamsitnik
Copy link
Member

We could investigate expanding the number of places which receive such an optimization.

Personally I am not a big fan of such optimizations. It adds a LOT of complexity on our side.

We could cache commonly-used cultures at the ICU interop layer.

I am not sure if I understand correctly, but we already have a cache:

https://github.com/dotnet/coreclr/blob/c6697a00ef7ecba6012b1a52f13401ce06b8295b/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.Unix.cs#L42

We could even contribute changes back to ICU if we identify ways to make the algorithms themselves better.

I had the "pleasure" to get more familiar with ICU code and I can say that there is definitely a place for improvement. The lack of possibility of breaking a 20 years old API makes it a little bit harder, but not impossible.

An example:

before we call the equivalent of IndexOf or LastIndexOf we need to call usearch_openFromCollator which calls method initialize, which calls initializePattern method which calls initializePatternCETable

initializePatternCETable allocates an array of integers of length of the "pattern" string and translates every character in the pattern string to an int32 (collation element). The problem is that it does it for every input, even for things like "a".IndexOf(new string('a', 1_000_000)) It's OK for most of the use cases, the point that I am trying to make here is that the API that we need the most (collators) does not seem to be optimized for perf.

@tarekgh are there other parties at Microsoft interested in using ICU in performance-sensitive code? Maybe we should just identify the perf bottlenecks, submit proposals and somehow divide the work amongst teams. Also, if we really want to use ICU as our default we should ask for expanding the usearch API with StartsWith and EndsWith methods

@tarekgh
Copy link
Member Author

tarekgh commented Oct 31, 2019

are there other parties at Microsoft interested in using ICU in performance-sensitive code?

We got some parties asking for using ICU on Windows. They already used it on Linux and never heard any complain from them about the perf there. So far, we didn't get anyone talked about the expected performance at all.

@jkotas
Copy link
Member

jkotas commented Oct 31, 2019

It is super rare to see globalization APIs used on performance critical paths. If somebody uses globalization API on a performance critical path, it is most likely a bug - a wrong API being called.

@adamsitnik
Copy link
Member

It is super rare to see globalization APIs used on performance critical paths.

I agree. The problem is that StringComparison.CurrentCulture is our default for all string based operations. So anytime our customers call string.IndexOf, string.LastIndexOf, string.StartsWith, string.EndsWith without specifying the StringComparison in explicit way, they hit the "slow path". Most software engineers don't measure the performance regulary, so typically they realize that something is "hot" after they hit a problem in produciton. In my opinion, we should try to prevent that from happening.

I agree with @kevingosse who wrote in https://github.com/dotnet/corefx/issues/40674#issuecomment-526319388

The problem is that developers on .net framework or .net core/windows aren't really used to adding StringComparer.Ordinal, since the performance of culture-based string comparisons is pretty good on Windows (so it's mostly seen as a micro optimization). And then it breaks when we start using those bits on Linux.

I believe that if we really want to switch to ICU by default without optimizing it first, we should consider adding a Ordinal global mode, similar to global invariant mode we have.

@jkotas
Copy link
Member

jkotas commented Nov 1, 2019

anytime our customers call string.IndexOf, string.LastIndexOf, string.StartsWith, string.EndsWith without specifying the StringComparison in explicit way, they hit the "slow path"

They will hit the "slow path" only if the strings do contain globalization-sensitive characters.

we should consider adding a Ordinal global mode, similar to global invariant mode we have.

Part of the plan is to have a switch to use NLS. So if somebody hits a problem with ICU (whether it is functionality or performance), they can go back to the legacy NLS to workaround it.

@jkotas
Copy link
Member

jkotas commented Nov 1, 2019

StringComparer.Ordinal, since the performance of culture-based string comparisons is pretty good on Windows (so it's mostly seen as a micro optimization)

Using StringComparer.Ordinal properly is not just a performance micro optimization. It is often required for security.

@adamsitnik
Copy link
Member

adamsitnik commented Nov 1, 2019

They will hit the "slow path" only if the strings do contain globalization-sensitive characters.

This is true only for en* cultures. The "slow path" is the default for every other culture (edit: except invariant culture).

@kevingosse
Copy link
Contributor

I believe that if we really want to switch to ICU by default without optimizing it first, we should consider adding a Ordinal global mode, similar to global invariant mode we have.

I'm a big supporter of this. It is really weird to say on one hand "people should use Ordinal most of the time anyway" and yet have APIs use culture-sensitive comparison per default (thus requiring extra knowledge to do the right thing).

@safern safern self-assigned this Nov 25, 2019
@danmoseley danmoseley transferred this issue from dotnet/corefx Dec 13, 2019
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Globalization untriaged New issue has not been triaged by the area owner labels Dec 13, 2019
@danmoseley danmoseley removed the untriaged New issue has not been triaged by the area owner label Dec 13, 2019
@tarekgh tarekgh added this to the 5.0 milestone Dec 14, 2019
@danmoseley danmoseley moved this from Committed to In progress in .NET Core impacting internal partners Feb 6, 2020
@safern safern added this to To do in ICU on Windows, Mobile, and Wasm via automation Feb 13, 2020
@safern safern moved this from To do to In progress in ICU on Windows, Mobile, and Wasm Feb 13, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@safern
Copy link
Member

safern commented Jun 11, 2020

Thanks @tarekgh -- also pasting the official docs site for this since those might suffer updates:

https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging a pull request may close this issue.

10 participants