Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Invariant globalization #10264

Merged
merged 11 commits into from
Mar 19, 2017
Merged

Invariant globalization #10264

merged 11 commits into from
Mar 19, 2017

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Mar 17, 2017

This change is to introduce the config switch to control using the OS globalization or just run in the invariant mode.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 17, 2017

@tarekgh tarekgh changed the title Neutral globalization Invariant globalization Mar 17, 2017
@@ -49,6 +55,12 @@ internal static bool IsNormalized(String strInput, NormalizationForm normForm)

internal static String Normalize(String strInput, NormalizationForm normForm)
{
if (CultureData.InvariantMode)
{
// work ordinal, then all charcaters are normalized in this mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "charcaters " in a few places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Dan :-)

internal class CLRConfig
{

#if CORECLR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed - CORECLR is always defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the #else case to avoid compiling the code for non CORECLR. I may be missing something here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean CoreRT? I do not think we will want any of CLRConfig.cs in CoreRT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless this code may one day be shared with CoreRT, it will never get compiled without CORECLR defined. It's defined in System.Private.CoreLib.csproj

@@ -170,6 +170,13 @@ bool FindLibWithMajorMinorSubVersion(int* majorVer, int* minorVer, int* subVer)
return false;
}

static int32_t g_icuPresent = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not actually matter, but did you consider reversing this? i.e. starting it at 0 == not present, and then setting it to 1 only once ICU has been detected and all of the function pointers appropriately initialized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that

@tarekgh
Copy link
Member Author

tarekgh commented Mar 17, 2017

@dotnet-bot help

@@ -60,6 +60,10 @@ public override bool IsInvalid

protected override bool ReleaseHandle()
{
if (CultureData.InvariantMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation would ReleaseHandle be called but find InvariantMode == true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really needed now. I had the initial implementation which was keeping the sort handle but now I don't initialize it at all so it is not needed now. I'll convert this to assert instead

public partial class CompareInfo
{
internal static unsafe int InvariantIndexOfOrdinal(string source, string value, int startIndex, int count, bool ignoreCase)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably these arguments have all already been validated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but I can add some asserts for it

fixed (char* pSource = source) fixed (char* pValue = value)
{
char* pSrc = &pSource[startIndex];
int index = InvariantFindStringOrdinal(pSrc, count, pValue, value.Length, true, ignoreCase);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you use named arguments for things like the true here and false below? From context I'm assuming it's about first vs last, but it'd be easier to read if it was clear from the call site.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. it is good idea anyway :-)

@tarekgh
Copy link
Member Author

tarekgh commented Mar 17, 2017

@dotnet-bot test Ubuntu x64 corefx_baseline
@dotnet-bot test Windows_NT x64 corefx_baseline

for (ctrValue = 1; ctrValue < valueCount; ctrValue++)
{
sourceChar = InvariantToUpper(source[ctrSource + ctrValue]);
valueChar = InvariantToUpper(value[ctrValue]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have routines for doing this kind of thing (though I don't know if we have it for this exact thing), used for ASCII-only performance paths. Can we look at reusing where possible? That would help to a) ensure the fallback is using well-tested code paths, b) any optimizations to one would apply to the other, and c) for places where we don't currently have ASCII-only fast paths, we might be able to leverage this work to add some.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to that, how are we thinking about the performance of this invariant mode? Presumably one of the scenarios for it is apps that don't want to carry the ICU data around because they care about being lean, in which case throughput may also be of importance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already used the existing current optimized implementation in the functionality like string comparisons. but for IndexOf, looks we don't have a version for that. in short, I used whatever exist and provided the missing. Please point me to any implementation I should used if I missed it.
For the performance I agree in general. for the areas we can optimize it more we can look at it and do whatever needed there. I think the current implementation is reasonable for this check in and we can follow up for the optimization later. is this ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benadams who may know of something.

}
}

private static unsafe int InvariantFindStringOrdinal(char* source, int sourceCount, char* value, int valueCount, bool start, bool ignoreCase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean that we use both "Invariant" and "Ordinal" in the name? Do they connote different things?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the word Ordinal to reduce the confusion. it was meant in the invariant mode we do ordinal comparison.

@@ -49,6 +55,12 @@ internal static bool IsNormalized(String strInput, NormalizationForm normForm)

internal static String Normalize(String strInput, NormalizationForm normForm)
{
if (CultureData.InvariantMode)
{
// work ordinal, then all charcaters are normalized in this mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what is work ordinal trying to say.


namespace System.Text
{
static partial class Normalization
{
public static bool IsNormalized(this string strInput, NormalizationForm normalizationForm)
{
if (CultureData.InvariantMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should be after ValidateArguments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do that

@@ -25,6 +32,12 @@ public static bool IsNormalized(this string strInput, NormalizationForm normaliz

public static string Normalize(this string strInput, NormalizationForm normalizationForm)
{
if (CultureData.InvariantMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do that

char *pChar = &ch;
return IsSortable(pChar, 1);
}

public static unsafe bool IsSortable(string text)
{
if (CultureData.InvariantMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be after the argument validation at least; or even better in the common worked method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will move it after the validation. there is no common working method. we have forked implementation for different OS's

@@ -60,6 +60,10 @@ public override bool IsInvalid

protected override bool ReleaseHandle()
{
if (CultureData.InvariantMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we get here for the invariant mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right. I have converted this check to assert instead.

@@ -14,6 +14,20 @@ namespace System.Globalization
{
internal partial class CultureData
{
private static bool s_invariantGlobalizationMode = InitGlobalizationInvariantMode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be initialized explicitly during startup, not using implicit constructor. The implicit constructor as written will turn into two checks each time. 1. have we initialized already? 2. What the value is? So these checks are twice as expensive than what they need to be.

Or even better - we may want to cache the flag in a field in the objects with hot paths like TextInfo, and check the instance field there. Instance fields are faster to access than static fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since corelib is crossgen'd, I assume the normal tricks around static readonly bools becoming JIT-time consts wouldn't apply, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's why it would be best to cache the bool in instance fields where possible - instance fields are the fastest to access.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I am not clear about that. I am using this bool static in many static methods. if I cache it in instance field how I am going to use it in such static methods? could you please clarify more what you mean here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this would not work everywhere. But it would work for the string manipulation methods on TextInfo or CompareInfo that are probably the most impacted ones.

Feel free to ignore this suggestion if you do not think it is a good idea. It would be useful to pick a few of the leanest paths that the check is getting added to and measure perf impact of this change on them.

pResult[j] = pSource[j];
}

pResult[i] = (Char)(pSource[i] & ~0x20);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower case char
and above also

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do that

_sortHandle);
}

if (0 == ret)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything in the coding guidelines about Yoda conditionals but I think we avoid them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this code. also I am not sure why this bad to use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know in native code people often write constant == variable (aka Yoda conditionals) out of habit so that if they mistype constant = variable it will cause a compile error. In C# though there is no implicit conversion in most cases to boolean so eg for integers there is no danger of successfully compiling constant = variable. So in code bases I"ve worked on, typically it's written variable == constant for readability.
Just a nit and I don't see it in the style guidelines although it does seem largely followed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this one

if (!invariantEnabled && Interop.GlobalizationInterop.ICUPresent() == 0)
{
string message = "Couldn't find a valid ICU package installed on the system. " +
"Set the configuration flag System.Globalization.Invariant to true if want to run with no globalization support.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if want to -> if you want to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix that

@@ -459,6 +491,34 @@ internal static unsafe int CompareOrdinalIgnoreCase(string strA, int indexA, int
length--;
}

if (CultureData.InvariantMode)
{
while (length != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sole difference between this section and the one above it that this one doesn't limit itself to just ASCII chars, so that we don't call to the OS in invariant mode? Consider a) adding a comment, and b) moving this out into a separate function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do that

if (CultureData.InvariantMode)
{
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having these guards in every function, would it be possible to, for example, implement a special CultureInfo for invariant all in managed code with functions like you have, and then if ICU isn't present / the config switch is thrown, just return that invariant CulutureInfo from anything that can give back a CultureInfo? Maybe that's not feasible, but it seems like it could help to consolidate the logic better.

Copy link
Member Author

@tarekgh tarekgh Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have spent some time thinking about how we can implement the whole thing here. I have looked at the alternatives. for CultureInfo we already doing almost your idea mentioned here. but for other classes like CompareInfo/TextInfo I preferred the current implementation. To make it like what you suggested, I'll have to subclass CompareInfo/TextInfo and override all methods and then return such objects to the callers of such properties (like CultureInfo.CompareInfo, CultureInfo.TextInfo) . I dismissed this idea because a) we are not returning CompareInfo as we used to but we are returning a subclass of CompareInfo b) I was trying to avoid the unknown perf cost c) this will restrict us if we decided to expose a constructors for such classes in the future d) we'll have to use most of the existing code in CompareInfo anyway. we can refactor the code for this one though.

@@ -459,6 +491,34 @@ internal static unsafe int CompareOrdinalIgnoreCase(string strA, int indexA, int
length--;
}

if (CultureData.InvariantMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to duplicate this block? can't the condition above be changed from
while (length != 0 && (*a <= 0x80) && (*b <= 0x80))
to
while (length != 0 && ((*a <= 0x80) && (*b <= 0x80)) || CultureData.InvariantMode)

Copy link
Member Author

@tarekgh tarekgh Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do that for the perf reason. this will add extra check with every character.

Copy link
Member

@danmoseley danmoseley Mar 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

             byte comparator = CultureData.InvariantMode ? 0xFF : 0x80;
		 
                 while (length != 0 && (*a <= comparator) && (*b <= comparator))
                 {

that would avoid the duplication al,so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a separate function as @stephentoub suggests. It's just confusing two have two blocks like this IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea. I'll apply it. I'll use 0xFFFF (not 0xFF)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks I meant 0xFFFF :) ... this should get a good comment, though, as I had to think about it.

@stephentoub
Copy link
Member

How are we testing this?

@danmoseley
Copy link
Member

What is your thinking about testing that we don't accidentally load ICU when we aren't supposed to? It would be easy to introduce a code path a month from now that does this.
Should we be able to periodically run certain tests in a situation where ICU isn't available? Or we rely on just checking periodically under the debugger?
Could we add code that gets called when ICU was loaded, and make it (when enabled by a flag that some of our tests set) cause a failfast?

@tarekgh
Copy link
Member Author

tarekgh commented Mar 17, 2017

For the testing question, I have already created a test which testing almost all the globalization functionality when having the config switch is on. The test expect different values that if we have the switch is off. I have been testing it on Windows and Linux. also I have tested on system having ICU installed and not having the ICU installed with and without the config switch.
Also as you notice I have many asserts in the code in the paths that calls the OS. actually in my testing I was having these asserts as exception thrown. and I have successfully ran the test which exercise all the globalization functionality. also the test ran fine on system doesn't have ICU installed

@danmoseley
Copy link
Member

I have tested on system having ICU installed and not having the ICU installed with and without the config switch.

I guess what I'm wondering is how easy it would be for a dev 6 months from now to accidentally introduce a load of ICU in a non-ICU codepath and not notice. Unless we either have something in the product that notices it loads (if that's even possible) or we regularly run our tests on a box without ICU available, as you've done manually, we probably wouldn't notice.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 17, 2017

@danmosemsft I understand your point. we need to be more robust to guard against mistakes in the future. you are talking about ICU here but we should think about Windows too which I am seeing them as same. I think the best here is we create a static analysis tool that can scan the whole code and ensure all paths to the pinvokes are guarded.

return LastIndexOfCore(source, value, startIndex, count, options);
}

internal unsafe int LastIndexOfOrdinal(string source, string value, int startIndex, int count, bool ignoreCase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

{
internal sealed partial class GlobalizationMode
{
internal static bool GetGlobalizationInvariantMode()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

{
internal sealed partial class GlobalizationMode
{
internal static bool GetGlobalizationInvariantMode()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@jkotas
Copy link
Member

jkotas commented Mar 19, 2017

LGTM modulo few nits.

internal sealed partial class GlobalizationMode
{
private const string c_InvariantModeConfigSwitch = "System.Globalization.Invariant";
private static bool s_invariantMode = GetGlobalizationInvariantMode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be readonly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even internal static bool Invariant { get; } = GetGlobalizationInvariantMode();

@tarekgh
Copy link
Member Author

tarekgh commented Mar 19, 2017

@jkotas thanks for your review and the feedback. I have one question. one of your comments you asked to remove the #if CORECLR. my question is, when these changes get merged with the corert, I think this will break. right? GlobalizationMode.Invariant get initialized by the QCall which is not in coreRT. so I think we'll need to add the #if CORECLR in the initialization to avoid the QCall calling in corert.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 19, 2017

test OSX10.12 x64 Checked Build and Test please

@jkotas
Copy link
Member

jkotas commented Mar 19, 2017

when these changes get merged with the corert, I think this will break. right?

I expect that we are not going to share this file with CoreRT - this file will be CoreCLR specific. There will be CoreRT specific version of it that reads it directly from AppContext and has potentially other CoreRT specific tweaks.

@jkotas
Copy link
Member

jkotas commented Mar 19, 2017

Also, could you please work @russellhadley and @erozenfeld to figure out the plan how to get this integrated with the linker that they are working on? This can make standalone apps that opt into invariant globalization smaller; and it can be also used to test that you got all paths to ICU covered with the condition - the linker should tree shake all calls to ICU.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 19, 2017

I expect that we are not going to share this file with CoreRT - this file will be CoreCLR specific. There will be CoreRT specific version of it that reads it directly from AppContext and has potentially other CoreRT specific tweaks.

I was just wondering about the calls of GlobalizationMode.Invarian in many places inside the globalization code. so corert should have the definition of GlobalizationMode or otherwise the code will not compile. I may be missing something but anyway I'll go ahead and merge the changes here and we can handle any break if we get any.

@tarekgh
Copy link
Member Author

tarekgh commented Mar 19, 2017

test OSX10.12 x64 Checked Build and Test please

@tarekgh
Copy link
Member Author

tarekgh commented Mar 19, 2017

I'll work with @russellhadley and @erozenfeld as you suggested too.

@tarekgh tarekgh merged commit d905f67 into dotnet:master Mar 19, 2017
@tarekgh tarekgh deleted the NeutralGlobalization branch March 19, 2017 23:20
@russellhadley
Copy link

Yes, we should be able to deadcode the appropriate code paths if the invariant flags is know at publish time. @erozenfeld can you make sure that we have a backlog item for this?

@tarekgh
Copy link
Member Author

tarekgh commented Mar 20, 2017

@erozenfeld please let me know if you need anything from my side. I'll be happy to help.

jorive pushed a commit to guhuro/coreclr that referenced this pull request May 4, 2017
* Invariant Globalization Work

* Convert the testing Exceptions to asserts

* Remove un-needed comment

* Fix typos

* Fix unrelated typo

* Address the PR feedback

* More feedback addressing

* More feedback addressing

* Fix Linux break

* More feedback addressing

* cleanup
@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
jkotas added a commit to dotnet/corert that referenced this pull request Oct 3, 2017
dotnet-bot pushed a commit that referenced this pull request Oct 6, 2017
Partial port of #10264

Fixes #4640

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas added a commit to jkotas/coreclr that referenced this pull request Oct 6, 2017
Partial port of dotnet#10264

Fixes dotnet#4640

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas added a commit that referenced this pull request Oct 6, 2017
Partial port of #10264

Fixes #4640

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
dotnet-bot pushed a commit to dotnet/corefx that referenced this pull request Jan 13, 2018
Partial port of dotnet/coreclr#10264

Fixes #4640

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
dotnet-bot pushed a commit to dotnet/corefx that referenced this pull request Jan 13, 2018
Partial port of dotnet/coreclr#10264

Fixes #4640

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
safern pushed a commit to dotnet/corefx that referenced this pull request Jan 16, 2018
Partial port of dotnet/coreclr#10264

Fixes #4640

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
safern pushed a commit to dotnet/corefx that referenced this pull request Jan 16, 2018
Partial port of dotnet/coreclr#10264

Fixes #4640

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
9 participants