Fix long-standing String.GetHashCode x64 early halt bug #229
Conversation
As documented in connect bug 519104, String.GetHashCode ignores any characters in the string beyond the first null character in x64 runtime. This means that if someone embeds a NULL character in a string, hash-code will only be computed up to that character. This is wrong in that Strings are Length specified and embedded NULL character **are** significant and characters after it **are** important to the hash-code. NOTE: 1) The return value in 32-bit builds DOES take all characters into account, so the strings that have a nice distribution in x32 builds might generate horrific collisions in x64 builds. 2) While as an implementation detail the String buffer is always followed by a NULL character, this is merely to ease the interop issues and has NOTHING to do with how strings should be handled in pure CLR. 3) Changing the hash-code in a new build is NOT a breaking change, just look at the comment in line 892-896.
We need to be careful here. While changing the hashcode of string should not be a breaking change, since folks should not be relying on it's stability, in practice we have been bitten by these sorts of changes in the past. We need to take a careful look at this change in our internal compat lab before merging this. |
This seems more relevant than ever: http://xkcd.com/1172/ Almost sounds like .NET should start having a compatibility shim system like Windows has it. |
@IDisposable thanks for your contribution. There are a couple of things to call out here:
While this is a good change, based on the contribution guidelines outlined here (https://github.com/dotnet/coreclr/wiki/Contribution-guidelines#bug-bar) it does not meet our current bar. |
So, just like the connect bug before it, you're going to continue to ignore a REAL ERROR that was documented because it "might" break something for people that are using GetHashCode wrong. not surprised I applaud the change to the new algorithm, just make sure you actually implement it correctly on all bit widths this time. |
Walking the tangled web of the FEATURE_RANDOMIZED_STRING_HASHING define we go from src\vm\ecalllist.h: to src\classlibnative\bcltype\stringnative.cpp: to src\classlibnative\nls\nlsinfo.cpp: to src\vm\comutilnative.cpp: The default behavior seems to be in https://github.com/dotnet/coreclr/blob/master/src/vm/comutilnative.cpp#L2860 which does seem to use the string length as it should, but it also looks like it's really up the current NlsHashProvider. I'm wondering what the rationale of something that is this sensitive to change being swappable is... Also, I wonder about the huge performance cost of all those indirections and the wisdom of something this sensitive to being at the whim of three different defines (x86, x64 or FEATURE_RANDOMIZED_STRING_HASHING) and a runtime flag that can select really randomized or not. |
@IDisposable. It's really complexity built up over the years by requests for a bunch of features:
Could things be better? Most likely. We're not opposed to change. However, it's important to understand that this code is shared with the desktop runtime which is extremely risk sensitive. That's the price you pay when you are installed on billions of machines and you want the flexibility to move folks towards a new version of the framework via Windows Update. So we need to make sure the change is done in a way that mitigates risk. In mscorlib, we have a way to change the behavior of a method depending on what version of the runtime your application was developed against. It's possible that we could consider under a guard like that. @AlexGhiondea would know better if that's something we could explore. One issue would be that I believe the bug you point out is also present in the hashing routines in the VM (at a minimum I think this is a problem for case insensitive ordinal hashing), when not using the new hashing routines. So a complete solution would likely have to touch code there as well and be guarded in some way as well. There have been a bunch of improvements we've wanted to make in the past which would break invariants that folks should not be relying on. Often, we've caused issues and we've had to roll back the fix or figure out a way to make it more targeted. It's not fun and it frustrates us as well, but in some ways this is the price you pay when you are at the foundation of a popular platform. Hopefully this helps explain our position. |
I wish we had fixed the original back when I reported it during the beta of x64 runtime so we weren't saddled with the broken implementation all this time. As for using Thanks for taking the time to spell out the requirements you're working with... this is the sort of thing OPEN source should enable. |
Going to close this up. My comment about linguistic hashing was to the effect that for non ordinal case sensitive hashes, we do most of the work in the VM. I think that for hash codes for linguistic strings we are not susceptible to the null termination issue, as IIRC, we compute the sort key and then hash that (there are some optimizations here but they should be doing the same thing). However, for ordinal ignore case hashing, we essentially ToUpper the string and then compute the hash code in the VM. I sort of remember when I did the Marvin32 work I noticed that all of that code would get tripped up by embedded nulls. This codepath would be hit if you say created a Dictionary with a string as a key and used Completely agree that if you use GetHashCode for ordering you are asking for trouble. Thanks for taking the time to understand our position here! |
@ellismg What would you say about fixing this under a |
When spinning up an app domain, we enable it on CoreCLR unless you opt out. If you wanted to change this for CoreCLR, we could consider it, but I think just pushing folks towards randomized hashing is the right long term move, and that's what we've done. |
@ellismg Maybe if #4696 is merged, we could simply fix the |
Disappointed this is still being debated... as if fixing this bug will hurt anyone. |
@IDisposable It very well could; under normal circumstances |
Do I understand correctly that CoreClr will by default use randomized hashing? That is great news because that means that users cannot take a bad dependency on stable hash codes. Any such reliance will be detected at the next application test run when their app fails. This should have been done on Desktop starting with 1.0. Maybe we can move to that scheme eventually on Desktop as well? Maybe at the next major release (5.0?) with compatibility switches. |
@GSPP Correct. In fact, this behavior is in the product today:
This is the long term plan. There are switches to force code in an app domain to use randomized hashing for |
As documented in connect bug 519104,
String.GetHashCode
ignores any characters in the string beyond the first null character in x64 runtime. This means that if someone embeds a NULL character in a string, the hash-code will only be computed up to that character. This is wrong in that CLRString
instances are Length specified and embedded NULL character are significant and characters after it are important to the hash-code.NOTE: