Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Fix long-standing String.GetHashCode x64 early halt bug #229

Closed
wants to merge 1 commit into from

Conversation

IDisposable
Copy link

As documented in connect bug 519104, String.GetHashCode ignores any characters in the string beyond the first null character in x64 runtime. This means that if someone embeds a NULL character in a string, the hash-code will only be computed up to that character. This is wrong in that CLR String instances are Length specified and embedded NULL character are significant and characters after it are important to the hash-code.

NOTE:

  1. The return value in 32-bit builds DOES take all characters into account, so the strings that have a nice distribution in x32 builds might generate horrific collisions in x64 builds.
  2. While as an implementation detail the String buffer is always followed by a NULL character, this is merely to ease the interop issues and has NOTHING to do with how strings should be handled in pure CLR.
  3. Changing the hash-code in a new build is NOT a breaking change, just look at the comment in line 892-896.

As documented in connect bug 519104, String.GetHashCode ignores any characters in the string beyond the first null character in x64 runtime. This means that if someone embeds a NULL character in a string, hash-code will only be computed up to that character. This is wrong in that Strings are Length specified and embedded NULL character **are** significant and characters after it **are** important to the hash-code.

NOTE:
1) The return value in 32-bit builds DOES take all characters into account, so the strings that have a nice distribution in x32 builds might generate horrific collisions in x64 builds.
2) While as an implementation detail the String buffer is always followed by a NULL character, this is merely to ease the interop issues and has NOTHING to do with how strings should be handled in pure CLR.
3) Changing the hash-code in a new build is NOT a breaking change, just look at the comment in line 892-896.
@ellismg
Copy link

ellismg commented Feb 12, 2015

We need to be careful here. While changing the hashcode of string should not be a breaking change, since folks should not be relying on it's stability, in practice we have been bitten by these sorts of changes in the past.

We need to take a careful look at this change in our internal compat lab before merging this.

@poizan42
Copy link

This seems more relevant than ever: http://xkcd.com/1172/

Almost sounds like .NET should start having a compatibility shim system like Windows has it.

@AlexGhiondea AlexGhiondea self-assigned this Feb 12, 2015
@AlexGhiondea
Copy link

@IDisposable thanks for your contribution.

There are a couple of things to call out here:

  • As @ellismg said, while changing the hash code should not be a breaking change, in our experience, it does turn out to be an issue.
  • The code that adds the daily build number to the hash code was meant to help us not take dependencies on it. It is also only present in debug builds.
  • In CoreCLR we are actually moving away from this hash code and into a randomized hash code algorithm (which you can see here: https://github.com/dotnet/coreclr/blob/master/src/vm/marvin32.cpp) . On CoreCLR this is turned on by default, meaning the code you are fixing should not be invoked.

While this is a good change, based on the contribution guidelines outlined here (https://github.com/dotnet/coreclr/wiki/Contribution-guidelines#bug-bar) it does not meet our current bar.

@IDisposable
Copy link
Author

So, just like the connect bug before it, you're going to continue to ignore a REAL ERROR that was documented because it "might" break something for people that are using GetHashCode wrong.

not surprised

I applaud the change to the new algorithm, just make sure you actually implement it correctly on all bit widths this time.

@IDisposable
Copy link
Author

Walking the tangled web of the FEATURE_RANDOMIZED_STRING_HASHING define we go

from src\vm\ecalllist.h: FCFuncElement("InternalMarvin32HashString", COMString::Marvin32HashString)

to src\classlibnative\bcltype\stringnative.cpp: GetCurrentNlsHashProvider()->HashString(..

to src\classlibnative\nls\nlsinfo.cpp: result = curDomain->m_pNlsHashProvider->HashString( or result = COMNlsHashProvider::s_NlsHashProvider.HashString(

to src\vm\comutilnative.cpp: INT32 COMNlsHashProvider::HashString(

The default behavior seems to be in https://github.com/dotnet/coreclr/blob/master/src/vm/comutilnative.cpp#L2860 which does seem to use the string length as it should, but it also looks like it's really up the current NlsHashProvider. I'm wondering what the rationale of something that is this sensitive to change being swappable is...

Also, I wonder about the huge performance cost of all those indirections and the wisdom of something this sensitive to being at the whim of three different defines (x86, x64 or FEATURE_RANDOMIZED_STRING_HASHING) and a runtime flag that can select really randomized or not.

@ellismg
Copy link

ellismg commented Feb 12, 2015

@IDisposable. It's really complexity built up over the years by requests for a bunch of features:

  1. In the Desktop runtime, some customers want to be able to control how sorting works (essentially they have requirements that collation behavior is stable across frameworks and OS versions, which is something the CLR does not want to provide, so we define an interface that allows them to hook this behavior).
  2. We need to add a feature to do this randomized string hashing (after the attacks against ASP.NET with hash flooding). This is problematic in itself because doing so would change the hash codes and unfortunately folks assume these things are going to be stable when they are not. So we have to put it behind some checks by default. But partner teams like ASP.NET would rather always opt into this algorithm, so there needs to be a way to meet their needs. Then you have to rationalize this change with the above interface you added at an earlier period of time.
  3. Since we (at least at one time, I think this may have changed recently) picked up a library to do the Marvin32 hashing which we didn't build ourselves, and built for multiple platforms, we need additional guards.
  4. The randomized string hashing was added rather late in the release cycle where we are under pressure to fix issues quick and mitigate risk. Obviously, these are at tension and we make engineering tradeoffs. The team is small and sometimes we accumulate debt that we'd like to pay down but don't get the chance to.

Could things be better? Most likely. We're not opposed to change. However, it's important to understand that this code is shared with the desktop runtime which is extremely risk sensitive. That's the price you pay when you are installed on billions of machines and you want the flexibility to move folks towards a new version of the framework via Windows Update. So we need to make sure the change is done in a way that mitigates risk.

In mscorlib, we have a way to change the behavior of a method depending on what version of the runtime your application was developed against. It's possible that we could consider under a guard like that. @AlexGhiondea would know better if that's something we could explore.

One issue would be that I believe the bug you point out is also present in the hashing routines in the VM (at a minimum I think this is a problem for case insensitive ordinal hashing), when not using the new hashing routines. So a complete solution would likely have to touch code there as well and be guarded in some way as well.

There have been a bunch of improvements we've wanted to make in the past which would break invariants that folks should not be relying on. Often, we've caused issues and we've had to roll back the fix or figure out a way to make it more targeted. It's not fun and it frustrates us as well, but in some ways this is the price you pay when you are at the foundation of a popular platform.

Hopefully this helps explain our position.

@IDisposable
Copy link
Author

IDisposable commented Feb 12, 2015

I wish we had fixed the original back when I reported it during the beta of x64 runtime so we weren't saddled with the broken implementation all this time.

As for using GetHashCode for ANY kind of ordering, you're doing it wrong. That's not what that method is for... never was. If someone wants linguistically sensitive values for sorting, there's code for that... just ask Michael Kaplan sadly he passed before this got fixed

Thanks for taking the time to spell out the requirements you're working with... this is the sort of thing OPEN source should enable.

@ellismg
Copy link

ellismg commented Feb 13, 2015

Going to close this up.

My comment about linguistic hashing was to the effect that for non ordinal case sensitive hashes, we do most of the work in the VM. I think that for hash codes for linguistic strings we are not susceptible to the null termination issue, as IIRC, we compute the sort key and then hash that (there are some optimizations here but they should be doing the same thing). However, for ordinal ignore case hashing, we essentially ToUpper the string and then compute the hash code in the VM. I sort of remember when I did the Marvin32 work I noticed that all of that code would get tripped up by embedded nulls. This codepath would be hit if you say created a Dictionary with a string as a key and used StringComparer.OrdinalIgnoreCase as your comparer.

Completely agree that if you use GetHashCode for ordering you are asking for trouble.

Thanks for taking the time to understand our position here!

@ellismg ellismg closed this Feb 13, 2015
@jamesqo
Copy link

jamesqo commented Apr 29, 2016

@ellismg What would you say about fixing this under a #if FEATURE_CORECLR, like what you proposed in #4654? As far as I can see, randomized string hashing looks like it's always being set to false at the moment, even for coreclr.

@ellismg
Copy link

ellismg commented Apr 29, 2016

As far as I can see, randomized string hashing looks like it's always being set to false at the moment, even for coreclr.

When spinning up an app domain, we enable it on CoreCLR unless you opt out.

If you wanted to change this for CoreCLR, we could consider it, but I think just pushing folks towards randomized hashing is the right long term move, and that's what we've done.

@jamesqo
Copy link

jamesqo commented Apr 30, 2016

@ellismg Maybe if #4696 is merged, we could simply fix the GetLegacyNonRandomizedHashCode for coreclr builds? This would kill 2 birds with 1 stone, since it would fix this for both 1) strings getting hashed by Dictionary, which uses non-randomized hashing by default and 2) users who choose to opt-out of randomized hashing.

@IDisposable
Copy link
Author

Disappointed this is still being debated... as if fixing this bug will hurt anyone.

@jamesqo
Copy link

jamesqo commented Apr 30, 2016

@IDisposable It very well could; under normal circumstances GetHashCode shouldn't be guaranteed to have a certain output, but these sources are shared with the full framework (which has a big emphasis on compatibility) and String is a pretty widely used class, so you never know if there might be an app out there depending on this behavior (however bizarre that would be). That's why I'm proposing to fix it only under coreclr.

@GSPP
Copy link

GSPP commented May 2, 2016

Do I understand correctly that CoreClr will by default use randomized hashing? That is great news because that means that users cannot take a bad dependency on stable hash codes. Any such reliance will be detected at the next application test run when their app fails.

This should have been done on Desktop starting with 1.0. Maybe we can move to that scheme eventually on Desktop as well? Maybe at the next major release (5.0?) with compatibility switches.

@ellismg
Copy link

ellismg commented May 2, 2016

@GSPP Correct. In fact, this behavior is in the product today:

[matell@matell-rhel hello]$ cat Program.cs
using System;

namespace ConsoleApplication
{
    public class Program
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Hello World!".GetHashCode());
        }
    }
}
[matell@matell-rhel hello]$ dotnet run
Project hello (.NETCoreApp,Version=v1.0) was previously compiled. Skipping compilation.
-512602046
[matell@matell-rhel hello]$ dotnet run
Project hello (.NETCoreApp,Version=v1.0) was previously compiled. Skipping compilation.
-1394896098
[matell@matell-rhel hello]$

Maybe we can move to that scheme eventually on Desktop as well? Maybe at the next major release (5.0?) with compatibility switches.

This is the long term plan. There are switches to force code in an app domain to use randomized hashing for string.GetHashCode() (ASP.NET uses this to protect against hash flooding).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
7 participants