Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect Dictionary<,> against hash collision flood attacks #4761

Open
jskeet opened this issue Dec 9, 2015 · 21 comments
Open

Protect Dictionary<,> against hash collision flood attacks #4761

jskeet opened this issue Dec 9, 2015 · 21 comments
Assignees
Labels
area-System.Collections enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@jskeet
Copy link

jskeet commented Dec 9, 2015

Java 8 introduced a new feature into HashMap whereby when a significant number of keys fall into the hash bucket and the key type implements Comparable<T> (broadly equivalent to .NET's IComparable<T>), the bucket is converted into a tree using that ordering.

This is particularly relevant when creating a map/dictionary with keys coming in over the network from an untrusted source. An attacker with appropriate knowledge of the hash code algorithm used can send N keys which require O(N2) time to insert, as each insertion currently ends up requiring O(N) time due to a linear search of the bucket. Even if the bucket only contains a single hash code, in many cases multiple keys can have the exact same hash code - and in many cases it's trivial to construct such keys. Using the tree approach, the per-key lookup time becomes O(log N), leading to an overall dictionary construction time of O(N log N).

There are downsides to this approach - anything implementing the comparison in a way which is inconsistent with Equals can lead to keys not being found. Introducing this as a default implementation would therefore be a breaking change, strictly speaking (although unlikely to hit users who don't deliberately seek it out). It would be really nice to have support for this within the BCL however, either as a separate IDictionary<,> implementation, or an option when creating a Dictionary<,>.

@jskeet
Copy link
Author

jskeet commented Dec 9, 2015

It's good to see that for strings, but it's entirely possible for requests to include dictionaries with other key types. The scenario I'm currently interested in is using Protocol Buffers for RPC requests, but I'm sure that's not the only plausible scenario. Imagine a Dictionary<long, string> - that would have the same problem. Now it's possible that HashHelpers could be modified to handle all the numeric types and provide randomized equality comparers for all of those... but it's not ideal for it to require BCL support like that.

(It does suggest another approach that could be used, mind you - one where I create my own randomized equality comparers for each of the key types I'm interested in.)

@benaadams
Copy link
Member

I did implement a randomly seeded hasher; though it had a much smaller surface area (collection is bounded); but may be helpful in the interim until you get a better response - works on uint/byte chunks.

https://github.com/benaadams/KestrelHttpServer/blob/479a54634e7c789bc314358630c3df4c73e68667/src/Microsoft.AspNet.Server.Kestrel/Infrastructure/MemoryPoolIterator2Extensions.cs#L14-L61

Useful in that no two servers will have the same pattern and the collisions will be different between runs; so attack learnings aren't transferable.

@OtherCrashOverride
Copy link

The scenario I'm currently interested in is using Protocol Buffers for RPC requests

Why not just override Object.GetHashCode to provide your own custom non-predictable, custom tailored values?

Alternatively, since it has the potential to break existing code, why not make it an entirely new class that implements IDictionary<K, V>?

What is the performance impact of this change for varying set sizes and complexity including value type vs reference type? Comparing O(N^2) and O(log N) is only meaningful when the work involved is the same. Its entirely possible (and does happen as evident in sort algorithms) that O(N^2) may actually be faster.

@jskeet
Copy link
Author

jskeet commented Dec 9, 2015

The keys I'm interested in are strings, the ByteString class (which is mine) and various numeric types. So I could definitely implement an IEqualityComparer for those which uses a random seed - it just feels like it would be nicer if this were taken into consideration in the dictionary itself. (In particular, I don't want to rely on the "sometimes on" prevention behaviour already, which means I need to implement my own string hashing, which is unlikely to be as fast or good as the framework code.)

In terms of the O(N^2) vs O(N log N), aside from possible locality of reference issues, I would expect this to be faster than the current implementation in situations where it triggers at all (i.e. large hash buckets). I can't immediately see any reason why it would be worse, although obviously performance can be very hard to predict. Of course, the detection of the situation has some impact as well. I wouldn't expect any boxing to occur, assuming we really detected IComparable<T> rather than just IComparable.

@OtherCrashOverride
Copy link

Another issue to consider is that it would change the enumeration order for items stored in the Dictionary. Dictionary (last I checked) does not provide any guarantees about the order; however, that does not mean changing it will not break code. This is a real world issue that was filed against mono's implementation of Dictionary.

The performance profile is the most important item to establish in my opinion. We need some real world data to show what benefit or regression it will offer. For those not using Dictionary as an adhoc database, its important this does not slow them down. There is a lot of code and algorithms out there using very small Dictionaries that do not have key collision issues. They are often part of high traffic APIs, so even a small increase in time would add up very quickly.

@ellismg
Copy link
Contributor

ellismg commented Dec 14, 2015

I think this is a reasonable idea to explore. The general push-back would be around compatibility. As you can see from the existing code we only cut over to the different hashing algorithms when we have good reason to believe we are under a hash-flooding attack and not as an opportunity to improve performance. We would also need to make sure to do this only in cases we were reasonably sure that moving to IComparable<T> was safe (e.g. default equality comparer is in use or the comparer object also implements IComparable<T>).

It's a bit of a bummer that the Dictionary<TKey, TValue> implementation lives so low in the stack. We'll have to pull in some sort of balance binary tree to do this the right way, but at the same time we're trying to move code out mscorlib instead of adding a bunch more code. It may make sense to hold off on this until we can pull all the generic collections out of mscorlib.dll and into their own assembly for .NET Core.

FWIW, I believe that we have plans to expose the Randomized String Hashing Equality comparer publicly so you can ensure you always use it when you care about these sorts of things. When we do that work, it probably also makes sense to figure out how to expose the primitives in way such that you can hash whatever binary stream of data you have. I think doing this will be easier from an engineering point of view when compared with adding a bunch more smarts to Dictionary<TKey, TValue>. This would let you build a resilient IEqualityComparer<T>.

@jskeet
Copy link
Author

jskeet commented Dec 14, 2015

I agree with the trickiness being on the compatibility front, unless it was opt-in via a factory method or new constructor parameter.

I believe that there are at least theoretical hash flooding attacks with any hashing implementation, but I'll be the first to admit I'm out of my depth here - and it certainly feels like a randomized hash is better than nothing for hash flood prevention. Thanks for thinking about it, and let me know anything else you'd like me to contribute to the matter - bearing in mind my position of relative ignorance on security matters.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@eiriktsarpalis
Copy link
Member

Given this thread hasn't been updated in a while, @GrabYourPitchforks could you comment on what the current state of dotnet is w.r.t. randomizing hashcode generation? My understanding is that it has been added in quite a few places.

@eiriktsarpalis eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label May 6, 2020
@GrabYourPitchforks
Copy link
Member

On .NET Core, string.GetHashCode is always randomized, regardless of whether it's used in a dictionary.

On .NET Full Framework, string.GetHashCode is not randomized by default. However, Dictionary<string, ...> (and other similar in-box collection types) are always protected against hash collisions as long as the dictionary is instantiated using the null comparer or using one of the in-box StringComparer implementations.

For Dictionary<TKey, TValue> where TKey is anything other than string, there are no in-box protections across either .NET Core or .NET Framework against hash collision attacks. Either TKey.GetHashCode needs to be randomized, or the entity which instantiated the dictionary instance needs to pass a randomizing comparer. As Jon said, this implies that it is not safe to instantiate a Dictionary<long, ...> and to populate it with untrusted input. We check first-party code for these patterns during security review. And if somebody discovers that any first-party Microsoft code is susceptible to this attack they can disclose it responsibly and the report should qualify for a bounty. But it's unlikely that third-party code follows the same rigorous review process.

Personally I'd like to see this added as a defense-in-depth measure. Given that there's likely considerable third-party usage of Dictionary<TKey, ...> where TKey is a non-string primitive, this would give them a bit extra protection.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented May 7, 2020

@GrabYourPitchforks would randomizing hashcode generation of, say, long be considered a breaking change? We already have a unary HashCode.Combine method which could perhaps be used for this purpose?

Another way of phrasing this, is there any case (other than perhaps System.Int32 itself) where deterministic hashcode generation is a desirable property?

@benaadams
Copy link
Member

Another way of phrasing this, is there any case (other than perhaps System.Int32 itself) where deterministic hashcode generation is a desirable property?

The current performance of the hashcode generation for the primitive ValueTypes is a desirable property

@eiriktsarpalis
Copy link
Member

The current performance of the hashcode generation for the primitive ValueTypes is a desirable property

Would combining a primitive with a precomputed random seed have an observable perf impact when calling GetHashCode()?

@benaadams
Copy link
Member

benaadams commented May 7, 2020

Would combining a primitive with a precomputed random seed have an observable perf impact when calling GetHashCode()?

That's not what the unary HashCode.Combine does.

But yes, a simple xoring all hashcodes with a random readonly static inside Dictionary (and other Hash-types), probably wouldn't have a huge impact and isolating it inside these datastructures rather than changing .GetHashCode for the types would likely minimise any breaks; and would give a degree of process instantiation isolation on hashcode collisions? (e.g. a sequence causing collisions wouldn't be immediately be applicable to another instantiation of the same exe)

@eiriktsarpalis
Copy link
Member

probably wouldn't have a huge impact and isolating it inside these datastructures rather than changing .GetHashCode for the types would likely minimise any breaks

Agreed. The obvious downside is that the mitigation only works with the data structures we apply it to. So it begs the question, should we restrict this to Dictionary only? Or do we apply it to other commonly used types as well?

I suppose another thing we have to be careful about is to apply a transformation that both mitigates these types of attacks but also doesn't impact the distribution characteristics of the original hashcode.

@GrabYourPitchforks
Copy link
Member

You don't need to change the hash code calculation characteristics of any of the built in types to pull this off. The change can be made entirely within dictionary and related types.

In Full Framework all dictionary-like types have this same protection built in for strings. HashSet, SortedList, etc. It's around 7 or 8 types total I believe.

@GrabYourPitchforks
Copy link
Member

Also, using a random 32-bit seed and XORing it with TKey.GetHashCode isn't a viable mechanism.

Keep in mind we don't have to solve this right now. It's not slated for 5.0.

@markusschaber
Copy link

You don't need to change the hash code calculation characteristics of any of the built in types to pull this off. The change can be made entirely within dictionary and related types.

Hmm... When the data type produces the same hash code in GetHashCode(), any random value the Dictionary applies after the fact will not resolve the hash collision.
So mixing in a static nonce in the container will not protect against attacks which manage to produce the same GetHashCode() result, which should be trivial e. G. for GUIDs or long.

An generic solution would be a separate overload for GetHashCode(int nonce), but that would essentially amount to a parallel implementation of the whole GetHashCode()-Ecosystem, and probably some performance loss - the GetHashCode implementations will need to be a bit more complex, as the basic trivial XOR-based implementations will not fix the problem.

@theodorzoulias
Copy link
Contributor

I have a question. AFAIK the Dictionary<string, TValue> is protected against HashDoS attacks. Is it also protected in case the dictionary is constructed with a custom IEqualityComparer<string> implementation, that delegates to the EqualityComparer<string>.Default? Or this indirection invalidates the protection?

class MyEqualityComparer : IEqualityComparer<string>
{
    public bool Equals(string x, string y) => EqualityComparer<string>.Default.Equals(x, y);
    public int GetHashCode(string x) => EqualityComparer<string>.Default.GetHashCode(x);
}

Is the dictionary below protected from attacks?

Dictionary<string, int> dict = new(new MyEqualityComparer());

If not, is there any easy way to implement a custom IEqualityComparer<string> that maintains the protection?

@benaadams
Copy link
Member

I have a question. AFAIK the Dictionary<string, TValue> is protected against HashDoS attacks. Is it also protected in case the dictionary is constructed with a custom IEqualityComparer<string> implementation, that delegates to the EqualityComparer<string>.Default? Or this indirection invalidates the protection?

String.GetHashCode has the protection so you should be fine

@GrabYourPitchforks
Copy link
Member

So mixing in a static nonce in the container will not protect against attacks which manage to produce the same GetHashCode() result, which should be trivial e. G. for GUIDs or long.

Correct. That's what I meant by this comment. I don't really see us changing the implementation of any of the existing primitive GetHashCode routines since people currently depend on them to be fast and predictable.

A parallel implementation of GetHashCode is what we prototyped earlier and it does indeed work. It's also not as invasive as you might think. One needs to special-case some commonly used structs (Guid, long, DateTime, etc.); enlighten ValueTuple, Tuple, and HashCode to understand these special cases; and wire the whole thing up through Dictionary and friends. That's sufficient to address all the use cases we see in practice. There might be some niche use cases which aren't covered by this, but those should be vanishingly rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Collections enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

10 participants