Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possibilities to improve performance of Object.hash() #50693

Open
2 of 11 tasks
mkustermann opened this issue Dec 12, 2022 · 4 comments
Open
2 of 11 tasks

Investigate possibilities to improve performance of Object.hash() #50693

mkustermann opened this issue Dec 12, 2022 · 4 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. type-performance Issue relates to performance or code size

Comments

@mkustermann
Copy link
Member

mkustermann commented Dec 12, 2022

We should investigate whether we can improve the performance of our main corelib API for combining hashcodes, namely Object.hash.

  • Make benchmarks of Object.hash()
  • Avoid dealing with overhead of optional parameters, especially for small number of arguments. e.g. inlining may avoid the cost and also allow de-virtualizing some obj<X>.hashCode calls on the callsite.
    => a Object.hash(a, b) should be as efficient as finalizeHash(combineHash(a.hashCode, b.hashCode))
  • Try to avoid optional parameters in SystemHash.hashX()
  • Consider taking full 64-bits into account when combining hashes (currently it seems to only take lower 29 bits into account)
    => Avoiding the bit-wise-and operations may speed it also up.
  • Possibly optimize generated code and/or evaluate different algorithm.
  • Make it deterministic: Object identity hashes are already based on random numbers. For objects with custom / deterministic hash codes, it should be just fine to also produce a deterministic hash code. It seems unnecessary to introduce yet-another-random element in the function that combines hash codes.
    => This may? also avoid overhead of the initial combine(seed, obj1.hashCode).

Ideally we'll end up in a core library API that we're willing to use ourselves in our code, e.g. replacing

Arguably some implementations want to call combine on a runtime-dependent number of objects and only call finish afterwards. That would make it hard to use Object.hash() as an API (see e.g. fidl).

/cc @aam

@mkustermann mkustermann added area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. type-performance Issue relates to performance or code size labels Dec 12, 2022
@mkustermann
Copy link
Member Author

/cc @lrhn @rakudrama

@lrhn
Copy link
Member

lrhn commented Dec 16, 2022

As mentioned elsewhere, the goal should be to be as efficient as

finalizeHash(combineHash(combineHash(0, a.hashCode), b.hashCode))

There should be a combineHash call per actual hash code, skipping the first one risks losing some entropy, if the first object's hash code is not perfectly distributed.

Providing a seed, instead of using 0, should not affect performance much, but obviously it needs to come from somewhere. Clearing a register using xor is cheaper than loading from memory, but I don't think that's going to be a measurable part of the computation time of something which calls multiple hashCode getters.

The current implementation is shared between platforms. We could make all the public methods external and allow platforms to specialize any way they want to. There should be no issue with that.

A platform could also choose to use a randomized seed in debug mode, to ensure that users do not end up relying on, say, a particular ordering of a hash map. In production mode, that value could be fixed, or even zero, instead.
I'm not sold on determinism. It comes with a cost too. And I'd rather force nondeterminism into (at least) debug builds, than allow users to start thinking that the current value is somehow prescribed.

If we even consider using a different algorithm, then we should definitely make sure that we document that the hash value is not stable between executions. We've so far ensured that by using a random seed for user calls.

The optional seed parameters on SystemHash.hashX don't need to be optional. It's internal-only code, so we can just make them required, and add , 0 to the calls that don't pass an argument.
I'd encourage compilers to try to inline both Object.hash and those methods.

If it's a problem for the VM that someone might get access to the sentinel value used to detect the actual number of arguments passed, I wouldn't worry overly about it. If Object.hash(o1, o2) does not give the same result as Object.hash(o1, o2, leakedSentinel), even though the library source code looks like it should, just say that the library source code is not normative. The documentation is, the implementation we allow people to see is at most a suggestion.

copybara-service bot pushed a commit that referenced this issue Jan 2, 2023
Issue #50693

Change-Id: Ib587b70bcb57cbd2d16319b7814e2569c7e41213
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/276161
Commit-Queue: Lasse Nielsen <lrn@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>
@mkustermann
Copy link
Member Author

... skipping the first one risks losing some entropy, if the first object's hash code is not perfectly distributed.

Aren't we assuming that any obj.hashCode is well distributed?

Users write code such as

class Foo {
  final a;

  int get hashCode => a.hashCode;
  bool operator(other) => identical(this, other) || other is Foo && a == other.a;
}

Once they add a field, it changes to

class Foo {
  final a;
  final b;

  int get hashCode => Object.hash(a, b);
  bool operator(other) => identical(this, other) || other is Foo && a == other.a && b == other.b;
}

In fact Object.hash() requires at least 2 objects to hash. So it encourages this code.

Given the above, the assumption is that a.hashCode is well distributed.

If the primitives in core library have well-distributed .hashCode implementations (int, double, String, ...) and anything that combines such hash codes uses our Object.hash() / ... functions, things are always well distributed.

@lrhn
Copy link
Member

lrhn commented Jan 4, 2023

Aren't we assuming that any obj.hashCode is well distributed?

That's historically not been sound to assume.
Just because we've now fixed int.hashCode doesn't give me faith that there are no other bad hash-codes out there.

Consider a class like;

class SomethingEntity {
   static int _counter = 0;
   // Each instance has a unique ID.
   final int id = _counter++;
   // More efficient than identityHashCode
   int get hashCode => id; 
   // Still use identity for equality.
   bool operator==(Object other);
}

That seems as a safe and correct hash-code implementation, but it's not well-distributed.
It's something reasonable people will do.

(I can see we've change a bunch of => id; into => id.hashCode;, but we haven't done them all, and third-party code probably hasn't either.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. type-performance Issue relates to performance or code size
Projects
None yet
Development

No branches or pull requests

2 participants