Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MessageRouter hash inconsistent on C++/Java client #1029

Merged
merged 14 commits into from
Jan 31, 2018

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Jan 4, 2018

Motivation

This closes #1017.

Modifications

  • Implement MurmurHash3 32 and use this when partitioning on C++ client.
  • Use MurmurHash3 32, implemented by Google Guava, on Java client.
  • Add hash select option.

Result

The hash inconsistent on each client are resolved.

@merlimat merlimat added the type/feature The PR added a new feature or issue requested a new feature label Jan 4, 2018
@merlimat merlimat added this to the 1.22.0-incubating milestone Jan 4, 2018
@@ -69,6 +69,9 @@ class ProducerConfiguration {
ProducerConfiguration& setMessageRouter(const MessageRoutingPolicyPtr& router);
const MessageRoutingPolicyPtr& getMessageRouterPtr() const;

ProducerConfiguration& setUseMurmurHash(const bool useMurmurHash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have an enum type to select the default hashing scheme (default, since it can always be overridden in a custom message router.

Something like:

ProducerConfiguration& setDefaultHashingScheme(HashingScheme hashingScheme);

This way it might be interesting to provide the same function across C++ and Java, even the existing functions (eg: JavaStringHash)

@@ -0,0 +1,108 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Licht-T - As per you comment in #1017

boost::hash uses MurmurHash3

Then why are we reimplementing it and changing the code on C++ side

Copy link
Contributor Author

@Licht-T Licht-T Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jai1 There are various implementations of MurmurHash3. boost::hash is much different from the Google Guava's one which is based on the original one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the actual result different between Guava and boost versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat Yes, and the result by my implementation is same as Google Guava's.
Google Guava: https://ideone.com/Jq4stk
boost::hash: https://ideone.com/W4jEBT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for checking

}

@Override
public int choosePartition(Message msg) {
// If the message has a key, it supersedes the single partition routing policy
if (msg.hasKey()) {
return ((msg.getKey().hashCode() & Integer.MAX_VALUE) % numPartitions);
if (useMurmurHash) {
HashFunction hf = Hashing.murmur3_32(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic here is not allocation-friendly. It generates two objects HashFunction and HashCode per message here.

I would suggest using a MurmurHash class that is a static method take takes a bytes array or a string to complete a hash value, rather than using guava HashFunction or Hasher. E.g.


public class MurmurHash3 {

     public static int hash32(ByteBuf data, int offset, int length, int seed) { ... }

}

Another suggestion - it would be good to write a jmh microbenchmark, testing the difference between using MurmurHash3 and simple hash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 4, 2018

@merlimat @jai1 @sijie Thanks for your review! I will change codes.

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 13, 2018

I will fix after #1033 merged.

@merlimat
Copy link
Contributor

@Licht-T This should be unblocked. I had merged all the other PRs

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 26, 2018

Thanks @merlimat, I'll fix up in this weekend.

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 26, 2018

Just fixed up C++ client. Java client will be re-implemented in the same way.

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 26, 2018

retest this please

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Licht-T The C++ LGTM. Just a comment on moving the .h files into lib/ directory since they're not going to be part of user facing API.


#include <cstdint>
#include <string>
#include <boost/functional/hash.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class should go in the lib/ directory since it's should not be part of the user API. As a user, I just need to know about the "Boost" hash enumerator and not its implementation

#include <string>

namespace pulsar {
class Hash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for Hash , if the user interaction is with the enum, no need to expose this interface. In any case the user won't be able to directly supply a different hash function (other than writing a custom router).

@@ -46,7 +46,7 @@
private boolean blockIfQueueFull = false;
private int maxPendingMessages = 1000;
private MessageRoutingMode messageRouteMode = MessageRoutingMode.SinglePartition;
private boolean useMurmurHash = true;
private HashingScheme hashingScheme = HashingScheme.Murmur3_32Hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot change the default from Java to Murmur hash, since it would unexpectedly break ordering for people upgrading the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat Is this also can be said to C++ client?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately we cannot change the default in either library implementation. C++ would have to keep using the current BoostHash

MessageRouterBase(ProducerConfiguration.HashingScheme hashingScheme) {
switch (hashingScheme) {
case JavaStringHash:
this.hash = new JavaStringHash();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can keep static instances of the Hashing objects since they can be shared around

package org.apache.pulsar.client.impl;

public interface Hash {
long makeHash(String s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're converting later to an int, we should just have an it here.

Copy link
Contributor Author

@Licht-T Licht-T Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat MurmurHash returns an unsigned int value, so we have to return as a positive value for calculating the partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but the number is anyway masked with & Integer.MAX_INT so the bit sign will be stripped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat Does this conversion keep the quality of hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not good workaround because this may waste the quality of hash.

Copy link
Contributor

@merlimat merlimat Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it really changes anything because the hash is then moduled to the number of partitions, so all the high order bits are anyway discarded.

Copy link
Contributor Author

@Licht-T Licht-T Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat IMO, this conversion may break the uniform distribution of hash. I concern that message routing is biased to the specific partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat After second thought, I found you are right! Excuse me!

else {
return ((msg.getKey().hashCode() & Integer.MAX_VALUE) % topicMetadata.numPartitions());
}
return (int) (hash.makeHash(msg.getKey()) % topicMetadata.numPartitions());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the makeHash() to always return a positive number, which is true in the current implementation. We should mention that in the makeHash() javadoc.

@merlimat
Copy link
Contributor

@Licht-T The change looks good to me. Just the last thing is to move the new headers added in pulsar-client-cpp/include/pulsar/BoostHash.h into pulsar-client-cpp/lib/... because they should not be exposed as public API to the user.

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 28, 2018

retest this please

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 29, 2018

@merlimat C++ client is now fixed. I'll add some tests.

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Nice change!

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 30, 2018

@merlimat Now tests added!

@merlimat merlimat merged commit 8d159ef into apache:master Jan 31, 2018
@saandrews
Copy link
Contributor

@Licht-T Some of the changes like "constexpr" seems to be c++11 specific and it's breaking our build.

@Licht-T
Copy link
Contributor Author

Licht-T commented Feb 2, 2018

@saandrews Oops! Excuse me! I'll fix soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature The PR added a new feature or issue requested a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hashcode of message key is inconsistent between different language clients
5 participants