DRILL-6825: Apply different hash algorithms to different data types #1662

weijietong · 2019-02-27T13:08:41Z

This PR introduces those main changes:

Add hash32(int index) method to the ValueVector interface.
To Interger data types, introduce https://gist.github.com/badboy/6267743 multiply-shift algorithm which was proved to be good at performance by the paper https://15721.courses.cs.cmu.edu/spring2019/papers/17-hashjoins/richter-vldb2015.pdf. Other data types like double,float,string still use the Murmur3 hash method.
Since HashJoin needs to solve implicit casting , it treats all numeric data types as double. So this PR only apply to HashAgg operator.

weijietong · 2019-02-27T13:09:38Z

@Ben-Zvi could you review this PR ?

Ben-Zvi · 2019-02-28T07:40:22Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java

 import org.apache.drill.exec.record.MaterializedField;
 import org.apache.drill.exec.record.RecordBatch;
 import org.apache.drill.exec.record.TypedFieldId;
-import org.apache.drill.exec.record.VectorAccessible;
 import org.apache.drill.exec.record.VectorContainer;
 import org.apache.drill.exec.vector.ValueVector;



(This comment should be at lines 71-77 below; github does not let putting a comment there....)
Need also to remove the code generation for "getHashBuild" and "getHashProbe".

weijietong · 2019-02-28T07:45:34Z

Since HashJoin needs to solve implicit casting,it treats all different numeric data types as double. So this PR only apply to HashAgg.

Ben-Zvi · 2019-02-28T07:47:46Z

@weijietong can you add a detailed description of your work - the hash functions are a sensitive part of the execution, and possibly more people would like to look into the changes.

weijietong · 2019-02-28T07:57:16Z

@Ben-Zvi I have added some description.

Ben-Zvi · 2019-03-02T07:12:10Z

Adding the hash32() method to the ValueVector is useful; however picking up algorithms just based on a paper or being famous may not be good enough. At my previous employer I evaluated many hash functions by actually running (stand alone) performance and distribution tests. One clear result shown back then is that murmur performed well on long strings, less good on shorter data.

How do the new hash functions in IntegerHashing compare with the existing one in HashHelper ?

The Boost implementation of hash_combine looks "fishy" (e.g., some bits get more used than others) -- see some more critique at https://stackoverflow.com/questions/35985960/c-why-is-boosthash-combine-the-best-way-to-combine-hash-values

Why can't the seed be given directly to the hash function instead of being "combined" later ?

Another good hash function used in the past (don't recall any name) worked with a map of 256 prime numbers, and the code (starting with the seed) was using each input byte as an index to the map - rotate old value, XOR with new mapped value, continue ....

The above experience was in C++; maybe things perform differently in Java.

Also - do you know of any open source hash functions we can just import instead of writing the code in Drill ?

weijietong · 2019-03-04T12:33:10Z

The IntegerHashing's method was also used in ClickHouse for integer types(see: https://github.com/yandex/ClickHouse/blob/master/dbms/src/Common/HashTable/Hash.h intHash32 method). CK does a fine hashing method choosing according to the data types and keys width which is valuable for us to learn. As you mentioned Murmur3Hash does not have a good performance at the shorter integer case.So it's better to use the IntegerHash at the integer keys case.

The Boost implementation's discussion you mentioned I had read before. But I think it's reasonable why Boost still keep the current implementation now as a base library.

The reason to keep seed away from the hash32 function and involve the Boost's hash_combine method is that I want to change the current hashing strategy later. I plan to change the hash32(hash32(hash32)) row iterate model to hash32() hash_combine hash32() hash_combine hash32() column combine model at the multi-keys case. The row iterate module has a data dependency and will hurt the cpu pipeline performance.

Other hashing methods I know can be found here: https://github.com/benalexau/hash-bench. It's a java hashing method collection. The benchmark I run showed that https://github.com/OpenHFT/Zero-Allocation-Hashing/blob/master/src/main/java/net/openhft/hashing/LongHashFunction.java 's city_1_1 has a best performance at 32,64 bytes key width.

I also wonder whether we can do the join keys data type implication at the project node later. So the HashJoin and Exchange node can also benefit from this PR.

weijietong · 2019-03-06T06:36:13Z

@Ben-Zvi I have done a benchmark about IntegerHash and MurmurHash3_32. The result shows that IntegerHash has nearly 2 times performance than MurmurHash3_32.

HashBenchInteger.hashInteger  IntegerHash       N/A  avgt    5   3.987 ± 0.162  ns/op
HashBenchInteger.hashInteger   Murmur3_32       N/A  avgt    5   6.626 ± 0.085  ns/op
HashBenchInteger.hashLong     IntegerHash       N/A  avgt    5   4.903 ± 0.320  ns/op
HashBenchInteger.hashLong      Murmur3_32       N/A  avgt    5   8.525 ± 0.649  ns/op

The city_1_1 hash algorithm mentioned above only has a better performance than Murmur3_32 but not good than IntegerHash.

I also run two sql queries comparison which are using IntegerHash and MurmurHash separately to the long data types. The queries are hashing hotspot:
Q1:

select c.c_custkey, count(*)
from dfs.`/tpch100/customer` c
group by c.c_custkey

Q2:

select c.c_custkey,c.c_nationkey, count(*)
from dfs.`/tpch100/customer` c
group by c.c_custkey,c.c_nationkey

The dataset is tpc-h scale 100. The query result shows that :
To Q1:
IntegerHash has a 5% query performance improvement than MurmurHash3_32(IntegerHash: 14.029 sec, MurmurHash3_32: 14.870 sec ).
To Q2:
IntegerHash has a 16% query performance improvement than MurmurHash3_32(IntegerHash: 20.499 sec,MurmurHash3_32: 23.921 sec).

It is clear that the more integer datatype columns grouped by, the more query performance will be gained.

Ben-Zvi · 2019-03-06T22:38:14Z

exec/vector/src/main/java/org/apache/drill/exec/hash/MurmurHashing.java

+ * 32bit hashing for string and so on.
+ * murmurhash3
+ */
+public class MurmurHashing


BTW, seems that the same code already exists in org.apache.drill.exec.expr.fn.impl.MurmurHash3

Ben-Zvi · 2019-03-07T00:09:41Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java

+   * @throws SchemaChangeException
+   */
+  @Override
+  public int getBuildHashCodeTreatDataTypes(int incomingRowIdx) throws SchemaChangeException {


A benefit of this work can be to extract the Hash-Value calculation from the HashTable class(es). Conceptually the two are independent -- the hash-value is a parameter given to the hash table API.

weijietong · 2019-03-13T05:56:07Z

@Ben-Zvi The hash code computation of HashAgg has been moved to HashAggTemplate. Your review advices have been applied. Please check, tks!

Ben-Zvi · 2019-03-14T02:48:09Z

@weijietong - how about postponing this work to 1.17.0 ? It is a change in a critical part of the engine, and we need to get a better understanding of the impact, and 1.16.0 is too close. (Also I am very busy with some 1.16 work now, and have no time to test myself). In 1.17.0 the work can be extended to Hash-Join and exchanges to be more complete.

weijietong · 2019-03-14T05:29:49Z

@Ben-Zvi agree with you. I have changed the issue to 1.17.0

weijietong · 2019-07-02T03:51:13Z

@Ben-Zvi Do you have the time to go on this work ?

arina-ielchiieva · 2019-10-14T20:26:17Z

@weijietong that's the status of this PR?

weijietong force-pushed the DRILL-6825 branch 3 times, most recently from d4c4e2d to d011bca Compare February 28, 2019 04:04

Ben-Zvi reviewed Feb 28, 2019

View reviewed changes

Ben-Zvi reviewed Mar 6, 2019

View reviewed changes

Ben-Zvi reviewed Mar 7, 2019

View reviewed changes

weijietong force-pushed the DRILL-6825 branch from ac88b99 to dc57889 Compare March 13, 2019 05:50

weijietong force-pushed the DRILL-6825 branch from dc57889 to dbf3fa2 Compare March 13, 2019 07:23

kkhatua assigned weijietong May 13, 2019

weijie.tong added 3 commits July 1, 2019 13:19

DRILL-6825: apply different hash algorithms to different data types

6c3c1c4

This PR only apply to HashAgg.

c2b8987

DRILL-6825: apply review advices

a11bdfe

weijietong force-pushed the DRILL-6825 branch from 33c455a to a11bdfe Compare July 1, 2019 05:24

arina-ielchiieva changed the title ~~DRILL-6825: apply different hash algorithms to different data types~~ DRILL-6825: Apply different hash algorithms to different data types Oct 14, 2019

weijietong closed this Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-6825: Apply different hash algorithms to different data types #1662

DRILL-6825: Apply different hash algorithms to different data types #1662

weijietong commented Feb 27, 2019 •

edited

weijietong commented Feb 27, 2019

Ben-Zvi Feb 28, 2019

weijietong commented Feb 28, 2019

Ben-Zvi commented Feb 28, 2019

weijietong commented Feb 28, 2019

Ben-Zvi commented Mar 2, 2019 •

edited

weijietong commented Mar 4, 2019

weijietong commented Mar 6, 2019 •

edited

Ben-Zvi Mar 6, 2019

Ben-Zvi Mar 7, 2019

weijietong commented Mar 13, 2019

Ben-Zvi commented Mar 14, 2019

weijietong commented Mar 14, 2019

weijietong commented Jul 2, 2019

arina-ielchiieva commented Oct 14, 2019

DRILL-6825: Apply different hash algorithms to different data types #1662

DRILL-6825: Apply different hash algorithms to different data types #1662

Conversation

weijietong commented Feb 27, 2019 • edited

weijietong commented Feb 27, 2019

Ben-Zvi Feb 28, 2019

Choose a reason for hiding this comment

weijietong commented Feb 28, 2019

Ben-Zvi commented Feb 28, 2019

weijietong commented Feb 28, 2019

Ben-Zvi commented Mar 2, 2019 • edited

weijietong commented Mar 4, 2019

weijietong commented Mar 6, 2019 • edited

Ben-Zvi Mar 6, 2019

Choose a reason for hiding this comment

Ben-Zvi Mar 7, 2019

Choose a reason for hiding this comment

weijietong commented Mar 13, 2019

Ben-Zvi commented Mar 14, 2019

weijietong commented Mar 14, 2019

weijietong commented Jul 2, 2019

arina-ielchiieva commented Oct 14, 2019

weijietong commented Feb 27, 2019 •

edited

Ben-Zvi commented Mar 2, 2019 •

edited

weijietong commented Mar 6, 2019 •

edited