Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-6825: Apply different hash algorithms to different data types #1662

Closed
wants to merge 3 commits into from

Conversation

weijietong
Copy link

@weijietong weijietong commented Feb 27, 2019

This PR introduces those main changes:

@weijietong
Copy link
Author

@Ben-Zvi could you review this PR ?

@weijietong weijietong force-pushed the DRILL-6825 branch 3 times, most recently from d4c4e2d to d011bca Compare February 28, 2019 04:04
import org.apache.drill.exec.record.MaterializedField;
import org.apache.drill.exec.record.RecordBatch;
import org.apache.drill.exec.record.TypedFieldId;
import org.apache.drill.exec.record.VectorAccessible;
import org.apache.drill.exec.record.VectorContainer;
import org.apache.drill.exec.vector.ValueVector;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This comment should be at lines 71-77 below; github does not let putting a comment there....)
Need also to remove the code generation for "getHashBuild" and "getHashProbe".

@weijietong
Copy link
Author

Since HashJoin needs to solve implicit casting,it treats all different numeric data types as double. So this PR only apply to HashAgg.

@Ben-Zvi
Copy link
Contributor

Ben-Zvi commented Feb 28, 2019

@weijietong can you add a detailed description of your work - the hash functions are a sensitive part of the execution, and possibly more people would like to look into the changes.

@weijietong
Copy link
Author

@Ben-Zvi I have added some description.

@Ben-Zvi
Copy link
Contributor

Ben-Zvi commented Mar 2, 2019

Adding the hash32() method to the ValueVector is useful; however picking up algorithms just based on a paper or being famous may not be good enough. At my previous employer I evaluated many hash functions by actually running (stand alone) performance and distribution tests. One clear result shown back then is that murmur performed well on long strings, less good on shorter data.

How do the new hash functions in IntegerHashing compare with the existing one in HashHelper ?

The Boost implementation of hash_combine looks "fishy" (e.g., some bits get more used than others) -- see some more critique at https://stackoverflow.com/questions/35985960/c-why-is-boosthash-combine-the-best-way-to-combine-hash-values

Why can't the seed be given directly to the hash function instead of being "combined" later ?

Another good hash function used in the past (don't recall any name) worked with a map of 256 prime numbers, and the code (starting with the seed) was using each input byte as an index to the map - rotate old value, XOR with new mapped value, continue ....

The above experience was in C++; maybe things perform differently in Java.

Also - do you know of any open source hash functions we can just import instead of writing the code in Drill ?

@weijietong
Copy link
Author

The IntegerHashing's method was also used in ClickHouse for integer types(see: https://github.com/yandex/ClickHouse/blob/master/dbms/src/Common/HashTable/Hash.h intHash32 method). CK does a fine hashing method choosing according to the data types and keys width which is valuable for us to learn. As you mentioned Murmur3Hash does not have a good performance at the shorter integer case.So it's better to use the IntegerHash at the integer keys case.

The Boost implementation's discussion you mentioned I had read before. But I think it's reasonable why Boost still keep the current implementation now as a base library.

The reason to keep seed away from the hash32 function and involve the Boost's hash_combine method is that I want to change the current hashing strategy later. I plan to change the hash32(hash32(hash32)) row iterate model to hash32() hash_combine hash32() hash_combine hash32() column combine model at the multi-keys case. The row iterate module has a data dependency and will hurt the cpu pipeline performance.

Other hashing methods I know can be found here: https://github.com/benalexau/hash-bench. It's a java hashing method collection. The benchmark I run showed that https://github.com/OpenHFT/Zero-Allocation-Hashing/blob/master/src/main/java/net/openhft/hashing/LongHashFunction.java 's city_1_1 has a best performance at 32,64 bytes key width.

I also wonder whether we can do the join keys data type implication at the project node later. So the HashJoin and Exchange node can also benefit from this PR.

@weijietong
Copy link
Author

weijietong commented Mar 6, 2019

@Ben-Zvi I have done a benchmark about IntegerHash and MurmurHash3_32. The result shows that IntegerHash has nearly 2 times performance than MurmurHash3_32.

HashBenchInteger.hashInteger  IntegerHash       N/A  avgt    5   3.987 ± 0.162  ns/op
HashBenchInteger.hashInteger   Murmur3_32       N/A  avgt    5   6.626 ± 0.085  ns/op
HashBenchInteger.hashLong     IntegerHash       N/A  avgt    5   4.903 ± 0.320  ns/op
HashBenchInteger.hashLong      Murmur3_32       N/A  avgt    5   8.525 ± 0.649  ns/op

The city_1_1 hash algorithm mentioned above only has a better performance than Murmur3_32 but not good than IntegerHash.

I also run two sql queries comparison which are using IntegerHash and MurmurHash separately to the long data types. The queries are hashing hotspot:
Q1:

select c.c_custkey, count(*)
from dfs.`/tpch100/customer` c
group by c.c_custkey

Q2:

select c.c_custkey,c.c_nationkey, count(*)
from dfs.`/tpch100/customer` c
group by c.c_custkey,c.c_nationkey

The dataset is tpc-h scale 100. The query result shows that :
To Q1:
IntegerHash has a 5% query performance improvement than MurmurHash3_32(IntegerHash: 14.029 sec, MurmurHash3_32: 14.870 sec ).
To Q2:
IntegerHash has a 16% query performance improvement than MurmurHash3_32(IntegerHash: 20.499 sec,MurmurHash3_32: 23.921 sec).

It is clear that the more integer datatype columns grouped by, the more query performance will be gained.

* 32bit hashing for string and so on.
* murmurhash3
*/
public class MurmurHashing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, seems that the same code already exists in org.apache.drill.exec.expr.fn.impl.MurmurHash3

* @throws SchemaChangeException
*/
@Override
public int getBuildHashCodeTreatDataTypes(int incomingRowIdx) throws SchemaChangeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A benefit of this work can be to extract the Hash-Value calculation from the HashTable class(es). Conceptually the two are independent -- the hash-value is a parameter given to the hash table API.

@weijietong
Copy link
Author

@Ben-Zvi The hash code computation of HashAgg has been moved to HashAggTemplate. Your review advices have been applied. Please check, tks!

@Ben-Zvi
Copy link
Contributor

Ben-Zvi commented Mar 14, 2019

@weijietong - how about postponing this work to 1.17.0 ? It is a change in a critical part of the engine, and we need to get a better understanding of the impact, and 1.16.0 is too close. (Also I am very busy with some 1.16 work now, and have no time to test myself). In 1.17.0 the work can be extended to Hash-Join and exchanges to be more complete.

@weijietong
Copy link
Author

@Ben-Zvi agree with you. I have changed the issue to 1.17.0

@weijietong
Copy link
Author

@Ben-Zvi Do you have the time to go on this work ?

@arina-ielchiieva arina-ielchiieva changed the title DRILL-6825: apply different hash algorithms to different data types DRILL-6825: Apply different hash algorithms to different data types Oct 14, 2019
@arina-ielchiieva
Copy link
Member

@weijietong that's the status of this PR?

@weijietong weijietong closed this Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants