New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-6825: Apply different hash algorithms to different data types #1662
Conversation
@Ben-Zvi could you review this PR ? |
d4c4e2d
to
d011bca
Compare
import org.apache.drill.exec.record.MaterializedField; | ||
import org.apache.drill.exec.record.RecordBatch; | ||
import org.apache.drill.exec.record.TypedFieldId; | ||
import org.apache.drill.exec.record.VectorAccessible; | ||
import org.apache.drill.exec.record.VectorContainer; | ||
import org.apache.drill.exec.vector.ValueVector; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This comment should be at lines 71-77 below; github does not let putting a comment there....)
Need also to remove the code generation for "getHashBuild" and "getHashProbe".
Since HashJoin needs to solve implicit casting,it treats all different numeric data types as double. So this PR only apply to HashAgg. |
@weijietong can you add a detailed description of your work - the hash functions are a sensitive part of the execution, and possibly more people would like to look into the changes. |
@Ben-Zvi I have added some description. |
Adding the hash32() method to the ValueVector is useful; however picking up algorithms just based on a paper or being famous may not be good enough. At my previous employer I evaluated many hash functions by actually running (stand alone) performance and distribution tests. One clear result shown back then is that murmur performed well on long strings, less good on shorter data. How do the new hash functions in IntegerHashing compare with the existing one in HashHelper ? The Boost implementation of hash_combine looks "fishy" (e.g., some bits get more used than others) -- see some more critique at https://stackoverflow.com/questions/35985960/c-why-is-boosthash-combine-the-best-way-to-combine-hash-values Why can't the seed be given directly to the hash function instead of being "combined" later ? Another good hash function used in the past (don't recall any name) worked with a map of 256 prime numbers, and the code (starting with the seed) was using each input byte as an index to the map - rotate old value, XOR with new mapped value, continue .... The above experience was in C++; maybe things perform differently in Java. Also - do you know of any open source hash functions we can just import instead of writing the code in Drill ? |
The IntegerHashing's method was also used in ClickHouse for integer types(see: https://github.com/yandex/ClickHouse/blob/master/dbms/src/Common/HashTable/Hash.h intHash32 method). CK does a fine hashing method choosing according to the data types and keys width which is valuable for us to learn. As you mentioned Murmur3Hash does not have a good performance at the shorter integer case.So it's better to use the IntegerHash at the integer keys case. The Boost implementation's discussion you mentioned I had read before. But I think it's reasonable why Boost still keep the current implementation now as a base library. The reason to keep seed away from the hash32 function and involve the Boost's hash_combine method is that I want to change the current hashing strategy later. I plan to change the hash32(hash32(hash32)) row iterate model to Other hashing methods I know can be found here: https://github.com/benalexau/hash-bench. It's a java hashing method collection. The benchmark I run showed that https://github.com/OpenHFT/Zero-Allocation-Hashing/blob/master/src/main/java/net/openhft/hashing/LongHashFunction.java 's city_1_1 has a best performance at 32,64 bytes key width. I also wonder whether we can do the join keys data type implication at the project node later. So the HashJoin and Exchange node can also benefit from this PR. |
@Ben-Zvi I have done a benchmark about IntegerHash and MurmurHash3_32. The result shows that IntegerHash has nearly 2 times performance than MurmurHash3_32.
The city_1_1 hash algorithm mentioned above only has a better performance than Murmur3_32 but not good than IntegerHash. I also run two sql queries comparison which are using IntegerHash and MurmurHash separately to the long data types. The queries are hashing hotspot:
Q2:
The dataset is tpc-h scale 100. The query result shows that : It is clear that the more integer datatype columns grouped by, the more query performance will be gained. |
* 32bit hashing for string and so on. | ||
* murmurhash3 | ||
*/ | ||
public class MurmurHashing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, seems that the same code already exists in org.apache.drill.exec.expr.fn.impl.MurmurHash3
* @throws SchemaChangeException | ||
*/ | ||
@Override | ||
public int getBuildHashCodeTreatDataTypes(int incomingRowIdx) throws SchemaChangeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A benefit of this work can be to extract the Hash-Value calculation from the HashTable class(es). Conceptually the two are independent -- the hash-value is a parameter given to the hash table API.
@Ben-Zvi The hash code computation of HashAgg has been moved to |
@weijietong - how about postponing this work to 1.17.0 ? It is a change in a critical part of the engine, and we need to get a better understanding of the impact, and 1.16.0 is too close. (Also I am very busy with some 1.16 work now, and have no time to test myself). In 1.17.0 the work can be extended to Hash-Join and exchanges to be more complete. |
@Ben-Zvi agree with you. I have changed the issue to 1.17.0 |
@Ben-Zvi Do you have the time to go on this work ? |
@weijietong that's the status of this PR? |
This PR introduces those main changes: