- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.7k
 
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge?
If the build side of the join is large, a significant bottleneck can be building the hash table.
We can explore some opportunities to improve the performance of building this map.
Describe the solution you'd like
Core Idea
The slowest part of building the hash map is finding and then inserting the items (hash + offset) into the map for each element.
We should be able to test the following:
- Sort the items by hash (and offset) to be able to deduplicate hashes (this introduces some overhead but the hope is this pays off during inserting to the table)
 - We can use insert_unique (https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.insert_unique) rather than https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.entry for the first entry, which should be quite a bit faster by not having to search for existing items
 - Keep on using the previous entry for duplicated elements (saving calls to 
entryfor each duplicate) 
If this doesn't involve any regressions, there are some other opportunities for further improving the performance and simplify the join algorithm by using the sorted property for improving the "chain" datastructure as well (I'll do some experiments on this later).
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster