[C++][Compute] Hash Join microbenchmarks #30038

asfimport · 2021-10-26T19:01:48Z

Implement a series of microbenchmarks giving a good picture of the performance of hash join implemented in Arrow across different set of dimensions.
Compare the performance against some other product(s).
Add scripts for generating useful visual reports giving a good picture of the costs of hash join.

Examples of dimensions to explore in microbenchmarks:

number of duplicate keys on build side
relative size of build side to probe side
selectivity of the join
number of key columns
number of payload columns
filtering performance for semi- and anti- joins
dense integer key vs sparse integer key vs string key
build size
scaling of build, filtering, probe
inner vs left outer, inner vs right outer
left semi vs right semi, left anti vs right anti, left outer vs right outer
non-uniform key distribution
monotonic key values in input, partitioned key values in input (with and without per batch min-max metadata)
chain of multiple hash joins
overhead of Bloom filter for non-selective Bloom filter

Reporter: Michal Nowakiewicz / @michalursa
Assignee: Sasha Krassovsky / @save-buffer

Related issues:

[C++][Compute] Hash Join performance improvement (is a child of)

PRs and other links:

GitHub Pull Request #11876

_{Note: This issue was originally created as ARROW-14479. Please see the migration documentation for further details.}

asfimport · 2021-11-30T10:17:56Z

Michal Nowakiewicz / @michalursa:

Dimensions

Below is the list of useful dimensions for hash join performance testing.

We would like to have one test specifically targeting each of these dimensions. At the end of the description of each of the dimensions below there may be a note suggesting how to combine this dimension with other related dimensions for multi-dimensional performance analysis. We can roughly assume that exploration of multi-dimensional space with up to 3 dimensions would mean producing one graph for each value of the dimension with the smallest cardinality, with one curve for each value of the second smallest cardinality dimension, using the highest cardinality dimension as x axis.

1. Size of the hash table

Size of the hash table can be measured in bytes or in the number of inserted elements (either rows or unique keys on build side). Different implementations can use different organization of data in hash tables and related data structures and because of that produce different sizes of these data structures for the same set of input rows. It seems reasonable to use the number of inserted rows as the input parameter for benchmarks and size of memory used as the output of measurement.

Measuring performance for varying size of the hash table is the fundamental test of the core hash table implementation. Going from tens to millions of inserted items in the hash table, we can observe the impact of various types of CPU cache misses on the performance. The graph should show performance dropping at the points where each type of CPU cache becomes smaller than what is required for hash table data. The exact input value point where it happens depends on the organization of data in the hash table and may vary from one implementation to another (some implementations may for instance use light-weight compression of data in the hash table to move these points further to the right on the graph).

The extremes on both sides of this graph are of particular interest. Testing with hash tables fitting into CPU L1 cache takes away the cost of memory access and verifies the implementation of things like: hashing, traversing hash table, fetching, transforming and moving data from hash table. Testing with a hash table filling a significant chunk of all available memory verifies the scenario when the main bottleneck by far is loading a cache line from main memory. In this case, techniques like using large/huge pages, memory prefetching, partitioning before accessing hash tables, play a significant role in achieving good performance despite the fact that the same techniques would hurt the performance on small hash tables. Another interesting data point is the throughput in terms of bytes per second output by hash join in that scenario. What makes it interesting is that this number can often be similar or even worse (at least for a single-threaded hash join) than the SSD storage read/write bandwidth. What it means is that for the largest hash table, the implementation that uses spill-over to disk storage may not be much worse than in-memory processing and there may be a point where the two performance lines meet (a point defined by the hash table size combined with the number of processing threads, at which there is no benefit to fully in-memory processing in terms of execution time).

2. Data types

Another interesting dimension is data types and number of key columns. Hashing, storing of keys in a hash table, looking up keys in the hash table is typically much different between fixed-length data types (e.g. integer keys) and varying-length data types (e.g. string keys). There are optimizations available uniquely for integer keys from a dense domain (range of key values comparable to number of distinct keys).

Typically the number of key columns is small (single digit). There are some interesting performance questions related to the number of keys. An example would be comparing performance on 4x int64_t columns to performance on a 1x 32B binary column. Since there is an easy way to transform data from the first case to the second case, a large difference in performance could be a sign of a problem with implementation for one of these cases. Similar scenario for strings could involve comparing 4x strings having from 0 to 16 characters each to a 1x string having from 0 to 64 characters.

Combine with 1.

3. Selectivity

One of the functions of the hash join (for semi or inner join) is to remove rows on the probe side with no matches on the build side. Hash join may choose for instance to build a Bloom filter for keys on the build side to quickly eliminate such rows on the probe side, providing a fast membership test but with a small probability of false positives. It may also be interesting to compare the filtering performance of semi hash join to that of in-list filter, since both implement roughly the same functionality.

Filtering performance of hash join mostly depends on two factors:

a) number of distinct keys on build side (or number of inserted rows on build side for some implementations, which may be different in case of duplicate keys);

b) probability of accepting a key on probe side (selectivity).

The first factor affects the cost of a single filter query, while the second affects the relative portion of processing time spent doing post-filter tasks, like hash table lookups and materialization of matching rows from build side.

Included in the join filtering performance test should be a comparison of semi-join to anti-semi-join. Side by side view of the results obtained for both these join types when the selectivities are the same is interesting. On one hand, it seems that anti-join could be implemented just the same way as semi-join just with filter results negated. On the other hand, there are some important differences between these two cases, e.g. semi-join can use Bloom filters (with false positives) for fast elimination of rows, while anti-join cannot (Bloom filter would provide false negatives in such a case).

Combine with 1. and 2.

4. Join types

There are 8 basic types of join: left/right semi, left/right anti, inner, left/right outer and full outer. The processing pipeline for each of them is slightly different with a different subset of required steps. Join filtering tests will cover evaluation of semi and anti variants, basic tests will be based on inner join, which leaves outer joins as missing area for performance evaluation. Full outer join combines the unique steps of left and right outer joins, so it should be enough to test left and right outer joins separately to get a good picture of performance of hash join steps unique for outer joins.

Combine with 1. and 3.

5. Number of matches

The work required to process many-to-many joins may present unique challenges that are not present in case of many-to-one joins. For instance, the possibility of having multiple matches for an input row on the probe side means that we may need to produce multiple copies of that row - one for each match in the hash table. We may also need to traverse chains of rows stored in the hash table for a specific single key in order to retrieve payloads from the build side of the join for all matching rows.

Of particular interest is looking at the average number of matches in the context of inequality joins. Hash join can perform a join operation with a predicate that mixes key equalities with key inequalities. The simplest implementation uses a hash table for equalities and then traverses potentially many matches coming from it, evaluating inequalities for each of them. The usefulness of such an implementation depends on the cost of traversal of chains of matches for a given key and materialization of their data for the purpose of residual predicate evaluation. It may be acceptable for high selectivity residual predicates (high percentage of candidate matching rows pass the predicate) but not for low selectivity ones (most of the matching rows are false positives rejected by subsequent inequality tests).

Combine with 1.

6. Payload size

Size of payload columns affects the cost of data movement when partitioning, copying or fetching rows from a hash table. It seems reasonable to look at varying payload sizes in the range of perhaps 0 to 100 bytes.

Combine with 1. and 5.

7. Degree of parallelism

Probe side should scale very well with increasing number of executing threads, since (almost) all data structures, including the hash table, are read-only during probe side processing and do not require any synchronization between the threads. Also, all threads share the same read-only copy of data, which means that they use CPU cache during probe side processing in an efficient way (without competing needs across threads executing hash join together).

Build side is of particular interest with respect to degree of parallelism, since the hash table build process is not trivially parallelizable.

Combine with: 1. and 2.

8. Dictionaries

Arrow data format supports dictionary encoded values (there is an array of values called a dictionary and another array storing only integer ids pointing to values in the dictionary). There is an opportunity to process data faster in hash join in the presence of dictionary encoded key columns, since a dictionary already accomplishes some of the work that hash join needs to do when mapping duplicate keys to the same hash table entry. The most meaningful scenario for hash join involves normalized dictionaries - dictionaries with only unique values. It is interesting to compare hash join performance with the same key column when using a normalized dictionary and when not using it. There are 4 interesting cases: no dictionaries, dictionary on build side but not on probe side, dictionary on probe side but not on build side, (different) dictionaries on both sides.

Combine with 1. and 2.

9. [Optional] Distribution of key frequencies

Tests involving dimensions above should be performed on key columns with uniform distribution for most meaningful results. Despite that, the real-world data often does not follow a uniform distribution and is closer to some form of exponential distribution function. Although it is not a common thing to be implemented, it is possible to add to hash join performance optimizations specifically related to different keys having much different frequencies on the probe side. Keeping most frequent keys and their payloads together in memory should change the cache hit ratios in case of hash tables larger than the CPU cache.

Combine with 1.

Separating build side and probe side costs

In the course of hash join processing, typically processing of build side and probe side of the join happens in separate phases, e.g. join starts with hash table build operation that involves only the build side, then switches to hash table lookups for all rows on the probe side, and then finally optionally scans the hash table (outputting either hash table rows with any matches or with no matches). Most dimensions of hash join performance testing mentioned in this document affect both sides of the join, but the challenges for build side and probe side can be much different. When testing, it is easy to vary the relative size (in number of rows) of build side vs probe side. Generating 10x more rows on the probe side will make almost all of build side processing overshadowed by the probe side processing cost. Symmetrically, testing with 10x less rows on the probe side will mostly show only the build side processing costs. There is no need for varying relative sizes of both sides of the hash join between two extremes, since the costs in between should with high probability just follow a linear combination of the costs for the extremes.

asfimport · 2022-01-13T02:36:09Z

Weston Pace / @westonpace:
Issue resolved by pull request 11876
#11876

asfimport closed this as completed Jan 13, 2022

asfimport mentioned this issue Jan 11, 2023

[C++][Compute] Hash Join performance improvement #29768

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Compute] Hash Join microbenchmarks #30038

[C++][Compute] Hash Join microbenchmarks #30038

asfimport commented Oct 26, 2021 •

edited

asfimport commented Nov 30, 2021

asfimport commented Jan 13, 2022

[C++][Compute] Hash Join microbenchmarks #30038

[C++][Compute] Hash Join microbenchmarks #30038

Comments

asfimport commented Oct 26, 2021 • edited

Related issues:

PRs and other links:

asfimport commented Nov 30, 2021

Dimensions

1. Size of the hash table

2. Data types

3. Selectivity

4. Join types

5. Number of matches

6. Payload size

7. Degree of parallelism

8. Dictionaries

9. [Optional] Distribution of key frequencies

Separating build side and probe side costs

asfimport commented Jan 13, 2022

asfimport commented Oct 26, 2021 •

edited