benchmark: investigate interleaved table join performance #20586

richardwu · 2017-12-09T02:43:57Z

Outline

In #19853, full planning support for interleaved table joins was introduced. We would like to quantify the performance improvement from the new InterleavedReaderJoiner in the distributed execution engine.

There are four benchmark scenarios:

Joins on non-interleaved tables (master)
Joins on interleaved tables (master) - baseline
Joins on non-interleaved tables (w/ patch)
Joins on interleaved tables (w/ patch) - introduces InterleavedReaderJoiner

Our expectations are:

(1) and (3) results should not differ
(4) results should be strictly better than (2)
(4) results should be comparable if not better than (1)/(3)

Configuration

The most recent HEAD on master (d4f3b14) was used as the baseline. The patch was rebased ontop of this HEAD.

Four identical nodes were deployed using roachprod. The machine specs were:

4 CPUs
15GB RAM
GCE machine type: n1-standard-4
local SSDs

The cluster was launched on nodes 1-3 with each binary (master and patch) and with a default replication factor 3. The following notable flags were used to start Cockroach (using the roachperf utility):

--cache=25%
--max-sql-memory=25%
# Write to local SSD
--store=path=/mnt/data1/cockroach

The interleave benchmark ran on the 4th node and targeted node 1 (gateway). This benchmark configured a hierarchy of tables with a specified number of rows each:

merchants
    products
        variants
    stores

Concurrency was set to 2 * runtime.NumCPU() = 8 (i.e. 8 workers concurrently executed the same query). The random seed used to generate the data was left at the default value of 42. All sub-cases were ran at least twice for 60s (only results from two trials are shown below) and the data directories were wiped per subsequent sub-case.

Cases

We benchmarked the four scenarios with different queries and table sizes.

Case 1 (Simple)

SELECT * FROM merchant JOIN product ON m_id = p_m_id

--merchants=1000
--products=100000

SHOW TESTING_RANGES FROM TABLE product
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       32 | {1,2,3}  |            3 |
+-----------+---------+----------+----------+--------------+

(1) non-interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         328             5.5   1442.8   1476.4   1543.5   1610.6   1610.6

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         328             5.5   1456.3   1476.4   1543.5   1543.5   1610.6

(2) interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         240             4.0   1975.4   2013.3   2147.5   2281.7   2281.7

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         240             4.0   1962.9   2013.3   2147.5   2147.5   2281.7

(3) non-interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         321             5.3   1470.7   1543.5   1610.6   1610.6   1677.7

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         327             5.4   1461.4   1543.5   1610.6   1610.6   1677.7

(4) interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         317             5.3   1493.1   1543.5   1677.7   1811.9   1811.9

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         318             5.3   1495.4   1543.5   1610.6   1677.7   1677.7

Summary of Case 1

Our assumption that (1) and (3) should not differ (after correcting for noise) held true
(4) performed comparably to (1) - there was no trade-off with interleaving
(4) with the patch improved joins on interleaved tables by 19.5% in throughput and latency decreased by ~15%

Since all the data fit onto one node (albeit the gateway node (node 1) was not the leaseholder (node 3)) the performance gain can be attributed to the one-scan (4) vs. two-scan (2) optimization for interleaved tables.

We also see comparable performance for joins between interleaved and non-interleaved tables (although (4) had slightly higher tail latencies, which could be empirical noise).

Case 2 (multiple ranges)

We tried to structure the query such that it touched all three nodes and overlapped splits in both the non-interleaved and interleaved scenarios.

SELECT COUNT(*) FROM merchant JOIN product ON m_id = p_m_id WHERE (m_id >= 10000 AND m_id <= 14000) OR (m_id >=70000 AND m_id <= 75000)

--merchants=100000
--products=3000000

##########################
# Non-interleaved tables #
##########################

# No range splits: table < 64MB
SHOW TESTING_RANGES FROM TABLE merchant
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       19 | {1,2,3}  |            1 |
+-----------+---------+----------+----------+--------------+

ALTER TABLE product TESTING_RELOCATE
   SELECT ARRAY[((i/12000 - 1)%3)::INT+1], i FROM generate_series(12000, 96000, 12000) AS g(i)

# Wait for ranges to replicate fully
SHOW TESTING_RANGES FROM TABLE product
+----------------+----------------+----------+----------+--------------+
|   Start Key    |    End Key     | Range ID | Replicas | Lease Holder |
+----------------+----------------+----------+----------+--------------+
| NULL           | /12061/512061  |       20 | {1,2,3}  |            1 |
| /12061/512061  | /24126/1124126 |       41 | {1,2,3}  |            2 |
| /24126/1124126 | /36183/936183  |       24 | {1,2,3}  |            3 |
| /36183/936183  | /48284/348284  |       28 | {1,2,3}  |            1 |
| /48284/348284  | /60865/1160865 |       23 | {1,2,3}  |            2 |
| /60865/1160865 | /73911/673911  |       27 | {1,2,3}  |            3 |
| /73911/673911  | /86947/486947  |       25 | {1,2,3}  |            1 |
| /86947/486947  | NULL           |       26 | {1,2,3}  |            2 |
+----------------+----------------+----------+----------+--------------+

######################
# Interleaved tables #
######################

ALTER TABLE product TESTING_RELOCATE
   SELECT ARRAY[(((i-11000)/12000)%3)::INT+1], i FROM generate_series(11000, 96000, 12000) AS g(i)

# Wait for ranges to replicate fully
SHOW TESTING_RANGES FROM TABLE (merchant|product)
+-----------------------+-----------------------+----------+----------+--------------+
|       Start Key       |        End Key        | Range ID | Replicas | Lease Holder |
+-----------------------+-----------------------+----------+----------+--------------+
| NULL                  | /11785/#/52/1/2211785 |       19 | {1,2,3}  |            1 |
| /11785/#/52/1/2211785 | /23634/#/52/1/223634  |       27 | {1,2,3}  |            2 |
| /23634/#/52/1/223634  | /35415/#/52/1/2035415 |       25 | {1,2,3}  |            3 |
| /35415/#/52/1/2035415 | /47532/#/52/1/47532   |       28 | {1,2,3}  |            1 |
| /47532/#/52/1/47532   | /60388/#/52/1/260388  |       23 | {1,2,3}  |            2 |
| /60388/#/52/1/260388  | /73252/#/52/1/173252  |       26 | {1,2,3}  |            3 |
| /73252/#/52/1/173252  | /86604/#/52/1/886604  |       24 | {1,2,3}  |            1 |
| /86604/#/52/1/886604  | NULL                  |       41 | {1,2,3}  |            2 |
+-----------------------+-----------------------+----------+----------+--------------+

(1) non-interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         208             3.5   2248.3   2281.7   2415.9   2550.1   2684.4

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         211             3.5   2237.5   2415.9   2550.1   2684.4   2684.4

(2) interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         186             3.1   2494.8   2550.1   2952.8   3221.2   3221.2

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         184             3.1   2557.6   2684.4   3087.0   3355.4   3623.9

(3) non-interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         209             3.5   2226.5   2415.9   2550.1   2684.4   2952.8

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         214             3.6   2218.5   2281.7   2550.1   2684.4   3087.0

(4) interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         365             6.1   1300.8   1342.2   1543.5   1543.5   1677.7

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         364             6.1   1305.2   1342.2   1543.5   1610.6   1744.8

Summary of Case 2

(1) ~ (3) held
Huge improvement from (4) compared to (1) and (2)
- vs (1): 74% increase in throughput and ~38% decrease in latency
- vs (2): 97% increase in throughput and ~48% decrease in latency

The patch yielded tremendous performance improvements (+97% throughput, -48% latency) for joins on interleaved tables. This can be attributed to the one vs two scans and decreased node-to-node RPC calls since joins are localized. We can diff this improvement with the results from Case 1 to approximate the improvement form the localized joins within InterleavedReaderJoiner.

We see that interleaved joins were also much better than non-interleaved joins (+74% throughput, -38% latency), which is expected and desired because of data locality. In fact, the non-interleaved case was rather pessimistic (for the interleaved joins) since the entire merchant table was on the gateway node: we should expect interleaved joins to perform even better in the average and optimal cases.

Case 3 (Multiple range + grandchildren rows)

Case 2 verified the most optimistic scenario where one is joining on all tables in the interleaved hierarchy.

Case 3 verified a pessimistic case and an "average case" where we are joining on a minority and majority "domain" of data in the interleaved hierarchy, respectively.

# Pessimistic: Join on approximately 3% of the hierarchy
SELECT COUNT(*) FROM merchant JOIN store ON m_id = s_m_id
# Average: Join on 1.6% but across multiple nodes
SELECT COUNT(*) FROM merchant JOIN variant ON m_id = v_m_id WHERE (m_id >= 10000 AND m_id <= 14000) OR (m_id >= 35000 AND m_id <= 40000)

--merchants=50000
--products=200000
--variants=3500000
--stores=70000

##########################
# Non-interleaved tables #
##########################

# Note that we tend even more pessimistic by locating all single range tables to the gateway node

SHOW TESTING_RANGES FROM TABLE merchant
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       31 | {1,2,3}  |            1 |
+-----------+---------+----------+----------+--------------+

SHOW TESTING_RANGES FROM TABLE product
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       32 | {1,2,3}  |            1 |
+-----------+---------+----------+----------+--------------+

SHOW TESTING_RANGES FROM TABLE store;
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       34 | {1,2,3}  |            1 |
+-----------+---------+----------+----------+--------------+

ALTER TABLE variant TESTING_RELOCATE
    SELECT ARRAY[(((i/6000)-1)%3)::INT+1], i FROM generate_series(6000, 48000, 6000) as g(i)

SHOW TESTING_RANGES FROM TABLE variant;
+-----------------------+-----------------------+----------+----------+--------------+
|       Start Key       |        End Key        | Range ID | Replicas | Lease Holder |
+-----------------------+-----------------------+----------+----------+--------------+
| NULL                  | /6144/56144/656144    |       33 | {1,2,3}  |            1 |
| /6144/56144/656144    | /12337/162337/362337  |       41 | {1,2,3}  |            2 |
| /12337/162337/362337  | /18615/68615/2268615  |       37 | {1,2,3}  |            3 |
| /18615/68615/2268615  | /24901/74901/474901   |       38 | {1,2,3}  |            1 |
| /24901/74901/474901   | /31168/181168/781168  |       35 | {1,2,3}  |            2 |
| /31168/181168/781168  | /37438/137438/937438  |       39 | {1,2,3}  |            3 |
| /37438/137438/937438  | /43705/193705/2993705 |       36 | {1,2,3}  |            1 |
| /43705/193705/2993705 | NULL                  |       40 | {1,2,3}  |            2 |
+-----------------------+-----------------------+----------+----------+--------------+

######################
# Interleaved tables #
######################

ALTER TABLE variant TESTING_RELOCATE                                 
    SELECT ARRAY[(((i-5000)/6000)%3)::INT+1], i FROM generate_series(5000, 47000, 6000) AS g(i)

SHOW TESTING_RANGES FROM TABLE (merchant|product|variant|store)
+-------------------------------------+-------------------------------------+----------+----------+--------------+
|              Start Key              |               End Key               | Range ID | Replicas | Lease Holder |
+-------------------------------------+-------------------------------------+----------+----------+--------------+
| NULL                                | /5822/#/54/1/55822                  |       19 | {1,2,3}  |            1 |
| /5822/#/54/1/55822                  | /11672/#/52/1/111672/#/53/1/111672  |       41 | {1,2,3}  |            2 |
| /11672/#/52/1/111672/#/53/1/111672  | /17514/#/54/1/17514                 |       25 | {1,2,3}  |            3 |
| /17514/#/54/1/17514                 | /23478/#/52/1/173478/#/53/1/173478  |       28 | {1,2,3}  |            1 |
| /23478/#/52/1/173478/#/53/1/173478  | /29973/#/52/1/79973/#/53/1/1079973  |       23 | {1,2,3}  |            2 |
| /29973/#/52/1/79973/#/53/1/1079973  | /36557/#/52/1/186557/#/53/1/986557  |       27 | {1,2,3}  |            3 |
| /36557/#/52/1/186557/#/53/1/986557  | /43257/#/52/1/193257/#/53/1/1593257 |       24 | {1,2,3}  |            1 |
| /43257/#/52/1/193257/#/53/1/1593257 | NULL                                |       26 | {1,2,3}  |            2 |
+-------------------------------------+-------------------------------------+----------+----------+--------------+

(1) non-interleaved (master)

### Pessimistic
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         454             7.6   1046.8   1073.7   1476.4   1543.5   1677.7

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         449             7.5   1057.5   1073.7   1476.4   1677.7   1879.0

### Average
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          87             1.4   5373.3   5637.1   5905.6   6174.0   6174.0

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          81             1.3   5509.6   5637.1   5905.6   6174.0   6710.9

(2) interleaved (master)

### Pessimistic 
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          25             0.4  15236.4  15569.3  16643.0  16643.0  16643.0

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          25             0.4  15311.6  15569.3  17179.9  18253.6  18253.6

### Average
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          64             1.1   7170.2   7247.8   8321.5   8589.9   9126.8

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          64             1.1   7126.1   7516.2   8053.1   8321.5   8321.5

(3) non-interleaved (patch)

### Pessimistic
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         446             7.4   1068.7   1140.9   1275.1   1409.3   1610.6

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         442             7.4   1074.3   1140.9   1275.1   1476.4   1610.6
### Average

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          86             1.4   5364.0   5637.1   5905.6   5905.6   5905.6

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          82             1.4   5513.6   5637.1   6174.0   6174.0   6174.0

(4) interleaved (patch)

### Pessimistic
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          56             0.9   7612.1   8053.1   8321.5   8589.9   8589.9

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          56             0.9   7923.6   8053.1   8589.9   8589.9   8589.9

### Average
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         122             2.0   3810.4   3892.3   4160.7   4295.0   4563.4

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         121             2.0   3852.9   3892.3   4160.7   4295.0   4295.0

Summary of Case 3

There was some deviation between (1) and (3), but this does not influence our conclusion (see below)
Non-interleaved tables performed far better if the tables being joined were small enough to fit on the same nodes (pessimistic case)
Interleaved tables with the improved join logic performed the best in an "average case" where joins occurred on all nodes, even if it was only on a small proportion of the hierarchy
- vs (1): 45% increase on throughput and ~30% decrease in latency
- vs (2): 90% increase on throughput and ~50% decrease in latency

Intuitively, interleaved tables are not recommended over non-interleaved tables if the tables that are commonly joined (merchant and store in the pessimistic case) can both fit on one range or one node and make up a small proportion of the interleaved hierarchy.

Also, interleaved tables are great if the join happens on tables that make up a majority of the hierarchy (merchant and variant). That being said: joining on interleaved tables was still more efficient than joining on non-interleaved tables even when the join occurred on a very small minority of the two tables (average case). This implies that the actual overhead of scanning the entire interleaved hierarchy was relatively insignificant compared to the improved locality and reduced RPC traffic that came with the InterleavedReaderJoiner.

Case 4 (Multiple range + 70ms latency)

Used the comcast utility to simulate 70ms trans-Atlantic latencies.

# Same query and table sizes as case 2

# 35ms * 2 latency (node-to-node)
./comcast --device=ens5 --latency=35

tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc htb 10: dev ens5 root refcnt 5 r2q 10 default 1 direct_packets_stat 3 direct_qlen 1000
 Sent 228625 bytes 262 pkt (dropped 0, overlimits 476 requeues 0)
 backlog 461b 4p requeues 0
qdisc netem 100: dev ens5 parent 10:10 limit 1000 delay 35.0ms
 Sent 226653 bytes 248 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 461b 4p requeues 0


# Same table ranges as case 2

(1) non-interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         209             3.5   2239.3   2415.9   2550.1   2550.1   2550.1

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         207             3.4   2304.6   2415.9   2550.1   2550.1   2818.6

(2) interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         180             3.0   2599.3   2684.4   2818.6   3221.2   3221.2

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         183             3.0   2530.2   2684.4   2952.8   3221.2   3355.4

(3) non-interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         216             3.6   2184.1   2281.7   2550.1   2550.1   2684.4

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         217             3.6   2195.9   2415.9   2550.1   2684.4   2684.4

(4) interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         365             6.1   1301.1   1342.2   1476.4   1543.5   1543.5

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         366             6.1   1301.6   1342.2   1476.4   1543.5   1543.5

Summary of Case 4

Both non-interleaved and interleaved performed similarly to Case 2 with a 70ms latency overhead
- This was expected since each query took ~1800ms: a 70ms overhead was proportionally small
Latencies sometimes decreased between case 2 and case 4: attributed to noise
Regardless, (4) still trumped all
- vs (1): 75% increase on throughput and ~42% decrease in latency
- vs (2): 101% increase on throughput and ~45% decrease in latency

Perhaps with shorter/smaller queries we can see more interesting results.

Case 5 (Multiple range + 160ms latency)

Simulate 160ms South East Asia to US latencies.

# Same query and table sizes as case 2

# 80ms * 2 latency (node-to-node)
./comcast --device=ens5 --latency=80

tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc htb 10: dev ens5 root refcnt 5 r2q 10 default 1 direct_packets_stat 3 direct_qlen 1000
 Sent 3589252 bytes 12452 pkt (dropped 0, overlimits 32607 requeues 0)
 backlog 102b 1p requeues 0
qdisc netem 100: dev ens5 parent 10:10 limit 1000 delay 80.0ms
 Sent 3587088 bytes 12434 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 102b 1p requeues 0


# Same table ranges as case 2

(1) non-interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         213             3.5   2250.5   2415.9   2684.4   2684.4   2684.4

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         209             3.5   2235.5   2415.9   2684.4   2684.4   2684.4

(2) interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         179             3.0   2676.9   2818.6   2952.8   2952.8   3087.0

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         173             2.9   2681.1   2818.6   3087.0   3087.0   3087.0

(3) non-interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         215             3.6   2184.9   2281.7   2550.1   2684.4   2818.6

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         209             3.5   2237.4   2415.9   2550.1   2684.4   2818.6

(4) interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         339             5.6   1393.0   1409.3   1677.7   1677.7   1744.8

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         340             5.7   1397.6   1409.3   1610.6   1677.7   1744.8

Summary of Case 5

Interleaved joins did take a hit compared to Case 4, but still performed far better than non-interleaved tables
- vs (1): 61% increase in throughput and ~40% decrease in latency
- vs (2): 93% increase in throughput and ~45% decrease in latency

Future Work

Outer joins (not yet implemented with interleaved table joins)
Larger hierarchies
Trade-offs with point lookups or range scans on descendant tables
More complex queries (with aggregations, grouping, multiple joins)
Experiment with other network clamps (e.g. packet loss, throttling)

The text was updated successfully, but these errors were encountered:

richardwu · 2017-12-09T08:31:04Z

cc: @RaduBerinde @arjunravinarayan @petermattis

petermattis · 2017-12-09T16:48:43Z

@richardwu This is a fantastic write up. One nit: you use the future tense (we will) a lot even though you've already done the work.

I wonder why interleaved tables do not have the same performance as non-interleaved tables for Case 1. The performance is only 10% worse, but that's still surprising. There might be something fairly simple that can explain that, such as slightly more memory allocations or different memory accounting. For example, the inner-loop of interleavedJoinReader might be allocating helper postProcHelper on every row and that looks easy to fix.

petermattis · 2017-12-09T16:51:30Z

@@ -242,9 +242,10 @@ func (irj *interleavedReaderJoiner) Run(ctx context.Context, wg *sync.WaitGroup)
                        break
                }

-               var helper postProcHelper
+               var helper *postProcHelper
                helperIdx := -1
-               for i, h := range irj.tablePosts {
+               for i := range irj.tablePosts {
+                       h := &irj.tablePosts[i]
                        if desc.ID == h.tableID && index.ID == h.indexID {
                                helper = h
                                helperIdx = i

Note that sizeof(postProcHelper) == 928. So in addition to avoiding an allocation, this would avoid copying those bytes on every iteration of the for-loop.

richardwu · 2017-12-10T06:11:46Z

Looks like you were right: performance in Case 1 is now comparable to non-interleaved tables and the results from the other cases also improved.

petermattis · 2017-12-10T17:17:26Z

@richardwu Not sure if you have the time in your last week, but it would be useful to add Go benchmarks for mergeJoiner and interleavedReaderJoiner. Even better if these benchmarks use the same schema as BenchmarkHashJoiner so that the numbers are comparable.

rjnn · 2017-12-10T23:08:52Z

This is an excellent write-up! Far more thorough than I expected from our discussion! 🎉

Just one thing: please merge the interleave benchmark! It's an open PR now, and if you need anything unblocked that's currently blocking the merge (nothing AFAIK from our discussion on Thursday) let me know ASAP.

richardwu added C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Dec 9, 2017

richardwu mentioned this issue Dec 12, 2017

sql: support interleaved prefix/subset joins in InterleavedReaderJoiner #20661

Closed

awoods187 added this to Triage in (DEPRECATED) SQL execution via automation Mar 5, 2019

jordanlewis added this to Triage in [DEPRECATED] Old SQLExec board. Don't move stuff here Apr 23, 2019

jordanlewis moved this from Triage to Lower priority backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here May 6, 2019

asubiotto closed this as completed Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: investigate interleaved table join performance #20586

benchmark: investigate interleaved table join performance #20586

richardwu commented Dec 9, 2017 •

edited

richardwu commented Dec 9, 2017

petermattis commented Dec 9, 2017

petermattis commented Dec 9, 2017

richardwu commented Dec 10, 2017

petermattis commented Dec 10, 2017

rjnn commented Dec 10, 2017

benchmark: investigate interleaved table join performance #20586

benchmark: investigate interleaved table join performance #20586

Comments

richardwu commented Dec 9, 2017 • edited

Outline

Configuration

Cases

Case 1 (Simple)

(1) non-interleaved (master)

(2) interleaved (master)

(3) non-interleaved (patch)

(4) interleaved (patch)

Summary of Case 1

Case 2 (multiple ranges)

(1) non-interleaved (master)

(2) interleaved (master)

(3) non-interleaved (patch)

(4) interleaved (patch)

Summary of Case 2

Case 3 (Multiple range + grandchildren rows)

(1) non-interleaved (master)

(2) interleaved (master)

(3) non-interleaved (patch)

(4) interleaved (patch)

Summary of Case 3

Case 4 (Multiple range + 70ms latency)

(1) non-interleaved (master)

(2) interleaved (master)

(3) non-interleaved (patch)

(4) interleaved (patch)

Summary of Case 4

Case 5 (Multiple range + 160ms latency)

(1) non-interleaved (master)

(2) interleaved (master)

(3) non-interleaved (patch)

(4) interleaved (patch)

Summary of Case 5

Future Work

richardwu commented Dec 9, 2017

petermattis commented Dec 9, 2017

petermattis commented Dec 9, 2017

richardwu commented Dec 10, 2017

petermattis commented Dec 10, 2017

rjnn commented Dec 10, 2017

richardwu commented Dec 9, 2017 •

edited