Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark: investigate interleaved table join performance #20586

Closed
richardwu opened this issue Dec 9, 2017 · 6 comments
Closed

benchmark: investigate interleaved table join performance #20586

richardwu opened this issue Dec 9, 2017 · 6 comments
Labels
C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@richardwu
Copy link
Contributor

richardwu commented Dec 9, 2017

Outline

In #19853, full planning support for interleaved table joins was introduced. We would like to quantify the performance improvement from the new InterleavedReaderJoiner in the distributed execution engine.

There are four benchmark scenarios:

  1. Joins on non-interleaved tables (master)
  2. Joins on interleaved tables (master) - baseline
  3. Joins on non-interleaved tables (w/ patch)
  4. Joins on interleaved tables (w/ patch) - introduces InterleavedReaderJoiner

Our expectations are:

  • (1) and (3) results should not differ
  • (4) results should be strictly better than (2)
  • (4) results should be comparable if not better than (1)/(3)

Configuration

The most recent HEAD on master (d4f3b14) was used as the baseline. The patch was rebased ontop of this HEAD.

Four identical nodes were deployed using roachprod. The machine specs were:

  • 4 CPUs
  • 15GB RAM
  • GCE machine type: n1-standard-4
  • local SSDs

The cluster was launched on nodes 1-3 with each binary (master and patch) and with a default replication factor 3. The following notable flags were used to start Cockroach (using the roachperf utility):

--cache=25%
--max-sql-memory=25%
# Write to local SSD
--store=path=/mnt/data1/cockroach

The interleave benchmark ran on the 4th node and targeted node 1 (gateway). This benchmark configured a hierarchy of tables with a specified number of rows each:

merchants
    products
        variants
    stores

Concurrency was set to 2 * runtime.NumCPU() = 8 (i.e. 8 workers concurrently executed the same query). The random seed used to generate the data was left at the default value of 42. All sub-cases were ran at least twice for 60s (only results from two trials are shown below) and the data directories were wiped per subsequent sub-case.

Cases

We benchmarked the four scenarios with different queries and table sizes.

Case 1 (Simple)

SELECT * FROM merchant JOIN product ON m_id = p_m_id

--merchants=1000
--products=100000

SHOW TESTING_RANGES FROM TABLE product
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       32 | {1,2,3}  |            3 |
+-----------+---------+----------+----------+--------------+

(1) non-interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         328             5.5   1442.8   1476.4   1543.5   1610.6   1610.6

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         328             5.5   1456.3   1476.4   1543.5   1543.5   1610.6

(2) interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         240             4.0   1975.4   2013.3   2147.5   2281.7   2281.7

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         240             4.0   1962.9   2013.3   2147.5   2147.5   2281.7

(3) non-interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         321             5.3   1470.7   1543.5   1610.6   1610.6   1677.7

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         327             5.4   1461.4   1543.5   1610.6   1610.6   1677.7

(4) interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         317             5.3   1493.1   1543.5   1677.7   1811.9   1811.9

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         318             5.3   1495.4   1543.5   1610.6   1677.7   1677.7

Summary of Case 1

  • Our assumption that (1) and (3) should not differ (after correcting for noise) held true
  • (4) performed comparably to (1) - there was no trade-off with interleaving
  • (4) with the patch improved joins on interleaved tables by 19.5% in throughput and latency decreased by ~15%

Since all the data fit onto one node (albeit the gateway node (node 1) was not the leaseholder (node 3)) the performance gain can be attributed to the one-scan (4) vs. two-scan (2) optimization for interleaved tables.

We also see comparable performance for joins between interleaved and non-interleaved tables (although (4) had slightly higher tail latencies, which could be empirical noise).

Case 2 (multiple ranges)

We tried to structure the query such that it touched all three nodes and overlapped splits in both the non-interleaved and interleaved scenarios.

SELECT COUNT(*) FROM merchant JOIN product ON m_id = p_m_id WHERE (m_id >= 10000 AND m_id <= 14000) OR (m_id >=70000 AND m_id <= 75000)

--merchants=100000
--products=3000000

##########################
# Non-interleaved tables #
##########################

# No range splits: table < 64MB
SHOW TESTING_RANGES FROM TABLE merchant
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       19 | {1,2,3}  |            1 |
+-----------+---------+----------+----------+--------------+

ALTER TABLE product TESTING_RELOCATE
   SELECT ARRAY[((i/12000 - 1)%3)::INT+1], i FROM generate_series(12000, 96000, 12000) AS g(i)

# Wait for ranges to replicate fully
SHOW TESTING_RANGES FROM TABLE product
+----------------+----------------+----------+----------+--------------+
|   Start Key    |    End Key     | Range ID | Replicas | Lease Holder |
+----------------+----------------+----------+----------+--------------+
| NULL           | /12061/512061  |       20 | {1,2,3}  |            1 |
| /12061/512061  | /24126/1124126 |       41 | {1,2,3}  |            2 |
| /24126/1124126 | /36183/936183  |       24 | {1,2,3}  |            3 |
| /36183/936183  | /48284/348284  |       28 | {1,2,3}  |            1 |
| /48284/348284  | /60865/1160865 |       23 | {1,2,3}  |            2 |
| /60865/1160865 | /73911/673911  |       27 | {1,2,3}  |            3 |
| /73911/673911  | /86947/486947  |       25 | {1,2,3}  |            1 |
| /86947/486947  | NULL           |       26 | {1,2,3}  |            2 |
+----------------+----------------+----------+----------+--------------+

######################
# Interleaved tables #
######################

ALTER TABLE product TESTING_RELOCATE
   SELECT ARRAY[(((i-11000)/12000)%3)::INT+1], i FROM generate_series(11000, 96000, 12000) AS g(i)

# Wait for ranges to replicate fully
SHOW TESTING_RANGES FROM TABLE (merchant|product)
+-----------------------+-----------------------+----------+----------+--------------+
|       Start Key       |        End Key        | Range ID | Replicas | Lease Holder |
+-----------------------+-----------------------+----------+----------+--------------+
| NULL                  | /11785/#/52/1/2211785 |       19 | {1,2,3}  |            1 |
| /11785/#/52/1/2211785 | /23634/#/52/1/223634  |       27 | {1,2,3}  |            2 |
| /23634/#/52/1/223634  | /35415/#/52/1/2035415 |       25 | {1,2,3}  |            3 |
| /35415/#/52/1/2035415 | /47532/#/52/1/47532   |       28 | {1,2,3}  |            1 |
| /47532/#/52/1/47532   | /60388/#/52/1/260388  |       23 | {1,2,3}  |            2 |
| /60388/#/52/1/260388  | /73252/#/52/1/173252  |       26 | {1,2,3}  |            3 |
| /73252/#/52/1/173252  | /86604/#/52/1/886604  |       24 | {1,2,3}  |            1 |
| /86604/#/52/1/886604  | NULL                  |       41 | {1,2,3}  |            2 |
+-----------------------+-----------------------+----------+----------+--------------+

(1) non-interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         208             3.5   2248.3   2281.7   2415.9   2550.1   2684.4

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         211             3.5   2237.5   2415.9   2550.1   2684.4   2684.4

(2) interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         186             3.1   2494.8   2550.1   2952.8   3221.2   3221.2

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         184             3.1   2557.6   2684.4   3087.0   3355.4   3623.9

(3) non-interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         209             3.5   2226.5   2415.9   2550.1   2684.4   2952.8

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         214             3.6   2218.5   2281.7   2550.1   2684.4   3087.0

(4) interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         365             6.1   1300.8   1342.2   1543.5   1543.5   1677.7

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         364             6.1   1305.2   1342.2   1543.5   1610.6   1744.8

Summary of Case 2

  • (1) ~ (3) held
  • Huge improvement from (4) compared to (1) and (2)
    • vs (1): 74% increase in throughput and ~38% decrease in latency
    • vs (2): 97% increase in throughput and ~48% decrease in latency

The patch yielded tremendous performance improvements (+97% throughput, -48% latency) for joins on interleaved tables. This can be attributed to the one vs two scans and decreased node-to-node RPC calls since joins are localized. We can diff this improvement with the results from Case 1 to approximate the improvement form the localized joins within InterleavedReaderJoiner.

We see that interleaved joins were also much better than non-interleaved joins (+74% throughput, -38% latency), which is expected and desired because of data locality. In fact, the non-interleaved case was rather pessimistic (for the interleaved joins) since the entire merchant table was on the gateway node: we should expect interleaved joins to perform even better in the average and optimal cases.

Case 3 (Multiple range + grandchildren rows)

Case 2 verified the most optimistic scenario where one is joining on all tables in the interleaved hierarchy.

Case 3 verified a pessimistic case and an "average case" where we are joining on a minority and majority "domain" of data in the interleaved hierarchy, respectively.

# Pessimistic: Join on approximately 3% of the hierarchy
SELECT COUNT(*) FROM merchant JOIN store ON m_id = s_m_id
# Average: Join on 1.6% but across multiple nodes
SELECT COUNT(*) FROM merchant JOIN variant ON m_id = v_m_id WHERE (m_id >= 10000 AND m_id <= 14000) OR (m_id >= 35000 AND m_id <= 40000)

--merchants=50000
--products=200000
--variants=3500000
--stores=70000

##########################
# Non-interleaved tables #
##########################

# Note that we tend even more pessimistic by locating all single range tables to the gateway node

SHOW TESTING_RANGES FROM TABLE merchant
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       31 | {1,2,3}  |            1 |
+-----------+---------+----------+----------+--------------+

SHOW TESTING_RANGES FROM TABLE product
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       32 | {1,2,3}  |            1 |
+-----------+---------+----------+----------+--------------+

SHOW TESTING_RANGES FROM TABLE store;
+-----------+---------+----------+----------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------+---------+----------+----------+--------------+
| NULL      | NULL    |       34 | {1,2,3}  |            1 |
+-----------+---------+----------+----------+--------------+

ALTER TABLE variant TESTING_RELOCATE
    SELECT ARRAY[(((i/6000)-1)%3)::INT+1], i FROM generate_series(6000, 48000, 6000) as g(i)

SHOW TESTING_RANGES FROM TABLE variant;
+-----------------------+-----------------------+----------+----------+--------------+
|       Start Key       |        End Key        | Range ID | Replicas | Lease Holder |
+-----------------------+-----------------------+----------+----------+--------------+
| NULL                  | /6144/56144/656144    |       33 | {1,2,3}  |            1 |
| /6144/56144/656144    | /12337/162337/362337  |       41 | {1,2,3}  |            2 |
| /12337/162337/362337  | /18615/68615/2268615  |       37 | {1,2,3}  |            3 |
| /18615/68615/2268615  | /24901/74901/474901   |       38 | {1,2,3}  |            1 |
| /24901/74901/474901   | /31168/181168/781168  |       35 | {1,2,3}  |            2 |
| /31168/181168/781168  | /37438/137438/937438  |       39 | {1,2,3}  |            3 |
| /37438/137438/937438  | /43705/193705/2993705 |       36 | {1,2,3}  |            1 |
| /43705/193705/2993705 | NULL                  |       40 | {1,2,3}  |            2 |
+-----------------------+-----------------------+----------+----------+--------------+

######################
# Interleaved tables #
######################

ALTER TABLE variant TESTING_RELOCATE                                 
    SELECT ARRAY[(((i-5000)/6000)%3)::INT+1], i FROM generate_series(5000, 47000, 6000) AS g(i)

SHOW TESTING_RANGES FROM TABLE (merchant|product|variant|store)
+-------------------------------------+-------------------------------------+----------+----------+--------------+
|              Start Key              |               End Key               | Range ID | Replicas | Lease Holder |
+-------------------------------------+-------------------------------------+----------+----------+--------------+
| NULL                                | /5822/#/54/1/55822                  |       19 | {1,2,3}  |            1 |
| /5822/#/54/1/55822                  | /11672/#/52/1/111672/#/53/1/111672  |       41 | {1,2,3}  |            2 |
| /11672/#/52/1/111672/#/53/1/111672  | /17514/#/54/1/17514                 |       25 | {1,2,3}  |            3 |
| /17514/#/54/1/17514                 | /23478/#/52/1/173478/#/53/1/173478  |       28 | {1,2,3}  |            1 |
| /23478/#/52/1/173478/#/53/1/173478  | /29973/#/52/1/79973/#/53/1/1079973  |       23 | {1,2,3}  |            2 |
| /29973/#/52/1/79973/#/53/1/1079973  | /36557/#/52/1/186557/#/53/1/986557  |       27 | {1,2,3}  |            3 |
| /36557/#/52/1/186557/#/53/1/986557  | /43257/#/52/1/193257/#/53/1/1593257 |       24 | {1,2,3}  |            1 |
| /43257/#/52/1/193257/#/53/1/1593257 | NULL                                |       26 | {1,2,3}  |            2 |
+-------------------------------------+-------------------------------------+----------+----------+--------------+



(1) non-interleaved (master)

### Pessimistic
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         454             7.6   1046.8   1073.7   1476.4   1543.5   1677.7

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         449             7.5   1057.5   1073.7   1476.4   1677.7   1879.0

### Average
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          87             1.4   5373.3   5637.1   5905.6   6174.0   6174.0

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          81             1.3   5509.6   5637.1   5905.6   6174.0   6710.9

(2) interleaved (master)

### Pessimistic 
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          25             0.4  15236.4  15569.3  16643.0  16643.0  16643.0

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          25             0.4  15311.6  15569.3  17179.9  18253.6  18253.6

### Average
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          64             1.1   7170.2   7247.8   8321.5   8589.9   9126.8

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          64             1.1   7126.1   7516.2   8053.1   8321.5   8321.5

(3) non-interleaved (patch)

### Pessimistic
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         446             7.4   1068.7   1140.9   1275.1   1409.3   1610.6

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         442             7.4   1074.3   1140.9   1275.1   1476.4   1610.6
### Average

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          86             1.4   5364.0   5637.1   5905.6   5905.6   5905.6

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          82             1.4   5513.6   5637.1   6174.0   6174.0   6174.0

(4) interleaved (patch)

### Pessimistic
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          56             0.9   7612.1   8053.1   8321.5   8589.9   8589.9

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s          56             0.9   7923.6   8053.1   8589.9   8589.9   8589.9

### Average
# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         122             2.0   3810.4   3892.3   4160.7   4295.0   4563.4

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         121             2.0   3852.9   3892.3   4160.7   4295.0   4295.0

Summary of Case 3

  • There was some deviation between (1) and (3), but this does not influence our conclusion (see below)
  • Non-interleaved tables performed far better if the tables being joined were small enough to fit on the same nodes (pessimistic case)
  • Interleaved tables with the improved join logic performed the best in an "average case" where joins occurred on all nodes, even if it was only on a small proportion of the hierarchy
    • vs (1): 45% increase on throughput and ~30% decrease in latency
    • vs (2): 90% increase on throughput and ~50% decrease in latency

Intuitively, interleaved tables are not recommended over non-interleaved tables if the tables that are commonly joined (merchant and store in the pessimistic case) can both fit on one range or one node and make up a small proportion of the interleaved hierarchy.

Also, interleaved tables are great if the join happens on tables that make up a majority of the hierarchy (merchant and variant). That being said: joining on interleaved tables was still more efficient than joining on non-interleaved tables even when the join occurred on a very small minority of the two tables (average case). This implies that the actual overhead of scanning the entire interleaved hierarchy was relatively insignificant compared to the improved locality and reduced RPC traffic that came with the InterleavedReaderJoiner.

Case 4 (Multiple range + 70ms latency)

Used the comcast utility to simulate 70ms trans-Atlantic latencies.

# Same query and table sizes as case 2

# 35ms * 2 latency (node-to-node)
./comcast --device=ens5 --latency=35

tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc htb 10: dev ens5 root refcnt 5 r2q 10 default 1 direct_packets_stat 3 direct_qlen 1000
 Sent 228625 bytes 262 pkt (dropped 0, overlimits 476 requeues 0)
 backlog 461b 4p requeues 0
qdisc netem 100: dev ens5 parent 10:10 limit 1000 delay 35.0ms
 Sent 226653 bytes 248 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 461b 4p requeues 0


# Same table ranges as case 2

(1) non-interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         209             3.5   2239.3   2415.9   2550.1   2550.1   2550.1

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         207             3.4   2304.6   2415.9   2550.1   2550.1   2818.6

(2) interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         180             3.0   2599.3   2684.4   2818.6   3221.2   3221.2

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         183             3.0   2530.2   2684.4   2952.8   3221.2   3355.4

(3) non-interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         216             3.6   2184.1   2281.7   2550.1   2550.1   2684.4

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         217             3.6   2195.9   2415.9   2550.1   2684.4   2684.4

(4) interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         365             6.1   1301.1   1342.2   1476.4   1543.5   1543.5

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         366             6.1   1301.6   1342.2   1476.4   1543.5   1543.5

Summary of Case 4

  • Both non-interleaved and interleaved performed similarly to Case 2 with a 70ms latency overhead
    • This was expected since each query took ~1800ms: a 70ms overhead was proportionally small
  • Latencies sometimes decreased between case 2 and case 4: attributed to noise
  • Regardless, (4) still trumped all
    • vs (1): 75% increase on throughput and ~42% decrease in latency
    • vs (2): 101% increase on throughput and ~45% decrease in latency

Perhaps with shorter/smaller queries we can see more interesting results.

Case 5 (Multiple range + 160ms latency)

Simulate 160ms South East Asia to US latencies.

# Same query and table sizes as case 2

# 80ms * 2 latency (node-to-node)
./comcast --device=ens5 --latency=80

tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc htb 10: dev ens5 root refcnt 5 r2q 10 default 1 direct_packets_stat 3 direct_qlen 1000
 Sent 3589252 bytes 12452 pkt (dropped 0, overlimits 32607 requeues 0)
 backlog 102b 1p requeues 0
qdisc netem 100: dev ens5 parent 10:10 limit 1000 delay 80.0ms
 Sent 3587088 bytes 12434 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 102b 1p requeues 0


# Same table ranges as case 2

(1) non-interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         213             3.5   2250.5   2415.9   2684.4   2684.4   2684.4

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         209             3.5   2235.5   2415.9   2684.4   2684.4   2684.4

(2) interleaved (master)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         179             3.0   2676.9   2818.6   2952.8   2952.8   3087.0

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         173             2.9   2681.1   2818.6   3087.0   3087.0   3087.0

(3) non-interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         215             3.6   2184.9   2281.7   2550.1   2684.4   2818.6

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         209             3.5   2237.4   2415.9   2550.1   2684.4   2818.6

(4) interleaved (patch)

# Run 1
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         339             5.6   1393.0   1409.3   1677.7   1677.7   1744.8

# Run 2
_elapsed___________ops_____ops/s(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
   60.0s         340             5.7   1397.6   1409.3   1610.6   1677.7   1744.8

Summary of Case 5

  • Interleaved joins did take a hit compared to Case 4, but still performed far better than non-interleaved tables
    • vs (1): 61% increase in throughput and ~40% decrease in latency
    • vs (2): 93% increase in throughput and ~45% decrease in latency

Future Work

  • Outer joins (not yet implemented with interleaved table joins)
  • Larger hierarchies
  • Trade-offs with point lookups or range scans on descendant tables
  • More complex queries (with aggregations, grouping, multiple joins)
  • Experiment with other network clamps (e.g. packet loss, throttling)
@richardwu richardwu added C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Dec 9, 2017
@richardwu
Copy link
Contributor Author

cc: @RaduBerinde @arjunravinarayan @petermattis

@petermattis
Copy link
Collaborator

@richardwu This is a fantastic write up. One nit: you use the future tense (we will) a lot even though you've already done the work.

I wonder why interleaved tables do not have the same performance as non-interleaved tables for Case 1. The performance is only 10% worse, but that's still surprising. There might be something fairly simple that can explain that, such as slightly more memory allocations or different memory accounting. For example, the inner-loop of interleavedJoinReader might be allocating helper postProcHelper on every row and that looks easy to fix.

@petermattis
Copy link
Collaborator

@@ -242,9 +242,10 @@ func (irj *interleavedReaderJoiner) Run(ctx context.Context, wg *sync.WaitGroup)
                        break
                }

-               var helper postProcHelper
+               var helper *postProcHelper
                helperIdx := -1
-               for i, h := range irj.tablePosts {
+               for i := range irj.tablePosts {
+                       h := &irj.tablePosts[i]
                        if desc.ID == h.tableID && index.ID == h.indexID {
                                helper = h
                                helperIdx = i

Note that sizeof(postProcHelper) == 928. So in addition to avoiding an allocation, this would avoid copying those bytes on every iteration of the for-loop.

@richardwu
Copy link
Contributor Author

Looks like you were right: performance in Case 1 is now comparable to non-interleaved tables and the results from the other cases also improved.

@petermattis
Copy link
Collaborator

@richardwu Not sure if you have the time in your last week, but it would be useful to add Go benchmarks for mergeJoiner and interleavedReaderJoiner. Even better if these benchmarks use the same schema as BenchmarkHashJoiner so that the numbers are comparable.

@rjnn
Copy link
Contributor

rjnn commented Dec 10, 2017

This is an excellent write-up! Far more thorough than I expected from our discussion! 🎉

Just one thing: please merge the interleave benchmark! It's an open PR now, and if you need anything unblocked that's currently blocking the merge (nothing AFAIK from our discussion on Thursday) let me know ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
No open projects
Development

No branches or pull requests

4 participants