Give scans control over cache via scan dispatchers #1383 by keith-turner · Pull Request #1440 · apache/accumulo

keith-turner · 2019-11-20T20:48:15Z

This commit enables per scan control over cache usages. This was done
via extending the scope of ScanDispatchers to include cache control.

The built in SimpleScanDispatcher was improved to support mapping
scanner execution hints of the form scan_type=X to
cache usage directives for a scan.

Accumulo caches open files. Before this change the cache was bound to
the file when it was opened. Now a cache can be bound to an already
open file. The internal interface CacheProvider was created to
facilitate this.

This commit enables per scan control over cache usages. This was done via extending the scope of ScanDispatchers to include cache control. The built in SimpleScanDispatcher was improved to support mapping scanner execution hints of the form scan_type=X to cache usage directives for a scan. Accumulo caches open files. Before this change the cache was bound to the file when it was opened. Now a cache can be bound to an already open file. The internal interface CacheProvider was created to facilitate this.

keith-turner · 2019-11-20T21:14:37Z

I ran a performance test on a small EC2 cluster to give this a try. I had the following setup.

Table with around 192 million entries
8 tablets
3 tservers (d2xlarge)
30% of 4G data cache size
6 threads running circular full table scans (each with different random start) with execution hint scan_type=background
16 threads doing random lookups out from set of 8K random rows

For the threads doing random lookup on a subset of the table, their data just fit in the data cache. The full table did not fit into the cache. For this test scenario all of the data was local for each tserver and it fit in the OS cache, so a cache miss was not terribly slow. Also HDFS was optimized for local data. Cache misses were noticeable from a latency perspective, which is all I cared about.

I ran three test all setup with the following Accumulo shell commands. This makes all the scans that set the execution hint scan_type=background go to a special executor with a single thread.

  createtable dc
  config -s tserver.scan.executors.bge.threads=1
  config -t dc -s table.scan.dispatcher.opts.executor.background=bge
  config -t dc -s table.cache.index.enable=true
  config -t dc -s table.cache.block.enable=true
  config -t dc -s table.file.compress.type=snappy

For one test run I set the following to make scans that set the execution hint scan_type=background use opportunistic caching. This means those scans would use data if its in the cache, but would never load missing data into the cache.

  config -t dc -s table.scan.dispatcher.opts.cacheUsage.background=opportunistic

For another test run I set the following to make scans that set the execution hint scan_type=background fully use the cache.

  config -t dc -s table.scan.dispatcher.opts.cacheUsage.background=enabled

For the last test run, I just did not run background scans. I only ran the scans doing random lookups.

Below are the average time for the scans doing random lookups for the three test runs. If I ran in a situation where cache misses had higher latency, I suspect the plot would differ more dramatically. I wish I had run a 4th test where the random lookup threads did not use cache. In all three test the random lookup threads always used cache.

misc updates

74d7550

keith-turner merged commit cf59651 into apache:master Jan 2, 2020

keith-turner deleted the accumulo-1383 branch January 2, 2020 21:02

milleruntime mentioned this pull request Jan 27, 2020

fixes #1474: Option to leave cloned tables offline #1475

Merged

dlmarion mentioned this pull request Sep 9, 2021

Make scan execution hints influence caching. #1383

Closed

keith-turner mentioned this pull request Jan 10, 2022

Add Pluggable Cache to user guide apache/accumulo-website#306

Closed

ctubbsii added this to the 2.1.0 milestone Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give scans control over cache via scan dispatchers #1383#1440

Give scans control over cache via scan dispatchers #1383#1440
keith-turner merged 2 commits intoapache:masterfrom
keith-turner:accumulo-1383

keith-turner commented Nov 20, 2019

Uh oh!

keith-turner commented Nov 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

keith-turner commented Nov 20, 2019

Uh oh!

keith-turner commented Nov 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants