Skip to content

Give scans control over cache via scan dispatchers #1383#1440

Merged
keith-turner merged 2 commits intoapache:masterfrom
keith-turner:accumulo-1383
Jan 2, 2020
Merged

Give scans control over cache via scan dispatchers #1383#1440
keith-turner merged 2 commits intoapache:masterfrom
keith-turner:accumulo-1383

Conversation

@keith-turner
Copy link
Copy Markdown
Contributor

This commit enables per scan control over cache usages. This was done
via extending the scope of ScanDispatchers to include cache control.

The built in SimpleScanDispatcher was improved to support mapping
scanner execution hints of the form scan_type=X to
cache usage directives for a scan.

Accumulo caches open files. Before this change the cache was bound to
the file when it was opened. Now a cache can be bound to an already
open file. The internal interface CacheProvider was created to
facilitate this.

This commit enables per scan control over cache usages.  This was done
via extending the scope of ScanDispatchers to include cache control.

The built in SimpleScanDispatcher was improved to support mapping
scanner execution hints of the form scan_type=X to
cache usage directives for a scan.

Accumulo caches open files.  Before this change the cache was bound to
the file when it was opened.  Now a cache can be bound to an already
open file. The internal interface CacheProvider was created to
facilitate this.
@keith-turner
Copy link
Copy Markdown
Contributor Author

I ran a performance test on a small EC2 cluster to give this a try. I had the following setup.

  • Table with around 192 million entries
  • 8 tablets
  • 3 tservers (d2xlarge)
  • 30% of 4G data cache size
  • 6 threads running circular full table scans (each with different random start) with execution hint scan_type=background
  • 16 threads doing random lookups out from set of 8K random rows

For the threads doing random lookup on a subset of the table, their data just fit in the data cache. The full table did not fit into the cache. For this test scenario all of the data was local for each tserver and it fit in the OS cache, so a cache miss was not terribly slow. Also HDFS was optimized for local data. Cache misses were noticeable from a latency perspective, which is all I cared about.

I ran three test all setup with the following Accumulo shell commands. This makes all the scans that set the execution hint scan_type=background go to a special executor with a single thread.

  createtable dc
  config -s tserver.scan.executors.bge.threads=1
  config -t dc -s table.scan.dispatcher.opts.executor.background=bge
  config -t dc -s table.cache.index.enable=true
  config -t dc -s table.cache.block.enable=true
  config -t dc -s table.file.compress.type=snappy

For one test run I set the following to make scans that set the execution hint scan_type=background use opportunistic caching. This means those scans would use data if its in the cache, but would never load missing data into the cache.

  config -t dc -s table.scan.dispatcher.opts.cacheUsage.background=opportunistic

For another test run I set the following to make scans that set the execution hint scan_type=background fully use the cache.

  config -t dc -s table.scan.dispatcher.opts.cacheUsage.background=enabled

For the last test run, I just did not run background scans. I only ran the scans doing random lookups.

Below are the average time for the scans doing random lookups for the three test runs. If I ran in a situation where cache misses had higher latency, I suspect the plot would differ more dramatically. I wish I had run a 4th test where the random lookup threads did not use cache. In all three test the random lookup threads always used cache.

test-results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants