Skip to content

Commit

Permalink
ACCUMULO-4752 Create documentation on improving performance (#46)
Browse files Browse the repository at this point in the history
* Also, created documentation on RFile along with diagram
  • Loading branch information
mikewalch committed Dec 8, 2017
1 parent 269dfb4 commit 0a525d5
Show file tree
Hide file tree
Showing 3 changed files with 67 additions and 5 deletions.
20 changes: 15 additions & 5 deletions _docs-2-0/getting-started/design.md
Expand Up @@ -107,17 +107,28 @@ ingest and query load is balanced across the cluster.
When a write arrives at a TabletServer it is written to a Write-Ahead Log and
then inserted into a sorted data structure in memory called a MemTable. When the
MemTable reaches a certain size, the TabletServer writes out the sorted
key-value pairs to a file in HDFS called a Relative Key File (RFile), which is a
kind of Indexed Sequential Access Method (ISAM) file. This process is called a
minor compaction. A new MemTable is then created and the fact of the compaction
is recorded in the Write-Ahead Log.
key-value pairs to a file in HDFS called an [RFile](#rfile)). This process is
called a minor compaction. A new MemTable is then created and the fact of the
compaction is recorded in the Write-Ahead Log.

When a request to read data arrives at a TabletServer, the TabletServer does a
binary search across the MemTable as well as the in-memory indexes associated
with each RFile to find the relevant values. If clients are performing a scan,
several key-value pairs are returned to the client in order from the MemTable
and the set of RFiles by performing a merge-sort as they are read.

## RFile

RFile (short for Relative Key File) is a file that contains Accumulo's sorted key-value
pairs. The file is written to HDFS by Tablet Servers during a minor compaction. RFiles are
organized using the Index Sequential Access Method (ISAM). RFiles consist of data (key/value) block,
index blocks (which are used to find data block), and meta blocks (which contain
metadata for bloom filters and summary statistics). Data in an RFile is seperated by
locality group. The diagram below shows the logical view and HDFS file view of an RFile.

![rfile diagram]({{ site.url }}/images/docs/rfile_diagram.png)
<!-- Source at https://docs.google.com/presentation/d/1w9BgfgUtZ-3M14K-lIgv0UmvnOhVg10Zof6AUi-7pcc/edit?usp=sharing -->

## Compactions

In order to manage the number of files per tablet, periodically the TabletServer
Expand Down Expand Up @@ -167,4 +178,3 @@ TabletServer failures are noted on the Master's monitor page, accessible via
[clients]: {{page.docs_baseurl}}/getting-started/clients
[merging]: {{page.docs_baseurl}}/getting-started/table_configuration#merging-tablets
[compaction]: {{page.docs_baseurl}}/getting-started/table_configuration#compaction

52 changes: 52 additions & 0 deletions _docs-2-0/troubleshooting/performance.md
@@ -0,0 +1,52 @@
---
title: Performance
category: troubleshooting
order: 5
---

Accumulo can be tuned to improve read and write performance.

## Read performance

1. Enable [caching] on tables to reduce reads to disk.

1. Enable [bloom filters][bloom-filters] on tables to limit the number of disk lookups.

1. Decrease the [major compaction ratio][compaction] of a table to decrease the number of
files per tablet. Less files reduces the latency of reads.

1. Decrease the size of [data blocks in RFiles][rfile] by lowering [table.file.compress.blocksize] which can result
in better random seek performance. However, this can increase the size of indexes in the RFile. If the indexes
are too large to fit in cache, this can hinder performance. Also, as the index size increases the depth of the
index tree in each file may increase. Increasing [table.file.compress.blocksize.index] can reduce the depth of
the tree.

## Write performance

1. Enable [native maps][native-maps] on tablet servers to prevent Java garbage collection pauses
which can slow ingest.

1. [Pre-split new tables][split] to distribute writes across multiple tablet servers.

1. Ingest data using [multiple clients][multi-client] or [bulk ingest][bulk] to increase ingest throughput.

1. Increase the [major compaction ratio][compaction] of a table to limit the number of major compactions
which improves ingest performance.

1. On large Accumulo clusters, use [multiple HDFS volumes][multivolume] to increase write performance.

1. Change the compression format used by [blocks in RFiles][rfile] by setting [table.file.compress.type] to
`snappy`. This increases write speed at the expense of using more disk space.

[caching]: {{ page.docs_baseurl }}/administration/caching
[bloom-filters]: {{ page.docs_baseurl }}/getting-started/table_configuration#bloom-filters
[compaction]: {{ page.docs_baseurl }}/getting-started/table_configuration#compaction
[rfile]: {{ page.docs_baseurl }}/getting-started/design#rfile
[native-maps]: {{ page.docs_baseurl }}/administration/in-depth-install#native-map
[split]: {{ page.docs_baseurl }}//getting-started/table_configuration#pre-splitting-tables
[multi-client]: {{ page.docs_baseurl }}/development/high_speed_ingest#multiple-ingest-clients
[bulk]: {{ page.docs_baseurl }}/development/high_speed_ingest#bulk-ingest
[multivolume]: {{ page.docs_baseurl }}/administration/multivolume
[table.file.compress.blocksize]: {{ page.docs_baseurl }}/administration/properties#table_file_compress_blocksize
[table.file.compress.blocksize.index]: {{ page.docs_baseurl }}/administration/properties#table_file_compress_blocksize_index
[table.file.compress.type]: {{ page.docs_baseurl }}/administration/properties#table_file_compress_type
Binary file added images/docs/rfile_diagram.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 0a525d5

Please sign in to comment.