Skip to content

Commit

Permalink
Update docs from master
Browse files Browse the repository at this point in the history
  • Loading branch information
apurtell committed Aug 17, 2015
1 parent ea67e68 commit 3c4b78d
Show file tree
Hide file tree
Showing 15 changed files with 338 additions and 42 deletions.
11 changes: 1 addition & 10 deletions src/main/asciidoc/_chapters/architecture.adoc
Expand Up @@ -1016,16 +1016,7 @@ For background, see link:https://issues.apache.org/jira/browse/HBASE-2643[HBASE-

===== Performance Improvements during Log Splitting

WAL log splitting and recovery can be resource intensive and take a long time, depending on the number of RegionServers involved in the crash and the size of the regions. <<distributed.log.splitting>> and <<distributed.log.replay>> were developed to improve performance during log splitting.

[[distributed.log.splitting]]
====== Distributed Log Splitting

_Distributed Log Splitting_ was added in HBase version 0.92 (link:https://issues.apache.org/jira/browse/HBASE-1364[HBASE-1364]) by Prakash Khemani from Facebook.
It reduces the time to complete log splitting dramatically, improving the availability of regions and tables.
For example, recovering a crashed cluster took around 9 hours with single-threaded log splitting, but only about six minutes with distributed log splitting.

The information in this section is sourced from Jimmy Xiang's blog post at http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/.
WAL log splitting and recovery can be resource intensive and take a long time, depending on the number of RegionServers involved in the crash and the size of the regions. <<distributed.log.splitting>> was developed to improve performance during log splitting.

.Enabling or Disabling Distributed Log Splitting

Expand Down
10 changes: 5 additions & 5 deletions src/main/asciidoc/_chapters/community.adoc
Expand Up @@ -62,11 +62,11 @@ Any -1 on a patch by anyone vetos a patch; it cannot be committed until the just
.How to set fix version in JIRA on issue resolve
Here is how link:http://search-hadoop.com/m/azemIi5RCJ1[we agreed] to set versions in JIRA when we resolve an issue.
If trunk is going to be 0.98.0 then:
If master is going to be 0.98.0 then:
* Commit only to trunk: Mark with 0.98
* Commit to 0.95 and trunk : Mark with 0.98, and 0.95.x
* Commit to 0.94.x and 0.95, and trunk: Mark with 0.98, 0.95.x, and 0.94.x
* Commit only to master: Mark with 0.98
* Commit to 0.95 and master: Mark with 0.98, and 0.95.x
* Commit to 0.94.x and 0.95, and master: Mark with 0.98, 0.95.x, and 0.94.x
* Commit to 89-fb: Mark with 89-fb.
* Commit site fixes: no version
Expand Down Expand Up @@ -103,7 +103,7 @@ Owners do not need to be committers.
[[hbase.commit.msg.format]]
== Commit Message format
We link:http://search-hadoop.com/m/Gwxwl10cFHa1[agreed] to the following SVN commit message format:
We link:http://search-hadoop.com/m/Gwxwl10cFHa1[agreed] to the following Git commit message format:
[source]
----
HBASE-xxxxx <title>. (<contributor>)
Expand Down
29 changes: 17 additions & 12 deletions src/main/asciidoc/_chapters/developer.adoc
Expand Up @@ -569,8 +569,8 @@ Checkin the _CHANGES.txt_ and any version changes.
. Update the documentation.
+
Update the documentation under _src/main/docbkx_.
This usually involves copying the latest from trunk and making version-particular adjustments to suit this release candidate version.
Update the documentation under _src/main/asciidoc_.
This usually involves copying the latest from master and making version-particular adjustments to suit this release candidate version.
. Build the source tarball.
+
Expand Down Expand Up @@ -1689,7 +1689,9 @@ $ make_patch.sh [-a] [-p <patch_dir>]
If you decline, the script uses +git diff+ instead.
The patch is saved in a configurable directory and is ready to be attached to your JIRA.

* .Patching WorkflowAlways patch against the master branch first, even if you want to patch in another branch.
.Patching Workflow

* Always patch against the master branch first, even if you want to patch in another branch.
HBase committers always apply patches first to the master branch, and backport if necessary.
* Submit one single patch for a fix.
If necessary, squash local commits to merge local commits into a single one first.
Expand Down Expand Up @@ -1725,17 +1727,20 @@ Please understand that not every patch may get committed, and that feedback will
However, at times it is easier to refer to different version of a patch if you add `-vX`, where the [replaceable]_X_ is the version (starting with 2).
* If you need to submit your patch against multiple branches, rather than just master, name each version of the patch with the branch it is for, following the naming conventions in <<submitting.patches.create,submitting.patches.create>>.

.Methods to Create PatchesEclipse::
.Methods to Create Patches
Eclipse::
Select the menu item.

Git::
`git format-patch` is preferred because it preserves commit messages.
`git format-patch` is preferred:
- It preserves the committer and commit message.
- It handles binary files by default, whereas `git diff` ignores them unless
you use the `--binary` option.
Use `git rebase -i` first, to combine (squash) smaller commits into a single larger one.

Subversion::

Make sure you review <<eclipse.code.formatting,eclipse.code.formatting>> and <<common.patch.feedback,common.patch.feedback>> for code style.
If your patch was generated incorrectly or your code does not adhere to the code formatting guidelines, you may be asked to redo some work.
Make sure you review <<eclipse.code.formatting,eclipse.code.formatting>> and <<common.patch.feedback,common.patch.feedback>> for code style.
If your patch was generated incorrectly or your code does not adhere to the code formatting guidelines, you may be asked to redo some work.

[[submitting.patches.tests]]
==== Unit Tests
Expand Down Expand Up @@ -1846,13 +1851,13 @@ The instructions and preferences around the way to create patches have changed,
This is the indication that the patch was not created with `--no-prefix`.
+
----
diff --git a/src/main/docbkx/developer.xml b/src/main/docbkx/developer.xml
diff --git a/src/main/asciidoc/_chapters/developer.adoc b/src/main/asciidoc/_chapters/developer.adoc
----

* If the first line of the patch looks similar to the following (without the `a` and `b`), the patch was created with +git diff --no-prefix+ and you need to add `-p0` to the +git apply+ command below.
+
----
diff --git src/main/docbkx/developer.xml src/main/docbkx/developer.xml
diff --git src/main/asciidoc/_chapters/developer.adoc src/main/asciidoc/_chapters/developer.adoc
----

+
Expand Down Expand Up @@ -1930,7 +1935,7 @@ If the contributor used +git format-patch+ to generate the patch, their commit m
[[committer.amending.author]]
====== Add Amending-Author when a conflict cherrypick backporting

We've established the practice of committing to trunk and then cherry picking back to branches whenever possible.
We've established the practice of committing to master and then cherry picking back to branches whenever possible.
When there is a minor conflict we can fix it up and just proceed with the commit.
The resulting commit retains the original author.
When the amending author is different from the original committer, add notice of this at the end of the commit message as: `Amending-Author: Author
Expand All @@ -1951,7 +1956,7 @@ A committer should.
In the thread link:http://search-hadoop.com/m/DHED4EiwOz[HBase, mail # dev - ANNOUNCEMENT: Git Migration In Progress (WAS =>
Re: Git Migration)], it was agreed on the following patch flow

. Develop and commit the patch against trunk/master first.
. Develop and commit the patch against master first.
. Try to cherry-pick the patch when backporting if possible.
. If this does not work, manually commit the patch to the branch.

Expand Down
9 changes: 5 additions & 4 deletions src/main/asciidoc/_chapters/getting_started.adoc
Expand Up @@ -80,15 +80,16 @@ See <<java,Java>> for information about supported JDK versions.
This will take you to a mirror of _HBase
Releases_.
Click on the folder named _stable_ and then download the binary file that ends in _.tar.gz_ to your local filesystem.
Be sure to choose the version that corresponds with the version of Hadoop you are likely to use later.
In most cases, you should choose the file for Hadoop 2, which will be called something like _hbase-0.98.3-hadoop2-bin.tar.gz_.
Prior to 1.x version, be sure to choose the version that corresponds with the version of Hadoop you are
likely to use later (in most cases, you should choose the file for Hadoop 2, which will be called
something like _hbase-0.98.13-hadoop2-bin.tar.gz_).
Do not download the file ending in _src.tar.gz_ for now.
. Extract the downloaded file, and change to the newly-created directory.
+
----
$ tar xzvf hbase-<?eval ${project.version}?>-hadoop2-bin.tar.gz
$ cd hbase-<?eval ${project.version}?>-hadoop2/
$ tar xzvf hbase-<?eval ${project.version}?>-bin.tar.gz
$ cd hbase-<?eval ${project.version}?>/
----
. For HBase 0.98.5 and later, you are required to set the `JAVA_HOME` environment variable before starting HBase.
Expand Down
236 changes: 236 additions & 0 deletions src/main/asciidoc/_chapters/hbase_mob.adoc
@@ -0,0 +1,236 @@
////
/**
*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
////
[[hbase_mob]]
== Storing Medium-sized Objects (MOB)
:doctype: book
:numbered:
:toc: left
:icons: font
:experimental:
:toc: left
:source-language: java
Data comes in many sizes, and saving all of your data in HBase, including binary
data such as images and documents, is ideal. While HBase can technically handle
binary objects with cells that are larger than 100 KB in size, HBase's normal
read and write paths are optimized for values smaller than 100KB in size. When
HBase deals with large numbers of objects over this threshold, referred to here
as medium objects, or MOBs, performance is degraded due to write amplification
caused by splits and compactions. When using MOBs, ideally your objects will be between
100KB and 10MB. HBase ***FIX_VERSION_NUMBER*** adds support
for better managing large numbers of MOBs while maintaining performance,
consistency, and low operational overhead. MOB support is provided by the work
done in link:https://issues.apache.org/jira/browse/HBASE-11339[HBASE-11339]. To
take advantage of MOB, you need to use <<hfilev3,HFile version 3>>. Optionally,
configure the MOB file reader's cache settings for each RegionServer (see
<<mob.cache.configure>>), then configure specific columns to hold MOB data.
Client code does not need to change to take advantage of HBase MOB support. The
feature is transparent to the client.

=== Configuring Columns for MOB

You can configure columns to support MOB during table creation or alteration,
either in HBase Shell or via the Java API. The two relevant properties are the
boolean `IS_MOB` and the `MOB_THRESHOLD`, which is the number of bytes at which
an object is considered to be a MOB. Only `IS_MOB` is required. If you do not
specify the `MOB_THRESHOLD`, the default threshold value of 100 KB is used.

.Configure a Column for MOB Using HBase Shell
====
----
hbase> create 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400}
hbase> alter 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400}
----
====

.Configure a Column for MOB Using the Java API
====
[source,java]
----
...
HColumnDescriptor hcd = new HColumnDescriptor(“f”);
hcd.setMobEnabled(true);
...
hcd.setMobThreshold(102400L);
...
----
====


=== Testing MOB

The utility `org.apache.hadoop.hbase.IntegrationTestIngestMOB` is provided to assist with testing
the MOB feature. The utility is run as follows:
[source,bash]
----
$ sudo -u hbase hbase org.apache.hadoop.hbase.IntegrationTestIngestMOB \
-threshold 102400 \
-minMobDataSize 512 \
-maxMobDataSize 5120
----

* `*threshold*` is the threshold at which cells are considered to be MOBs.
The default is 1 kB, expressed in bytes.
* `*minMobDataSize*` is the minimum value for the size of MOB data.
The default is 512 B, expressed in bytes.
* `*maxMobDataSize*` is the maximum value for the size of MOB data.
The default is 5 kB, expressed in bytes.


[[mob.cache.configure]]
=== Configuring the MOB Cache


Because there can be a large number of MOB files at any time, as compared to the number of HFiles,
MOB files are not always kept open. The MOB file reader cache is a LRU cache which keeps the most
recently used MOB files open. To configure the MOB file reader's cache on each RegionServer, add
the following properties to the RegionServer's `hbase-site.xml`, customize the configuration to
suit your environment, and restart or rolling restart the RegionServer.

.Example MOB Cache Configuration
====
[source,xml]
----
<property>
<name>hbase.mob.file.cache.size</name>
<value>1000</value>
<description>
Number of opened file handlers to cache.
A larger value will benefit reads by provinding more file handlers per mob
file cache and would reduce frequent file opening and closing.
However, if this is set too high, this could lead to a "too many opened file handers"
The default value is 1000.
</description>
</property>
<property>
<name>hbase.mob.cache.evict.period</name>
<value>3600</value>
<description>
The amount of time in seconds after which an unused file is evicted from the
MOB cache. The default value is 3600 seconds.
</description>
</property>
<property>
<name>hbase.mob.cache.evict.remain.ratio</name>
<value>0.5f</value>
<description>
A multiplier (between 0.0 and 1.0), which determines how many files remain cached
after the threshold of files that remains cached after a cache eviction occurs
which is triggered by reaching the `hbase.mob.file.cache.size` threshold.
The default value is 0.5f, which means that half the files (the least-recently-used
ones) are evicted.
</description>
</property>
----
====

=== MOB Optimization Tasks

==== Manually Compacting MOB Files

To manually compact MOB files, rather than waiting for the
<<mob.cache.configure,configuration>> to trigger compaction, use the
`compact_mob` or `major_compact_mob` HBase shell commands. These commands
require the first argument to be the table name, and take an optional column
family as the second argument. If the column family is omitted, all MOB-enabled
column families are compacted.

----
hbase> compact_mob 't1', 'c1'
hbase> compact_mob 't1'
hbase> major_compact_mob 't1', 'c1'
hbase> major_compact_mob 't1'
----

These commands are also available via `Admin.compactMob` and
`Admin.majorCompactMob` methods.

==== MOB Sweeper

HBase MOB a MapReduce job called the Sweeper tool for
optimization. The Sweeper tool oalesces small MOB files or MOB files with many
deletions or updates. The Sweeper tool is not required if you use native MOB compaction, which
does not rely on MapReduce.

To configure the Sweeper tool, set the following options:

[source,xml]
----
<property>
<name>hbase.mob.sweep.tool.compaction.ratio</name>
<value>0.5f</value>
<description>
If there are too many cells deleted in a mob file, it's regarded
as an invalid file and needs to be merged.
If existingCellsSize/mobFileSize is less than ratio, it's regarded
as an invalid file. The default value is 0.5f.
</description>
</property>
<property>
<name>hbase.mob.sweep.tool.compaction.mergeable.size</name>
<value>134217728</value>
<description>
If the size of a mob file is less than this value, it's regarded as a small
file and needs to be merged. The default value is 128MB.
</description>
</property>
<property>
<name>hbase.mob.sweep.tool.compaction.memstore.flush.size</name>
<value>134217728</value>
<description>
The flush size for the memstore used by sweep job. Each sweep reducer owns such a memstore.
The default value is 128MB.
</description>
</property>
<property>
<name>hbase.master.mob.ttl.cleaner.period</name>
<value>86400</value>
<description>
The period that ExpiredMobFileCleanerChore runs. The unit is second.
The default value is one day.
</description>
</property>
----

Next, add the HBase install directory, _`$HBASE_HOME`/*_, and HBase library directory to
_yarn-site.xml_ Adjust this example to suit your environment.
[source,xml]
----
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,
$HBASE_HOME/*, $HBASE_HOME/lib/*
</value>
</property>
----

Finally, run the `sweeper` tool for each column which is configured for MOB.
[source,bash]
----
$ org.apache.hadoop.hbase.mob.compactions.Sweeper _tableName_ _familyName_
----
2 changes: 1 addition & 1 deletion src/main/asciidoc/_chapters/performance.adoc
Expand Up @@ -546,7 +546,7 @@ To disable the WAL, see <<wal.disable>>.
=== HBase Client: Group Puts by RegionServer
In addition to using the writeBuffer, grouping `Put`s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
There is a utility `HTableUtil` currently on TRUNK that does this, but you can either copy that or implement your own version for those still on 0.90.x or earlier.
There is a utility `HTableUtil` currently on MASTER that does this, but you can either copy that or implement your own version for those still on 0.90.x or earlier.
[[perf.hbase.write.mr.reducer]]
=== MapReduce: Skip The Reducer
Expand Down

0 comments on commit 3c4b78d

Please sign in to comment.