Update docs from master

apache · Aug 17, 2015 · 3c4b78d · 3c4b78d
1 parent ea67e68
commit 3c4b78d
Show file tree

Hide file tree

Showing 15 changed files with 338 additions and 42 deletions.
diff --git a/src/main/asciidoc/_chapters/architecture.adoc b/src/main/asciidoc/_chapters/architecture.adoc
@@ -1016,16 +1016,7 @@ For background, see link:https://issues.apache.org/jira/browse/HBASE-2643[HBASE-
 
 ===== Performance Improvements during Log Splitting
 
-WAL log splitting and recovery can be resource intensive and take a long time, depending on the number of RegionServers involved in the crash and the size of the regions. <<distributed.log.splitting>> and <<distributed.log.replay>> were developed to improve performance during log splitting.
-
-[[distributed.log.splitting]]
-====== Distributed Log Splitting
-
-_Distributed Log Splitting_ was added in HBase version 0.92 (link:https://issues.apache.org/jira/browse/HBASE-1364[HBASE-1364]) by Prakash Khemani from Facebook.
-It reduces the time to complete log splitting dramatically, improving the availability of regions and tables.
-For example, recovering a crashed cluster took around 9 hours with single-threaded log splitting, but only about six minutes with distributed log splitting.
-
-The information in this section is sourced from Jimmy Xiang's blog post at http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/.
+WAL log splitting and recovery can be resource intensive and take a long time, depending on the number of RegionServers involved in the crash and the size of the regions. <<distributed.log.splitting>> was developed to improve performance during log splitting.
 
 .Enabling or Disabling Distributed Log Splitting
 

diff --git a/src/main/asciidoc/_chapters/community.adoc b/src/main/asciidoc/_chapters/community.adoc
@@ -62,11 +62,11 @@ Any -1 on a patch by anyone vetos a patch; it cannot be committed until the just
 .How to set fix version in JIRA on issue resolve
 
 Here is how link:http://search-hadoop.com/m/azemIi5RCJ1[we agreed] to set versions in JIRA when we resolve an issue.
-If trunk is going to be 0.98.0 then: 
+If master is going to be 0.98.0 then:
 
-* Commit only to trunk: Mark with 0.98 
-* Commit to 0.95 and trunk : Mark with 0.98, and 0.95.x 
-* Commit to 0.94.x and 0.95, and trunk: Mark with 0.98, 0.95.x, and 0.94.x 
+* Commit only to master: Mark with 0.98
+* Commit to 0.95 and master: Mark with 0.98, and 0.95.x
+* Commit to 0.94.x and 0.95, and master: Mark with 0.98, 0.95.x, and 0.94.x
 * Commit to 89-fb: Mark with 89-fb. 
 * Commit site fixes: no version 
 
@@ -103,7 +103,7 @@ Owners do not need to be committers.
 [[hbase.commit.msg.format]]
 == Commit Message format
 
-We link:http://search-hadoop.com/m/Gwxwl10cFHa1[agreed] to the following SVN commit message format: 
+We link:http://search-hadoop.com/m/Gwxwl10cFHa1[agreed] to the following Git commit message format: 
 [source]
 ----
 HBASE-xxxxx <title>. (<contributor>)

diff --git a/src/main/asciidoc/_chapters/developer.adoc b/src/main/asciidoc/_chapters/developer.adoc
@@ -569,8 +569,8 @@ Checkin the _CHANGES.txt_ and any version changes.
 
 . Update the documentation.
 +
-Update the documentation under _src/main/docbkx_.
-This usually involves copying the latest from trunk and making version-particular adjustments to suit this release candidate version. 
+Update the documentation under _src/main/asciidoc_.
+This usually involves copying the latest from master and making version-particular adjustments to suit this release candidate version. 
 
 . Build the source tarball.
 +
@@ -1689,7 +1689,9 @@ $ make_patch.sh [-a] [-p <patch_dir>]
   If you decline, the script uses +git diff+ instead.
   The patch is saved in a configurable directory and is ready to be attached to your JIRA.
 
-* .Patching WorkflowAlways patch against the master branch first, even if you want to patch in another branch.
+.Patching Workflow
+
+* Always patch against the master branch first, even if you want to patch in another branch.
   HBase committers always apply patches first to the master branch, and backport if necessary.
 * Submit one single patch for a fix.
   If necessary, squash local commits to merge local commits into a single one first.
@@ -1725,17 +1727,20 @@ Please understand that not every patch may get committed, and that feedback will
   However, at times it is easier to refer to different version of a patch if you add `-vX`, where the [replaceable]_X_ is the version (starting with 2).
 * If you need to submit your patch against multiple branches, rather than just master, name each version of the patch with the branch it is for, following the naming conventions in <<submitting.patches.create,submitting.patches.create>>.
 
-.Methods to Create PatchesEclipse::
+.Methods to Create Patches
+Eclipse::
   Select the  menu item.
 
 Git::
-  `git format-patch` is preferred because it preserves commit messages.
+  `git format-patch` is preferred:
+     - It preserves the committer and commit message.
+     - It handles binary files by default, whereas `git diff` ignores them unless
+     you use the `--binary` option.
   Use `git rebase -i` first, to combine (squash) smaller commits into a single larger one.
 
 Subversion::
-
-Make sure you review <<eclipse.code.formatting,eclipse.code.formatting>> and <<common.patch.feedback,common.patch.feedback>> for code style.
-If your patch was generated incorrectly or your code does not adhere to the code formatting guidelines, you may be asked to redo some work.
+  Make sure you review <<eclipse.code.formatting,eclipse.code.formatting>> and <<common.patch.feedback,common.patch.feedback>> for code style.
+  If your patch was generated incorrectly or your code does not adhere to the code formatting guidelines, you may be asked to redo some work.
 
 [[submitting.patches.tests]]
 ==== Unit Tests
@@ -1846,13 +1851,13 @@ The instructions and preferences around the way to create patches have changed,
   This is the indication that the patch was not created with `--no-prefix`.
 +
 ----
-diff --git a/src/main/docbkx/developer.xml b/src/main/docbkx/developer.xml
+diff --git a/src/main/asciidoc/_chapters/developer.adoc b/src/main/asciidoc/_chapters/developer.adoc
 ----
 
 * If the first line of the patch looks similar to the following (without the `a` and `b`), the patch was created with +git diff --no-prefix+ and you need to add `-p0` to the +git apply+                                        command below.
 +
 ----
-diff --git src/main/docbkx/developer.xml src/main/docbkx/developer.xml
+diff --git src/main/asciidoc/_chapters/developer.adoc src/main/asciidoc/_chapters/developer.adoc
 ----
 
 +
@@ -1930,7 +1935,7 @@ If the contributor used +git format-patch+ to generate the patch, their commit m
 [[committer.amending.author]]
 ====== Add Amending-Author when a conflict cherrypick backporting
 
-We've established the practice of committing to trunk and then cherry picking back to branches whenever possible.
+We've established the practice of committing to master and then cherry picking back to branches whenever possible.
 When there is a minor conflict we can fix it up and just proceed with the commit.
 The resulting commit retains the original author.
 When the amending author is different from the original committer, add notice of this at the end of the commit message as: `Amending-Author: Author
@@ -1951,7 +1956,7 @@ A committer should.
 In the thread link:http://search-hadoop.com/m/DHED4EiwOz[HBase, mail # dev - ANNOUNCEMENT: Git Migration In Progress (WAS =>
                                 Re: Git Migration)], it was agreed on the following patch flow 
 
-. Develop and commit the patch against trunk/master first.
+. Develop and commit the patch against master first.
 . Try to cherry-pick the patch when backporting if possible.
 . If this does not work, manually commit the patch to the branch.                        
 

diff --git a/src/main/asciidoc/_chapters/getting_started.adoc b/src/main/asciidoc/_chapters/getting_started.adoc
@@ -80,15 +80,16 @@ See <<java,Java>> for information about supported JDK versions.
   This will take you to a mirror of _HBase
   Releases_.
   Click on the folder named _stable_ and then download the binary file that ends in _.tar.gz_ to your local filesystem.
-  Be sure to choose the version that corresponds with the version of Hadoop you are likely to use later.
-  In most cases, you should choose the file for Hadoop 2, which will be called something like _hbase-0.98.3-hadoop2-bin.tar.gz_.
+  Prior to 1.x version, be sure to choose the version that corresponds with the version of Hadoop you are
+  likely to use later (in most cases, you should choose the file for Hadoop 2, which will be called
+  something like _hbase-0.98.13-hadoop2-bin.tar.gz_).
   Do not download the file ending in _src.tar.gz_ for now.
 . Extract the downloaded file, and change to the newly-created directory.
 +
 ----
 
-$ tar xzvf hbase-<?eval ${project.version}?>-hadoop2-bin.tar.gz
-$ cd hbase-<?eval ${project.version}?>-hadoop2/
+$ tar xzvf hbase-<?eval ${project.version}?>-bin.tar.gz
+$ cd hbase-<?eval ${project.version}?>/
 ----
 
 . For HBase 0.98.5 and later, you are required to set the `JAVA_HOME` environment variable before starting HBase.

diff --git a/src/main/asciidoc/_chapters/hbase_mob.adoc b/src/main/asciidoc/_chapters/hbase_mob.adoc
@@ -0,0 +1,236 @@
+////
+/**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+////
+
+[[hbase_mob]]
+== Storing Medium-sized Objects (MOB)
+:doctype: book
+:numbered:
+:toc: left
+:icons: font
+:experimental:
+:toc: left
+:source-language: java
+
+Data comes in many sizes, and saving all of your data in HBase, including binary
+data such as images and documents, is ideal. While HBase can technically handle
+binary objects with cells that are larger than 100 KB in size, HBase's normal
+read and write paths are optimized for values smaller than 100KB in size. When
+HBase deals with large numbers of objects over this threshold, referred to here
+as medium objects, or MOBs, performance is degraded due to write amplification
+caused by splits and compactions. When using MOBs, ideally your objects will be between
+100KB and 10MB. HBase ***FIX_VERSION_NUMBER*** adds support
+for better managing large numbers of MOBs while maintaining performance,
+consistency, and low operational overhead. MOB support is provided by the work
+done in link:https://issues.apache.org/jira/browse/HBASE-11339[HBASE-11339]. To
+take advantage of MOB, you need to use <<hfilev3,HFile version 3>>. Optionally,
+configure the MOB file reader's cache settings for each RegionServer (see
+<<mob.cache.configure>>), then configure specific columns to hold MOB data.
+Client code does not need to change to take advantage of HBase MOB support. The
+feature is transparent to the client.
+
+=== Configuring Columns for MOB
+
+You can configure columns to support MOB during table creation or alteration,
+either in HBase Shell or via the Java API. The two relevant properties are the
+boolean `IS_MOB` and the `MOB_THRESHOLD`, which is the number of bytes at which
+an object is considered to be a MOB. Only `IS_MOB` is required. If you do not
+specify the `MOB_THRESHOLD`, the default threshold value of 100 KB is used.
+
+.Configure a Column for MOB Using HBase Shell
+====
+----
+hbase> create 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400}
+hbase> alter 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400}
+----
+====
+
+.Configure a Column for MOB Using the Java API
+====
+[source,java]
+----
+...
+HColumnDescriptor hcd = new HColumnDescriptor(“f”);
+hcd.setMobEnabled(true);
+...
+hcd.setMobThreshold(102400L);
+...
+----
+====
+
+
+=== Testing MOB
+
+The utility `org.apache.hadoop.hbase.IntegrationTestIngestMOB` is provided to assist with testing
+the MOB feature. The utility is run as follows:
+[source,bash]
+----
+$ sudo -u hbase hbase org.apache.hadoop.hbase.IntegrationTestIngestMOB \
+            -threshold 102400 \
+            -minMobDataSize 512 \
+            -maxMobDataSize 5120
+----
+
+* `*threshold*` is the threshold at which cells are considered to be MOBs.
+   The default is 1 kB, expressed in bytes.
+* `*minMobDataSize*` is the minimum value for the size of MOB data.
+   The default is 512 B, expressed in bytes.
+* `*maxMobDataSize*` is the maximum value for the size of MOB data.
+   The default is 5 kB, expressed in bytes.
+
+
+[[mob.cache.configure]]
+=== Configuring the MOB Cache
+
+
+Because there can be a large number of MOB files at any time, as compared to the number of HFiles,
+MOB files are not always kept open. The MOB file reader cache is a LRU cache which keeps the most
+recently used MOB files open. To configure the MOB file reader's cache on each RegionServer, add
+the following properties to the RegionServer's `hbase-site.xml`, customize the configuration to
+suit your environment, and restart or rolling restart the RegionServer.
+
+.Example MOB Cache Configuration
+====
+[source,xml]
+----
+<property>
+    <name>hbase.mob.file.cache.size</name>
+    <value>1000</value>
+    <description>
+      Number of opened file handlers to cache.
+      A larger value will benefit reads by provinding more file handlers per mob
+      file cache and would reduce frequent file opening and closing.
+      However, if this is set too high, this could lead to a "too many opened file handers"
+      The default value is 1000.
+    </description>
+</property>
+<property>
+    <name>hbase.mob.cache.evict.period</name>
+    <value>3600</value>
+    <description>
+      The amount of time in seconds after which an unused file is evicted from the
+      MOB cache. The default value is 3600 seconds.
+    </description>
+</property>
+<property>
+    <name>hbase.mob.cache.evict.remain.ratio</name>
+    <value>0.5f</value>
+    <description>
+      A multiplier (between 0.0 and 1.0), which determines how many files remain cached
+      after the threshold of files that remains cached after a cache eviction occurs
+      which is triggered by reaching the `hbase.mob.file.cache.size` threshold.
+      The default value is 0.5f, which means that half the files (the least-recently-used
+      ones) are evicted.
+    </description>
+</property>
+----
+====
+
+=== MOB Optimization Tasks
+
+==== Manually Compacting MOB Files
+
+To manually compact MOB files, rather than waiting for the
+<<mob.cache.configure,configuration>> to trigger compaction, use the
+`compact_mob` or `major_compact_mob` HBase shell commands. These commands
+require the first argument to be the table name, and take an optional column
+family as the second argument. If the column family is omitted, all MOB-enabled
+column families are compacted.
+
+----
+hbase> compact_mob 't1', 'c1'
+hbase> compact_mob 't1'
+hbase> major_compact_mob 't1', 'c1'
+hbase> major_compact_mob 't1'
+----
+
+These commands are also available via `Admin.compactMob` and
+`Admin.majorCompactMob` methods.
+
+==== MOB Sweeper
+
+HBase MOB a MapReduce job called the Sweeper tool for
+optimization. The Sweeper tool oalesces small MOB files or MOB files with many
+deletions or updates. The Sweeper tool is not required if you use native MOB compaction, which
+does not rely on MapReduce.
+
+To configure the Sweeper tool, set the following options:
+
+[source,xml]
+----
+<property>
+    <name>hbase.mob.sweep.tool.compaction.ratio</name>
+    <value>0.5f</value>
+    <description>
+      If there are too many cells deleted in a mob file, it's regarded
+      as an invalid file and needs to be merged.
+      If existingCellsSize/mobFileSize is less than ratio, it's regarded
+      as an invalid file. The default value is 0.5f.
+    </description>
+</property>
+<property>
+    <name>hbase.mob.sweep.tool.compaction.mergeable.size</name>
+    <value>134217728</value>
+    <description>
+      If the size of a mob file is less than this value, it's regarded as a small
+      file and needs to be merged. The default value is 128MB.
+    </description>
+</property>
+<property>
+    <name>hbase.mob.sweep.tool.compaction.memstore.flush.size</name>
+    <value>134217728</value>
+    <description>
+      The flush size for the memstore used by sweep job. Each sweep reducer owns such a memstore.
+      The default value is 128MB.
+    </description>
+</property>
+<property>
+    <name>hbase.master.mob.ttl.cleaner.period</name>
+    <value>86400</value>
+    <description>
+      The period that ExpiredMobFileCleanerChore runs. The unit is second.
+      The default value is one day.
+    </description>
+</property>
+----
+
+Next, add the HBase install directory, _`$HBASE_HOME`/*_, and HBase library directory to
+_yarn-site.xml_ Adjust this example to suit your environment.
+[source,xml]
+----
+<property>
+    <description>Classpath for typical applications.</description>
+    <name>yarn.application.classpath</name>
+    <value>
+        $HADOOP_CONF_DIR,
+        $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
+        $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
+        $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
+        $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,
+        $HBASE_HOME/*, $HBASE_HOME/lib/*
+    </value>
+</property>
+----
+
+Finally, run the `sweeper` tool for each column which is configured for MOB.
+[source,bash]
+----
+$ org.apache.hadoop.hbase.mob.compactions.Sweeper _tableName_ _familyName_
+----
diff --git a/src/main/asciidoc/_chapters/performance.adoc b/src/main/asciidoc/_chapters/performance.adoc
@@ -546,7 +546,7 @@ To disable the WAL, see <<wal.disable>>.
 === HBase Client: Group Puts by RegionServer
 
 In addition to using the writeBuffer, grouping `Put`s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
-There is a utility `HTableUtil` currently on TRUNK that does this, but you can either copy that or implement your own version for those still on 0.90.x or earlier.
+There is a utility `HTableUtil` currently on MASTER that does this, but you can either copy that or implement your own version for those still on 0.90.x or earlier.
 
 [[perf.hbase.write.mr.reducer]]
 === MapReduce: Skip The Reducer