Skip to content
Permalink
Browse files
More doc on fixing
  • Loading branch information
saintstack committed Oct 23, 2018
1 parent 7c13e2c commit 7ed9e83cfe1e78759c92c387afc35eb89383d3b4
Showing 1 changed file with 150 additions and 62 deletions.
@@ -1,32 +1,89 @@
# HBCK2
# Apache HBase HBCK2 Tool

HBCK2 is the successor to [hbck](https://hbase.apache.org/book.html#hbck.in.depth),
the hbase-1.x fixup tool (A.K.A _hbck1_). Use it in place of _hbck1_ making repairs
against hbase-2.x installs.

## _hbck1_
The _hbck_ that ships with hbase-1.x (A.K.A _hbck1_) should not be run against an
The _hbck_ tool that ships with hbase-1.x (A.K.A _hbck1_) should not be run against an
hbase-2.x cluster. It may do damage. While _hbck1_ is still bundled inside hbase-2.x
-- to minimize surprise (it has a fat pointer to _HBCK2_ at the head of its help
output) -- it's write-facility (`-fix`) has been removed. It can report on the state
of an hbase-2.x cluster but its assessments are likely inaccurate since it does not
understand the workings of an hbase-2.x.
understand the internal workings of an hbase-2.x.

_HBCK2_ does much less than _hbck1_ because many of the class of problems
_hbck1_ addressed are either no longer issues in hbase-2.x, or we've made
(or will make) a dedicated tool to do what _hbck1_ used do. _HBCK2_ also
works in a manner that differs from how _hbck1_ worked, asking the HBase
(or will make) a dedicated tool to do what _hbck1_ used incorporate. _HBCK2_ also
works in a manner that differs from how _hbck1_ operated, asking the HBase
Master to do its bidding, rather than replicate functionality outside of the
Master inside the _hbck1_ tool.


## Running _HBCK2_
`org.apache.hbase.HBCK2` is the name of the main class. Running the below
will dump out the _HBCK2_ usage:
`org.apache.hbase.HBCK2` is the name of the _HBCK2_ main class. After building
_HBCK2_ to generate the _HBCK2_ jar file, running the below will dump out the _HBCK2_ usage:

~~~~
$ HBASE_CLASSPATH_PREFIX=/tmp/hbase-hbck2-1.0.0-SNAPSHOT.jar ./bin/hbase org.apache.hbase.HBCK2
$ HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar ./bin/hbase org.apache.hbase.HBCK2
~~~~

```usage: HBCK2 [OPTIONS] COMMAND <ARGS>
Options:
-d,--debug run with debug output
-h,--help output this help message
-p,--hbase.zookeeper.property.clientPort port of target hbase ensemble
-q,--hbase.zookeeper.quorum <arg> ensemble of target hbase
-v,--version this hbck2 version
-z,--zookeeper.znode.parent parent znode of target hbase
Commands:
assigns [OPTIONS] <ENCODED_REGIONNAME>...
Options:
-o,--override override ownership by another procedure
A 'raw' assign that can be used even during Master initialization.
Skirts Coprocessors. Pass one or more encoded RegionNames.
1588230740 is the hard-coded name for the hbase:meta region and
de00010733901a05f5a2a3a382e27dd4 is an example of what a user-space
encoded Region name looks like. For example:
$ HBCK2 assign 1588230740 de00010733901a05f5a2a3a382e27dd4
Returns the pid(s) of the created AssignProcedure(s) or -1 if none.
bypass [OPTIONS] <PID>...
Options:
-o,--override override if procedure is running/stuck
-r,--recursive bypass parent and its children. SLOW! EXPENSIVE!
-w,--lockWait milliseconds to wait on lock before giving up;
default=1
Pass one (or more) procedure 'pid's to skip to procedure finish.
Parent of bypassed procedure will also be skipped to the finish.
Entities will be left in an inconsistent state and will require
manual fixup. May need Master restart to clear locks still held.
Bypass fails if procedure has children. Add 'recursive' if all
you have is a parent pid to finish parent and children. This
is SLOW, and dangerous so use selectively. Does not always work.
unassigns <ENCODED_REGIONNAME>...
Options:
-o,--override override ownership by another procedure
A 'raw' unassign that can be used even during Master initialization.
Skirts Coprocessors. Pass one or more encoded RegionNames:
1588230740 is the hard-coded name for the hbase:meta region and
de00010733901a05f5a2a3a382e27dd4 is an example of what a user-space
encoded Region name looks like. For example:
$ HBCK2 unassign 1588230740 de00010733901a05f5a2a3a382e27dd4
Returns the pid(s) of the created UnassignProcedure(s) or -1 if none.
setTableState <TABLENAME> <STATE>
Possible table states: ENABLED, DISABLED, DISABLING, ENABLING
To read current table state, in the hbase shell run:
hbase> get 'hbase:meta', '<TABLENAME>', 'table:state'
A value of \x08\x00 == ENABLED, \x08\x01 == DISABLED, etc.
An example making table name 'user' ENABLED:
$ HBCK2 setTableState users ENABLED
Returns whatever the previous table state was.```
## _HBCK2_ Overview
_HBCK2_ is currently a simple tool that does one thing at a time only.
@@ -41,73 +98,80 @@ _HBCK2_ works by making use of an intentionally obscured `HbckService` hosted on
Master. The Service publishes a few methods for the _HBCK2_ tool to pull on. The
first thing _HBCK2_ does is poke the cluster to ensure the service is available.
It will fail if it is not or if the `HbckService` is lacking a wanted facility.
_HBCK2_ versions should be able to work across multiple hbase-2 releases; it will
fail with a message if it is unable to run. There is no `HbckService` in versions
of hbase before 2.0.3 and 2.1.1; _HBCK2_ will not work against these versions.
_HBCK2_ versions should be able to work across multiple hbase-2 releases. It will
fail with a complaint if it is unable to run. There is no `HbckService` in versions
of hbase before 2.0.3 and 2.1.1. _HBCK2_ will not work against these versions.
## Finding Problems
While _hbck1_ performed an analysis reporting your cluster good or bad, _HBCK2_
does no such thing (not currently). The operator figures what needs fixing and
then uses tools including _HBCK2_ to do fixup.
While _hbck1_ performed analysis reporting your cluster GOOD or BAD, _HBCK2_
is less presumptious. In hbase-2.x, the operator figures what needs fixing and
then uses tooling including _HBCK2_ to do fixup.
To figure issues in assignment, make use of the following utilities.
To figure if issues in assignment, check Master logs, the Master UI home
page _table_ tab at `https://YOUR_HOST:YOUR_PORT/master-status#tables`,
the current _Procedures & Locks_ tab at
`https://YOUR_HOST:YOUR_PORT/procedures.jsp` off the Master UI home page,
the HBase Canary tool, and reading Region state out of the `hbase:meta`
table. Lets look at each in turn. We'll follow this review with a set of
scenarios in which we use the below tooling to do various fixes.
### Diagnosis Tooling
### Master Logs
#### Master Logs
The Master runs all assignments, server crash handling, cluster start and
stop, etc. In hbase-2.x, all that the Master does has been cast as
Procedures run on a state machine engine. See [Procedure Framework](https://hbase.apache.org/book.html#pv2)
Procedures run on a state machine engine. See
[Procedure Framework](https://hbase.apache.org/book.html#pv2)
and [Assignment Manager](https://hbase.apache.org/book.html#amv2)
for detail on how this infrastructure works. Each Procedure has a
Procedure `id`', it's `pid`. You can trace the lifecycle of a
Procedure as it logs each of its macro steps denoted by its
`pid`. Procedures start, step through states and finish. Some
Procedures spawn sub-procedures, wait on their Children, and then
themselves finish.
for detail on how this new infrastructure works. Each Procedure has a
unique Procedure `id`, its `pid`, that it lists on each logging.
Following the _pid_, you can trace the lifecycle of a Procedure in the
Master logs as Procedures transition from start, through each of the
Procedure's various stages to finish. Some Procedures spawn sub-procedures,
wait on their Children, and then themselves finish. Each child logs
its _pid_ but also its _ppid_; its parent's _pid_.
Generally all runs problem free but if some unforeseen circumstance
arises, the assignment framework may sustain damage requiring
operator intervention. Below we will discuss some such scenarios
but they manifest in the Master log as a Region being _STUCK_ or
a Procedure transitioning an entity -- a Region of a Table --
but they can manifest in the Master log as a Region being _STUCK_ or
a Procedure transitioning an entity -- a Region or a Table --
may be blocked because another Procedure holds the exclusive lock
and is not letting go. More on these scenarios below.
and is not letting go.
_STUCK_ Procedures look like this:
```2018-09-12 15:29:06,558 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=va1001.example.org,22101,1536173230599, table=IntegrationTestBigLinkedList_20180626110336, region=dbdb56242f17610c46ea044f7a42895b```
### /master-status#tables
This tab on the Master UI home-page shows a list of tables with
columns showing whether a
table _ENABLED_, _ENABLING_, _DISABLING_, or _DISABLED_ as well
as other attributes of table. Also listed are columns with counts
#### /master-status#tables
This section about midway down in Master UI home-page shows a list of tables
with columns for whether the table is _ENABLED_, _ENABLING_, _DISABLING_, or
_DISABLED_ among other attributes. Also listed are columns with counts
of Regions in their various transition states: _OPEN_, _CLOSED_,
etc. A read of this table is good for figuring if the Regions of
this table have a proper disposition. For example if a table is
_ENABLED_ and there are Regions that are not in the _OPEN_ state
and the Master Log is silent about any ongoing assigns, then
something is amiss.
### Procedures & Locks
#### Procedures & Locks
This page off the Master UI home page under the
_Procedures & Locks_ tab lists all ongoing Procedures and
Locks as well as the current set of Master Proc WALs (named
_pv2-0000000000000000###.log_ under the _MasterProcWALs_
_Procedures & Locks_ menu item in the page heading lists all ongoing
Procedures and Locks as well as the current set of Master Procedure WALs
(named _pv2-0000000000000000###.log_ under the _MasterProcWALs_
directory in your hbase install). On startup, on a large
cluster when furious assigning is afoot, this page is
filled with lists of Procedures and Locks. The count of
MasterProcWALs will bloat too. If after the cluster settles,
there is a stuck Lock or Procedure or the count of WALs
doesn't ever come down but only grows, then operator intervention
is required.
is needed to alieve the blockage.
Lists of locks and procedures can also be obtained via the hbase shell:
### The [HBase Canary Tool](http://hbase.apache.org/book.html#_canary)
```$ echo "list_locks"| hbase shell &> /tmp/locks.txt
$ echo "list_procedures"| hbase shell &> /tmp/procedures.txt```
#### The [HBase Canary Tool](http://hbase.apache.org/book.html#_canary)
The Canary tool is useful verifying the state of assign.
It can be run with a table focus or against the whole cluster.
@@ -121,16 +185,17 @@ fetches and the _-t 6000000_ tells the Canary run for ~two hours
maximum. When done, check out _/tmp/canary.log_. Grep for
_ERROR_ lines to find problematic Region assigns.
To do similar to what the Canary does against a single Region,
in the hbase shell, do something similar. In our example, the
Region belongs to the table _testtable_ and the Region
start row is _d1dddd0c_ (For overview on parsing a Region
name into its constituent parts, see
[RegionInfo API](https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/RegionInfo.html)):
You can do a probe like the Canary's in the hbase shell.
For example, given a Region that has a start row of _d1dddd0c_
belonging to the table _testtable_, do as follows:
```hbase> scan 'testtable', {STARTROW => 'd1dddd0c', LIMIT => 10}```
### Other Tools
For an overview on parsing a Region name into its constituent parts, see
[RegionInfo API](https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/RegionInfo.html).
#### Other Tools
To figure the list of Regions that are not _OPEN_ on an
_ENABLED_ or _ENABLING_ table, read the _hbase:meta_ table _info:state_ column.
@@ -152,7 +217,7 @@ can do bulk assigning.
General principals include a Region can not be assigned if
it is in _CLOSING_ state (or the inverse, unassigned if in
_OPENING_ state) without first transitioning via _CLOSED_:
Regions always move from _CLOSED_, to _OPENING_, to _OPEN_,
Regions must always move from _CLOSED_, to _OPENING_, to _OPEN_,
and then to _CLOSING_, _CLOSED_.
When making repair, do fixup a table at a time.
@@ -169,25 +234,48 @@ assign, and then set it back again after the unassign.
_HBCK2_ has facility to allow you do this. See the
_HBCK2_ usage output.
#### Start-over
### Start-over
If the Master is distraught and all attempts at fixup only
At an extreme, if the Master is distraught and all attempts at fixup only
turn up undoable locks or Procedures that won't finish, and/or
the set of MasterProcWALs is growing without bound, it is
possible to have the Master start over. Just move aside the
possible to wipe the Master state clean. Just move aside the
_/hbase/MasterProcWALs/_ directory under your hbase install and
restart the Master. It will come back as a tabula rasa without
restart the Master process. It will come back as a `tabula rasa` without
memory of the bad times past.
If at the time of the erasure, all Regions were happily
assigned or offlined, then on Master restart, all should be
like a brand new day. But if there were Regions-In-Transition
assigned or offlined, then on Master restart, the Master should
pick up and continue as though nothing happened. But if there were Regions-In-Transition
at the time, then the operator may have to intervene to bring outstanding
assigns/unassigns to their terminal point.
assigns/unassigns to their terminal point. Read the _hbase:meta_
_info:state_ columns as described above to figure what needs
assigning/unassigning. Having erased all history moving aside
the _MasterProcWALs_, none of the entities should be locked so
you are free to bulk assign/unassign.
### Assigning/Unassigning
Generally, on assign, the Master will persist until successful.
An assign takes an exclusive lock on the Region. This precludes
a concurrent assign or unassign from running. An assign against
a locked Region will wait until the lock is released before
making progress. See the [Procedures & Locks] section above for
current list of outstanding Locks.
### _Master startup cannot progress, in holding-pattern until region onlined_
This should never happen. If it does, here is what it looks like:
```2018-10-01 22:07:42,792 WARN org.apache.hadoop.hbase.master.HMaster: hbase:meta,,1.1588230740 is NOT online; state={1588230740 state=CLOSING, ts=1538456302300, server=ve1017.example.org,22101,1538449648131}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined.```
The Master is unable to continue startup because there is no Procedure to assign
_hbase:meta_ (or _hbase:namespace_). To inject one, use the _HBCK2_ tool:
``` HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase org.apache.hbase.HBCK2 assigns 1588230740```
...where 1588230740 is the encoded name of the _hbase:meta_ Region.
by table
file issues...
TODO: fix version file.
TODO: a rebuild of hbase:meta table by reading the fs content.
The same may happen to the _hbase:namespace_ system table. Look for the
encoded Region name of the _hbase:namespace_ Region and do similar to
what we did for _hbase:meta_.

0 comments on commit 7ed9e83

Please sign in to comment.