HBASE-24405 : Document hbase:slowlog related operations (#1747)

Signed-off-by: ramkrish86 <ramkrishna@apache.org> Signed-off-by: Anoop Sam John <anoopsamjohn@apache.org>
apache · May 22, 2020 · e1d7dc8 · e1d7dc8
1 parent 9d9f07b
commit e1d7dc8
Show file tree

Hide file tree

Showing 4 changed files with 151 additions and 0 deletions.
diff --git a/hbase-common/src/main/resources/hbase-default.xml b/hbase-common/src/main/resources/hbase-default.xml
@@ -1947,6 +1947,18 @@ possible configurations would overwhelm and obscure the important.
       responses with complete data.
     </description>
   </property>
+  <property>
+    <name>hbase.regionserver.slowlog.systable.enabled</name>
+    <value>false</value>
+    <description>
+      Should be enabled only if hbase.regionserver.slowlog.buffer.enabled is enabled. If enabled
+      (true), all slow/large RPC logs would be persisted to system table hbase:slowlog (in addition
+      to in-memory ring buffer at each RegionServer). The records are stored in increasing
+      order of time. Operators can scan the table with various combination of ColumnValueFilter.
+      More details are provided in the doc section:
+      "Get Slow/Large Response Logs from System table hbase:slowlog"
+    </description>
+  </property>
   <property>
     <name>hbase.master.metafixer.max.merge.count</name>
     <value>64</value>

diff --git a/src/main/asciidoc/_chapters/hbase-default.adoc b/src/main/asciidoc/_chapters/hbase-default.adoc
@@ -2246,6 +2246,24 @@ The percent of region server RPC threads failed to abort RS.
 `false`
 
 
+[[hbase.regionserver.slowlog.systable.enabled]]
+*`hbase.regionserver.slowlog.systable.enabled`*::
++
+.Description
+
+      Should be enabled only if hbase.regionserver.slowlog.buffer.enabled is enabled.
+      If enabled (true), all slow/large RPC logs would be persisted to system table
+      hbase:slowlog (in addition to in-memory ring buffer at each RegionServer).
+      The records are stored in increasing order of time.
+      Operators can scan the table with various combination of ColumnValueFilter and
+      time range.
+      More details are provided in the doc section:
+      "Get Slow/Large Response Logs from System table hbase:slowlog"
+
++
+.Default
+`false`
+
 
 [[hbase.master.metafixer.max.merge.count]]
 *`hbase.master.metafixer.max.merge.count`*::

diff --git a/src/main/asciidoc/_chapters/ops_mgt.adoc b/src/main/asciidoc/_chapters/ops_mgt.adoc
@@ -2079,6 +2079,9 @@ Examples:
 
 ----
 
+include::slow_log_responses_from_systable.adoc[]
+
+
 === Block Cache Monitoring
 
 Starting with HBase 0.98, the HBase Web UI includes the ability to monitor and report on the performance of the block cache.

diff --git a/src/main/asciidoc/_chapters/slow_log_responses_from_systable.adoc b/src/main/asciidoc/_chapters/slow_log_responses_from_systable.adoc
@@ -0,0 +1,118 @@
+////
+/**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+////
+
+[[slow_log_responses_from_systable]]
+==== Get Slow/Large Response Logs from System table hbase:slowlog
+
+The above section provides details about Admin APIs:
+
+* get_slowlog_responses
+* get_largelog_responses
+* clear_slowlog_responses
+
+All of the above APIs access online in-memory ring buffers from
+individual RegionServers and accumulate logs from ring buffers to display
+to end user. However, since the logs are stored in memory, after RegionServer is
+restarted, all the objects held in memory of that RegionServer will be cleaned up
+and previous logs are lost. What if we want to persist all these logs forever?
+What if we want to store them in such a manner that operator can get all historical
+records with some filters? e.g get me all large/slow RPC logs that are triggered by
+user1 and are related to region:
+cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf. ?
+
+If we have a system table that stores such logs in increasing (not so strictly though)
+order of time, it can definitely help operators debug some historical events
+(scan, get, put, compaction, flush etc) with detailed inputs.
+
+Config which enabled system table to be created and store all log events is
+`hbase.regionserver.slowlog.systable.enabled`.
+
+The default value for this config is `false`. If provided `true`
+(Note: `hbase.regionserver.slowlog.buffer.enabled` should also be `true`),
+a cron job running in every RegionServer will persist the slow/large logs into
+table hbase:slowlog. By default cron job runs every 10 min. Duration can be configured
+with key: `hbase.slowlog.systable.chore.duration`. By default, RegionServer will
+store upto 1000(config key: `hbase.regionserver.slowlog.systable.queue.size`)
+slow/large logs in an internal queue and the chore will retrieve these logs
+from the queue and perform batch insertion in hbase:slowlog.
+
+hbase:slowlog has single ColumnFamily: `info`
+`info` contains multiple qualifiers which are the same attributes present as
+part of `get_slowlog_responses` API response.
+
+* info:call_details
+* info:client_address
+* info:method_name
+* info:param
+* info:processing_time
+* info:queue_time
+* info:region_name
+* info:response_size
+* info:server_class
+* info:start_time
+* info:type
+* info:username
+
+And example of 2 rows from hbase:slowlog scan result:
+[source]
+----
+
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:call_details, timestamp=2020-05-16T14:58:14.211Z, value=Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:client_address, timestamp=2020-05-16T14:58:14.211Z, value=172.20.10.2:57347
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:method_name, timestamp=2020-05-16T14:58:14.211Z, value=Scan
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:param, timestamp=2020-05-16T14:58:14.211Z, value=region { type: REGION_NAME value: "hbase:meta,,1" } scan { column { family: "info" } attribute { name: "_isolationle
+                                                             vel_" value: "\x5C000" } start_row: "cluster_test,33333333,99999999999999" stop_row: "cluster_test,," time_range { from: 0 to: 9223372036854775807 } max_versions: 1 cache_blocks
+                                                             : true max_result_size: 2097152 reversed: true caching: 10 include_stop_row: true readType: PREAD } number_of_rows: 10 close_scanner: false client_handles_partials: true client_
+                                                             handles_heartbeats: true track_scan_metrics: false
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:processing_time, timestamp=2020-05-16T14:58:14.211Z, value=18
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:queue_time, timestamp=2020-05-16T14:58:14.211Z, value=0
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:region_name, timestamp=2020-05-16T14:58:14.211Z, value=hbase:meta,,1
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:response_size, timestamp=2020-05-16T14:58:14.211Z, value=1575
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:server_class, timestamp=2020-05-16T14:58:14.211Z, value=HRegionServer
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:start_time, timestamp=2020-05-16T14:58:14.211Z, value=1589640743732
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:type, timestamp=2020-05-16T14:58:14.211Z, value=ALL
+ \x024\xC1\x03\xE9\x04\xF5@                                  column=info:username, timestamp=2020-05-16T14:58:14.211Z, value=user2
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:call_details, timestamp=2020-05-16T14:59:58.764Z, value=Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:client_address, timestamp=2020-05-16T14:59:58.764Z, value=172.20.10.2:57348
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:method_name, timestamp=2020-05-16T14:59:58.764Z, value=Scan
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:param, timestamp=2020-05-16T14:59:58.764Z, value=region { type: REGION_NAME value: "cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf." } scan { a
+                                                             ttribute { name: "_isolationlevel_" value: "\x5C000" } start_row: "cccccccc" time_range { from: 0 to: 9223372036854775807 } max_versions: 1 cache_blocks: true max_result_size: 2
+                                                             097152 caching: 2147483647 include_stop_row: false } number_of_rows: 2147483647 close_scanner: false client_handles_partials: true client_handles_heartbeats: true track_scan_met
+                                                             rics: false
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:processing_time, timestamp=2020-05-16T14:59:58.764Z, value=24
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:queue_time, timestamp=2020-05-16T14:59:58.764Z, value=0
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:region_name, timestamp=2020-05-16T14:59:58.764Z, value=cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf.
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:response_size, timestamp=2020-05-16T14:59:58.764Z, value=211227
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:server_class, timestamp=2020-05-16T14:59:58.764Z, value=HRegionServer
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:start_time, timestamp=2020-05-16T14:59:58.764Z, value=1589640743932
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:type, timestamp=2020-05-16T14:59:58.764Z, value=ALL
+ \x024\xC1\x06X\x81\xF6\xEC                                  column=info:username, timestamp=2020-05-16T14:59:58.764Z, value=user1
+----
+
+Operator can use ColumnValueFilter to filter records based on region_name, username,
+client_address etc.
+
+Time range based queries will also be very useful.
+Example:
+[source]
+----
+scan 'hbase:slowlog', { TIMERANGE => [1589621394000, 1589637999999] }
+----