diff --git a/hbase-common/src/main/resources/hbase-default.xml b/hbase-common/src/main/resources/hbase-default.xml index d1a5cb94d985..da6ab0f39812 100644 --- a/hbase-common/src/main/resources/hbase-default.xml +++ b/hbase-common/src/main/resources/hbase-default.xml @@ -1947,6 +1947,18 @@ possible configurations would overwhelm and obscure the important. responses with complete data. + + hbase.regionserver.slowlog.systable.enabled + false + + Should be enabled only if hbase.regionserver.slowlog.buffer.enabled is enabled. If enabled + (true), all slow/large RPC logs would be persisted to system table hbase:slowlog (in addition + to in-memory ring buffer at each RegionServer). The records are stored in increasing + order of time. Operators can scan the table with various combination of ColumnValueFilter. + More details are provided in the doc section: + "Get Slow/Large Response Logs from System table hbase:slowlog" + + hbase.master.metafixer.max.merge.count 64 diff --git a/src/main/asciidoc/_chapters/hbase-default.adoc b/src/main/asciidoc/_chapters/hbase-default.adoc index 6da700423247..cdefb5cfcd75 100644 --- a/src/main/asciidoc/_chapters/hbase-default.adoc +++ b/src/main/asciidoc/_chapters/hbase-default.adoc @@ -2246,6 +2246,24 @@ The percent of region server RPC threads failed to abort RS. `false` +[[hbase.regionserver.slowlog.systable.enabled]] +*`hbase.regionserver.slowlog.systable.enabled`*:: ++ +.Description + + Should be enabled only if hbase.regionserver.slowlog.buffer.enabled is enabled. + If enabled (true), all slow/large RPC logs would be persisted to system table + hbase:slowlog (in addition to in-memory ring buffer at each RegionServer). + The records are stored in increasing order of time. + Operators can scan the table with various combination of ColumnValueFilter and + time range. + More details are provided in the doc section: + "Get Slow/Large Response Logs from System table hbase:slowlog" + ++ +.Default +`false` + [[hbase.master.metafixer.max.merge.count]] *`hbase.master.metafixer.max.merge.count`*:: diff --git a/src/main/asciidoc/_chapters/ops_mgt.adoc b/src/main/asciidoc/_chapters/ops_mgt.adoc index e159a325d664..96a7d80305fc 100644 --- a/src/main/asciidoc/_chapters/ops_mgt.adoc +++ b/src/main/asciidoc/_chapters/ops_mgt.adoc @@ -2079,6 +2079,9 @@ Examples: ---- +include::slow_log_responses_from_systable.adoc[] + + === Block Cache Monitoring Starting with HBase 0.98, the HBase Web UI includes the ability to monitor and report on the performance of the block cache. diff --git a/src/main/asciidoc/_chapters/slow_log_responses_from_systable.adoc b/src/main/asciidoc/_chapters/slow_log_responses_from_systable.adoc new file mode 100644 index 000000000000..72189045e8cb --- /dev/null +++ b/src/main/asciidoc/_chapters/slow_log_responses_from_systable.adoc @@ -0,0 +1,118 @@ +//// +/** + * + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +//// + +[[slow_log_responses_from_systable]] +==== Get Slow/Large Response Logs from System table hbase:slowlog + +The above section provides details about Admin APIs: + +* get_slowlog_responses +* get_largelog_responses +* clear_slowlog_responses + +All of the above APIs access online in-memory ring buffers from +individual RegionServers and accumulate logs from ring buffers to display +to end user. However, since the logs are stored in memory, after RegionServer is +restarted, all the objects held in memory of that RegionServer will be cleaned up +and previous logs are lost. What if we want to persist all these logs forever? +What if we want to store them in such a manner that operator can get all historical +records with some filters? e.g get me all large/slow RPC logs that are triggered by +user1 and are related to region: +cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf. ? + +If we have a system table that stores such logs in increasing (not so strictly though) +order of time, it can definitely help operators debug some historical events +(scan, get, put, compaction, flush etc) with detailed inputs. + +Config which enabled system table to be created and store all log events is +`hbase.regionserver.slowlog.systable.enabled`. + +The default value for this config is `false`. If provided `true` +(Note: `hbase.regionserver.slowlog.buffer.enabled` should also be `true`), +a cron job running in every RegionServer will persist the slow/large logs into +table hbase:slowlog. By default cron job runs every 10 min. Duration can be configured +with key: `hbase.slowlog.systable.chore.duration`. By default, RegionServer will +store upto 1000(config key: `hbase.regionserver.slowlog.systable.queue.size`) +slow/large logs in an internal queue and the chore will retrieve these logs +from the queue and perform batch insertion in hbase:slowlog. + +hbase:slowlog has single ColumnFamily: `info` +`info` contains multiple qualifiers which are the same attributes present as +part of `get_slowlog_responses` API response. + +* info:call_details +* info:client_address +* info:method_name +* info:param +* info:processing_time +* info:queue_time +* info:region_name +* info:response_size +* info:server_class +* info:start_time +* info:type +* info:username + +And example of 2 rows from hbase:slowlog scan result: +[source] +---- + + \x024\xC1\x03\xE9\x04\xF5@ column=info:call_details, timestamp=2020-05-16T14:58:14.211Z, value=Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest) + \x024\xC1\x03\xE9\x04\xF5@ column=info:client_address, timestamp=2020-05-16T14:58:14.211Z, value=172.20.10.2:57347 + \x024\xC1\x03\xE9\x04\xF5@ column=info:method_name, timestamp=2020-05-16T14:58:14.211Z, value=Scan + \x024\xC1\x03\xE9\x04\xF5@ column=info:param, timestamp=2020-05-16T14:58:14.211Z, value=region { type: REGION_NAME value: "hbase:meta,,1" } scan { column { family: "info" } attribute { name: "_isolationle + vel_" value: "\x5C000" } start_row: "cluster_test,33333333,99999999999999" stop_row: "cluster_test,," time_range { from: 0 to: 9223372036854775807 } max_versions: 1 cache_blocks + : true max_result_size: 2097152 reversed: true caching: 10 include_stop_row: true readType: PREAD } number_of_rows: 10 close_scanner: false client_handles_partials: true client_ + handles_heartbeats: true track_scan_metrics: false + \x024\xC1\x03\xE9\x04\xF5@ column=info:processing_time, timestamp=2020-05-16T14:58:14.211Z, value=18 + \x024\xC1\x03\xE9\x04\xF5@ column=info:queue_time, timestamp=2020-05-16T14:58:14.211Z, value=0 + \x024\xC1\x03\xE9\x04\xF5@ column=info:region_name, timestamp=2020-05-16T14:58:14.211Z, value=hbase:meta,,1 + \x024\xC1\x03\xE9\x04\xF5@ column=info:response_size, timestamp=2020-05-16T14:58:14.211Z, value=1575 + \x024\xC1\x03\xE9\x04\xF5@ column=info:server_class, timestamp=2020-05-16T14:58:14.211Z, value=HRegionServer + \x024\xC1\x03\xE9\x04\xF5@ column=info:start_time, timestamp=2020-05-16T14:58:14.211Z, value=1589640743732 + \x024\xC1\x03\xE9\x04\xF5@ column=info:type, timestamp=2020-05-16T14:58:14.211Z, value=ALL + \x024\xC1\x03\xE9\x04\xF5@ column=info:username, timestamp=2020-05-16T14:58:14.211Z, value=user2 + \x024\xC1\x06X\x81\xF6\xEC column=info:call_details, timestamp=2020-05-16T14:59:58.764Z, value=Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest) + \x024\xC1\x06X\x81\xF6\xEC column=info:client_address, timestamp=2020-05-16T14:59:58.764Z, value=172.20.10.2:57348 + \x024\xC1\x06X\x81\xF6\xEC column=info:method_name, timestamp=2020-05-16T14:59:58.764Z, value=Scan + \x024\xC1\x06X\x81\xF6\xEC column=info:param, timestamp=2020-05-16T14:59:58.764Z, value=region { type: REGION_NAME value: "cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf." } scan { a + ttribute { name: "_isolationlevel_" value: "\x5C000" } start_row: "cccccccc" time_range { from: 0 to: 9223372036854775807 } max_versions: 1 cache_blocks: true max_result_size: 2 + 097152 caching: 2147483647 include_stop_row: false } number_of_rows: 2147483647 close_scanner: false client_handles_partials: true client_handles_heartbeats: true track_scan_met + rics: false + \x024\xC1\x06X\x81\xF6\xEC column=info:processing_time, timestamp=2020-05-16T14:59:58.764Z, value=24 + \x024\xC1\x06X\x81\xF6\xEC column=info:queue_time, timestamp=2020-05-16T14:59:58.764Z, value=0 + \x024\xC1\x06X\x81\xF6\xEC column=info:region_name, timestamp=2020-05-16T14:59:58.764Z, value=cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf. + \x024\xC1\x06X\x81\xF6\xEC column=info:response_size, timestamp=2020-05-16T14:59:58.764Z, value=211227 + \x024\xC1\x06X\x81\xF6\xEC column=info:server_class, timestamp=2020-05-16T14:59:58.764Z, value=HRegionServer + \x024\xC1\x06X\x81\xF6\xEC column=info:start_time, timestamp=2020-05-16T14:59:58.764Z, value=1589640743932 + \x024\xC1\x06X\x81\xF6\xEC column=info:type, timestamp=2020-05-16T14:59:58.764Z, value=ALL + \x024\xC1\x06X\x81\xF6\xEC column=info:username, timestamp=2020-05-16T14:59:58.764Z, value=user1 +---- + +Operator can use ColumnValueFilter to filter records based on region_name, username, +client_address etc. + +Time range based queries will also be very useful. +Example: +[source] +---- +scan 'hbase:slowlog', { TIMERANGE => [1589621394000, 1589637999999] } +----