Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-21405 [DOC] Add Details about Output of "status 'replication'" #1894

Merged
merged 2 commits into from
Jun 16, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions src/main/asciidoc/_chapters/ops_mgt.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2629,6 +2629,91 @@ You can use the HBase Shell command `status 'replication'` to monitor the replic
* `status 'replication', 'source'` -- prints the status for each replication source, sorted by hostname.
* `status 'replication', 'sink'` -- prints the status for each replication sink, sorted by hostname.

==== Understanding the output

The command output will vary according to the state of replication. For example right after a restart
and if destination peer is not reachable, no replication source threads would be running,
so no metrics would get displayed:

----
hbase01.home:
SOURCE: PeerID=1
Normal Queue: 1
No Reader/Shipper threads runnning yet.
SINK: TimeStampStarted=1591985197350, Waiting for OPs...
----

Under normal circumstances, a healthy, active-active replication deployment would
show the following:

----
hbase01.home:
SOURCE: PeerID=1
Normal Queue: 1
AgeOfLastShippedOp=0, TimeStampOfLastShippedOp=Fri Jun 12 18:49:23 BST 2020, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Jun 12 18:49:23 BST 2020, Replication Lag=0
SINK: TimeStampStarted=1591983663458, AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Jun 12 18:57:18 BST 2020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: For active-active replication, in order to get peerId of both clusters (defined in each other), we need to run status 'replication' at both clusters side right?
Getting ageOfLastShipped etc metric values from remote cluster is also not that easy even if we want to display here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: For active-active replication, in order to get peerId of both clusters (defined in each other), we need to run status 'replication' at both clusters side right?

Yes. The command only shows the context of an individual cluster, listing overall stats about the given cluster source queues and sink threads.

Getting ageOfLastShipped etc metric values from remote cluster is also not that easy even if we want to display here.

This "ageOfLastShipped" metric is related to the source cluster. On the source, we have ReplicationSourceShipper thread reading entries from the WAL and making synchronous RPC calls to ReplicationSink in the target. If the call is success, we get that time in ReplicationSourceShipper, decrease from the edit entry time and record it as "ageOfLastShipped". So "ageOfLastShipped" is how long a given edit took since it entered on source cluster until source cluster assumed it was successful replicated.

@virajjasani would you think this metric description should be improved? Looks like current text was not that clear.

Copy link
Contributor

@virajjasani virajjasani Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, my bad. I know ageOfLastShipped metric, I just took it's example to state that we can't get similar metrics from destination cluster anyways.

My question was, do we really have any metric (from replication viewpoint) available to us from destination cluster? (I don't know of any such metric so far, I hope we can't know), like cluster A knows it's own ageOfLastShipped and ageOfLastApplied but does it know cluster B's ageOfLastShipped and ageOfLastApplied if both clusters are each other's pairs?

And by this command output, I was thinking what if we could display both clusters' metrics together (in case of active-active), but that might not be possible (and might not even be worth spending time)

Something like this could be really fancy but nothing necessary (hbase01.home belongs to cluster A and hbase02.home belongs to cluster B, only if we have 2 way replication setup):

    hbase01.home:
      SOURCE: PeerID=1
         Normal Queue: 1
           AgeOfLastShippedOp=0, TimeStampOfLastShippedOp=Fri Jun 12 18:49:23 BST 2020, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Jun 12 18:49:23 BST 2020, Replication Lag=0
      SINK: TimeStampStarted=1591983663458, AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Jun 12 18:57:18 BST 2020

    hbase02.home:
      SOURCE: PeerID=1
         Normal Queue: 1
           AgeOfLastShippedOp=0, TimeStampOfLastShippedOp=Fri Jun 12 18:49:23 BST 2020, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Jun 12 18:49:23 BST 2020, Replication Lag=0
      SINK: TimeStampStarted=1591983663458, AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Jun 12 18:57:18 BST 2020

Copy link
Contributor

@virajjasani virajjasani Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, even if one cluster can know another one's metrics, it's not related to this PR, it is good to merge anyways :)
Moreover, description looks all good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was, do we really have any metric (from replication viewpoint) available to us from destination cluster? (I don't know of any such metric so far, I hope we can't know), like cluster A knows it's own ageOfLastShipped and ageOfLastApplied but does it know cluster B's ageOfLastShipped and ageOfLastApplied if both clusters are each other's pairs?

Ah, yeah, metrics from remote cluster are not available at source, and vice-versa.

And by this command output, I was thinking what if we could display both clusters' metrics together (in case of active-active), but that might not be possible (and might not even be worth spending time)

Maybe too much for shell command, but looks like a great idea for the replication stats page on the UI. Main problem I see is that it's not so easy to identify if a given cluster is actually a target, as it always exposes the sink rpc interface. I will dig further around and see if I can come with something not too complex.

Copy link
Contributor

@virajjasani virajjasani Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, for shell, it is too much, and also the fact that determining whether the target cluster is also actively pushing WAL Edits to current cluster is not that straightforward.

Anyways, this was just a thought, maybe for some time in future, who knows we might have this status display feature in future :)

----

The definition for each of these metrics is detailed below:

[cols="1,1,1", options="header"]
|===
| Type
| Metric Name
| Description

| Source
| AgeOfLastShippedOp
| How long last successfully shipped edit took to effectively get replicated on target.

| Source
| TimeStampOfLastShippedOp
| The actual date of last successful edit shipment.

| Source
| `
wchevreuil marked this conversation as resolved.
Show resolved Hide resolved
| Number of wal files on this given queue.

| Source
| EditsReadFromLogQueue
| How many edits have been read from this given queue since this source thread started.

| Source
| OpsShippedToTarget
| How many edits have been shipped to target since this source thread started.

| Source
| TimeStampOfNextToReplicate
| Date of the current edit been attempted to replicate.

| Source
| Replication Lag
| The elapsed time (in millis), since the last edit to replicate was read by this source
thread and effectively replicated to target

| Sink
| TimeStampStarted
| Date (in millis) of when this Sink thread started.

| Sink
| AgeOfLastAppliedOp
| How long it took to apply the last successful shipped edit.

| Sink
| TimeStampsOfLastAppliedOp
| Date of last successful applied edit.

|===

Growing values for `Source.TimeStampsOfLastAppliedOp` and/or
`Source.Replication Lag` would indicate replication delays. If those numbers keep going
up, while `Source.TimeStampOfLastShippedOp`, `Source.EditsReadFromLogQueue`,
`Source.OpsShippedToTarget` or `Source.TimeStampOfNextToReplicate` do not change at all,
then replication flow is failing to progress, and there might be problems within
clusters communication. This could also happen if replication is manually paused
(via hbase shell `disable_peer`command, for example), but date keeps getting ingested
wchevreuil marked this conversation as resolved.
Show resolved Hide resolved
in the source cluster tables.

== Running Multiple Workloads On a Single Cluster

HBase provides the following mechanisms for managing the performance of a cluster
Expand Down