Skip to content

Commit

Permalink
Add cluster stats re. snapshot activity (#93680)
Browse files Browse the repository at this point in the history
Shows how many ongoing snapshots/clones/deletions/etc. there are, and
summarises the shard-level status too for progress tracking.
  • Loading branch information
DaveCTurner committed Feb 13, 2023
1 parent 265d392 commit 9c8c952
Show file tree
Hide file tree
Showing 8 changed files with 960 additions and 5 deletions.
5 changes: 5 additions & 0 deletions docs/changelog/93680.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 93680
summary: Add cluster stats re. snapshot activity
area: Snapshot/Restore
type: enhancement
issues: []
150 changes: 148 additions & 2 deletions docs/reference/cluster/stats.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -1307,6 +1307,148 @@ Number of selected nodes using the distribution flavor and file type.
=====
====

`snapshots`::
(object)
Contains statistics about the <<snapshot-restore,snapshot>> activity in the cluster.
+
.Properties of `snapshots`
[%collapsible%open]
=====
`current_counts`:::
(object)
Contains statistics which report the numbers of various ongoing snapshot activities in the cluster.
+
.Properties of `current_counts`
[%collapsible%open]
======
`snapshots`:::
(integer)
The total number of snapshots and clones currently being created by the cluster.

`shard_snapshots`:::
(integer)
The total number of outstanding shard snapshots in the cluster.

`snapshot_deletions`:::
(integer)
The total number of snapshot deletion operations that the cluster is currently
running.

`concurrent_operations`:::
(integer)
The total number of snapshot operations that the cluster is currently running
concurrently. This is the total of the `snapshots` and `snapshot_deletions`
entries, and is limited by <<snapshot-max-concurrent-ops,the
`snapshot.max_concurrent_operations` setting>>.

`cleanups`:::
(integer)
The total number of repository cleanup operations that the cluster is currently
running. These operations do not count towards the total number of concurrent
operations.
======
`repositories`:::
(object)
Contains statistics which report the progress of snapshot activities broken down
by repository. This object contains one entry for each repository registered
with the cluster.
+
.Properties of `repositories`
[%collapsible%open]
======

`current_counts`:::
(object)
Contains statistics which report the numbers of various ongoing snapshot
activities for this repository.
+
.Properties of `current_counts`
[%collapsible%open]
=======
`snapshots`:::
(integer)
The total number of ongoing snapshots in this repository.
`clones`:::
(integer)
The total number of ongoing snapshot clones in this repository.
`finalizations`:::
(integer)
The total number of this repository's ongoing snapshots and clone operations
which are mostly complete except for their last "finalization" step.
`deletions`:::
(integer)
The total number of ongoing snapshot deletion operations in this repository.
`snapshot_deletions`:::
(integer)
The total number of snapshots that are currently being deleted from this
repository.
`active_deletions`:::
(integer)
The total number of ongoing snapshot deletion operations which are currently
active in this repository. Snapshot deletions do not run concurrently with other
snapshot operations, so this may be `0` if any pending deletes are waiting for
other operations to finish.
`shards`:::
(object)
Contains statistics which report the shard-level progress of ongoing snapshot
activities for a repository. Note that these statistics relate only to ongoing
snapshots.
+
.Properties of `shards`
[%collapsible%open]
========

`total`:::
(integer)
The total number of shard snapshots currently tracked by this repository. This
statistic only counts shards in ongoing snapshots, so it will drop when a
snapshot completes and will be `0` if there are no ongoing snapshots.

`complete`:::
(integer)
The total number of tracked shard snapshots which have completed in this
repository. This statistic only counts shards in ongoing snapshots, so it will
drop when a snapshot completes and will be `0` if there are no ongoing
snapshots.

`incomplete`:::
(integer)
The total number of tracked shard snapshots which have not completed in this
repository. This is the difference between the `total` and `complete` values.

`states`:::
(object)
The total number of shard snapshots in each of the named states in this
repository. These states are an implementation detail of the snapshotting
process which may change between versions. They are included here for expert
users, but should otherwise be ignored.

========
=======

`oldest_start_time`:::
(string)
The start time of the oldest running snapshot in this repository.

`oldest_start_time_in_millis`:::
(integer)
The start time of the oldest running snapshot in this repository, represented as
milliseconds since the Unix epoch.

======
=====

[[cluster-stats-api-example]]
==== {api-examples-title}

Expand Down Expand Up @@ -1587,6 +1729,9 @@ The API returns the following response:
...
}
]
},
"snapshots": {
...
}
}
--------------------------------------------------
Expand All @@ -1596,6 +1741,7 @@ The API returns the following response:
// TESTRESPONSE[s/"processor_stats": \{[^\}]*\}/"processor_stats": $body.$_path/]
// TESTRESPONSE[s/"count": \{[^\}]*\}/"count": $body.$_path/]
// TESTRESPONSE[s/"packaging_types": \[[^\]]*\]/"packaging_types": $body.$_path/]
// TESTRESPONSE[s/"snapshots": \{[^\}]*\}/"snapshots": $body.$_path/]
// TESTRESPONSE[s/"field_types": \[[^\]]*\]/"field_types": $body.$_path/]
// TESTRESPONSE[s/"runtime_field_types": \[[^\]]*\]/"runtime_field_types": $body.$_path/]
// TESTRESPONSE[s/"search": \{[^\}]*\}/"search": $body.$_path/]
Expand All @@ -1606,8 +1752,8 @@ The API returns the following response:
// 1. Ignore the contents of the `plugins` object because we don't know all of
// the plugins that will be in it. And because we figure folks don't need to
// see an exhaustive list anyway.
// 2. Similarly, ignore the contents of `network_types`, `discovery_types`, and
// `packaging_types`.
// 2. Similarly, ignore the contents of `network_types`, `discovery_types`,
// `packaging_types` and `snapshots`.
// 3. Ignore the contents of the (nodes) count object, as what's shown here
// depends on the license. Voting-only nodes are e.g. only shown when this
// test runs with a basic license.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -184,3 +184,47 @@
- gt: { indices.mappings.total_field_count: 0 }
- gt: { indices.mappings.total_deduplicated_field_count: 0 }
- gt: { indices.mappings.total_deduplicated_mapping_size_in_bytes: 0 }


---
"snapshot stats reported in get cluster stats":
- skip:
version: " - 8.7.99"
reason: "snapshot stats reported from 8.8 onwards"

- do:
snapshot.create_repository:
repository: test_repo_for_stats
body:
type: fs
settings:
location: "test_repo_for_stats_loc"

- do:
cluster.stats:
human: true

- gte: { snapshots.current_counts.snapshots: 0 }
- gte: { snapshots.current_counts.shard_snapshots: 0 }
- gte: { snapshots.current_counts.snapshot_deletions: 0 }
- gte: { snapshots.current_counts.concurrent_operations: 0 }
- gte: { snapshots.current_counts.cleanups: 0 }
- is_true: snapshots.repositories.test_repo_for_stats.type
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.snapshots: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.clones: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.finalizations: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.deletions: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.snapshot_deletions: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.active_deletions: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.total: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.complete: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.incomplete: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.states.INIT: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.states.SUCCESS: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.states.FAILED: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.states.ABORTED: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.states.MISSING: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.states.WAITING: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.current_counts.shards.states.QUEUED: 0 }
- gte: { snapshots.repositories.test_repo_for_stats.oldest_start_time_millis: 0 }
- is_true: snapshots.repositories.test_repo_for_stats.oldest_start_time
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import org.elasticsearch.action.FailedNodeException;
import org.elasticsearch.action.support.nodes.BaseNodesResponse;
import org.elasticsearch.cluster.ClusterName;
import org.elasticsearch.cluster.ClusterSnapshotStats;
import org.elasticsearch.cluster.health.ClusterHealthStatus;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.io.stream.StreamInput;
Expand All @@ -28,6 +29,7 @@ public class ClusterStatsResponse extends BaseNodesResponse<ClusterStatsNodeResp
final ClusterStatsNodes nodesStats;
final ClusterStatsIndices indicesStats;
final ClusterHealthStatus status;
final ClusterSnapshotStats clusterSnapshotStats;
final long timestamp;
final String clusterUUID;

Expand All @@ -46,6 +48,12 @@ public ClusterStatsResponse(StreamInput in) throws IOException {
}
this.clusterUUID = clusterUUID;

if (in.getTransportVersion().onOrAfter(TransportVersion.V_8_8_0)) {
clusterSnapshotStats = ClusterSnapshotStats.readFrom(in);
} else {
clusterSnapshotStats = ClusterSnapshotStats.EMPTY;
}

// built from nodes rather than from the stream directly
nodesStats = new ClusterStatsNodes(getNodes());
indicesStats = new ClusterStatsIndices(getNodes(), mappingStats, analysisStats, versionStats);
Expand All @@ -59,7 +67,8 @@ public ClusterStatsResponse(
List<FailedNodeException> failures,
MappingStats mappingStats,
AnalysisStats analysisStats,
VersionStats versionStats
VersionStats versionStats,
ClusterSnapshotStats clusterSnapshotStats
) {
super(clusterName, nodes, failures);
this.clusterUUID = clusterUUID;
Expand All @@ -75,6 +84,7 @@ public ClusterStatsResponse(
}
}
this.status = status;
this.clusterSnapshotStats = clusterSnapshotStats;
}

public String getClusterUUID() {
Expand Down Expand Up @@ -108,6 +118,9 @@ public void writeTo(StreamOutput out) throws IOException {
if (out.getTransportVersion().onOrAfter(TransportVersion.V_7_11_0)) {
out.writeOptionalWriteable(indicesStats.getVersions());
}
if (out.getTransportVersion().onOrAfter(TransportVersion.V_8_8_0)) {
clusterSnapshotStats.writeTo(out);
}
}

@Override
Expand All @@ -134,6 +147,10 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
builder.startObject("nodes");
nodesStats.toXContent(builder, params);
builder.endObject();

builder.field("snapshots");
clusterSnapshotStats.toXContent(builder, params);

return builder;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import org.elasticsearch.action.admin.indices.stats.ShardStats;
import org.elasticsearch.action.support.ActionFilters;
import org.elasticsearch.action.support.nodes.TransportNodesAction;
import org.elasticsearch.cluster.ClusterSnapshotStats;
import org.elasticsearch.cluster.ClusterState;
import org.elasticsearch.cluster.health.ClusterHealthStatus;
import org.elasticsearch.cluster.health.ClusterStateHealth;
Expand Down Expand Up @@ -122,6 +123,10 @@ protected void newResponseAsync(
final CancellableTask cancellableTask = (CancellableTask) task;
final ClusterState state = clusterService.state();
final Metadata metadata = state.metadata();
final ClusterSnapshotStats clusterSnapshotStats = ClusterSnapshotStats.of(
state,
clusterService.threadPool().absoluteTimeInMillis()
);

final StepListener<MappingStats> mappingStatsStep = new StepListener<>();
final StepListener<AnalysisStats> analysisStatsStep = new StepListener<>();
Expand All @@ -139,7 +144,8 @@ protected void newResponseAsync(
failures,
mappingStats,
analysisStats,
VersionStats.of(metadata, responses)
VersionStats.of(metadata, responses),
clusterSnapshotStats
)
),
listener::onFailure
Expand Down

0 comments on commit 9c8c952

Please sign in to comment.