[Data Tiers] Add telemetry enhancements for data tiers utilization #71204
Labels
:Data Management/Indices APIs
APIs to create and manage indices and templates
>enhancement
Team:Data Management
Meta label for data/management team
Telemetry was added for data tiers in this pr.
Currently collected data:
node_count :: number of nodes with this tier/role
index_count :: number of indices on this tier
total_shard_count :: total number of shards for all nodes in this tier
primary_shard_count :: number of primary shards for all nodes in this tier
doc_count :: number of documents for all nodes in this tier
total_size_bytes :: total number of bytes for all shards for all nodes in this tier
primary_size_bytes :: number of bytes for all primary shards on all nodes in this tier
primary_shard_size_avg_bytes :: average shard size for primary shard in this tier
primary_shard_size_median_bytes :: median shard size for primary shard in this tier
primary_shard_size_mad_bytes :: median absolute deviation of shard size for primary shard in this tier
Challenges with the current data:
The existing telemetry does not enable us to distinguish actual utilization and will wind up reporting things like index_count in multiple tiers if the node is tagged with multiple node roles. In order to be able to accurately report on the actual utilization of each tier, we need to add telemetry which would associate these fields with the role that the data is currently associated with.
For example, I would expect something like the following query of our telemetry data should accurately return only data that is “actively associated” with the warm tier:
stack_stats.xpack.data_tiers.data_warm.index_count > 1
A concrete example of how this data will be used is to report on and visualize the number of unique clusters that have data residing on a given tier (the ability to drill down into more detailed stats such as the doc_count or index_count for the data residing on each tier would also be useful).
It would also be useful to be able to distinguish whether the tier an index is located on matches its first preference (index.routing.allocation.include._tier_preference). So for example, an index might specify cold as its first preference but if no cold nodes are available it could reside on its tier of second preference (say warm). We could use this distinction to suggest actions to the user such as scaling or enabling autoscaling.
cc @dakrone @sajjadwahmed
The text was updated successfully, but these errors were encountered: