Add auto-sharding APM metrics #107593

parkertimmins · 2024-04-17T21:46:38Z

It's useful to know when data stream auto-sharding does a shard increase, decrease, or cooldown prevents increase/decrease during rollover. This information is already in logs, but we should add APM metrics to improve tracking. We will add four APM metrics:

es.auto_sharding.increase_shards.total
es.auto_sharding.decrease_shards.total
es.auto_sharding.cooldown_prevented_increase.total
es.auto_sharding.cooldown_prevented_decrease.total

~~Additionally, both metrics will be tagged with a "data_stream" field contain the data stream name, allowing the metrics to be grouped by data stream.~~

A test is failure intermittently, and I am unsure why. Perhaps a rollover is happening asynchronouosly and adding additional metrics? That doesn't seem likely ...

parkertimmins · 2024-04-23T16:52:59Z

...ams/src/internalClusterTest/java/org/elasticsearch/datastreams/DataStreamAutoshardingIT.java

+        MetadataRolloverService.AUTO_SHARDING_METRIC_NAMES.values()
+            .stream()
+            .filter(metric -> metric.equals(expectedEmittedMetric) == false)
+            .forEach(metric -> assertThat(measurements.get(metric), empty()));


I had a failure on this line in the CI build, which I was unable to reproduce locally. This means some metric that should not have been emitted, was emitted. I added the first call to resetTelemetry on line 126, as I thought that perhaps the test were run out of order and somehow the same telemetry plugin was being used in another test causing the measurement to still be in the recorder (though this doesn't seem likely). Any ideas what could have caused this?

...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/MetadataRolloverService.java

parkertimmins · 2024-04-23T16:58:13Z

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

@@ -444,6 +447,8 @@ protected void configure() {
        bind(ShardsAllocator.class).toInstance(shardsAllocator);
        bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy);
        bind(AllocationStatsService.class).toInstance(allocationStatsService);
+        bind(TelemetryProvider.class).toInstance(telemetryProvider);
+        bind(MetadataRolloverService.class).asEagerSingleton();


There had been multiple instances of MetadataRolloverService being instantiated, causing repeated attempted calls to registerLongCounter for the same metric. Making it a singleton, fixes this while allowing the metric definitions to remain in the MetadataRolloverService constructor.

I think this should be fine, but I am still unsure how did it work so that we were having multiple MetadataRolloverService. was this intentional to have multiple instances of this service?

I suspect the intention was to have one instance. The reason I think this was okay is because MetadataRolloverService is almost a pure function, so there was no harm in having multiple copies. I will be sure to get another look from someone more familiar with this code before merging.

but I am still unsure how did it work so that we were having multiple MetadataRolloverService

The service was stateless so it wouldn't have mattered.

I'm not super sure about this change. @pgomulka is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry to say something like:

registry.registerLongCounter( "myCounter", ... ); ... registry.getLongCounter("myCounter").incrementBy(1L, ...);

Am I missing something?

Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?

is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry to say something like:
registry.registerLongCounter( "myCounter", ... );
...
registry.getLongCounter("myCounter").incrementBy(1L, ...);

yes currently you need to have a state per a service that is creating metrics. We will be working to allow for static metric registration - think of logger - but it is not ready yet.

My main reason for wanting to do the registration here is that it feels weird to do the registration up in NodeConstruction, then reference the same metric name far away in this code. Though I do see that adding state to this class is not ideal.

Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?

My thought is that we are not expecting the metrics to decrease. I believe LongCounter is for metrics that only increase, whereas LongGauge can decrease. Do you see a situation where we'd want the metrics to decrease?

@pgomulka

yes currently you need to have a state per a service that is creating metrics. We will be working to allow for static metric registration - think of logger - but it is not ready yet.

Sorry if I'm missing something super obvious - but why do we need to have a Counter both in the service and in the metrics registry? I understand a service must register its desired counters (e.g. doing this bit registry.registerLongCounter( "myCounter", ... ); )
But what I don't understand is why do we need to keep hold of the counter in the service itself (like it's proposed here ) as opposed fetching it from the MeterRegistry ? (e.g. registry.getLongCounter("myCounter").incrementBy(1L, ...); )

The autoShardingMetricCounters state is added to MetadataRolloverService solely so we can do autoShardingMetricCounters.get("myCounter").increment()

We discussed a bit, and I had misunderstood. It seems that @andreidan is correct that the autoShardingMetricCounters map is not needed. We can just do the registry.registerLongCounter( "myCounter", ... ); in the constructor, not save the counters in a map, then access the counter with registry.getLongCounter("myCounter").incrementBy(1L, ...); shortly before emitting the metric.

I had been thinking the suggestion was to move registry.registerLongCounter( "myCounter", ... ); out of the class and up near NodeConstruction, to avoid making MetadataRolloverService a singleton. But it was just to remove the redundant state in autoShardingMetricCounters

Sorry if I'm missing something super obvious - but why do we need to have a Counter both in the service and in the metrics registry? I understand a service must register its desired counters (e.g. doing this bit registry.registerLongCounter( "myCounter", ... ); )
But what I don't understand is why do we need to keep hold of the counter in the service itself (like it's proposed here ) as opposed fetching it from the MeterRegistry ? (e.g. registry.getLongCounter("myCounter").incrementBy(1L, ...); )
The autoShardingMetricCounters state is added to MetadataRolloverService solely so we can do autoShardingMetricCounters.get("myCounter").increment()

you can do this, I think the effect is the same (one map less..) You would have to keep the state of the registry (a field somehow). So this is what I meant by keeping a state.
I think it is up to you to decide if you prefer to keep the fields local to the class or use a registry.get. Think of logger usage where you would keep the instance local to the class.
Some parts of our codebase are 'aggregating' metrics as records - see S3RepositoriesMetrics usages.

The current state PR still looks good to me.

elasticsearchmachine · 2024-04-24T07:57:40Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

pgomulka

LGTM,
but I left once comment to clarify on the multiple instances of the service

...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/MetadataRolloverService.java

pgomulka · 2024-04-24T09:49:19Z

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

@@ -444,6 +447,8 @@ protected void configure() {
        bind(ShardsAllocator.class).toInstance(shardsAllocator);
        bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy);
        bind(AllocationStatsService.class).toInstance(allocationStatsService);
+        bind(TelemetryProvider.class).toInstance(telemetryProvider);
+        bind(MetadataRolloverService.class).asEagerSingleton();


I think this should be fine, but I am still unsure how did it work so that we were having multiple MetadataRolloverService. was this intentional to have multiple instances of this service?

elasticsearchmachine · 2024-04-24T19:46:50Z

Hi @parkertimmins, I've created a changelog YAML for you.

andreidan

Thanks for working on this Parker. I've left a few questions about the approach.

...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/MetadataRolloverService.java

andreidan · 2024-04-25T14:24:44Z

...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/MetadataRolloverService.java

+                LongCounter metricCounter = autoShardingMetricCounters.get(autoShardingResult.type());
+                if (metricCounter != null) {
+                    metricCounter.increment();
+                }


nit: we can use computeIfPresent here

This doesn't apply after removing autoShardingMetricCounters, right? 🤔
The multiple if statement with null check aren't very pretty. I thought about using something like with Optionals.ofNullable/isPresent, but I haven't seen any Optional usage in the codebase, so decide to keep it as is.

andreidan · 2024-04-25T14:36:28Z

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

@@ -444,6 +447,8 @@ protected void configure() {
        bind(ShardsAllocator.class).toInstance(shardsAllocator);
        bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy);
        bind(AllocationStatsService.class).toInstance(allocationStatsService);
+        bind(TelemetryProvider.class).toInstance(telemetryProvider);
+        bind(MetadataRolloverService.class).asEagerSingleton();


but I am still unsure how did it work so that we were having multiple MetadataRolloverService

The service was stateless so it wouldn't have mattered.

I'm not super sure about this change. @pgomulka is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry to say something like:

registry.registerLongCounter( "myCounter", ... ); ... registry.getLongCounter("myCounter").incrementBy(1L, ...);

Am I missing something?

Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?

andreidan

LGTM, thanks for iterating on this Parker

Add auto-sharding increase/decrease APM metrics

031c789

elasticsearchmachine added the v8.15.0 label Apr 17, 2024

parkertimmins added 8 commits April 18, 2024 11:10

Remove data stream name attribute

8ebac2a

Add cooldown prevented increase/decrease metrics

d90e055

Check if autoShardingResult is null

d13f598

fix broken test

00822d4

Merge branch 'main' into auto-sharding-apm-metrics

b76e1eb

spotless apply

c4b5bc1

add integration test

e91a30c

spotless apply

c5959d4

parkertimmins changed the title ~~Add auto-sharding increase/decrease APM metrics~~ Add auto-sharding APM metrics Apr 23, 2024

Try resetting telemetry plugin before any test

b316c97

A test is failure intermittently, and I am unsure why. Perhaps a rollover is happening asynchronouosly and adding additional metrics? That doesn't seem likely ...

parkertimmins commented Apr 23, 2024

View reviewed changes

...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/MetadataRolloverService.java Show resolved Hide resolved

parkertimmins commented Apr 23, 2024

View reviewed changes

parkertimmins requested review from andreidan and pgomulka April 23, 2024 16:59

parkertimmins marked this pull request as ready for review April 23, 2024 17:00

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Apr 23, 2024

pgomulka added Team:Data Management Meta label for data/management team :Core/Infra/Metrics Metrics and metering infrastructure labels Apr 24, 2024

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed Team:Data Management Meta label for data/management team needs:triage Requires assignment of a team area label labels Apr 24, 2024

pgomulka approved these changes Apr 24, 2024

View reviewed changes

parkertimmins added the >enhancement label Apr 24, 2024

Update docs/changelog/107593.yaml

77f5350

andreidan reviewed Apr 25, 2024

View reviewed changes

remove redundant map

2e1ba94

parkertimmins requested a review from andreidan April 25, 2024 21:30

andreidan approved these changes Apr 26, 2024

View reviewed changes

parkertimmins self-assigned this Apr 26, 2024

parkertimmins merged commit 3ed42f3 into elastic:main Apr 26, 2024
14 checks passed

parkertimmins deleted the auto-sharding-apm-metrics branch April 26, 2024 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto-sharding APM metrics #107593

Add auto-sharding APM metrics #107593

parkertimmins commented Apr 17, 2024 •

edited

parkertimmins Apr 23, 2024

parkertimmins Apr 23, 2024

pgomulka Apr 24, 2024

parkertimmins Apr 24, 2024

andreidan Apr 25, 2024

pgomulka Apr 25, 2024

parkertimmins Apr 25, 2024 •

edited

parkertimmins Apr 25, 2024

andreidan Apr 25, 2024

parkertimmins Apr 25, 2024

pgomulka Apr 26, 2024

elasticsearchmachine commented Apr 24, 2024

pgomulka left a comment

pgomulka Apr 24, 2024

elasticsearchmachine commented Apr 24, 2024

andreidan left a comment

andreidan Apr 25, 2024

parkertimmins Apr 25, 2024

andreidan Apr 25, 2024

andreidan left a comment

Add auto-sharding APM metrics #107593

Add auto-sharding APM metrics #107593

Conversation

parkertimmins commented Apr 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parkertimmins Apr 25, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Apr 24, 2024

pgomulka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Apr 24, 2024

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan left a comment

Choose a reason for hiding this comment

parkertimmins commented Apr 17, 2024 •

edited

parkertimmins Apr 25, 2024 •

edited