Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add auto-sharding APM metrics #107593

Merged
merged 12 commits into from Apr 26, 2024

Conversation

parkertimmins
Copy link
Contributor

@parkertimmins parkertimmins commented Apr 17, 2024

It's useful to know when data stream auto-sharding does a shard increase, decrease, or cooldown prevents increase/decrease during rollover. This information is already in logs, but we should add APM metrics to improve tracking. We will add four APM metrics:

  • es.auto_sharding.increase_shards.total
  • es.auto_sharding.decrease_shards.total
  • es.auto_sharding.cooldown_prevented_increase.total
  • es.auto_sharding.cooldown_prevented_decrease.total

Additionally, both metrics will be tagged with a "data_stream" field contain the data stream name, allowing the metrics to be grouped by data stream.

@parkertimmins parkertimmins changed the title Add auto-sharding increase/decrease APM metrics Add auto-sharding APM metrics Apr 23, 2024
A test is failure intermittently, and I am unsure why. Perhaps a rollover is
happening asynchronouosly and adding additional metrics? That doesn't seem likely ...
MetadataRolloverService.AUTO_SHARDING_METRIC_NAMES.values()
.stream()
.filter(metric -> metric.equals(expectedEmittedMetric) == false)
.forEach(metric -> assertThat(measurements.get(metric), empty()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a failure on this line in the CI build, which I was unable to reproduce locally. This means some metric that should not have been emitted, was emitted. I added the first call to resetTelemetry on line 126, as I thought that perhaps the test were run out of order and somehow the same telemetry plugin was being used in another test causing the measurement to still be in the recorder (though this doesn't seem likely). Any ideas what could have caused this?

@@ -444,6 +447,8 @@ protected void configure() {
bind(ShardsAllocator.class).toInstance(shardsAllocator);
bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy);
bind(AllocationStatsService.class).toInstance(allocationStatsService);
bind(TelemetryProvider.class).toInstance(telemetryProvider);
bind(MetadataRolloverService.class).asEagerSingleton();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There had been multiple instances of MetadataRolloverService being instantiated, causing repeated attempted calls to registerLongCounter for the same metric. Making it a singleton, fixes this while allowing the metric definitions to remain in the MetadataRolloverService constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine, but I am still unsure how did it work so that we were having multiple MetadataRolloverService. was this intentional to have multiple instances of this service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the intention was to have one instance. The reason I think this was okay is because MetadataRolloverService is almost a pure function, so there was no harm in having multiple copies. I will be sure to get another look from someone more familiar with this code before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I am still unsure how did it work so that we were having multiple MetadataRolloverService

The service was stateless so it wouldn't have mattered.

I'm not super sure about this change. @pgomulka is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry to say something like:

registry.registerLongCounter( "myCounter", ... );

...

registry.getLongCounter("myCounter").incrementBy(1L, ...);

Am I missing something?

Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry to say something like:
registry.registerLongCounter( "myCounter", ... );
...
registry.getLongCounter("myCounter").incrementBy(1L, ...);

yes currently you need to have a state per a service that is creating metrics. We will be working to allow for static metric registration - think of logger - but it is not ready yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main reason for wanting to do the registration here is that it feels weird to do the registration up in NodeConstruction, then reference the same metric name far away in this code. Though I do see that adding state to this class is not ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?

My thought is that we are not expecting the metrics to decrease. I believe LongCounter is for metrics that only increase, whereas LongGauge can decrease. Do you see a situation where we'd want the metrics to decrease?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgomulka

yes currently you need to have a state per a service that is creating metrics. We will be working to allow for static metric registration - think of logger - but it is not ready yet.

Sorry if I'm missing something super obvious - but why do we need to have a Counter both in the service and in the metrics registry? I understand a service must register its desired counters (e.g. doing this bit registry.registerLongCounter( "myCounter", ... ); )
But what I don't understand is why do we need to keep hold of the counter in the service itself (like it's proposed here ) as opposed fetching it from the MeterRegistry ? (e.g. registry.getLongCounter("myCounter").incrementBy(1L, ...); )

The autoShardingMetricCounters state is added to MetadataRolloverService solely so we can do autoShardingMetricCounters.get("myCounter").increment()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed a bit, and I had misunderstood. It seems that @andreidan is correct that the autoShardingMetricCounters map is not needed. We can just do the registry.registerLongCounter( "myCounter", ... ); in the constructor, not save the counters in a map, then access the counter with registry.getLongCounter("myCounter").incrementBy(1L, ...); shortly before emitting the metric.

I had been thinking the suggestion was to move registry.registerLongCounter( "myCounter", ... ); out of the class and up near NodeConstruction, to avoid making MetadataRolloverService a singleton. But it was just to remove the redundant state in autoShardingMetricCounters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm missing something super obvious - but why do we need to have a Counter both in the service and in the metrics registry? I understand a service must register its desired counters (e.g. doing this bit registry.registerLongCounter( "myCounter", ... ); )
But what I don't understand is why do we need to keep hold of the counter in the service itself (like it's proposed here ) as opposed fetching it from the MeterRegistry ? (e.g. registry.getLongCounter("myCounter").incrementBy(1L, ...); )
The autoShardingMetricCounters state is added to MetadataRolloverService solely so we can do autoShardingMetricCounters.get("myCounter").increment()

you can do this, I think the effect is the same (one map less..) You would have to keep the state of the registry (a field somehow). So this is what I meant by keeping a state.
I think it is up to you to decide if you prefer to keep the fields local to the class or use a registry.get. Think of logger usage where you would keep the instance local to the class.
Some parts of our codebase are 'aggregating' metrics as records - see S3RepositoriesMetrics usages.

The current state PR still looks good to me.

@parkertimmins parkertimmins marked this pull request as ready for review April 23, 2024 17:00
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Apr 23, 2024
@pgomulka pgomulka added Team:Data Management Meta label for data/management team :Core/Infra/Metrics Metrics and metering infrastructure labels Apr 24, 2024
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed Team:Data Management Meta label for data/management team needs:triage Requires assignment of a team area label labels Apr 24, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

Copy link
Contributor

@pgomulka pgomulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,
but I left once comment to clarify on the multiple instances of the service

@@ -444,6 +447,8 @@ protected void configure() {
bind(ShardsAllocator.class).toInstance(shardsAllocator);
bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy);
bind(AllocationStatsService.class).toInstance(allocationStatsService);
bind(TelemetryProvider.class).toInstance(telemetryProvider);
bind(MetadataRolloverService.class).asEagerSingleton();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine, but I am still unsure how did it work so that we were having multiple MetadataRolloverService. was this intentional to have multiple instances of this service?

@elasticsearchmachine
Copy link
Collaborator

Hi @parkertimmins, I've created a changelog YAML for you.

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Parker. I've left a few questions about the approach.

Comment on lines 357 to 360
LongCounter metricCounter = autoShardingMetricCounters.get(autoShardingResult.type());
if (metricCounter != null) {
metricCounter.increment();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use computeIfPresent here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't apply after removing autoShardingMetricCounters, right? 🤔
The multiple if statement with null check aren't very pretty. I thought about using something like with Optionals.ofNullable/isPresent, but I haven't seen any Optional usage in the codebase, so decide to keep it as is.

@@ -444,6 +447,8 @@ protected void configure() {
bind(ShardsAllocator.class).toInstance(shardsAllocator);
bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy);
bind(AllocationStatsService.class).toInstance(allocationStatsService);
bind(TelemetryProvider.class).toInstance(telemetryProvider);
bind(MetadataRolloverService.class).asEagerSingleton();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I am still unsure how did it work so that we were having multiple MetadataRolloverService

The service was stateless so it wouldn't have mattered.

I'm not super sure about this change. @pgomulka is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry to say something like:

registry.registerLongCounter( "myCounter", ... );

...

registry.getLongCounter("myCounter").incrementBy(1L, ...);

Am I missing something?

Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this Parker

@parkertimmins parkertimmins self-assigned this Apr 26, 2024
@parkertimmins parkertimmins merged commit 3ed42f3 into elastic:main Apr 26, 2024
14 checks passed
@parkertimmins parkertimmins deleted the auto-sharding-apm-metrics branch April 26, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Metrics Metrics and metering infrastructure >enhancement Team:Core/Infra Meta label for core/infra team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants