New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add auto-sharding APM metrics #107593
Add auto-sharding APM metrics #107593
Conversation
A test is failure intermittently, and I am unsure why. Perhaps a rollover is happening asynchronouosly and adding additional metrics? That doesn't seem likely ...
MetadataRolloverService.AUTO_SHARDING_METRIC_NAMES.values() | ||
.stream() | ||
.filter(metric -> metric.equals(expectedEmittedMetric) == false) | ||
.forEach(metric -> assertThat(measurements.get(metric), empty())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a failure on this line in the CI build, which I was unable to reproduce locally. This means some metric that should not have been emitted, was emitted. I added the first call to resetTelemetry
on line 126, as I thought that perhaps the test were run out of order and somehow the same telemetry plugin was being used in another test causing the measurement to still be in the recorder (though this doesn't seem likely). Any ideas what could have caused this?
...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/MetadataRolloverService.java
Show resolved
Hide resolved
@@ -444,6 +447,8 @@ protected void configure() { | |||
bind(ShardsAllocator.class).toInstance(shardsAllocator); | |||
bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy); | |||
bind(AllocationStatsService.class).toInstance(allocationStatsService); | |||
bind(TelemetryProvider.class).toInstance(telemetryProvider); | |||
bind(MetadataRolloverService.class).asEagerSingleton(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There had been multiple instances of MetadataRolloverService being instantiated, causing repeated attempted calls to registerLongCounter
for the same metric. Making it a singleton, fixes this while allowing the metric definitions to remain in the MetadataRolloverService constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be fine, but I am still unsure how did it work so that we were having multiple MetadataRolloverService
. was this intentional to have multiple instances of this service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the intention was to have one instance. The reason I think this was okay is because MetadataRolloverService is almost a pure function, so there was no harm in having multiple copies. I will be sure to get another look from someone more familiar with this code before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I am still unsure how did it work so that we were having multiple MetadataRolloverService
The service was stateless so it wouldn't have mattered.
I'm not super sure about this change. @pgomulka is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry
to say something like:
registry.registerLongCounter( "myCounter", ... );
...
registry.getLongCounter("myCounter").incrementBy(1L, ...);
Am I missing something?
Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry to say something like:
registry.registerLongCounter( "myCounter", ... );
...
registry.getLongCounter("myCounter").incrementBy(1L, ...);
yes currently you need to have a state per a service that is creating metrics. We will be working to allow for static metric registration - think of logger - but it is not ready yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main reason for wanting to do the registration here is that it feels weird to do the registration up in NodeConstruction, then reference the same metric name far away in this code. Though I do see that adding state to this class is not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?
My thought is that we are not expecting the metrics to decrease. I believe LongCounter
is for metrics that only increase, whereas LongGauge
can decrease. Do you see a situation where we'd want the metrics to decrease?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes currently you need to have a state per a service that is creating metrics. We will be working to allow for static metric registration - think of logger - but it is not ready yet.
Sorry if I'm missing something super obvious - but why do we need to have a Counter both in the service and in the metrics registry? I understand a service must register its desired counters (e.g. doing this bit registry.registerLongCounter( "myCounter", ... );
)
But what I don't understand is why do we need to keep hold of the counter in the service itself (like it's proposed here ) as opposed fetching it from the MeterRegistry
? (e.g. registry.getLongCounter("myCounter").incrementBy(1L, ...);
)
The autoShardingMetricCounters
state is added to MetadataRolloverService
solely so we can do autoShardingMetricCounters.get("myCounter").increment()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed a bit, and I had misunderstood. It seems that @andreidan is correct that the autoShardingMetricCounters
map is not needed. We can just do the registry.registerLongCounter( "myCounter", ... );
in the constructor, not save the counters in a map, then access the counter with registry.getLongCounter("myCounter").incrementBy(1L, ...);
shortly before emitting the metric.
I had been thinking the suggestion was to move registry.registerLongCounter( "myCounter", ... );
out of the class and up near NodeConstruction, to avoid making MetadataRolloverService
a singleton. But it was just to remove the redundant state in autoShardingMetricCounters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm missing something super obvious - but why do we need to have a Counter both in the service and in the metrics registry? I understand a service must register its desired counters (e.g. doing this bit registry.registerLongCounter( "myCounter", ... ); )
But what I don't understand is why do we need to keep hold of the counter in the service itself (like it's proposed here ) as opposed fetching it from the MeterRegistry ? (e.g. registry.getLongCounter("myCounter").incrementBy(1L, ...); )
The autoShardingMetricCounters state is added to MetadataRolloverService solely so we can do autoShardingMetricCounters.get("myCounter").increment()
you can do this, I think the effect is the same (one map less..) You would have to keep the state of the registry (a field somehow). So this is what I meant by keeping a state.
I think it is up to you to decide if you prefer to keep the fields local to the class or use a registry.get
. Think of logger usage where you would keep the instance local to the class.
Some parts of our codebase are 'aggregating' metrics as records - see S3RepositoriesMetrics usages.
The current state PR still looks good to me.
Pinging @elastic/es-core-infra (Team:Core/Infra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
but I left once comment to clarify on the multiple instances of the service
...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/MetadataRolloverService.java
Show resolved
Hide resolved
@@ -444,6 +447,8 @@ protected void configure() { | |||
bind(ShardsAllocator.class).toInstance(shardsAllocator); | |||
bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy); | |||
bind(AllocationStatsService.class).toInstance(allocationStatsService); | |||
bind(TelemetryProvider.class).toInstance(telemetryProvider); | |||
bind(MetadataRolloverService.class).asEagerSingleton(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be fine, but I am still unsure how did it work so that we were having multiple MetadataRolloverService
. was this intentional to have multiple instances of this service?
Hi @parkertimmins, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Parker. I've left a few questions about the approach.
...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/MetadataRolloverService.java
Outdated
Show resolved
Hide resolved
LongCounter metricCounter = autoShardingMetricCounters.get(autoShardingResult.type()); | ||
if (metricCounter != null) { | ||
metricCounter.increment(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use computeIfPresent
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't apply after removing autoShardingMetricCounters
, right? 🤔
The multiple if statement with null check aren't very pretty. I thought about using something like with Optionals.ofNullable/isPresent, but I haven't seen any Optional usage in the codebase, so decide to keep it as is.
@@ -444,6 +447,8 @@ protected void configure() { | |||
bind(ShardsAllocator.class).toInstance(shardsAllocator); | |||
bind(ShardRoutingRoleStrategy.class).toInstance(shardRoutingRoleStrategy); | |||
bind(AllocationStatsService.class).toInstance(allocationStatsService); | |||
bind(TelemetryProvider.class).toInstance(telemetryProvider); | |||
bind(MetadataRolloverService.class).asEagerSingleton(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I am still unsure how did it work so that we were having multiple MetadataRolloverService
The service was stateless so it wouldn't have mattered.
I'm not super sure about this change. @pgomulka is this the general pattern for adding metrics? i.e. the services need to become stateful and keep track of their local counters?
The pattern in my mind is going through the MeterRegistry
to say something like:
registry.registerLongCounter( "myCounter", ... );
...
registry.getLongCounter("myCounter").incrementBy(1L, ...);
Am I missing something?
Also, another question for the general output here - gauges are meant to go up and down - are we expecting the auto sharding metrics to decrease ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for iterating on this Parker
It's useful to know when data stream auto-sharding does a shard increase, decrease, or cooldown prevents increase/decrease during rollover. This information is already in logs, but we should add APM metrics to improve tracking. We will add four APM metrics:
es.auto_sharding.increase_shards.total
es.auto_sharding.decrease_shards.total
es.auto_sharding.cooldown_prevented_increase.total
es.auto_sharding.cooldown_prevented_decrease.total
Additionally, both metrics will be tagged with a "data_stream" field contain the data stream name, allowing the metrics to be grouped by data stream.