Skip to content

Conversation

@JeremyDahlgren
Copy link
Contributor

This change registers counters for initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteConnectionStrategy constructor. This PR builds on the work done in #134415.

Resolves: ES-12695

This change registers counters for initial and reconnect
attempt failures.  The change also required minor refactoring
to make the metrics registry available from the TransportService
that is passed to the RemoteConnectionStrategy constructor.
This change builds on the work done in elastic#134415.

Resolves: ES-12695
@JeremyDahlgren JeremyDahlgren added >non-issue :Distributed Coordination/Network Http and internode communication implementations Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0 labels Oct 30, 2025
@JeremyDahlgren JeremyDahlgren requested a review from ywangd October 30, 2025 21:22
@JeremyDahlgren JeremyDahlgren marked this pull request as ready for review October 30, 2025 21:22
@JeremyDahlgren JeremyDahlgren requested a review from a team as a code owner October 30, 2025 21:22
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments for the production code changes.

this.transportService = transportService;
this.connectionManager = connectionManager;
this.maxPendingConnectionListeners = config.maxPendingConnectionListeners();
registerMetrics(transportService.getTelemetryProvider());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics registration should be node wide, not per Java instance. I suggest we register the metrics in RemoteClusterService instead and you can access them here by either passing them directly or use meterRegistry.getMeterRegistry(String name).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started refactoring to registering in RemoteClusterService, but realized the result was turning out the same, a single static instance initialized in a static synchronized method. Also we would be leaking knowledge of the metrics being used down in the strategy class. Aren't the single static instances in RemoteConnectionStrategy already node wide? Perhaps I'm misunderstanding what you mean by 'node wide' and 'per Java instance'. The way it is coded there are only two static counters, registered once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is coded there are only two static counters, registered once.

You are right that they are registered once as two static fields. But we should not need these static fields and the synchronized initialization. It is more idmoatic to register the metrics once at a common place instead of attempting registration by each individual object. For example, we register repository related metrics once in RepositoriesModule instead of inside each individual repository.

Static fields also do not work well with internal cluster tests where multiple nodes are running in the same Java process. We want most objects, including MetricRegistry and its registration, to be per ES node. Static fields break this encapsulation because they are shared by multiple ES nodes in the test cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense now, thank you for clearing this up Yang, forgive me for it not clicking before. I refactored to register the single metric in RemoteClusterService. I needed to refactor the unit tests for RemoteClusterService since they were creating an unnecessary RemoteClusterService instance, which would cause duplicate metric registration. Since this is a big change and separate from the focus of this PR I split it off into #137647.

logger.warn(msgSupplier, e);
// TODO: ES-12695: Increment either the initial or retry connection failure metric.
final var counter = isInitialAttempt ? initialConnectionAttemptFailures : reconnectAttemptFailures;
counter.increment();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should lablel the metric with information like target project alias etc. with counter.incrementBy(long, Map).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Yang, I added linked_project_id and linked_project_alias. Are there others you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can also have a label (an enum) to differentiate between initial connection and reconnection so that we need only a single APM metric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a label for ConnectionStrategy as well. Though CPS only uses proxy, this metric is in a more general place that can be used in other environments.

Comment on lines +217 to +220
assert transportService.getTelemetryProvider() != null;
final var meterRegistry = transportService.getTelemetryProvider().getMeterRegistry();
assert meterRegistry instanceof RecordingMeterRegistry;
final var metricRecorder = ((RecordingMeterRegistry) meterRegistry).getRecorder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my other comments about static fields. If we create two TransportService instances representing two nodes and test connection strategies for each of them, the static fields will make RemoteConnectionStrategy objects from different TransportService instance share the same metrics give incorrect results.

JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Nov 5, 2025
This refactoring is a split from elastic#137406.  A TransportService
instance already creates a RemoteClusterService instance,
with a getter method to get a reference to it.  The unit
tests were creating another instance, passing the existing
TransportService instance into the constructor.  This will
cause one time metric registration to fail since the metrics
have already been registered in the TransportService's
internal RemoteClusterService instance.

Relates: ES-12695
JeremyDahlgren added a commit that referenced this pull request Nov 6, 2025
This refactoring is a split from #137406.  A TransportService
instance already creates a RemoteClusterService instance,
with a getter method to get a reference to it.  The unit
tests were creating another instance, passing the existing
TransportService instance into the constructor.  This will
cause one time metric registration to fail since the metrics
have already been registered in the TransportService's
internal RemoteClusterService instance.

Relates: ES-12695
@JeremyDahlgren JeremyDahlgren requested a review from ywangd November 6, 2025 22:23
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 105 to 113
final var telemetryProvider = transportService == null ? null : transportService.getTelemetryProvider();
final var meterRegistry = telemetryProvider == null ? null : telemetryProvider.getMeterRegistry();
if (meterRegistry != null) {
meterRegistry.registerLongCounter(
CONNECTION_ATTEMPT_FAILURES_COUNTER_NAME,
"linked project connection attempt failure count",
"count"
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these null checks necessary? I assume they are only useful for tests? Does TransportService not return a NOOP telemetryProvider if a real one is not configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to eliminate most of the null checks here, the remaining one I'll address in a followup PR with some additional refactoring in RemoveClusterService tests.

@JeremyDahlgren JeremyDahlgren merged commit 4f5d001 into elastic:main Nov 8, 2025
34 checks passed
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
)

This refactoring is a split from elastic#137406.  A TransportService
instance already creates a RemoteClusterService instance,
with a getter method to get a reference to it.  The unit
tests were creating another instance, passing the existing
TransportService instance into the constructor.  This will
cause one time metric registration to fail since the metrics
have already been registered in the TransportService's
internal RemoteClusterService instance.

Relates: ES-12695
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
…37406)

This change registers a counter to track initial and reconnect
attempt failures.  The change also required minor refactoring
to make the metrics registry available from the TransportService
that is passed to the RemoteClusterService constructor.
This change builds on the work done in elastic#134415.

Resolves: ES-12695
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Nov 11, 2025
This change is a follow up to elastic#137406, removing a null check
around the TransportService instance when registering the metric.
This was only needed for a few test cases in the
RemoteClusterServiceTests where a null instance was passed to the
constructor.  The change also cleans up the code for creating the
TransportService instances in the test, cutting down on the LOC.

Relates: elastic#137406
JeremyDahlgren added a commit that referenced this pull request Nov 13, 2025
…37939)

This change is a follow up to #137406, removing a null check
around the TransportService instance when registering the metric.
This was only needed for a few test cases in the
RemoteClusterServiceTests where a null instance was passed to the
constructor.  The change also cleans up the code for creating the
TransportService instances in the test, cutting down on the LOC.

Relates: #137406
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants