-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add connection failure metrics in RemoteConnectionStrategy #137406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add connection failure metrics in RemoteConnectionStrategy #137406
Conversation
This change registers counters for initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteConnectionStrategy constructor. This change builds on the work done in elastic#134415. Resolves: ES-12695
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments for the production code changes.
| this.transportService = transportService; | ||
| this.connectionManager = connectionManager; | ||
| this.maxPendingConnectionListeners = config.maxPendingConnectionListeners(); | ||
| registerMetrics(transportService.getTelemetryProvider()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics registration should be node wide, not per Java instance. I suggest we register the metrics in RemoteClusterService instead and you can access them here by either passing them directly or use meterRegistry.getMeterRegistry(String name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started refactoring to registering in RemoteClusterService, but realized the result was turning out the same, a single static instance initialized in a static synchronized method. Also we would be leaking knowledge of the metrics being used down in the strategy class. Aren't the single static instances in RemoteConnectionStrategy already node wide? Perhaps I'm misunderstanding what you mean by 'node wide' and 'per Java instance'. The way it is coded there are only two static counters, registered once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it is coded there are only two static counters, registered once.
You are right that they are registered once as two static fields. But we should not need these static fields and the synchronized initialization. It is more idmoatic to register the metrics once at a common place instead of attempting registration by each individual object. For example, we register repository related metrics once in RepositoriesModule instead of inside each individual repository.
Static fields also do not work well with internal cluster tests where multiple nodes are running in the same Java process. We want most objects, including MetricRegistry and its registration, to be per ES node. Static fields break this encapsulation because they are shared by multiple ES nodes in the test cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense now, thank you for clearing this up Yang, forgive me for it not clicking before. I refactored to register the single metric in RemoteClusterService. I needed to refactor the unit tests for RemoteClusterService since they were creating an unnecessary RemoteClusterService instance, which would cause duplicate metric registration. Since this is a big change and separate from the focus of this PR I split it off into #137647.
| logger.warn(msgSupplier, e); | ||
| // TODO: ES-12695: Increment either the initial or retry connection failure metric. | ||
| final var counter = isInitialAttempt ? initialConnectionAttemptFailures : reconnectAttemptFailures; | ||
| counter.increment(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should lablel the metric with information like target project alias etc. with counter.incrementBy(long, Map).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Yang, I added linked_project_id and linked_project_alias. Are there others you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can also have a label (an enum) to differentiate between initial connection and reconnection so that we need only a single APM metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a label for ConnectionStrategy as well. Though CPS only uses proxy, this metric is in a more general place that can be used in other environments.
| assert transportService.getTelemetryProvider() != null; | ||
| final var meterRegistry = transportService.getTelemetryProvider().getMeterRegistry(); | ||
| assert meterRegistry instanceof RecordingMeterRegistry; | ||
| final var metricRecorder = ((RecordingMeterRegistry) meterRegistry).getRecorder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my other comments about static fields. If we create two TransportService instances representing two nodes and test connection strategies for each of them, the static fields will make RemoteConnectionStrategy objects from different TransportService instance share the same metrics give incorrect results.
This refactoring is a split from elastic#137406. A TransportService instance already creates a RemoteClusterService instance, with a getter method to get a reference to it. The unit tests were creating another instance, passing the existing TransportService instance into the constructor. This will cause one time metric registration to fail since the metrics have already been registered in the TransportService's internal RemoteClusterService instance. Relates: ES-12695
This refactoring is a split from #137406. A TransportService instance already creates a RemoteClusterService instance, with a getter method to get a reference to it. The unit tests were creating another instance, passing the existing TransportService instance into the constructor. This will cause one time metric registration to fail since the metrics have already been registered in the TransportService's internal RemoteClusterService instance. Relates: ES-12695
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| final var telemetryProvider = transportService == null ? null : transportService.getTelemetryProvider(); | ||
| final var meterRegistry = telemetryProvider == null ? null : telemetryProvider.getMeterRegistry(); | ||
| if (meterRegistry != null) { | ||
| meterRegistry.registerLongCounter( | ||
| CONNECTION_ATTEMPT_FAILURES_COUNTER_NAME, | ||
| "linked project connection attempt failure count", | ||
| "count" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these null checks necessary? I assume they are only useful for tests? Does TransportService not return a NOOP telemetryProvider if a real one is not configured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to eliminate most of the null checks here, the remaining one I'll address in a followup PR with some additional refactoring in RemoveClusterService tests.
) This refactoring is a split from elastic#137406. A TransportService instance already creates a RemoteClusterService instance, with a getter method to get a reference to it. The unit tests were creating another instance, passing the existing TransportService instance into the constructor. This will cause one time metric registration to fail since the metrics have already been registered in the TransportService's internal RemoteClusterService instance. Relates: ES-12695
…37406) This change registers a counter to track initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteClusterService constructor. This change builds on the work done in elastic#134415. Resolves: ES-12695
This change is a follow up to elastic#137406, removing a null check around the TransportService instance when registering the metric. This was only needed for a few test cases in the RemoteClusterServiceTests where a null instance was passed to the constructor. The change also cleans up the code for creating the TransportService instances in the test, cutting down on the LOC. Relates: elastic#137406
…37939) This change is a follow up to #137406, removing a null check around the TransportService instance when registering the metric. This was only needed for a few test cases in the RemoteClusterServiceTests where a null instance was passed to the constructor. The change also cleans up the code for creating the TransportService instances in the test, cutting down on the LOC. Relates: #137406
This change registers counters for initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the
TransportServicethat is passed to theRemoteConnectionStrategyconstructor. This PR builds on the work done in #134415.Resolves: ES-12695