Skip to content

Fix agent lifecycle events: Start instance name and Shutdown delivery#810

Merged
wu-sheng merged 1 commit into
mainfrom
fix/agent-lifecycle-events
Jul 1, 2026
Merged

Fix agent lifecycle events: Start instance name and Shutdown delivery#810
wu-sheng merged 1 commit into
mainfrom
fix/agent-lifecycle-events

Conversation

@wu-sheng

@wu-sheng wu-sheng commented Jul 1, 2026

Copy link
Copy Markdown
Member

Fix agent lifecycle events: the Start event missed the service instance name, and the Shutdown event was never delivered

  • Add a unit test to verify that the fix works.
  • Explain briefly why the bug exists and how to fix it.

Why the bug exists

BootService#priority() is documented as "BootServices with higher priorities will be started earlier, and shut down later", but ServiceManager sorted the phases the opposite way — prepare()/startup() ascending and shutdown() descending. So the Integer.MAX_VALUE services ran last on boot and first on shutdown, breaking two lifecycle events:

  • ServiceInstanceGenerator (MAX) prepared last, so EventReportServiceClient built the Start event before INSTANCE_NAME was generated → empty serviceInstance.
  • GRPCChannelManager (MAX) shut down first, closing the gRPC channel before EventReportServiceClient.shutdown() could send the Shutdown event. A second defect compounded it: the event stub's deadline was fixed once at connect time (withDeadlineAfter), so for any JVM alive longer than the upstream timeout the Shutdown RPC failed with DEADLINE_EXCEEDED even on a live channel.

How it's fixed

  • ServiceManager: sort prepare()/startup() descending and shutdown() ascending, matching the documented contract (foundational services come up first and go down last).
  • EventReportServiceClient: align with the stable TraceSegmentServiceClient / ServiceManagementClient idiom — volatile status/stub, a plain stub built on CONNECTED before the status is published, and a fresh deadline applied per send. The Start-event gate now uses explicit readiness (bootCompleted + CONNECTED), with reported meaning sent/in-flight, so a failed send retries on the next connect instead of being dropped (this also fixes the previously inverted compareAndSet gate).
  • KafkaProducerManager: no longer derives its priority from GRPCChannelManager — that coupling flipped under the corrected order and would close the shared producer before the Kafka reporters stop. It now uses a constant priority higher than the default-priority Kafka reporters, so the producer is closed only after they stop sending.

Verification

Added ServiceManagerOrderingTest (pins boot descending / shutdown ascending) and a KafkaProducerManager priority test; apm-agent-core and kafka-reporter-plugin suites pass with 0 checkstyle violations. Also verified end-to-end against a local OAP + BanyanDB: a service instrumented with the built agent reports a Start event carrying the instance name and, on graceful shutdown (including JVMs running well past the upstream timeout), a Shutdown event.

The BootService priority contract (higher priority is started earlier and
shut down later) was inverted in ServiceManager: prepare()/startup() sorted
ascending and shutdown() sorted descending. As a result:

- ServiceInstanceGenerator (priority MAX_VALUE) prepared last, so the Start
  event was built before INSTANCE_NAME was generated, reporting an empty
  serviceInstance.
- GRPCChannelManager (priority MAX_VALUE) shut down first and closed the
  gRPC channel before EventReportServiceClient could send the Shutdown event.

Sort prepare()/startup() descending and shutdown() ascending to match the
documented contract, so foundational services come up first and go down last.

Align EventReportServiceClient with the TraceSegmentServiceClient /
ServiceManagementClient idiom: volatile status/stub, a plain stub built on
CONNECTED before the status is published, and a fresh gRPC deadline applied
per send. Previously the Shutdown event reused a stub whose absolute deadline
was fixed at connect time and had already expired for any JVM alive longer
than the upstream timeout. The Start-event gate now uses explicit readiness
(bootCompleted + CONNECTED) with reported meaning sent/in-flight, so a failed
send can retry on the next connect instead of being dropped.

KafkaProducerManager no longer derives its priority from GRPCChannelManager;
it uses a constant priority higher than the default-priority Kafka reporters
that share its producer, so the producer is closed only after they stop.

Add ServiceManagerOrderingTest and a KafkaProducerManager priority test.
@wu-sheng wu-sheng added this to the 9.7.0 milestone Jul 1, 2026
@wu-sheng wu-sheng added the bug Something isn't working label Jul 1, 2026
@wu-sheng wu-sheng merged commit a1d69cb into main Jul 1, 2026
216 of 273 checks passed
@wu-sheng wu-sheng deleted the fix/agent-lifecycle-events branch July 1, 2026 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants