Feature/otlp ksm node enrichment by louisall · Pull Request #2072 · aws/amazon-cloudwatch-agent

louisall · 2026-04-03T13:50:15Z

Description of the issue

KSM (kube-state-metrics) metrics scraped by the cluster-scraper lack per-node
host attributes (instance ID, instance type, AMI ID, availability zone, hostname).
These attributes are only available via IMDS on each node, but the cluster-scraper
runs as a single Deployment and cannot access IMDS for every node.

Description of changes

Introduces a three-component pipeline to bridge IMDS data from DaemonSet nodes
to the cluster-scraper:

LeaseWriter (extension/entitystore/leasewriter.go): Runs on each
DaemonSet node. After EC2Info is populated from IMDS, creates a Kubernetes
Lease (cwagent-node-metadata-<nodeName>) with host attributes as annotations.
Includes jitter to prevent thundering herd, exponential backoff on failures,
and best-effort cleanup on shutdown.
nodemetadatacache extension (extension/nodemetadatacache/): Runs on the
cluster-scraper. Watches Leases via a K8s informer scoped to the addon namespace.
Maintains an in-memory cache keyed by node name with staleness checks
(renewTime + leaseDuration). Degrades gracefully if K8s client setup fails.
nodemetadataenricher processor (plugins/processors/nodemetadataenricher/):
Runs in the cluster-scraper's KSM pipeline. For each ResourceMetrics with a
k8s.node.name attribute, looks up the node in the cache and sets host.id,
host.name, host.type, host.image.id, and cloud.availability_zone.
Uses atomic.Pointer for thread-safe lazy initialization of the cache reference.

Supporting changes:

EC2Info expanded with 4 new fields + getters (mutex-protected)
setInstanceIDAccountID renamed to setEC2Metadata
Hostname failure is non-blocking (warn and proceed)
startLeaseWriter called in K8s mode with testable getEnv/getK8sConfig vars

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution,
under the terms of your choice.

Tests

ec2Info_test.go: Updated TestSetEC2Metadata table-driven tests for all 6 fields,
added TestGettersReturnEmptyBeforeInit, TestHostnameFailureProceedsWithoutIt
extension_test.go: Added TestStartLeaseWriter_MissingNodeName,
TestStartLeaseWriter_K8sConfigFailure, TestStartLeaseWriter_DefaultNamespace
leasewriter_test.go: Tests for buildLease, create, renewal (calls actual
renewLeaseWithRetry), stop/delete, leaseName, jitter bounds, default values
extension_test.go (nodemetadatacache): Tests for cache hit/miss, stale lease,
missing annotations, missing renewTime/leaseDuration, concurrent read/write,
tombstone handling, update overwrite, prefix filtering
processor_test.go: Tests for enrichment with cache hit, pass-through on cache
miss, no node name, empty node name, metric count preservation, AZ overwrite,
mixed metrics (enriched + pass-through)
Deployed to a 13-node EKS cluster with standard, GPU, Neuron, EFA, and attr-limit node groups. Verified all 13 DaemonSet agents created Leases with correct IMDS annotations (host.id, host.name, host.type, host.image.id, cloud.availability_zone) and leaseDurationSeconds=7200. Confirmed Lease adoption on pod restart (AlreadyExists handling) and renewal via agent logs. Verified the cluster-scraper's nodemetadatacache extension started and synced the Lease informer.

Ran the full integration test suite (2769 tests, 0 failures) which validates:

Lease existence and annotation correctness for every node,
host.id cross-checked against K8s providerID,
cloud.availability_zone per-node correctness on KSM metrics (the key AZ overwrite behavior),
host.image.id (AMI) presence on all KSM node-scoped metrics,
host.name matching k8s.node.name,
host.type matching expected instance type per node group,
node label enrichment on KSM node-scoped metrics,
absence of host attributes on non-node KSM metrics,
redundant node label removal parity between DaemonSet and cluster-scraper pipelines.

…or AI agents (~5,100 tokens) Adds a sparse tree of AGENTS.md files across the repository to help AI coding agents navigate the codebase effectively. Each file surfaces hidden context that isn't obvious from reading code alone — architectural invariants, registration traps, pitfall patterns, and cross-cutting contracts. Key context captured: - Dual runtime architecture (Telegraf + OTel) and the adapter bridge between them - Silent failure traps: OTel component registration, Telegraf blank imports - CloudWatch Logs concurrency trap (concurrency >1 introduces HOL blocking) - fd_release + auto_removal log loss interaction - Prometheus dual pipeline (completely different component chains for CW vs AMP) - EntityStore service name resolution priority chain - OTLP/HTTP validator as a security boundary - SigV4 handler ordering constraint (compression before signing) - Forked dependency warnings (go.mod replace directives) - Partition-aware credential handling All claims verified against current source code.

… enrichment Adds end-to-end pipeline for enriching KSM metrics with per-node IMDS host attributes (host.id, host.name, host.type, host.image.id, cloud.availability_zone) via Kubernetes Leases. - Expand EC2Info struct with InstanceType, ImageID, AvailabilityZone, Hostname fields - Rename setInstanceIDAccountID to setEC2Metadata to reflect expanded scope - Add LeaseWriter that publishes IMDS metadata as Lease annotations per node - Add nodemetadatacache extension that watches Leases via informer and caches metadata - Add nodemetadataenricher processor that enriches KSM metrics from the cache - Register nodemetadatacache and nodemetadataenricher in default components - Extract getEnv and getK8sConfig as package-level vars for testability - Add startLeaseWriter tests for missing env var and K8s config failure paths

movence · 2026-04-07T15:54:01Z

+			lw.logger.Info("Node metadata Lease already exists, adopting via update",
+				zap.String("name", lw.leaseName()),
+			)
+			existing, getErr := lw.client.Leases(lw.namespace).Get(context.Background(), lw.leaseName(), metav1.GetOptions{})


When AlreadyExists, the code does Get then Update. If Get fails (any non-NotFound error), it logs and calls continue — which immediately retries Create, gets AlreadyExists again, tries Get again, fails again. There is no backoff and no done channel check in this path. The same applies when Update fails: continue goes back to Create with no sleep. Under K8s API throttling or transient errors, this becomes a tight loop hammering the API server until the agent is killed.

Good catch. The bare continue statements bypass the backoff entirely. Fixed by restructuring the AlreadyExists branch so that Get/Update failures fall through to the shared backoff+done-check block at the bottom of the loop instead of jumping back to the top.

I considered introducing wait.Backoff but opted against it, the function already has its own backoff state and mixing two independent backoff mechanisms in the same retry loop would be harder to follow. The fall-through reuses the single existing backoff variable for all failure paths.

The trade-off is that after a Get/Update failure, the next iteration calls Create again (which will return AlreadyExists). This is intentional — the redundant Create is one extra call per backoff interval, and it handles the edge case where the Lease was deleted between retries (Create would succeed directly instead of looping on Get for a nonexistent object).

movence · 2026-04-07T15:59:24Z

+	annotationHostName = "cwagent.amazonaws.com/host.name"
+	annotationHostType = "cwagent.amazonaws.com/host.type"
+	annotationImageID  = "cwagent.amazonaws.com/host.image.id"
+	annotationAZ       = "cwagent.amazonaws.com/cloud.availability_zone"


can these consts be moved to a shared location? they are defined in both leasewriter and here

Created lease.go as the single source of truth — internal/k8sCommon/ is where the codebase already keeps shared K8s utilities (k8sclient, k8sutil, kubeletutil). Both the writer and reader now import from there. Kept it as a dedicated lease sub-package rather than mixing into k8sutil, since these constants define the Lease contract between two specific components.

movence · 2026-04-07T16:03:38Z

+	}
+	c.mutex.Lock()
+	defer c.mutex.Unlock()
+	c.cache = make(map[string]*NodeMetadata)


Shutdown closes stopCh (signaling the informer to stop) and then immediately acquires the write lock to clear the cache. But closing stopCh only signals the informer — it does not wait for in-flight event handler goroutines to exit. An onLeaseAdd or onLeaseUpdate call that was waiting for the write lock will acquire it after Shutdown releases it, re-populating the cache. After Shutdown returns, the cache may contain stale data. Downstream code that calls Get() after Shutdown could receive non-nil results.

Added an atomic.Bool shutdown flag. Shutdown() sets it before closing stopCh. Get() returns nil after shutdown, and handleLeaseEvent()/onLeaseDelete() bail out early — so in-flight handlers can't repopulate the cache after it's cleared. The check is a single atomic read, so the overhead on the event handler path is negligible.

movence · 2026-04-07T16:05:39Z

+	// Check staleness: renewTime + leaseDuration must be >= now
+	expiry := entry.RenewTime.Add(time.Duration(entry.LeaseDuration) * time.Second)
+	if time.Now().After(expiry) {
+		return nil


this returns nil for stale entries but does not remove them from the map. If a node is decommissioned and its lease expires naturally (TTL) without a delete event (e.g., the informer missed the delete, or the lease was force-deleted), the entry stays in the map forever. In a large cluster with frequent node churn, this is unbounded memory growth. The informer's 5-minute resync period helps with missed deletes, but does not guarantee cleanup of TTL-expired entries.

Considered three approaches here: delete-on-read (write-lock upgrade in Get()), background eviction goroutine, and no change.

Going with no change. The correctness concern is already handled — Get() checks renewTime + leaseDuration and returns nil for stale entries, so no wrong data is ever served. The memory concern is the inert map entries for decommissioned nodes. Each entry is ~300 bytes (5 short strings + a timestamp + an int32). Even 1000 decommissioned nodes over the process lifetime is ~300KB, and the map resets on any pod restart (deployment rollout, Helm upgrade, OOM, etc.).

Adding eviction logic (either delete-on-read or a background sweep) would duplicate the TTL-based staleness handling that motivated choosing Leases over ConfigMaps in the first place. The delete-on-read approach also only helps for nodes that are still actively queried — truly decommissioned nodes (no KSM metrics) would never be cleaned up anyway.

Happy to add a background sweep if you feel the memory bound isn't tight enough, but I think the current design is the right trade-off.

movence · 2026-04-07T16:09:12Z

+// IMDS metadata as a Kubernetes Lease. Must be called after ec2Info is
+// initialized (the LeaseWriter's waitForEC2Info handles the race).
+func (e *EntityStore) startLeaseWriter() {
+	nodeName := getEnv("K8S_NODE_NAME")


should check if LW has already been initialized

Added the nil guard. The current call site only invokes it once, but it's a cheap defensive check against future refactors.

movence · 2026-04-07T16:24:09Z

+	annotations := lease.Annotations
+
+	// All five annotations must be present
+	hostID, ok1 := annotations[annotationHostID]


should there be a check for any empty or nil values?

The ok1–ok5 checks handle the nil/missing case (key not present in the annotations map). Added empty string checks as well — none of these IMDS fields can legitimately be empty, so rejecting empty values is strictly more correct. The LeaseWriter won't write empty annotations in practice, but it's good defensive validation on the reader side.

movence · 2026-04-07T16:29:32Z

+	cache := p.cache.Load()
+	if cache == nil {
+		// Extension may not have been ready at creation time — retry.
+		cache = nodemetadatacache.GetNodeMetadataCache()


could add logs here to help with debugging

Added a debug log on successful lazy init. Intentionally skipped logging the "not yet available" case — processMetrics is called per metric batch, so that message would be too noisy.

movence · 2026-04-07T16:31:14Z

+// Stop stops the renewal goroutine, waits for it to exit, then performs a
+// best-effort delete of the Lease.
+func (lw *LeaseWriter) Stop() {
+	close(lw.done)


If this gets called twice (e.g., from a test that calls Stop() directly and then EntityStore.Shutdown() also calls it), the second close panics. There is no sync.Once or closed-flag guard.

Wrapped in sync.Once

mitali-salvi · 2026-04-13T20:57:35Z

+var _ extension.Extension = (*NodeMetadataCache)(nil)
+
+// SetForTest populates the cache with test data. Exported for cross-package test use.
+func (c *NodeMetadataCache) SetForTest(nodeName string, metadata *NodeMetadata) {


Can this be moved to a testutil package for export purposes ?

Looked into this — a testutil package can't access the unexported cache map and mutex, so it would need reflect/unsafe or a new exported Set() method (which is worse for the public API). Moving to a _test.go file doesn't work either since processor_test.go in the enricher package needs cross-package access. SetNodeMetadataCacheForTest in factory.go already follows the same pattern. Happy to move them if you see a cleaner approach, but I think the ForTest suffix + doc comments is the least-bad option here

sky333999 · 2026-04-14T02:01:12Z

+
+	// leaseWriter creates and renews a Kubernetes Lease with IMDS metadata
+	// for KSM node metadata enrichment
+	leaseWriter *LeaseWriter


How are we going to configure this extension in the helm-chart yaml? Specifically fields such as kubernetes_mode. The helm-chart could be used in EKS / K8s on EC2 / K8s on Prem etc.
This is handled in the config translation currently today with CWA but that does not exist for OTel CI.

The entitystore config (including kubernetes_mode, mode, region) continues to come from the existing CWA config translator, this PR doesn't change that path. The LeaseWriter doesn't depend on the specific kubernetes_mode value — it only gates on mode == EC2 && kubernetesMode != "", so it works identically on EKS, K8s-on-EC2, or any EC2-backed K8s cluster. On-prem (no EC2) is excluded because mode != EC2 means no IMDS. The LeaseWriter doesn't use region, profile, or credential config - it reads IMDS via the existing EC2Info and uses in-cluster K8s config. No new entitystore config fields or Helm chart config needed.

We are then now adding a hard dependency on the json config for OTel CI to work - so itll no longer be a pure yaml experience.

Have you any ideas of what way to change this or is it ok for now?

I think lets just copy the extension as-is into our yaml as well.. as long as it is identical to what the agent json translation generates, the merge shouldnt complain.
@mitali-salvi is going to validate that.

Agree with Kaushik here, we need to ensure this stays as a pure YAML experience.

I validated that copying the entitystore extension to the YAML wont cause merge conflicts during agent startup. The agent will merge/combine the 2 instances of the extension and instantiate it as a singleton

louisall added 2 commits April 2, 2026 11:40

louisall requested a review from a team as a code owner April 3, 2026 13:50

louisall force-pushed the feature/otlp-ksm-node-enrichment branch from 12e724b to 511cca2 Compare April 3, 2026 14:26

Update AGENTS.md docs for lease-based KSM enrichment components

c8128bb

louisall force-pushed the feature/otlp-ksm-node-enrichment branch from 511cca2 to c8128bb Compare April 7, 2026 08:15

Fix for test

6b2c433

movence reviewed Apr 7, 2026

View reviewed changes

movence added the ready for testing Indicates this PR is ready for integration tests to run label Apr 7, 2026

Adressing PR feedback

8e9ea4f

mitali-salvi reviewed Apr 13, 2026

View reviewed changes

Merge branch 'main' into feature/otlp-ksm-node-enrichment

9a852f4

sky333999 reviewed Apr 14, 2026

View reviewed changes

louisall added 2 commits April 14, 2026 15:58

Addressing PR feedback

17ec818

Removing Intent layer - AGENTS.md files

57ded37

sky333999 reviewed Apr 14, 2026

View reviewed changes

Comment thread extension/entitystore/extension.go Outdated

Comment thread extension/entitystore/extension.go Outdated

Comment thread extension/entitystore/extension.go Outdated

PR feedback

ba2cf39

louisall force-pushed the feature/otlp-ksm-node-enrichment branch from eb7612c to ba2cf39 Compare April 15, 2026 11:08

sky333999 approved these changes Apr 15, 2026

View reviewed changes

movence approved these changes Apr 15, 2026

View reviewed changes

Merge branch 'main' into feature/otlp-ksm-node-enrichment

185cf8b

mitali-salvi approved these changes Apr 16, 2026

View reviewed changes

mitali-salvi merged commit 19d0782 into main Apr 16, 2026
326 checks passed

mitali-salvi deleted the feature/otlp-ksm-node-enrichment branch April 16, 2026 01:17

Conversation

louisall commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the issue

Description of changes

License

Tests

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sky333999 Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

louisall commented Apr 3, 2026 •

edited

Loading

sky333999 Apr 14, 2026 •

edited

Loading