Skip to content

Add OTEL MetricsV2 standard cluster integration tests#671

Merged
sky333999 merged 2 commits into
aws:mainfrom
spanaik:otel/standard-cluster-clean
Apr 29, 2026
Merged

Add OTEL MetricsV2 standard cluster integration tests#671
sky333999 merged 2 commits into
aws:mainfrom
spanaik:otel/standard-cluster-clean

Conversation

@spanaik
Copy link
Copy Markdown
Contributor

@spanaik spanaik commented Apr 21, 2026

Port OTEL integration tests into the agent test framework for the
standard EKS cluster (2x t3.medium). Tests validate metric correctness
by querying the CloudWatch PromQL API with SigV4 auth and cross-
validating against K8s API ground truth.

Components:

  • util/otelmetrics/: Shared PromQL client library with SigV4 signing, query cache, ZIP-0006 label parsing, and rate limiting
  • terraform/eks/daemon/otel/: Ephemeral EKS cluster with Helm chart (release-6.1.0), Pod Identity, and nginx test workload
  • test/otel/standard/: 145 integration tests covering cadvisor, node_exporter, kubeletstats, KSM, control plane, dedup, resolution, host validation, and cross-source label correctness
  • generator/test_case_generator.go: Added to eks_daemon matrix

Validated in CI: 145/145 PASS via existing EKSIntegrationTest workflow.

Workflow Run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/24726297813

@spanaik spanaik requested a review from a team as a code owner April 21, 2026 14:32
@spanaik spanaik force-pushed the otel/standard-cluster-clean branch 2 times, most recently from d27798c to 1b0028f Compare April 21, 2026 14:35
@louisall
Copy link
Copy Markdown
Contributor

I know these tests run in us-west-2 but could be worth putting a region guard. If anyone adds a different region to the matrix that doesn't support metrics v2 it'll all just silently fail

Comment thread util/otelmetrics/query_cache.go Outdated
Comment thread test/otel/standard/setup_test.go Outdated
Comment thread test/otel/standard/assertions_test.go Outdated
@spanaik spanaik force-pushed the otel/standard-cluster-clean branch 3 times, most recently from 089afbe to d9cba8f Compare April 22, 2026 10:04
@spanaik spanaik requested a review from louisall April 22, 2026 10:51
Comment thread util/otelmetrics/client.go Outdated
Comment thread terraform/eks/daemon/otel/main.tf Outdated
Comment thread terraform/eks/daemon/otel/variables.tf Outdated
Comment thread terraform/eks/daemon/otel/variables.tf Outdated
@spanaik spanaik force-pushed the otel/standard-cluster-clean branch 2 times, most recently from d6e91d4 to 6af180b Compare April 23, 2026 15:49
@spanaik spanaik requested a review from sky333999 April 26, 2026 09:07
Comment thread terraform/eks/daemon/otel-multi-efa/helm-charts Outdated
Comment thread terraform/eks/daemon/otel/main.tf Outdated
}
qc.mu.RUnlock()

qc.mu.Lock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 :

  The write lock is held for the entire miss path, which includes the blocking HTTP calls. While the tests run sequentially today (no t.Parallel()), this design is fragile: if anyone enables t.Parallel() on these subtests, the cache becomes a single-flight serialization point across all metrics, not just per metric.
  The intra-metric fan-out via qc.sem ends up executing inside the held lock anyway, so the semaphore's inter-metric budget is effectively unused.

  Fix options:

  - Double-check the miss path with per-metric sentinels (e.g. x/sync/singleflight) so different metric names proceed concurrently.
  - Or release the lock after adding a placeholder entry, do the HTTP work, then re-acquire to store.

  Either approach keeps the RLock fast-path on cache hits intact.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 :

Once ^ is addressed, adding t.Parallel() at the outer TestXxx level would significantly cut wall-clock time — almost every test is cache-backed reads of the same metric set, and the ground-truth client is already guarded by sync.Once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, lemme know if this works

Comment thread util/otelmetrics/models.go Outdated
Comment thread test/otel/standard/assertions_test.go Outdated
@spanaik spanaik force-pushed the otel/standard-cluster-clean branch 8 times, most recently from b3f595c to a129439 Compare April 28, 2026 10:53
  Port OTEL integration tests into the agent test framework for the
  standard EKS cluster (2x t3.medium). Tests validate metric correctness
  by querying the CloudWatch PromQL API with SigV4 auth and cross-
  validating against K8s API ground truth.

  Components:
  - util/otelmetrics/: Shared PromQL client library with SigV4 signing,
    query cache, ZIP-0006 label parsing, and rate limiting
  - terraform/eks/daemon/otel/: Ephemeral EKS cluster with Helm chart
    (release-6.1.0), Pod Identity, and nginx test workload
  - test/otel/standard/: 145 integration tests covering cadvisor,
    node_exporter, kubeletstats, KSM, control plane, dedup, resolution,
    host validation, and cross-source label correctness
  - generator/test_case_generator.go: Added to eks_daemon matrix

  Validated in CI: 145/145 PASS via existing EKSIntegrationTest workflow.
@spanaik spanaik force-pushed the otel/standard-cluster-clean branch from a129439 to e39240b Compare April 28, 2026 11:02
@spanaik spanaik requested a review from sky333999 April 28, 2026 15:19
@sky333999 sky333999 merged commit d7e0a03 into aws:main Apr 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants