Skip to content

Feature/efa ec2 integration test#688

Merged
mitali-salvi merged 8 commits into
mainfrom
feature/efa-ec2-integration-test
May 29, 2026
Merged

Feature/efa ec2 integration test#688
mitali-salvi merged 8 commits into
mainfrom
feature/efa-ec2-integration-test

Conversation

@mitali-salvi
Copy link
Copy Markdown
Contributor

@mitali-salvi mitali-salvi commented May 8, 2026

Summary

Adds EFA EC2 integration test with Terraform module and test matrix entry.

Changes

  • EFA EC2 integration test (test/efa_ec2/) — validates EFA metrics are published to CloudWatch with correct short dimension names (device, port, eniId) and InstanceId from append_dimensions
  • Terraform module — provisions c5n.9xlarge instance with EFA-enabled ENI in a placement group
  • Test matrix entry (generator/resources/ec2_efa_test_matrix.json) — defines AMI and instance config for CI
  • Dimension name validation — verifies the transform processor correctly renames OTel-style attributes (aws.efa.device, aws.efa.port, aws.efa.eni.id) to CW-friendly short names

Related PRs

Testing

- test/efa_ec2/: Go integration test following nvidia_gpu EC2 pattern
  - Copies agent config, starts agent, sleeps 2min, stops, validates 9 EFA metrics via CloudWatch API
- terraform/ec2/efa/: EFA-specific Terraform module
  - EFA network interface (interface_type=efa), cluster placement group
  - Self-referencing security group for EFA OS-bypass
  - EIP for SSH access, EFA driver installation via aws-efa-installer
  - Hard failure if EFA device not detected after driver install
- Add ec2_efa to testTypeToTestConfig with testDir=./test/efa_ec2, terraformDir=terraform/ec2/efa
- Create ec2_efa_test_matrix.json with AL2023 on c5n.9xlarge
@mitali-salvi mitali-salvi requested a review from a team as a code owner May 8, 2026 17:54
Verify that EFA metrics use short CW-friendly dimension names (device,
port, eni_id) instead of OTel-style dotted names (aws.efa.device,
aws.efa.port, aws.efa.eni.id).
@mitali-salvi mitali-salvi force-pushed the feature/efa-ec2-integration-test branch from 5754307 to c541fa4 Compare May 8, 2026 19:48
mitali-salvi and others added 3 commits May 27, 2026 12:58
The EFA kernel module cannot be reloaded without a reboot.
Move driver installation to user_data so it runs during boot,
then reboot to load the module. The remote-exec provisioner
connects after reboot and verifies EFA is available.

- instance_initiated_shutdown_behavior: terminate -> stop
- EFA install moved to user_data with -n (skip ping test)
- Reboot at end of user_data to load kernel module
- Setup provisioner now only verifies EFA + clones + installs agent
The remote-exec provisioner dies when cloud-init reboots the
instance mid-session. Fix by using the repo's existing pattern:

1. integration_test_setup: install EFA driver, trigger shutdown -r
2. integration_test_reboot_wait: local-exec sleep 60s
3. integration_test_post_reboot: reconnect, verify EFA, clone, install
4. integration_test_run: execute tests

Also reverts user_data back to SSH hardening only (EFA install
stays in remote-exec where cloud-init status --wait ensures
the instance is ready before starting).
Comment thread test/efa_ec2/efa_ec2_unix.go Outdated

var (
expectedEfaEC2LinuxMetrics = []string{"efa_tx_bytes", "efa_rx_bytes", "efa_tx_pkts", "efa_rx_pkts", "efa_rx_dropped", "efa_rdma_read_bytes", "efa_rdma_write_bytes", "efa_send_bytes", "efa_recv_bytes"}
expectedEfaDimensionNames = []string{"device", "port", "eniId"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the dimension is eni_id, but in code its this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the discrepancy — the code is correct. The EFA receiver uses eniId (camelCase) as the dimension name. I'll fix the PR description to match.

Comment thread test/efa_ec2/efa_ec2_unix.go Outdated
- Replace errors.New(fmt.Sprintf(...)) with fmt.Errorf
- Return error from ValidateMetric in the metrics loop
- Remove unused errors import
@mitali-salvi mitali-salvi merged commit f46f66b into main May 29, 2026
6 checks passed
@mitali-salvi mitali-salvi deleted the feature/efa-ec2-integration-test branch May 29, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants