Feature/efa ec2 integration test#688
Merged
Merged
Conversation
- test/efa_ec2/: Go integration test following nvidia_gpu EC2 pattern - Copies agent config, starts agent, sleeps 2min, stops, validates 9 EFA metrics via CloudWatch API - terraform/ec2/efa/: EFA-specific Terraform module - EFA network interface (interface_type=efa), cluster placement group - Self-referencing security group for EFA OS-bypass - EIP for SSH access, EFA driver installation via aws-efa-installer - Hard failure if EFA device not detected after driver install
- Add ec2_efa to testTypeToTestConfig with testDir=./test/efa_ec2, terraformDir=terraform/ec2/efa - Create ec2_efa_test_matrix.json with AL2023 on c5n.9xlarge
Verify that EFA metrics use short CW-friendly dimension names (device, port, eni_id) instead of OTel-style dotted names (aws.efa.device, aws.efa.port, aws.efa.eni.id).
5754307 to
c541fa4
Compare
The EFA kernel module cannot be reloaded without a reboot. Move driver installation to user_data so it runs during boot, then reboot to load the module. The remote-exec provisioner connects after reboot and verifies EFA is available. - instance_initiated_shutdown_behavior: terminate -> stop - EFA install moved to user_data with -n (skip ping test) - Reboot at end of user_data to load kernel module - Setup provisioner now only verifies EFA + clones + installs agent
The remote-exec provisioner dies when cloud-init reboots the instance mid-session. Fix by using the repo's existing pattern: 1. integration_test_setup: install EFA driver, trigger shutdown -r 2. integration_test_reboot_wait: local-exec sleep 60s 3. integration_test_post_reboot: reconnect, verify EFA, clone, install 4. integration_test_run: execute tests Also reverts user_data back to SSH hardening only (EFA install stays in remote-exec where cloud-init status --wait ensures the instance is ready before starting).
JayPolanco
reviewed
May 28, 2026
JayPolanco
reviewed
May 28, 2026
|
|
||
| var ( | ||
| expectedEfaEC2LinuxMetrics = []string{"efa_tx_bytes", "efa_rx_bytes", "efa_tx_pkts", "efa_rx_pkts", "efa_rx_dropped", "efa_rdma_read_bytes", "efa_rdma_write_bytes", "efa_send_bytes", "efa_recv_bytes"} | ||
| expectedEfaDimensionNames = []string{"device", "port", "eniId"} |
Contributor
There was a problem hiding this comment.
PR description says the dimension is eni_id, but in code its this
Contributor
Author
There was a problem hiding this comment.
Thanks for catching the discrepancy — the code is correct. The EFA receiver uses eniId (camelCase) as the dimension name. I'll fix the PR description to match.
JayPolanco
reviewed
May 28, 2026
- Replace errors.New(fmt.Sprintf(...)) with fmt.Errorf - Return error from ValidateMetric in the metrics loop - Remove unused errors import
JayPolanco
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds EFA EC2 integration test with Terraform module and test matrix entry.
Changes
test/efa_ec2/) — validates EFA metrics are published to CloudWatch with correct short dimension names (device,port,eniId) andInstanceIdfrom append_dimensionsgenerator/resources/ec2_efa_test_matrix.json) — defines AMI and instance config for CIaws.efa.device,aws.efa.port,aws.efa.eni.id) to CW-friendly short namesRelated PRs
Testing