-
Notifications
You must be signed in to change notification settings - Fork 338
DAOS-17639 test: Detect all server fabric_ifaces #16913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Launch.py will detect all of the fastest interfaces common to all the specified server hosts and use them to populate the engine fabric_iface entries if no overrides are provided in the test yaml. Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: IorSmall Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
|
Ticket title is 'Support newly named ib devices for functional tests' |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/4/execution/node/557/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/3/execution/node/805/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: IorSmall Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/6/execution/node/747/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/5/execution/node/805/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: IorSmall Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/7/execution/node/805/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: IorSmall Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
|
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/10/execution/node/665/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/11/execution/node/557/log |
Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/895/log |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/954/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/909/log |
|
Failures seen in https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16913/14/testReport/ are known issues or should not be related to PR changes - in all cases the servers started successfully:
|
JohnMalmberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about how robust this solution is, as on the Ice Lake systems, we usually run two HDR adapter speeds.
And we run two OPA adapters speeds currently on at least one of our Omni-Path clusters.
| targets: 4 | ||
| nr_xs_helpers: 0 | ||
| fabric_iface_port: 31416 | ||
| fabric_iface: eth0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a default template value?
There is nothing guaranteeing that and eth0 device will be active right now if a system has multiple ethernet adapters, which is the case for many of our systems.
This is documented in the Red Hat EL-6 manuals, and as of the el-8 release several systems in the lab have been observed to alternate at random which adapter is eth0 at each boot, which is not seen on the older distro versions.
You have to look up the interface properties such as its online (carrier detected) and IP subnet in order to find interfaces on a common network.
The eth and ib interface names have been deprecated from Linux support and are only available by specifying a kernel option to enable them. In the future we want to stop setting that option so that we are prepared for when the kernels stop supporting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what currently works with our VM tests/clusters. This particular test defines a dual engine config where we only have one active interface on the VM host. The simplest way to handle this outlier of a test to manually set the same expected CI interface for both engines in this and the src/tests/ftest/control/dmg_system_start.py test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this may not always be the case for our VMs, but for now it will work. The alternative approach - which was considered - involves re-working the entire test yaml replacement code to support keyword replacement. This would be a larger undertaking at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adjusted the logic used to determine the default fabric_iface value for each engine to reuse the last known interface from the DAOS_TEST_FABRIC_IFACE env var when the index exceeds the list. This means if DAOS_TEST_FABRIC_IFACE=foo0 (or whatever is discovered on the VM server hosts) and the test requests dual engines, then engine 0 and engine 1 will both use fabric_iface: foo0. If the DAOS_TEST_FABRIC_IFACE is not set, then it will fall back to using eth0 and eth1, respectively.
| try: | ||
| # Select the fastest active interface available by sorting the speed | ||
| interface = available_interfaces[sorted(available_interfaces)[-1]] | ||
| fastest_speed = sorted(interfaces_at_speed)[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Mellanox HDR network, the test controller node will usually have a reported speed of 100G for the ConnectX-[5/6] interface and on Ice Lake systems the rest of the cluster will have a reported speed of 200G for the ConnectX-6 interface.
On the Omni-Path network, the test controller may have a reported speed of 56G and the rest of the cluster will have a reported speed of 100G.
On VMs, the reported network speed may be pure fiction and may not even have a value to look up or even parse as numeric value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR we are only checking the speeds of the adapters on the server hosts - excluding the launch node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For VMs we have (currently and historically) only matched one device, so even when the speed is fictional (e.g. -1) we still get the expected device:
2025/10/02 07:28:23 DEBUG _default_interface: Detecting network devices on brd-103vm[02-09] - DAOS_TEST_FABRIC_IFACE not set
2025/10/02 07:28:23 DEBUG run_remote: Running on brd-103vm[02-09] with a 120 second timeout: grep -l 'up' /sys/class/net/*/operstate | grep -Ev '/(lo|bond)/' | sort
2025/10/02 07:28:24 DEBUG log_result_data: brd-103vm[02-09] (rc=0): /sys/class/net/eth0/operstate
2025/10/02 07:28:24 INFO get_common_interfaces: Active network interfaces detected:
2025/10/02 07:28:24 INFO get_common_interfaces: - eth0 on brd-103vm[02-09] (Common=True)
2025/10/02 07:28:24 DEBUG run_remote: Running on brd-103vm[02-09] with a 120 second timeout: cat /sys/class/net/eth0/speed
2025/10/02 07:28:24 DEBUG log_result_data: brd-103vm[02-09] (rc=0): -1
2025/10/02 07:28:24 INFO get_fastest_interfaces: Active network interface speeds on brd-103vm[02-09]:
2025/10/02 07:28:24 INFO get_fastest_interfaces: - speed: -1 => ['eth0']
2025/10/02 07:28:24 INFO get_fastest_interfaces: Fastest interfaces detected on brd-103vm[02-09]: ['eth0']
2025/10/02 07:28:24 DEBUG _default_interface: Found interface(s): eth0
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/15/execution/node/553/log |
| for speed in sorted(interfaces_at_speed): | ||
| logger.info(" - speed: %6s => %s", speed, sorted(interfaces_at_speed[speed])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just sort the dictionary once upfront so we don't have to keep calling sorted?
| for speed in sorted(interfaces_at_speed): | |
| logger.info(" - speed: %6s => %s", speed, sorted(interfaces_at_speed[speed])) | |
| interfaces_at_speed = {k: interfaces_at_speed[k] for k in sorted(interfaces_at_speed)} | |
| for speed, interfaces in enumerate(interfaces_at_speed): | |
| logger.info(" - speed: %6s => %s", speed, interfaces) |
| fastest_speed = sorted(interfaces_at_speed)[-1] | ||
| fastest_interfaces = interfaces_at_speed[fastest_speed] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then this also becomes
| fastest_speed = sorted(interfaces_at_speed)[-1] | |
| fastest_interfaces = interfaces_at_speed[fastest_speed] | |
| fastest_speed, fastest_interfaces = list(interfaces_at_speed.items())[-1] |
| try: | ||
| _defaults = os.environ.get("DAOS_TEST_FABRIC_IFACE").split(",") | ||
| default_interface = list(filter(None, _defaults))[index] | ||
| except (AttributeError, IndexError): | ||
| default_interface = f"eth{index}" | ||
| default_port = int(os.environ.get("D_PORT", 31317 + (100 * index))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if someone sets for example DAOS_TEST_FABRIC_IFACE=ib0 and the test is dual engine, we'll end up trying to use ib0 and eth0 by default? Maybe if you set DAOS_TEST_FABRIC_IFACE it should be expected that you pass an interface for each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user attempts to run a dual engine test and only sets DAOS_TEST_FABRIC_IFACE=ib0 we will run the test with a server config using:
engines:
- fabric_iface: ib0
fabric_iface_port: 31317
- fabric_iface: eth1
fabric_iface_port: 31417
So, yes, they should have set DAOS_TEST_FABRIC_IFACE=ib0,ib1 in this case.
… hendersp/DAOS-17639 Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
afcc9d8
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/25/execution/node/925/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/25/execution/node/880/log |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/25/execution/node/1066/log |
Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/26/execution/node/917/log |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/26/execution/node/948/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/26/execution/node/927/log |
Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/27/execution/node/948/log |
JohnMalmberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not got this to work with the new udev rules that I am currently testing.
Fixing that should be a minor change that can be done after this PR, so I do not want to have that be a blocker.
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16913/27/testReport/ |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: CatRecovCoreTest IorInterceptMultiClient Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
988491b
|
The failures in https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16913/27/ are known issues:
|
|
https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16913/28/ passed running the two test yaml files updated due to merge conflicts after #16913 (comment) - which included the fix for the |
Per #16913 Test-tag: test_rebuild_interactive Test-repeat: 3 Skip-func-hw-test-large: false Skip-unit-tests: true Skip-fault-injection-test: true Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Launch.py will detect all of the fastest interfaces common to all the specified server hosts and use them to populate the engine fabric_iface entries if no overrides are provided in the test yaml. Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Launch.py will detect all of the fastest interfaces common to all the specified server hosts and use them to populate the engine fabric_iface entries if no overrides are provided in the test yaml.
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: IorSmall
Steps for the author:
After all prior steps are complete: