Skip to content

Conversation

@phender
Copy link
Contributor

@phender phender commented Sep 25, 2025

Launch.py will detect all of the fastest interfaces common to all the specified server hosts and use them to populate the engine fabric_iface entries if no overrides are provided in the test yaml.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: IorSmall

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Launch.py will detect all of the fastest interfaces common to all the
specified server hosts and use them to populate the engine fabric_iface
entries if no overrides are provided in the test yaml.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: IorSmall

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@phender phender requested review from a team as code owners September 25, 2025 23:13
@github-actions
Copy link

github-actions bot commented Sep 25, 2025

Ticket title is 'Support newly named ib devices for functional tests'
Status is 'In Review'
Labels: 'testp1'
https://daosio.atlassian.net/browse/DAOS-17639

@phender phender marked this pull request as draft September 25, 2025 23:14
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: IorSmall

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/4/execution/node/557/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/3/execution/node/805/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: IorSmall

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/6/execution/node/747/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/5/execution/node/805/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: IorSmall

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/7/execution/node/805/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: IorSmall

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/11/execution/node/557/log

Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/895/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/954/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/14/execution/node/909/log

@phender
Copy link
Contributor Author

phender commented Oct 6, 2025

Failures seen in https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16913/14/testReport/ are known issues or should not be related to PR changes - in all cases the servers started successfully:

@phender phender marked this pull request as ready for review October 6, 2025 13:12
Copy link
Contributor

@JohnMalmberg JohnMalmberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about how robust this solution is, as on the Ice Lake systems, we usually run two HDR adapter speeds.

And we run two OPA adapters speeds currently on at least one of our Omni-Path clusters.

targets: 4
nr_xs_helpers: 0
fabric_iface_port: 31416
fabric_iface: eth0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a default template value?

There is nothing guaranteeing that and eth0 device will be active right now if a system has multiple ethernet adapters, which is the case for many of our systems.

This is documented in the Red Hat EL-6 manuals, and as of the el-8 release several systems in the lab have been observed to alternate at random which adapter is eth0 at each boot, which is not seen on the older distro versions.

You have to look up the interface properties such as its online (carrier detected) and IP subnet in order to find interfaces on a common network.

The eth and ib interface names have been deprecated from Linux support and are only available by specifying a kernel option to enable them. In the future we want to stop setting that option so that we are prepared for when the kernels stop supporting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what currently works with our VM tests/clusters. This particular test defines a dual engine config where we only have one active interface on the VM host. The simplest way to handle this outlier of a test to manually set the same expected CI interface for both engines in this and the src/tests/ftest/control/dmg_system_start.py test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this may not always be the case for our VMs, but for now it will work. The alternative approach - which was considered - involves re-working the entire test yaml replacement code to support keyword replacement. This would be a larger undertaking at this time.

Copy link
Contributor Author

@phender phender Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted the logic used to determine the default fabric_iface value for each engine to reuse the last known interface from the DAOS_TEST_FABRIC_IFACE env var when the index exceeds the list. This means if DAOS_TEST_FABRIC_IFACE=foo0 (or whatever is discovered on the VM server hosts) and the test requests dual engines, then engine 0 and engine 1 will both use fabric_iface: foo0. If the DAOS_TEST_FABRIC_IFACE is not set, then it will fall back to using eth0 and eth1, respectively.

try:
# Select the fastest active interface available by sorting the speed
interface = available_interfaces[sorted(available_interfaces)[-1]]
fastest_speed = sorted(interfaces_at_speed)[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Mellanox HDR network, the test controller node will usually have a reported speed of 100G for the ConnectX-[5/6] interface and on Ice Lake systems the rest of the cluster will have a reported speed of 200G for the ConnectX-6 interface.

On the Omni-Path network, the test controller may have a reported speed of 56G and the rest of the cluster will have a reported speed of 100G.

On VMs, the reported network speed may be pure fiction and may not even have a value to look up or even parse as numeric value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR we are only checking the speeds of the adapters on the server hosts - excluding the launch node.

Copy link
Contributor Author

@phender phender Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For VMs we have (currently and historically) only matched one device, so even when the speed is fictional (e.g. -1) we still get the expected device:

2025/10/02 07:28:23 DEBUG             _default_interface: Detecting network devices on brd-103vm[02-09] - DAOS_TEST_FABRIC_IFACE not set
2025/10/02 07:28:23 DEBUG                     run_remote: Running on brd-103vm[02-09] with a 120 second timeout: grep -l 'up' /sys/class/net/*/operstate | grep -Ev '/(lo|bond)/' | sort
2025/10/02 07:28:24 DEBUG                log_result_data:   brd-103vm[02-09] (rc=0): /sys/class/net/eth0/operstate
2025/10/02 07:28:24 INFO           get_common_interfaces: Active network interfaces detected:
2025/10/02 07:28:24 INFO           get_common_interfaces:   - eth0     on brd-103vm[02-09] (Common=True)
2025/10/02 07:28:24 DEBUG                     run_remote: Running on brd-103vm[02-09] with a 120 second timeout: cat /sys/class/net/eth0/speed
2025/10/02 07:28:24 DEBUG                log_result_data:   brd-103vm[02-09] (rc=0): -1
2025/10/02 07:28:24 INFO          get_fastest_interfaces: Active network interface speeds on brd-103vm[02-09]:
2025/10/02 07:28:24 INFO          get_fastest_interfaces:   - speed:     -1 => ['eth0']
2025/10/02 07:28:24 INFO          get_fastest_interfaces: Fastest interfaces detected on brd-103vm[02-09]: ['eth0']
2025/10/02 07:28:24 DEBUG             _default_interface:   Found interface(s): eth0

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/15/execution/node/553/log

Comment on lines 421 to 422
for speed in sorted(interfaces_at_speed):
logger.info(" - speed: %6s => %s", speed, sorted(interfaces_at_speed[speed]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just sort the dictionary once upfront so we don't have to keep calling sorted?

Suggested change
for speed in sorted(interfaces_at_speed):
logger.info(" - speed: %6s => %s", speed, sorted(interfaces_at_speed[speed]))
interfaces_at_speed = {k: interfaces_at_speed[k] for k in sorted(interfaces_at_speed)}
for speed, interfaces in enumerate(interfaces_at_speed):
logger.info(" - speed: %6s => %s", speed, interfaces)

Comment on lines 425 to 426
fastest_speed = sorted(interfaces_at_speed)[-1]
fastest_interfaces = interfaces_at_speed[fastest_speed]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then this also becomes

Suggested change
fastest_speed = sorted(interfaces_at_speed)[-1]
fastest_interfaces = interfaces_at_speed[fastest_speed]
fastest_speed, fastest_interfaces = list(interfaces_at_speed.items())[-1]

Comment on lines 489 to 494
try:
_defaults = os.environ.get("DAOS_TEST_FABRIC_IFACE").split(",")
default_interface = list(filter(None, _defaults))[index]
except (AttributeError, IndexError):
default_interface = f"eth{index}"
default_port = int(os.environ.get("D_PORT", 31317 + (100 * index)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if someone sets for example DAOS_TEST_FABRIC_IFACE=ib0 and the test is dual engine, we'll end up trying to use ib0 and eth0 by default? Maybe if you set DAOS_TEST_FABRIC_IFACE it should be expected that you pass an interface for each

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user attempts to run a dual engine test and only sets DAOS_TEST_FABRIC_IFACE=ib0 we will run the test with a server config using:

engines:
- fabric_iface: ib0
  fabric_iface_port: 31317
- fabric_iface: eth1
  fabric_iface_port: 31417

So, yes, they should have set DAOS_TEST_FABRIC_IFACE=ib0,ib1 in this case.

… hendersp/DAOS-17639

Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@phender phender dismissed stale reviews from JohnMalmberg and daltonbohning via afcc9d8 November 21, 2025 20:24
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/25/execution/node/925/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/25/execution/node/880/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/25/execution/node/1066/log

Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/26/execution/node/917/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/26/execution/node/948/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/26/execution/node/927/log

Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16913/27/execution/node/948/log

daltonbohning
daltonbohning previously approved these changes Dec 17, 2025
JohnMalmberg
JohnMalmberg previously approved these changes Dec 17, 2025
Copy link
Contributor

@JohnMalmberg JohnMalmberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not got this to work with the new udev rules that I am currently testing.

Fixing that should be a minor change that can be done after this PR, so I do not want to have that be a blocker.

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16913/27/testReport/

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: CatRecovCoreTest IorInterceptMultiClient

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
@phender phender dismissed stale reviews from JohnMalmberg and daltonbohning via 988491b December 19, 2025 16:22
@phender
Copy link
Contributor Author

phender commented Dec 19, 2025

The failures in https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16913/27/ are known issues:

@phender
Copy link
Contributor Author

phender commented Dec 22, 2025

https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16913/28/ passed running the two test yaml files updated due to merge conflicts after #16913 (comment) - which included the fix for the 4-./ior/intercept_multi_client.py:IorInterceptMultiClient.test_ior_intercept_libpil4dfs failure.

@phender phender added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Dec 22, 2025
@daltonbohning daltonbohning requested a review from a team December 22, 2025 18:49
@daltonbohning daltonbohning merged commit 4acf5a3 into master Dec 22, 2025
33 checks passed
@daltonbohning daltonbohning deleted the hendersp/DAOS-17639 branch December 22, 2025 18:49
daltonbohning added a commit that referenced this pull request Dec 22, 2025
Per #16913

Test-tag: test_rebuild_interactive
Test-repeat: 3
Skip-func-hw-test-large: false
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
phender added a commit that referenced this pull request Jan 16, 2026
Launch.py will detect all of the fastest interfaces common to all the
specified server hosts and use them to populate the engine fabric_iface
entries if no overrides are provided in the test yaml.

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

5 participants