Skip to content

DAOS-17468 control: Prevent start if transparent hugepages are enabled#16313

Merged
daltonbohning merged 23 commits intomasterfrom
tanabarr/control-no-thp
Dec 19, 2025
Merged

DAOS-17468 control: Prevent start if transparent hugepages are enabled#16313
daltonbohning merged 23 commits intomasterfrom
tanabarr/control-no-thp

Conversation

@tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Apr 25, 2025

When THP feature is enabled on linux platforms, SPDK related
hugepage management in DAOS performs sub-optimally. Resulting problems
relate to memory accounting and fragmentation. To remedy, refuse to
start daos_server if THP is enabled on platform and recommend
disabling THP by applying kernel commandline parameters effective on
reboot.

Features: control

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@tanabarr tanabarr requested review from a team as code owners April 25, 2025 11:45
@github-actions
Copy link

github-actions bot commented Apr 25, 2025

Ticket title is 'Prevent start if transparent hugepages are enabled'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-17468

@tanabarr tanabarr self-assigned this Apr 25, 2025
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-no-thp branch from d46e506 to 5c07867 Compare April 25, 2025 15:16
@tanabarr tanabarr requested a review from a team as a code owner April 25, 2025 15:16
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested review from kjacque, knard38 and mjmac April 29, 2025 22:12
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
kjacque
kjacque previously approved these changes Apr 30, 2025
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my issues were addressed. You'll want to fix the typo in the title of the PR (DARS -> DAOS), otherwise looks good.

knard38
knard38 previously approved these changes May 5, 2025
@tanabarr
Copy link
Contributor Author

@ryon-jensen @JohnMalmberg can we please ensure that transparent hugepages feature is disabled on all CI test runners. if not it will create problems with DAOS and this PR will cause failures. TIA

tanabarr added 2 commits May 17, 2025 10:39
…-thp

Features: control
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…-thp

Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@tanabarr tanabarr changed the title DARS-17468 control: Prevent start if transparent hugepages are enabled DAOS-17468 control: Prevent start if transparent hugepages are enabled Jun 4, 2025
…-thp

Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@tanabarr
Copy link
Contributor Author

@ryon-jensen functional tests are failing because presumably on test runner THP is enabled: https://jenkins.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16313/8/#showFailuresLink I wonder whether THP needs to be enabled on the runner? if we find situations where THP needs to be enabled e.g. VMs then we can add override flag to skip to check.

…-thp

Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

kjacque
kjacque previously approved these changes Dec 16, 2025
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Nothing I would block on.

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/21/execution/node/1254/log

knard38
knard38 previously approved these changes Dec 17, 2025
```

If `allow_thp: true` parameter is set in server config file global section, the behavior will change
and the server will start (with THP enabled. SCM tmpfs will be mounted with `huge=always` on `dmg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT

Suggested change
and the server will start (with THP enabled. SCM tmpfs will be mounted with `huge=always` on `dmg
and the server will start (with THP enabled). SCM tmpfs will be mounted with `huge=always` on `dmg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update on follow on or repush

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@tanabarr
Copy link
Contributor Author

CI run no. 21 passed all tests apart from one known OSADrain failure. NLT test was skipped in the unit stage. Rerunning only NLT.

@daosbuild3
Copy link
Collaborator

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr dismissed stale reviews from knard38 and kjacque via 5892623 December 17, 2025 16:25
… NLT"

This reverts commit 5892623.
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/26/execution/node/1393/log

This reverts commit 3d55554.
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Doc-only: false
Priority: 2
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Dec 19, 2025
@tanabarr
Copy link
Contributor Author

@kjacque @knard38 could I please get reviews again, I've separated out scm_hugepages_disabled concerns from the PR and will handle those separately.
I have requested forced landing for the unrelated NLT container permissions failure.

Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing I'd block on, just some suggestions for a follow-on.

seenScmClsIdx = idx

if seenScmHugeIdx != -1 && scmConf.Scm.DisableHugepages != seenScmHuge {
log.Debugf("scm_hugepages_disabled entry %v in %d doesn't match %d",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth using Error here to log the details, since we error out anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will add in follow-on

defer test.ShowBufferOnFailure(t, buf)

conf := DefaultServer().
WithAllowTHP(true). // Enable differences between scm_hugepages_disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to have a case where allow_thp: false?

Copy link
Contributor Author

@tanabarr tanabarr Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add in other follow-on although as it's the default value it gets tested in multiple other places

@daltonbohning daltonbohning requested a review from a team December 19, 2025 17:18
@daltonbohning daltonbohning merged commit 7faa4d4 into master Dec 19, 2025
40 of 42 checks passed
@daltonbohning daltonbohning deleted the tanabarr/control-no-thp branch December 19, 2025 17:19
tanabarr added a commit that referenced this pull request Jan 26, 2026
#16313)

When THP feature is enabled on linux platforms, SPDK related
hugepage management in DAOS performs sub-optimally. Resulting problems
relate to memory accounting and fragmentation. To remedy, refuse to
start daos_server if THP is enabled on platform and recommend
disabling THP by applying kernel commandline parameters effective on
reboot.

Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
tanabarr added a commit that referenced this pull request Jan 26, 2026
#16313)

When THP feature is enabled on linux platforms, SPDK related
hugepage management in DAOS performs sub-optimally. Resulting problems
relate to memory accounting and fragmentation. To remedy, refuse to
start daos_server if THP is enabled on platform and recommend
disabling THP by applying kernel commandline parameters effective on
reboot.

Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

7 participants