Skip to content

DAOS-18597 ddb: enforce SPDK re-init rules#17838

Merged
daltonbohning merged 8 commits intomasterfrom
janekmi/DAOS-18597-DDB-limit-VMD-usage
Apr 2, 2026
Merged

DAOS-18597 ddb: enforce SPDK re-init rules#17838
daltonbohning merged 8 commits intomasterfrom
janekmi/DAOS-18597-DDB-limit-VMD-usage

Conversation

@janekmi
Copy link
Copy Markdown
Contributor

@janekmi janekmi commented Mar 30, 2026

SPDK, in all known applications—including daos_engine—is initialized only once during the lifetime of a process. DDB uses SPDK differently, allowing the user to initialize and re‑initialize SPDK multiple times within the same process.

However, SPDK does not fully support re‑initialization. At the moment, this issue appears only when the SPDK configuration uses the VMD subsystem. When a VMD‑enabled SPDK configuration is in use, two conditions must be respected:

  • The VMD‑enabled configuration must be used during the first SPDK initialization. The VMD subsystem and DPDK will not be initialized on subsequent SPDK re‑initializations.
  • After a VMD‑enabled configuration has been used, the user cannot re‑initialize SPDK—whether with VMD enabled or disabled. The internal state of the VMD subsystem and DPDK becomes unsafe to use, even if the next SPDK configuration does not include VMD.

These rules if not adhered to causes DDB to crash.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

SPDK, in all known applications—including daos_engine—is initialized
only once during the lifetime of a process. DDB uses SPDK differently,
allowing the user to initialize and re‑initialize SPDK multiple times
within the same process.
However, SPDK does not fully support re‑initialization. At the moment,
this issue appears only when the SPDK configuration uses the VMD
subsystem. When a VMD‑enabled SPDK configuration is in use, two
conditions must be respected:
- The VMD‑enabled configuration must be used during the first SPDK
  initialization. The VMD subsystem and DPDK will not be initialized
  on subsequent SPDK re‑initializations.
- After a VMD‑enabled configuration has been used, the user cannot
  re‑initialize SPDK—whether with VMD enabled or disabled.
  The internal state of the VMD subsystem and DPDK becomes unsafe to
  use, even if the next SPDK configuration does not include VMD.

These rules if not adhered to causes DDB to crash.

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
@janekmi janekmi requested review from a team as code owners March 30, 2026 17:18
@github-actions
Copy link
Copy Markdown

Ticket title is 'ddb dev_list with --vos_path in interactive mode causes segmentation fault'
Status is 'In Progress'
Labels: 'VMD,ddb,scrubbed_2.8,test_2.8'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18597

@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Mar 30, 2026
Comment thread src/utils/ddb/ddb_commands.c Outdated

if (opt->db_path != NULL && strnlen(opt->db_path, PATH_MAX) != 0) {
memset(path_parts.vf_db_path, 0, sizeof(path_parts.vf_db_path));
strncpy(path_parts.vf_db_path, opt->db_path, sizeof(path_parts.vf_db_path) - 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this section into above vos_path_parse(), then the other similar process in the patch can share that, and nobody will miss db_path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already done here: #17340 by @knard38

I do not want to repeat the effort.

Comment thread src/utils/ddb/ddb_vmd_wa.c Outdated
int rc;

rc = json_object_object_get_ex(vmd_subsystem, KEY_SUBSYSTEM_CONFIG, &config);
D_ASSERTF(rc == 1, "VMD subsystem does not have a '%s' key.\n", KEY_SUBSYSTEM_CONFIG);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not assertion since the input source maybe out of control. Similarly for other places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

But I still left two cases where, in my opinion, we have all the reason to expect a given json_*() call to succeed. e.g.

	methods_num = json_object_array_length(config);
	for (int i = 0; i < methods_num; i++) {
		struct json_object *method = json_object_array_get_idx(config, i);
		D_ASSERT(method != NULL);

The number of objects in the array is already established. If they are there it should be possible to get them one-by-one.

@Nasf-Fan Nasf-Fan self-requested a review March 31, 2026 06:17
Copy link
Copy Markdown
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am correct, one call to vmd_wa_can_proceed() is missing ddb_main.c

Are you planning to add some unit tests or do not want to bother with that as it is a short term workaround ?

Comment thread src/utils/ddb/ddb_vmd_wa.c Outdated
is_vmd_enabled(struct json_object *vmd_subsystem)
{
struct json_object *config;
struct json_object *method;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, the scope of the variable method can be reduced.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread src/utils/ddb/ddb_vmd_wa.c Outdated
}

static struct json_object *
get_vmd_subsystem(struct json_object *subsystems)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, the scope of the variable subsystem and rc can be reduced

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread src/utils/ddb/ddb_main.c
memset(path_parts.vf_db_path, 0, sizeof(path_parts.vf_db_path));
strncpy(path_parts.vf_db_path, pa.pa_db_path,
sizeof(path_parts.vf_db_path) - 1);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also run vmd_wa_can_proceed() to update the status of the global variable.
If we are using a command file, several open and close could be done.

Copy link
Copy Markdown
Contributor Author

@janekmi janekmi Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using a command file, several open and close could be done.

If I am not mistaken this is already covered. As far as I can tell the call sequence will go something like: main() -> parseOpts() -> runFileCmds() -> runCmdStr() -> app.RunCommand() -> ...

So, all open and close commands from a command file should go through ddb_run_open() as normal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also run vmd_wa_can_proceed() to update the status of the global variable.

Done.

janekmi added 2 commits March 31, 2026 14:50
+ limit scope for some local variables

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
@janekmi
Copy link
Copy Markdown
Contributor Author

janekmi commented Mar 31, 2026

@knard38 wrote:

Are you planning to add some unit tests or do not want to bother with that as it is a short term workaround ?

Lack of unit tests always come back to bite you and from experience short term workarounds are longer lived than expected. 😅

Let's see how the review and validation will go. I do not want to slow down the release or something and I have a few more things to take care of. If we end up still working on this review few days later it may give me time to write unit tests. If it makes sense.

@janekmi janekmi requested a review from knard38 March 31, 2026 15:16
@daltonbohning daltonbohning added the approved-to-merge PR has received release branch merge approval label Mar 31, 2026
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Unit Test with memcheck completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17838/3/testReport/

@daosbuild3
Copy link
Copy Markdown
Collaborator

char *nvme_conf;
int rc;

D_ASPRINTF(nvme_conf, "%s/%s", db_path, VOS_NVME_CONF);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will nvme_conf be released?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. Done.

@Nasf-Fan Nasf-Fan self-requested a review April 1, 2026 01:40
Copy link
Copy Markdown
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am correct there is some possible memory leak with the json object which needs to be fixed.
I also agree with @Nasf-Fan on missing memory free of the string nvme_conf.


rc = json_object_object_get_ex(root, KEY_SUBSYSTEMS, &subsystems);
if (rc != JSON_TRUE) {
ddb_errorf(ctx, "File %s does not have '%s' key\n", nvme_conf, KEY_SUBSYSTEMS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing call of json_object_put(root) ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! Done. Thank you.

janekmi added 2 commits April 1, 2026 10:37
Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
…DDB-limit-VMD-usage

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
@janekmi janekmi requested a review from knard38 April 1, 2026 10:44
Nasf-Fan
Nasf-Fan previously approved these changes Apr 1, 2026
@daosbuild3
Copy link
Copy Markdown
Collaborator

knard38
knard38 previously approved these changes Apr 1, 2026
Copy link
Copy Markdown
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Unit Test with memcheck completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17838/4/testReport/

@gnailzenh
Copy link
Copy Markdown
Collaborator

it still has failed UT, please check

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
@janekmi janekmi dismissed stale reviews from knard38 and Nasf-Fan via dac3ba1 April 1, 2026 13:42
@janekmi janekmi requested review from Nasf-Fan and knard38 April 1, 2026 13:42
Nasf-Fan
Nasf-Fan previously approved these changes Apr 1, 2026
knard38
knard38 previously approved these changes Apr 1, 2026
…DDB-limit-VMD-usage

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
@janekmi janekmi dismissed stale reviews from knard38 and Nasf-Fan via f4e5a04 April 1, 2026 15:03
Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
@janekmi janekmi requested review from Nasf-Fan and knard38 April 1, 2026 15:19
Copy link
Copy Markdown
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@janekmi janekmi requested a review from a team April 2, 2026 11:23
@daltonbohning daltonbohning removed the approved-to-merge PR has received release branch merge approval label Apr 2, 2026
@daltonbohning daltonbohning merged commit ed2bc08 into master Apr 2, 2026
42 checks passed
@daltonbohning daltonbohning deleted the janekmi/DAOS-18597-DDB-limit-VMD-usage branch April 2, 2026 14:22
@daltonbohning
Copy link
Copy Markdown
Contributor

FYI master is now 3.0 development so, if needed, this will need to be backported to release/2.8

janekmi added a commit that referenced this pull request Apr 2, 2026
SPDK, in all known applications—including daos_engine—is initialized only once during the lifetime of a process. DDB uses SPDK differently, allowing the user to initialize and re‑initialize SPDK multiple times within the same process.

However, SPDK does not fully support re‑initialization. At the moment, this issue appears only when the SPDK configuration uses the VMD subsystem. When a VMD‑enabled SPDK configuration is in use, two conditions must be respected:

- The VMD‑enabled configuration must be used during the first SPDK initialization. The VMD subsystem and DPDK will not be initialized on subsequent SPDK re‑initializations.
- After a VMD‑enabled configuration has been used, the user cannot re‑initialize SPDK—whether with VMD enabled or disabled. The internal state of the VMD subsystem and DPDK becomes unsafe to use, even if the next SPDK configuration does not include VMD.

These rules if not adhered to causes DDB to crash.

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
@janekmi janekmi restored the janekmi/DAOS-18597-DDB-limit-VMD-usage branch April 4, 2026 14:45
@janekmi janekmi deleted the janekmi/DAOS-18597-DDB-limit-VMD-usage branch April 4, 2026 14:46
daltonbohning pushed a commit that referenced this pull request Apr 10, 2026
SPDK, in all known applications—including daos_engine—is initialized only once during the lifetime of a process. DDB uses SPDK differently, allowing the user to initialize and re‑initialize SPDK multiple times within the same process.

However, SPDK does not fully support re‑initialization. At the moment, this issue appears only when the SPDK configuration uses the VMD subsystem. When a VMD‑enabled SPDK configuration is in use, two conditions must be respected:

- The VMD‑enabled configuration must be used during the first SPDK initialization. The VMD subsystem and DPDK will not be initialized on subsequent SPDK re‑initializations.
- After a VMD‑enabled configuration has been used, the user cannot re‑initialize SPDK—whether with VMD enabled or disabled. The internal state of the VMD subsystem and DPDK becomes unsafe to use, even if the next SPDK configuration does not include VMD.

These rules if not adhered to causes DDB to crash.

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
Co-authored-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

6 participants