Skip to content

Merge release/2.6 into google/2.6#15908

Closed
jolivier23 wants to merge 20 commits intogoogle/2.6from
jeffolivier/google/2.6
Closed

Merge release/2.6 into google/2.6#15908
jolivier23 wants to merge 20 commits intogoogle/2.6from
jeffolivier/google/2.6

Conversation

@jolivier23
Copy link
Copy Markdown
Contributor

Nasf-Fan and others added 18 commits January 29, 2025 11:41
…5793)

It is a temporary workaround for the collective punch crash at large scale.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
Intercepting io_queue_init() is needed on Ubuntu. There is compatibility issue for pil4dfs interception library when used with fio libaio engine. In some cases, fio initialize the aio context through io_queue_init function when loading the libaio engine. Through the pil4dfs has intercepted the io_setup function, but it seems that the io_setup which called by io_queue_init is not intercepted some times, which causing invalid aio context for I/O processing. So add an interception for io_queue_init to make it work for this case.

Signed-off-by: Jun Zeng <jun1.zeng@intel.com>
Signed-off-by: Lei Huang <lei.huang@intel.com>
The backport from master missed a couple of pool tests that
needed to be updated for the new JSON output from pool query.
Query results always include a dead_ranks array, even when
it's empty.

Signed-off-by: Michael MacDonald <mjmac@google.com>
…re. (#15696) (#15776)

Summary: Pass the ior_timeout to avoid the test hanging under certain situations.

Signed-off-by: Padmanabhan <ravindran.padmanabhan@intel.com>
…#15258)

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
…15824)

Signed-off-by: Joseph Moore <joseph.moore@hpe.com>
Whenever stopping an engine process from within the control-plane, use
SIGKILL rather than asking nicely (SIGTERM). This has been requested
to try to avoid situations that could result in dataloss.

This change preserves the behaviour where ds_mgmt_drpc_prep_shutdown()
and then ds_pool_disable_exclude() will be called during a controlled
shutdown where dmg system stop is called with new --full argument.

Notable behavior changes with this PR:
  * Always performs SIGKILL on dmg system stop unless --full command
option is supplied.
  * Will attempt prepare shutdown to disable exclusions across cluster
during “controlled” shutdown where dmg system stop is called with
--full option but this should be regarded as experimental and not
for use in production environments.
  * Force option is a no-op and is retained for backward compatibility
and future use.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…15833) (#15837)

This is a workaround for DAOS-16990 and DAOS-17011.

When using the CXI provider, retry HG_Init_opt2() on error cases since
it seems CXI has intermittent issues on initialization. A new
environment variable is added (CRT_CXI_INIT_RETRY) to control the retry
count (default is 3) and to be able to test future SS fixes without
retry.

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Increase the "Unit Test bdev with memcheck on EL 8.8" step timeout to be
in sync with the master branch.

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
With this change, three ULTs in pool and container code launched via
ds_pool_thread_collective() are changed to specify a larger ("deep")
stack size of 64KiB rather than a default 16KiB stack size. i.e., the
flags parameter specified as DSS_ULT_DEEP_STACK. The three ULT
function entrypoints are:
cont_open_one, cont_snap_update_one,and update_vos_prop_on_targets.

Before this change, intermittently in CI testing, shortly after
daos_engine startup, a dmg pool list (pool query on the back end)
would occasionally result in a segmentation fault in an engine, in
these three particular areas of the code. Specifically, the faults
occurred within the ABT thread create, inside ABTI_mem_pool_alloc().

This change is based on a guess that the stack size parameter may have
some effect.

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
Tag second release build for 2.6.3.

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Third argument is "void *" type in libc source code.
"va_arg(arg, int);" leads to wrong argument retrieved.
also need to return ENOTSUP for flock when compatible
mode is not enabled.

Signed-off-by: Lei Huang <lei.huang@hpe.com>
…) (#15859)

Remove the calling of cleanup methods for multiple containers and ior
commands that can be handled by destroying the pool and a single ior
kill command.

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Otherwise, the partial committed DTX entry will be re-committed when
reopen the container. Then access related dangling DTX record(s) may
trigger assertion and cause corruption.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
…#15882)

Skip existing partial committed DTX records that were generated by
DAOS-2.6.3-rc{1,2} to avoid repeated DTX commit after engine upgrade.

To be safe, it is required for the user/admin to explicitly set server side
environment variable "DAOS_SKIP_OLD_PARTIAL_DTX" while upgrading
from DAOS-2.6.3-rc{1,2}. 

The environment variable can be ignored for upgrade from earlier versions.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
…15879)

For dfs_readx/writex and array_read/write operations, add a limit for
the number of IODs being passed to DAOS of 16k if the range lengths are
under 16 bytes (best effort checking).

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Tag third release build for 2.6.3.

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Updated the expected journalctl message from "exited with 0" to "killed",
since #15811 changed the default dmg system stop to use --force.

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
@github-actions
Copy link
Copy Markdown

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Merge

…le/2.6

Revert e1393d8 as part of merge

Change-Id: I7e6c15c07ad7fcb94622ec8d6081624641c44441
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@jolivier23 jolivier23 force-pushed the jeffolivier/google/2.6 branch from 8921034 to 1dd4576 Compare February 14, 2025 02:23
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@daosbuild1
Copy link
Copy Markdown
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15908/3/execution/node/344/log

@daosbuild1
Copy link
Copy Markdown
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15908/3/execution/node/319/log

@daosbuild1
Copy link
Copy Markdown
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15908/3/execution/node/345/log

@daosbuild1
Copy link
Copy Markdown
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15908/3/execution/node/316/log

@jolivier23 jolivier23 closed this Feb 14, 2025
@jolivier23 jolivier23 deleted the jeffolivier/google/2.6 branch February 14, 2025 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.