DAOS-15863 container: fix container destroy error #13852

wangshilong · 2024-02-22T13:18:35Z

After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL
is passed in.

Required-githooks: true

github-actions · 2024-02-22T13:18:54Z

Bug-tracker data:
Ticket title is 'daos_test/suite.py:DaosCoreTest.test_daos_rebuild_simple - timeout waiting for rebuild'
Status is 'In Progress'
Labels: 'ci_impact,pr_test,scrubbed,triaged'
https://daosio.atlassian.net/browse/DAOS-15124

src/container/srv_target.c

github-actions · 2024-03-18T07:15:42Z

Ticket title is '1-./erasurecode/online_rebuild.py:EcodOnlineRebuild.test_ec_online_rebuild, 1-./ior/hard_rebuild.py:EcodIorHardRebuild.test_ec_ior_hard_online_rebuild tests fail due to container destroy error (-2001).'
Status is 'In Review'
Labels: 'ci_impact,md_on_ssd,triaged,weekly_test'
https://daosio.atlassian.net/browse/DAOS-15863

NiuYawei · 2024-03-21T06:42:17Z

src/container/srv_target.c

-		if (cont->sc_rebuilding)
-			ABT_cond_wait(cont->sc_rebuild_cond, cont->sc_mutex);
+		if (!daos_lru_is_last_user(&cont->sc_list))
+			ABT_cond_wait(cont->sc_fini_cond, cont->sc_mutex);


Do we abort dtx resync & rebuild on container destroy or just wait for them done here?

I think if the dtx resync & rebuild just hold container for a short period, we could just retry the loop in function for few more times, if they hold container for a long period, we'd inform them to abort.

Looks not necessary to introduce this sc_fini_cond and hack the cont_child_put()?

I think we abort resync, but for rebuild is a bit different because it is not based on container but based on pool. now the logic if we found container is destroying, we will skip that container scan/pulling before. but there that might take a bit longer if container has been held, we need wait existed operation finished..

I think the rebuild code needs be changed to abort current iterating jobs (for scan or pull)?

Nasf-Fan · 2024-03-21T06:34:16Z

src/container/srv_target.c

 {
+	bool		 wake_cond = false;
+
+	if (cont->sc_stopping == true && daos_lru_ref_count(&cont->sc_list) == 3)


Why wakeup only when hit the last three reference? If there are more users, then only the last 2nd user (which reference is 3 at that time) to trigger the wakeup? Why not wakeup as long as "sc_stopping" is set? that seems more safe.

Because in cont_child_destroy_one() it hold one extra refrences, and conf ref is inited as 1. so if ref count is 2 that means there is no other references for this containers.

Logic is like firstly we set sc_stopping for containers, and new lookup will fail.
but we wait existed references finish, that is why we could not wake up as long as sc_stopping is set. makes sense?

I think we'd try to avoid introducing such ad-hoc reference checking code further.

I think the '3' means we assume the daos_lru_cache for ds_cont_child is always created without D_HASH_FT_EPHEMERAL (no reference increasing when adding to cache) flag, I think we'd better don't make such assumption.

It looks to me current daos_lru code is flawed: in daos_lru_ref_release() it assumes 'in cache' holds one reference, but daos_lru_cache_create() doesn't regard D_HASH_FT_EPHEMERAL as an invalid feature for daos_lru.

I suggest implementing the common 'waiting for the last reference on daos_lru_ref_release()' for daos_lru, for example: introduce two more callbacks in daos_llink_ops, one for 'wakeup' and the other for 'wait', they could be called in daos_lru_ref_release() accordingly.

BTW, if this ticket isn't urgent, should we use this opportunity try to incorporate some the changes for what we discussed on I/O meeting? Thanks.

Nasf-Fan · 2024-03-21T06:53:01Z

src/container/srv_target.c

+		 * This might be racy, as dtx_resync() might yield after
+		 * ds_cont_child_lookup(), but before @sc_dtx_resyncing set,
+		 * we use @sc_fini_cond to guarantee all users exit properly.
+		 */


But it seems that some ULT (such as DTX resync) holding the container reference does not check sc_stopping flag, as to current destroy will be blocked here until related ULT release the reference without condition. Is it expected? I would say no.

As my understand, if the container to be destroyed, all the users holding reference on the container should stop related process. So we need add some check in related holders.

@Nasf-Fan I think DTX will check sc_stopping, it call ds_cont_child_lookup() internally and inside ds_cont_child_lookup() it will check sc_stopping and return error if sc_stopping has been set.

After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL is passed in. Required-githooks: true Signed-off-by: Wang Shilong <shilong.wang@intel.com>

src/common/lru.c

Required-githooks: true Signed-off-by: Wang Shilong <shilong.wang@intel.com>

daosbuild1 · 2024-06-22T02:08:29Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/7/execution/node/1551/log

Nasf-Fan · 2024-06-24T15:23:21Z

src/container/srv_target.c


-		/* If it is the last user, ds_cont_child will be removed from hash & freed. */
-		cont_child_put(tls->dt_cont_cache, cont);
+		daos_lru_ref_wait_evict(tls->dt_cont_cache, &cont->sc_list);


How can we prevent new coming IO during the wait? Although we set "stopping" flag on the cont_child, but related check is after ds_cont_find_hdl() that may be triggered by obj_ioc_init() and find the ds_cont_hdl in cache and call lru_hop_rec_addref() as to trigger your new added assertion? Just some code analysis, maybe wrong. Anyway it is better to give some test if possible.

After discussed with shilong, it seems work well.

Nasf-Fan · 2024-06-25T01:49:12Z

src/container/srv_target.c


-		/* If it is the last user, ds_cont_child will be removed from hash & freed. */
-		cont_child_put(tls->dt_cont_cache, cont);
+		daos_lru_ref_wait_evict(tls->dt_cont_cache, &cont->sc_list);


After discussed with shilong, it seems work well.

…S-15124

daosbuild1 · 2024-06-28T15:31:22Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/9/execution/node/1505/log

daosbuild1 · 2024-07-03T14:25:16Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/10/execution/node/547/log

daosbuild1 · 2024-07-03T21:32:31Z

Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/10/execution/node/798/log

…S-15124

The spelling error has been fixed.

After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL is passed in. Signed-off-by: Wang Shilong <shilong.wang@intel.com>

* DAOS-15863 container: fix container destroy error (#13852) After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Signed-off-by: Wang Shilong <shilong.wang@intel.com>

jolivier23 previously requested changes Mar 15, 2024

View reviewed changes

src/container/srv_target.c Outdated Show resolved Hide resolved

wangshilong marked this pull request as ready for review March 21, 2024 03:13

wangshilong requested review from a team as code owners March 21, 2024 03:13

wangshilong requested review from Nasf-Fan and NiuYawei March 21, 2024 03:14

NiuYawei reviewed Mar 21, 2024

View reviewed changes

Nasf-Fan reviewed Mar 21, 2024

View reviewed changes

wangshilong requested review from Nasf-Fan and NiuYawei June 13, 2024 01:57

wangshilong changed the title ~~DAOS-15124 container: fix container destroy error~~ DAOS-15863 container: fix container destroy error Jun 13, 2024

wangshilong force-pushed the shilongw/DAOS-15124 branch from c62aeef to 77794c3 Compare June 19, 2024 13:19

wangshilong requested a review from a team as a code owner June 19, 2024 13:19

wangshilong force-pushed the shilongw/DAOS-15124 branch from 77794c3 to 23e0506 Compare June 20, 2024 04:42

NiuYawei reviewed Jun 20, 2024

View reviewed changes

src/common/lru.c Outdated Show resolved Hide resolved

type fix

52b8bd7

Required-githooks: true Signed-off-by: Wang Shilong <shilong.wang@intel.com>

wangshilong requested a review from NiuYawei June 20, 2024 14:18

NiuYawei approved these changes Jun 21, 2024

View reviewed changes

wangshilong requested a review from jolivier23 June 24, 2024 05:53

Nasf-Fan reviewed Jun 24, 2024

View reviewed changes

Nasf-Fan approved these changes Jun 25, 2024

View reviewed changes

Merge branch 'master' of github.com:daos-stack/daos into shilongw/DAO…

1f48da7

…S-15124

Merge branch 'master' of github.com:daos-stack/daos into shilongw/DAO…

7a67c3c

…S-15124

wangshilong requested a review from a team July 9, 2024 01:22

NiuYawei merged commit 050681c into master Jul 9, 2024

NiuYawei deleted the shilongw/DAOS-15124 branch July 9, 2024 01:47

mjmac mentioned this pull request Aug 7, 2024

Merge upstream/release/2.6 into upstream/google/2.6 #14891

Merged

mjmac mentioned this pull request Nov 13, 2024

mjmac/DAOS 16787 google 2.6 #15498

Closed

mjmac mentioned this pull request Mar 18, 2025

dev/mjmac/b403616653 #16115

Closed

DAOS-15863 container: fix container destroy error #13852

DAOS-15863 container: fix container destroy error #13852

Uh oh!

Conversation

wangshilong commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 22, 2024

Uh oh!

Uh oh!

github-actions bot commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

daosbuild1 commented Jun 22, 2024

Uh oh!

Nasf-Fan Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daosbuild1 commented Jun 28, 2024

Uh oh!

daosbuild1 commented Jul 3, 2024

Uh oh!

daosbuild1 commented Jul 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

wangshilong commented Feb 22, 2024 •

edited

Loading

github-actions bot commented Mar 18, 2024 •

edited

Loading

Nasf-Fan Jun 24, 2024 •

edited

Loading