Skip to content

Conversation

@wangshilong
Copy link
Contributor

@wangshilong wangshilong commented Feb 22, 2024

After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL
is passed in.

Required-githooks: true

@github-actions
Copy link

Bug-tracker data:
Ticket title is 'daos_test/suite.py:DaosCoreTest.test_daos_rebuild_simple - timeout waiting for rebuild'
Status is 'In Progress'
Labels: 'ci_impact,pr_test,scrubbed,triaged'
https://daosio.atlassian.net/browse/DAOS-15124

@github-actions
Copy link

github-actions bot commented Mar 18, 2024

Ticket title is '1-./erasurecode/online_rebuild.py:EcodOnlineRebuild.test_ec_online_rebuild, 1-./ior/hard_rebuild.py:EcodIorHardRebuild.test_ec_ior_hard_online_rebuild tests fail due to container destroy error (-2001).'
Status is 'In Review'
Labels: 'ci_impact,md_on_ssd,triaged,weekly_test'
https://daosio.atlassian.net/browse/DAOS-15863

@wangshilong wangshilong marked this pull request as ready for review March 21, 2024 03:13
@wangshilong wangshilong requested review from a team as code owners March 21, 2024 03:13
if (cont->sc_rebuilding)
ABT_cond_wait(cont->sc_rebuild_cond, cont->sc_mutex);
if (!daos_lru_is_last_user(&cont->sc_list))
ABT_cond_wait(cont->sc_fini_cond, cont->sc_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we abort dtx resync & rebuild on container destroy or just wait for them done here?

I think if the dtx resync & rebuild just hold container for a short period, we could just retry the loop in function for few more times, if they hold container for a long period, we'd inform them to abort.

Looks not necessary to introduce this sc_fini_cond and hack the cont_child_put()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we abort resync, but for rebuild is a bit different because it is not based on container but based on pool. now the logic if we found container is destroying, we will skip that container scan/pulling before. but there that might take a bit longer if container has been held, we need wait existed operation finished..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rebuild code needs be changed to abort current iterating jobs (for scan or pull)?

{
bool wake_cond = false;

if (cont->sc_stopping == true && daos_lru_ref_count(&cont->sc_list) == 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wakeup only when hit the last three reference? If there are more users, then only the last 2nd user (which reference is 3 at that time) to trigger the wakeup? Why not wakeup as long as "sc_stopping" is set? that seems more safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in cont_child_destroy_one() it hold one extra refrences, and conf ref is inited as 1. so if ref count is 2 that means there is no other references for this containers.

Logic is like firstly we set sc_stopping for containers, and new lookup will fail.
but we wait existed references finish, that is why we could not wake up as long as sc_stopping is set. makes sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd try to avoid introducing such ad-hoc reference checking code further.

I think the '3' means we assume the daos_lru_cache for ds_cont_child is always created without D_HASH_FT_EPHEMERAL (no reference increasing when adding to cache) flag, I think we'd better don't make such assumption.

It looks to me current daos_lru code is flawed: in daos_lru_ref_release() it assumes 'in cache' holds one reference, but daos_lru_cache_create() doesn't regard D_HASH_FT_EPHEMERAL as an invalid feature for daos_lru.

I suggest implementing the common 'waiting for the last reference on daos_lru_ref_release()' for daos_lru, for example: introduce two more callbacks in daos_llink_ops, one for 'wakeup' and the other for 'wait', they could be called in daos_lru_ref_release() accordingly.

BTW, if this ticket isn't urgent, should we use this opportunity try to incorporate some the changes for what we discussed on I/O meeting? Thanks.

* This might be racy, as dtx_resync() might yield after
* ds_cont_child_lookup(), but before @sc_dtx_resyncing set,
* we use @sc_fini_cond to guarantee all users exit properly.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it seems that some ULT (such as DTX resync) holding the container reference does not check sc_stopping flag, as to current destroy will be blocked here until related ULT release the reference without condition. Is it expected? I would say no.

As my understand, if the container to be destroyed, all the users holding reference on the container should stop related process. So we need add some check in related holders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nasf-Fan I think DTX will check sc_stopping, it call ds_cont_child_lookup() internally and inside ds_cont_child_lookup() it will check sc_stopping and return error if sc_stopping has been set.

@wangshilong wangshilong requested review from Nasf-Fan and NiuYawei June 13, 2024 01:57
@wangshilong wangshilong changed the title DAOS-15124 container: fix container destroy error DAOS-15863 container: fix container destroy error Jun 13, 2024
@wangshilong wangshilong force-pushed the shilongw/DAOS-15124 branch from c62aeef to 77794c3 Compare June 19, 2024 13:19
@wangshilong wangshilong requested a review from a team as a code owner June 19, 2024 13:19
After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL
is passed in.

Required-githooks: true
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
@wangshilong wangshilong force-pushed the shilongw/DAOS-15124 branch from 77794c3 to 23e0506 Compare June 20, 2024 04:42
Required-githooks: true

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
@wangshilong wangshilong requested a review from NiuYawei June 20, 2024 14:18
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/7/execution/node/1551/log

@wangshilong wangshilong requested a review from jolivier23 June 24, 2024 05:53

/* If it is the last user, ds_cont_child will be removed from hash & freed. */
cont_child_put(tls->dt_cont_cache, cont);
daos_lru_ref_wait_evict(tls->dt_cont_cache, &cont->sc_list);
Copy link
Contributor

@Nasf-Fan Nasf-Fan Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we prevent new coming IO during the wait? Although we set "stopping" flag on the cont_child, but related check is after ds_cont_find_hdl() that may be triggered by obj_ioc_init() and find the ds_cont_hdl in cache and call lru_hop_rec_addref() as to trigger your new added assertion? Just some code analysis, maybe wrong. Anyway it is better to give some test if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussed with shilong, it seems work well.


/* If it is the last user, ds_cont_child will be removed from hash & freed. */
cont_child_put(tls->dt_cont_cache, cont);
daos_lru_ref_wait_evict(tls->dt_cont_cache, &cont->sc_list);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussed with shilong, it seems work well.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/9/execution/node/1505/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/10/execution/node/547/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/10/execution/node/798/log

@wangshilong wangshilong requested a review from a team July 9, 2024 01:22
@NiuYawei NiuYawei dismissed jolivier23’s stale review July 9, 2024 01:47

The spelling error has been fixed.

@NiuYawei NiuYawei merged commit 050681c into master Jul 9, 2024
@NiuYawei NiuYawei deleted the shilongw/DAOS-15124 branch July 9, 2024 01:47
wangshilong pushed a commit that referenced this pull request Jul 10, 2024
After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL
is passed in.

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
daltonbohning pushed a commit that referenced this pull request Jul 19, 2024
After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL
is passed in.

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
daltonbohning pushed a commit that referenced this pull request Jul 23, 2024
After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL
is passed in.

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
grom72 pushed a commit to grom72/daos that referenced this pull request Jul 25, 2024
After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL
is passed in.

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
gnailzenh pushed a commit that referenced this pull request Aug 2, 2024
* DAOS-15863 container: fix container destroy error (#13852)

After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
daltonbohning pushed a commit that referenced this pull request Aug 2, 2024
* DAOS-15863 container: fix container destroy error (#13852)

After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
@mjmac mjmac mentioned this pull request Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants