-
Notifications
You must be signed in to change notification settings - Fork 340
DAOS-15863 container: fix container destroy error #13852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Bug-tracker data: |
|
Ticket title is '1-./erasurecode/online_rebuild.py:EcodOnlineRebuild.test_ec_online_rebuild, 1-./ior/hard_rebuild.py:EcodIorHardRebuild.test_ec_ior_hard_online_rebuild tests fail due to container destroy error (-2001).' |
src/container/srv_target.c
Outdated
| if (cont->sc_rebuilding) | ||
| ABT_cond_wait(cont->sc_rebuild_cond, cont->sc_mutex); | ||
| if (!daos_lru_is_last_user(&cont->sc_list)) | ||
| ABT_cond_wait(cont->sc_fini_cond, cont->sc_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we abort dtx resync & rebuild on container destroy or just wait for them done here?
I think if the dtx resync & rebuild just hold container for a short period, we could just retry the loop in function for few more times, if they hold container for a long period, we'd inform them to abort.
Looks not necessary to introduce this sc_fini_cond and hack the cont_child_put()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we abort resync, but for rebuild is a bit different because it is not based on container but based on pool. now the logic if we found container is destroying, we will skip that container scan/pulling before. but there that might take a bit longer if container has been held, we need wait existed operation finished..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rebuild code needs be changed to abort current iterating jobs (for scan or pull)?
src/container/srv_target.c
Outdated
| { | ||
| bool wake_cond = false; | ||
|
|
||
| if (cont->sc_stopping == true && daos_lru_ref_count(&cont->sc_list) == 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wakeup only when hit the last three reference? If there are more users, then only the last 2nd user (which reference is 3 at that time) to trigger the wakeup? Why not wakeup as long as "sc_stopping" is set? that seems more safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in cont_child_destroy_one() it hold one extra refrences, and conf ref is inited as 1. so if ref count is 2 that means there is no other references for this containers.
Logic is like firstly we set sc_stopping for containers, and new lookup will fail.
but we wait existed references finish, that is why we could not wake up as long as sc_stopping is set. makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd try to avoid introducing such ad-hoc reference checking code further.
I think the '3' means we assume the daos_lru_cache for ds_cont_child is always created without D_HASH_FT_EPHEMERAL (no reference increasing when adding to cache) flag, I think we'd better don't make such assumption.
It looks to me current daos_lru code is flawed: in daos_lru_ref_release() it assumes 'in cache' holds one reference, but daos_lru_cache_create() doesn't regard D_HASH_FT_EPHEMERAL as an invalid feature for daos_lru.
I suggest implementing the common 'waiting for the last reference on daos_lru_ref_release()' for daos_lru, for example: introduce two more callbacks in daos_llink_ops, one for 'wakeup' and the other for 'wait', they could be called in daos_lru_ref_release() accordingly.
BTW, if this ticket isn't urgent, should we use this opportunity try to incorporate some the changes for what we discussed on I/O meeting? Thanks.
src/container/srv_target.c
Outdated
| * This might be racy, as dtx_resync() might yield after | ||
| * ds_cont_child_lookup(), but before @sc_dtx_resyncing set, | ||
| * we use @sc_fini_cond to guarantee all users exit properly. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it seems that some ULT (such as DTX resync) holding the container reference does not check sc_stopping flag, as to current destroy will be blocked here until related ULT release the reference without condition. Is it expected? I would say no.
As my understand, if the container to be destroyed, all the users holding reference on the container should stop related process. So we need add some check in related holders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nasf-Fan I think DTX will check sc_stopping, it call ds_cont_child_lookup() internally and inside ds_cont_child_lookup() it will check sc_stopping and return error if sc_stopping has been set.
c62aeef to
77794c3
Compare
After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL is passed in. Required-githooks: true Signed-off-by: Wang Shilong <shilong.wang@intel.com>
77794c3 to
23e0506
Compare
|
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/7/execution/node/1551/log |
|
|
||
| /* If it is the last user, ds_cont_child will be removed from hash & freed. */ | ||
| cont_child_put(tls->dt_cont_cache, cont); | ||
| daos_lru_ref_wait_evict(tls->dt_cont_cache, &cont->sc_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we prevent new coming IO during the wait? Although we set "stopping" flag on the cont_child, but related check is after ds_cont_find_hdl() that may be triggered by obj_ioc_init() and find the ds_cont_hdl in cache and call lru_hop_rec_addref() as to trigger your new added assertion? Just some code analysis, maybe wrong. Anyway it is better to give some test if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussed with shilong, it seems work well.
|
|
||
| /* If it is the last user, ds_cont_child will be removed from hash & freed. */ | ||
| cont_child_put(tls->dt_cont_cache, cont); | ||
| daos_lru_ref_wait_evict(tls->dt_cont_cache, &cont->sc_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussed with shilong, it seems work well.
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/9/execution/node/1505/log |
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/10/execution/node/547/log |
|
Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13852/10/execution/node/798/log |
After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL is passed in. Signed-off-by: Wang Shilong <shilong.wang@intel.com>
After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL is passed in. Signed-off-by: Wang Shilong <shilong.wang@intel.com>
After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL is passed in. Signed-off-by: Wang Shilong <shilong.wang@intel.com>
After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL is passed in. Signed-off-by: Wang Shilong <shilong.wang@intel.com>
* DAOS-15863 container: fix container destroy error (#13852) After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Signed-off-by: Wang Shilong <shilong.wang@intel.com>
* DAOS-15863 container: fix container destroy error (#13852) After stopping the container service, certain user cases may still retain references to the container's child objects, such as ongoing IO operations. To prevent errors during container destruction, ensure that the container is only destroyed after all dependent operations have completed. Signed-off-by: Wang Shilong <shilong.wang@intel.com>
After stopping the container service, certain user cases
may still retain references to the container's child objects,
such as ongoing IO operations. To prevent errors during container
destruction, ensure that the container is only destroyed after
all dependent operations have completed.
Fix daos_lru_cache_create() to return an error if D_HASH_FT_EPHEMERAL
is passed in.
Required-githooks: true