DAOS-11936 object: fix online extending bugs#10604
Conversation
|
Bug-tracker data: |
daosbuild1
left a comment
There was a problem hiding this comment.
LGTM. No errors found by checkpatch.
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/1/execution/node/1083/log |
71ac44f to
c71b561
Compare
daosbuild1
left a comment
There was a problem hiding this comment.
LGTM. No errors found by checkpatch.
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/2/execution/node/1130/log |
c71b561 to
c13bd16
Compare
daosbuild1
left a comment
There was a problem hiding this comment.
LGTM. No errors found by checkpatch.
| * So let's free the existent bulk, and recreate the bulk later. | ||
| */ | ||
| if (obj_auxi->io_retry && obj_auxi->bulks != NULL) | ||
| obj_bulk_fini(obj_auxi); |
There was a problem hiding this comment.
just confirm for this bulk change, why need to recreate the bulk handle?
note that for crt_bulk_bind() is it to bind local context address to the bulk handle (not remote target address), so not sure why need to recreate (for rebind?) here?
There was a problem hiding this comment.
yes, this is for rebinding. for osa online case. it is possible that the initial I/O will not need bind the bulk because it does not forward, but retry will refresh the pool map, then need forward (need bulk the bind).
On the other hand, the initial I/O might need forward, but then retry might not need forward.
So we might just delete the bulk and recreate and re-bind the bulk as needed.
| new_shards[grp_idx].po_rebuilding = 1; | ||
|
|
||
| if (f_shard->fs_status == PO_COMP_ST_UP) | ||
| if (f_shard->fs_status == PO_COMP_ST_UP || f_shard->fs_status == PO_COMP_ST_NEW) |
There was a problem hiding this comment.
just confirm, a reintegrating tgt, it should be in PO_COMP_ST_UP state? (see update_one_tgt, case POOL_REINT)
Is it possible to get the NEW shard here?
There was a problem hiding this comment.
oh, it could be NEW for extending targets.
| if app_name == "ior": | ||
| self.run_ior_thread("Read", oclass, test_seq) |
There was a problem hiding this comment.
I'm not familiar with this test, but do we need something like this?
else:
self.run_mdtest_thread()
There was a problem hiding this comment.
oh, I think that is for verification as I thought. I am not sure if there are separate verification for mdtest? @rpadma2
There was a problem hiding this comment.
Presently, mdtest doesn't do separate write/read. It just runs mdtest on the same container read/write together. That's all... In future, we can enhance it (if needed). IOR is only used as separate write and read now.
There was a problem hiding this comment.
We should run mdtest with read & stat to verify the tree. Let's create another ticket for this.
|
Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-10604/3/testReport/(root)/ |
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/3/execution/node/1083/log |
c13bd16 to
fbeb466
Compare
daosbuild1
left a comment
There was a problem hiding this comment.
LGTM. No errors found by checkpatch.
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/4/execution/node/915/log |
fbeb466 to
220316b
Compare
daosbuild1
left a comment
There was a problem hiding this comment.
LGTM. No errors found by checkpatch.
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/6/execution/node/916/log |
1. Only checking non-extending shards in obj_grp_valid_shard_get(). 2. Recreate the bulk during retry, in case the bulk needs to be created and re-binded, once refreshing the layout. 3. allow_version only be valid if it is > 0. 4. EXTEND target should also be reintegrating status. 5. Set pool rebuild version for the client between servers. Required-githooks: true Signed-off-by: Di Wang <di.wang@intel.com>
220316b to
3a0f9b2
Compare
daosbuild1
left a comment
There was a problem hiding this comment.
LGTM. No errors found by checkpatch.
| if app_name == "ior": | ||
| self.run_ior_thread("Read", oclass, test_seq) |
There was a problem hiding this comment.
We should run mdtest with read & stat to verify the tree. Let's create another ticket for this.
Only checking non-extending shards in obj_grp_valid_shard_get().
Recreate the bulk during retry, in case the bulk needs to be created and re-binded, once refreshing the layout.
Required-githooks: true
Signed-off-by: Di Wang di.wang@intel.com
Before requesting gatekeeper:
Gatekeeper: