Skip to content

DAOS-11936 object: fix online extending bugs#10604

Merged
johannlombardi merged 1 commit intomasterfrom
wangdi/daos_11936
Oct 25, 2022
Merged

DAOS-11936 object: fix online extending bugs#10604
johannlombardi merged 1 commit intomasterfrom
wangdi/daos_11936

Conversation

@wangdi1
Copy link
Contributor

@wangdi1 wangdi1 commented Oct 19, 2022

  1. Only checking non-extending shards in obj_grp_valid_shard_get().

  2. Recreate the bulk during retry, in case the bulk needs to be created and re-binded, once refreshing the layout.

Required-githooks: true

Signed-off-by: Di Wang di.wang@intel.com

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficent testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@github-actions
Copy link

github-actions bot commented Oct 19, 2022

Bug-tracker data:
Ticket title is 'ERROR: dfs_write(0x5591d7e8e000, 40000) failed (22): Invalid argument. during online extend test.'
Status is 'In Review'
Job should run at elevated priority (3)
https://daosio.atlassian.net/browse/DAOS-11936

@github-actions github-actions bot added priority Ticket has high priority (automatically managed) release-2.4 PR is eventually targeted for 2.4 labels Oct 19, 2022
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/1/execution/node/1083/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@jolivier23 jolivier23 changed the title DAOS-11936 objects: fix online extending bugs DAOS-11936 object: fix online extending bugs Oct 20, 2022
jolivier23
jolivier23 previously approved these changes Oct 20, 2022
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/2/execution/node/1130/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

* So let's free the existent bulk, and recreate the bulk later.
*/
if (obj_auxi->io_retry && obj_auxi->bulks != NULL)
obj_bulk_fini(obj_auxi);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirm for this bulk change, why need to recreate the bulk handle?
note that for crt_bulk_bind() is it to bind local context address to the bulk handle (not remote target address), so not sure why need to recreate (for rebind?) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is for rebinding. for osa online case. it is possible that the initial I/O will not need bind the bulk because it does not forward, but retry will refresh the pool map, then need forward (need bulk the bind).

On the other hand, the initial I/O might need forward, but then retry might not need forward.

So we might just delete the bulk and recreate and re-bind the bulk as needed.

new_shards[grp_idx].po_rebuilding = 1;

if (f_shard->fs_status == PO_COMP_ST_UP)
if (f_shard->fs_status == PO_COMP_ST_UP || f_shard->fs_status == PO_COMP_ST_NEW)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirm, a reintegrating tgt, it should be in PO_COMP_ST_UP state? (see update_one_tgt, case POOL_REINT)
Is it possible to get the NEW shard here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it could be NEW for extending targets.

Comment on lines +145 to +146
if app_name == "ior":
self.run_ior_thread("Read", oclass, test_seq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this test, but do we need something like this?

else:
    self.run_mdtest_thread()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I think that is for verification as I thought. I am not sure if there are separate verification for mdtest? @rpadma2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presently, mdtest doesn't do separate write/read. It just runs mdtest on the same container read/write together. That's all... In future, we can enhance it (if needed). IOR is only used as separate write and read now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run mdtest with read & stat to verify the tree. Let's create another ticket for this.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-10604/3/testReport/(root)/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/3/execution/node/1083/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/4/execution/node/915/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10604/6/execution/node/916/log

1. Only checking non-extending shards in obj_grp_valid_shard_get().

2. Recreate the bulk during retry, in case the bulk needs to be
created and re-binded, once refreshing the layout.

3. allow_version only be valid if it is > 0.

4. EXTEND target should also be reintegrating status.

5. Set pool rebuild version for the client between servers.

Required-githooks: true

Signed-off-by: Di Wang <di.wang@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Comment on lines +145 to +146
if app_name == "ior":
self.run_ior_thread("Read", oclass, test_seq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run mdtest with read & stat to verify the tree. Let's create another ticket for this.

@johannlombardi johannlombardi merged commit 431340f into master Oct 25, 2022
@johannlombardi johannlombardi deleted the wangdi/daos_11936 branch October 25, 2022 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed) release-2.4 PR is eventually targeted for 2.4

Development

Successfully merging this pull request may close these issues.

7 participants