Skip to content

DAOS-11092 bio: restrict inflight ios#9667

Merged
jolivier23 merged 1 commit intomasterfrom
niu/DAOS-11092
Aug 5, 2022
Merged

DAOS-11092 bio: restrict inflight ios#9667
jolivier23 merged 1 commit intomasterfrom
niu/DAOS-11092

Conversation

@NiuYawei
Copy link
Contributor

@NiuYawei NiuYawei commented Jul 12, 2022

Stop issuing new NVMe I/O when the per-xstream inflight IOs reachs
a threshold.

Confine the number of batched blob unmap calls in bio_blob_unmap_sgl()
to workaroud the mysterious segfault. The default max unmap count is
set to 32, and it can be configured through DAOS_SPDK_MAX_UNMAP_CNT.

Test-tag: pr ec_aggregation_default ec_aggregation_time ec_online_rebuild_mdtest

Required-githooks: true

Signed-off-by: Niu Yawei yawei.niu@intel.com

@github-actions
Copy link

github-actions bot commented Jul 12, 2022

Bug-tracker data:
Ticket title is 'Segfault in libbio during EcodOnlineRebuildMdtest'
Status is 'In Review'
Labels: 'triaged'
Job should run at elevated priority (3)
https://daosio.atlassian.net/browse/DAOS-11092

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-9667/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-9667/4/display/redirect

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@NiuYawei NiuYawei changed the title DAOS-11092 debug: debug patch DAOS-11092 vea: use smaller MAX_FLUSH_FRAGS Jul 14, 2022
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@NiuYawei NiuYawei marked this pull request as ready for review July 14, 2022 10:13
@NiuYawei
Copy link
Contributor Author

See my last comment in DAOS-11092, I'm not sure why this PR could fix the segfault, I still suspect it's a stack overrun from EC rebuild/aggregation, let's investigate it further once Argobots upgraded.

tanabarr
tanabarr previously approved these changes Jul 14, 2022
liuxuezhao
liuxuezhao previously approved these changes Jul 14, 2022
@NiuYawei NiuYawei dismissed stale reviews from liuxuezhao and tanabarr via de5f8b4 July 15, 2022 01:52
@NiuYawei
Copy link
Contributor Author

Add missed 'pr' in Test-tag.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9667/7/execution/node/371/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9667/7/execution/node/367/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9667/7/execution/node/410/log

@daosbuild1
Copy link
Collaborator

Test stage Build on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9667/7/execution/node/386/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@NiuYawei NiuYawei requested a review from liuxuezhao July 15, 2022 06:47
@NiuYawei NiuYawei removed the request for review from rpadma2 July 15, 2022 08:29
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9667/10/execution/node/1368/log

@NiuYawei NiuYawei requested a review from a team as a code owner July 16, 2022 01:02
@github-actions github-actions bot added priority Ticket has high priority (automatically managed) release-2.4 PR is eventually targeted for 2.4 labels Jul 16, 2022
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@NiuYawei NiuYawei changed the title DAOS-11092 vea: use smaller MAX_FLUSH_FRAGS DAOS-11092 bio: restrict inflight ios Aug 1, 2022
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Stop issuing new NVMe I/O when the per-xstream inflight IOs reachs
a threshold.

Confine the number of batched blob unmap calls in bio_blob_unmap_sgl()
to workaroud the mysterious segfault. The default max unmap count is
set to 32, and it can be configured through DAOS_SPDK_MAX_UNMAP_CNT.

Test-tag: pr ec_aggregation_default ec_aggregation_time ec_online_rebuild_mdtest

Required-githooks: true

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@NiuYawei
Copy link
Contributor Author

NiuYawei commented Aug 5, 2022

ping reviewers, this patch is ready for review now.

@jolivier23 jolivier23 merged commit 41ec716 into master Aug 5, 2022
@jolivier23 jolivier23 deleted the niu/DAOS-11092 branch August 5, 2022 19:47
knard38 pushed a commit that referenced this pull request Aug 17, 2022
Stop issuing new NVMe I/O when the per-xstream inflight IOs reachs
a threshold.

Confine the number of batched blob unmap calls in bio_blob_unmap_sgl()
to workaroud the mysterious segfault. The default max unmap count is
set to 32, and it can be configured through DAOS_SPDK_MAX_UNMAP_CNT.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed) release-2.4 PR is eventually targeted for 2.4

Development

Successfully merging this pull request may close these issues.

5 participants