New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/ReplicatedBackend: reset thread heartbeat after every omap entry … #15823

Merged
merged 1 commit into from Jul 4, 2017

Conversation

Projects
None yet
5 participants
@jdurgin
Member

jdurgin commented Jun 21, 2017

…in deep-scrub

Doing this every 100 entries could be after 100MB of reads. There's
little cost to reset this, so remove the option for configuring it.

This reduces the likelihood of crashing the osd due to too many omap
values on an object.

Fixes: http://tracker.ceph.com/issues/20375
Signed-off-by: Josh Durgin jdurgin@redhat.com

osd/ReplicatedBackend: reset thread heartbeat after every omap entry …
…in deep-scrub

Doing this every 100 entries could be after 100MB of reads. There's
little cost to reset this, so remove the option for configuring it.

This reduces the likelihood of crashing the osd due to too many omap
values on an object.

Fixes: http://tracker.ceph.com/issues/20375
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 22, 2017

It's not a lot expensive, but it does have to get the current time on every invocation, and isn't the common case for that iter->next() call just reading one short string out of memory? Even 100 1MB calls shouldn't take the many tens of seconds a timeout requires, right?

If nothing else, maybe we should just turn down the default config; I think we'll want to undo this change on fast nvram storage.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 22, 2017

jenkins retest this please

@jdurgin

This comment has been minimized.

Member

jdurgin commented Jun 22, 2017

@gregsfortytwo leveldb compaction can cause large latency spikes, such that hitting timeouts like this is possible with 100 reads. time(2) is only second-accurate and inexpensive anyway - 1M calls take <0.01s.

@liewegas liewegas added the needs-qa label Jun 27, 2017

@liewegas

could also just change the tunable to 1? either way is fine with me.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jun 30, 2017

retest this please

@tchaikov tchaikov merged commit a5471e7 into ceph:master Jul 4, 2017

4 of 5 checks passed

arm64 make check arm64 make check failed
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
default Build finished.
Details
make check make check succeeded
Details

@jdurgin jdurgin deleted the jdurgin:wip-omap-tp-heartbeat branch Jul 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment