New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: re-write of _update_calc_stats and improve pg degraded state #19850

Merged
merged 13 commits into from Jan 16, 2018

Conversation

Projects
None yet
3 participants
@dzafman
Member

dzafman commented Jan 9, 2018

@dzafman dzafman requested review from liewegas and jdurgin Jan 9, 2018

@dzafman

This comment has been minimized.

Member

dzafman commented Jan 9, 2018

Replacing pull request from #19327

@dzafman

This comment has been minimized.

Member

dzafman commented Jan 9, 2018

@liewegas @jdurgin Need to figure out the rados suite failures. Also, there are intermittent (timing?) test failures in the new qa/standalone/osd/osd-*-stats.sh tests.

@liewegas

This comment has been minimized.

Member

liewegas commented Jan 9, 2018

@dzafman I just noticed the error injection is broken on bluestore in master. See #19866

@dzafman

This comment has been minimized.

Member

dzafman commented Jan 12, 2018

Test results

dzafman-2018-01-11_21:21:49-rados-wip-calc-stats-distro-basic-smithi (—filter standalone -N 10)
2063205 qa/standalone/mon/misc.sh TEST_mon_features failed to get quorum

dzafman-2018-01-11_21:37:33-rados:thrash-wip-calc-stats-distro-basic-smithi
2063344 Bug #22656 scrub mismatch on bytes
2063430 ceph_test_rados core dump (need to get stack trace from core dump)
2063422 Clock sync
2063432 Clock sync
2063457 Clock sync

@dzafman

This comment has been minimized.

Member

dzafman commented Jan 14, 2018

Tests passed 38 runs of each of osd-backfill-stats.sh and osd-recovery-stats.sh using run-standalone.sh

dzafman added some commits Oct 26, 2017

osd: cleanup: Fix log message
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: cleanup: Remove unused const vars
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Rewrite _update_calc_stats() to make it cleaner and more accurate
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Base pg degraded state on num_degraded_objects
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Handling when recovery sources have missing
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Improve pg degraded state setting based on _update_calc_stats() …
…degraded count

Signed-off-by: David Zafman <dzafman@redhat.com>
test: Verify stat calculations during recovery
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Improve the way insufficient targets is handled to be compatible…
… with EC

Signed-off-by: David Zafman <dzafman@redhat.com>
test: Verify stat calculations during backfill
Signed-off-by: David Zafman <dzafman@redhat.com>
qa: Ignore degraded PGs when injecting random eio errors
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Don't start recovery for missing until active pg state set
I was seeing recovery hang when it is started before _activate_committed()
The state machine passes into "Active" but this transitions to activating
pg state and after commmitted into "active" pg state.

Signed-off-by: David Zafman <dzafman@redhat.com>
ceph-helpers.sh: Add flush_pg_stats() to wait_for_clean() to make it …
…reliable

osd-scrub-repair.sh: Fixes for omap keys landing on different OSDs due to flush

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman

This comment has been minimized.

Member

dzafman commented Jan 15, 2018

@liewegas You approved the previous version but there have been slight changes including
64047e1,
aeba36a

Merge comment should include:

Fixes: http://tracker.ceph.com/issues/20059

tests: recovery-unfound-found test needs to account for correct mispl…
…aced calculations

The test expected HEALTH_OK when in a state with misplaced objects therefore HEALTH_WARN

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman

This comment has been minimized.

Member

dzafman commented Jan 16, 2018

dzafman-2018-01-15_15:34:02-rados-wip-zafman-testing-distro-basic-smithi
2076708 Error ETIMEDOUT: crush test failed with -110: timed out during smoke test (5 seconds) doing ceph osd pool create fast_eviction 1 1
2076717 rbd failures
2076765 clocks not sync'ed
2076870 test.sh cal_raw_used_size != raw_used_size
2076953 test.sh cal_raw_used_size != raw_used_size
2076985 Need test case adjustment 1 osd out marked misplaced

Job 2076985 fixed by 9f103f0

@dzafman dzafman merged commit 7ccb7b7 into ceph:master Jan 16, 2018

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
@dzafman

This comment has been minimized.

Member

dzafman commented Jan 16, 2018

PASSED with correction 9f103f0
dzafman-2018-01-16_10:53:54-rados-wip-zafman-testing-distro-basic-smithi/2079354

@dzafman dzafman deleted the dzafman:wip-calc-stats branch Mar 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment