Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test-erasure-eio and osd-scrub-repair races (17830) #11926

Merged
merged 9 commits into from Nov 17, 2016

Conversation

dzafman
Copy link
Contributor

@dzafman dzafman commented Nov 12, 2016

I want Jenkins to run this. And probably have @tchaikov cherry-pick the test commit to #11914 and just close this pull request.

@tchaikov
Copy link
Contributor

tchaikov commented Nov 14, 2016

@dzafman, i will pull your PR in once i figure out why the jenkins "make check" still fails and fix it.

dmick added a commit that referenced this pull request Nov 15, 2016
After Sage's suggestion of earlier today, in now-closed
PR #11970, which can't be resurrected just now

Kefu is trying to reproduce osd-scrub-repair failures

David is working test-erasure-eio in
#11926

Signed-off-by: Dan Mick <dan.mick@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Reduce size of log on timeout by doing a backoff so that
we don't log 3000 loops at 1/10 second sleeps.

Signed-off-by: David Zafman <dzafman@redhat.com>
Tests use objectstore_tool() which stops and starts OSDs,
but may assume consistency of object locations.

Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman
Copy link
Contributor Author

dzafman commented Nov 16, 2016

@tchaikov My changes in osd-scrub-repair.sh might fix the races.

@dzafman
Copy link
Contributor Author

dzafman commented Nov 16, 2016

Jenkins passed, but let's try another attempt. I have a commit that fixes compiler warnings.

@dzafman dzafman changed the title DNM: Wip test 17830 Fix test-erasure-eio and osd-scrub-repair races (17830) Nov 16, 2016
@dzafman
Copy link
Contributor Author

dzafman commented Nov 16, 2016

@athanatos If you want to move these tests to teuthology we don't have to put this pull request in. It seems to fix the races I know about, but can't be sure there aren't others.

@athanatos
Copy link
Contributor

@dzafman Meh, if it's reliable, it's not a particularly high priority to move it to teuthology.

@athanatos
Copy link
Contributor

Looks like jenkins failed though.

Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Caused by: af720cc

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman
Copy link
Contributor Author

dzafman commented Nov 17, 2016

I fixed something else in osd-scrub-repair. Let's try it again.

retest this please

Add comments about uniqueness of port number required

Signed-off-by: David Zafman <dzafman@redhat.com>
This failed because test blew through 60 requests in 8 seceons before
the scrub repair even started on Jenkins.

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman
Copy link
Contributor Author

dzafman commented Nov 17, 2016

@athanatos @tchaikov Seemingly unrelated unittest_journal failed.

@@ -26,7 +26,7 @@ function run() {
local dir=$1
shift

export CEPH_MON="127.0.0.1:17109"
export CEPH_MON="127.0.0.1:17110" # git grep '\<17110\>' : there must be only one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my! good catch!

@@ -355,6 +355,7 @@ function TEST_list_missing_erasure_coded() {
[ $i -lt 60 ] || return 1
matches=$(ceph pg $pg list_missing | egrep "MOBJ0|MOBJ1" | wc -l)
[ $matches -eq 2 ] && break
sleep 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use repair $pg || return 1. as we do wait_for_scrub() in repair(). i will post another PR to do this cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants