New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qa/tasks: assert on pg status with a timeout #14608

Merged
merged 3 commits into from Apr 20, 2017

Conversation

Projects
None yet
3 participants
@tchaikov
Contributor

tchaikov commented Apr 18, 2017

Fixes: http://tracker.ceph.com/issues/19594
Signed-off-by: Kefu Chai kchai@redhat.com

@tchaikov tchaikov requested a review from liewegas Apr 18, 2017

@tchaikov

This comment has been minimized.

@tchaikov tchaikov requested a review from jdurgin Apr 18, 2017

@jcsp

This comment has been minimized.

Contributor

jcsp commented Apr 18, 2017

Hmm, why did this work when the command was going to the mon, but need a sleep now that it's in mgr? The mon had the same stats period didn't it?

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Apr 18, 2017

good question. the osd_mon_report_interval_min is also 5. i have no idea why we don't have this problem before.

@liewegas

This comment has been minimized.

Member

liewegas commented Apr 18, 2017

I think this is a band-aid. The problem is the stats delay (mon stats also lag a bit too). The problem is that there are weird implicit timing assumptoins in the tests that once one condition is true (e.g., wait_for_healthy) then the pg stats will reflect that. It used to be true because both conditions were coming out of the mon, but now the mon and mgr info is disconnected, and we get occasionaly failures due to that.

We also have the problem that an ill-timed mgr failure-over may kick the pg stats back in time a bit too.

I'm worried that we need to sort these cases out individually by fixing the tests. I'm also concerned that this function is called from lots of placing and making it sleep 10s every time is going to be a problem...

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Apr 18, 2017

so shall we decrease the interval for qa tests?

@tchaikov tchaikov changed the title from qa/tasks/ceph_manger: wait for (2 * mgr_stats_period) before 'pg dump' to [DNM] qa/tasks/ceph_manger: wait for (2 * mgr_stats_period) before 'pg dump' Apr 18, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Apr 18, 2017

we need a "pg flush" command to force all OSDs to send their pg stats to mgr right away.

@liewegas

This comment has been minimized.

Member

liewegas commented Apr 18, 2017

Two ideas:

  1. do something like the osd tell flush_pg_stats thing so that we block until the stats go to the mgr
  2. accept that pg stats are "eventually consistent" and wrap every test/assertion about them in a loop with a timeout.
@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Apr 18, 2017

i vote for option 2 =)

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Apr 19, 2017

@tchaikov tchaikov changed the title from [DNM] qa/tasks/ceph_manger: wait for (2 * mgr_stats_period) before 'pg dump' to qa/tasks: assert on pg status with a timeout Apr 19, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Apr 19, 2017

@jcsp i think we failed to schedule a scrub because the "last_scrub_stamp" returned by the first and the second "pg dump" calls in CephManager.do_pg_scrub() are different, because the different ctimes of pg-create messages before luminous and after luminous. i included c52d035 in this PR to address this.

def primary(stats):
assert stats is not None
return int(stats['acting'][0])
return self.with_pg(pool, pgnum, primary)

This comment has been minimized.

@tchaikov

tchaikov Apr 19, 2017

Contributor

this change will conflict with #14559

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Apr 19, 2017

@liewegas fixed and repushed.

@liewegas

yes! this is so much better :)

i'd maybe add a few more sleeps values in the decorator just for good measure...

tchaikov added some commits Apr 19, 2017

qa/tasks/ceph_manager: add a "wait_for_pg_stats()" decorator
and accompany it with two helpers to access the pg stats in a more
natural way

Signed-off-by: Kefu Chai <kchai@redhat.com>
qa/tasks: update tests with helper to wait for pg-stats
and remove unused helpers

Fixes: http://tracker.ceph.com/issues/19594
Signed-off-by: Kefu Chai <kchai@redhat.com>
mon/OSDMonitor: use inc.modified as the ctime for MOSDPGCreate
so the last_scrub_time of the newly created pg is consistent

Fixes: http://tracker.ceph.com/issues/19594
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Apr 20, 2017

changelog

  • dropped the change of get_pg_primary() which would otherwise conflict with #14559
  • add more sleep values in the wait_for_pg_stats() decorator
@tchaikov

This comment has been minimized.

@tchaikov tchaikov merged commit ee653ba into ceph:master Apr 20, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@tchaikov tchaikov deleted the tchaikov:wip-19594 branch Apr 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment