New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pybind/mgr/progress: introduce 5 second sleep interval #41907
pybind/mgr/progress: introduce 5 second sleep interval #41907
Conversation
e7258fa
to
fb51a52
Compare
qa/tasks/mgr/test_progress.py
Outdated
| @@ -330,28 +346,29 @@ def test_osd_cannot_recover(self): | |||
| # First do some failures that will result in a normal rebalance | |||
| # (Assumption: we're in a test environment that is configured | |||
| # not to require replicas be on different hosts, like teuthology) | |||
| self._set_no_recovery_backfill(True) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, also, this calls are normally paired for guarding the test. what about:
def recovery_backfill_disabled(self):
self.mgr_cluster.mon_manager.raw_cluster_cmd(
'osd set nobackfill'.split())
self.mgr_cluster.mon_manager.raw_cluster_cmd(
'osd set norecover'.split())
yield
self.mgr_cluster.mon_manager.raw_cluster_cmd(
'osd unset nobackfill'.split())
self.mgr_cluster.mon_manager.raw_cluster_cmd(
'osd unset norecover'.split())
# ...
with self.recovery_backfill_disabled():
# testThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2021-06-18T13:11:01.333 INFO:tasks.cephfs_test_runner:======================================================================
2021-06-18T13:11:01.333 INFO:tasks.cephfs_test_runner:ERROR: test_osd_came_back (tasks.mgr.test_progress.TestProgress)
2021-06-18T13:11:01.334 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-06-18T13:11:01.334 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-06-18T13:11:01.334 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_25c891fcbe5987f64bb6ba26ef6bd1dafdd5725f/qa/tasks/mgr/test_progress.py", line 316, in test_osd_came_back
2021-06-18T13:11:01.335 INFO:tasks.cephfs_test_runner: ev1 = self._simulate_failure()
2021-06-18T13:11:01.335 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_25c891fcbe5987f64bb6ba26ef6bd1dafdd5725f/qa/tasks/mgr/test_progress.py", line 205, in _simulate_failure
2021-06-18T13:11:01.335 INFO:tasks.cephfs_test_runner: period=1)
2021-06-18T13:11:01.336 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_25c891fcbe5987f64bb6ba26ef6bd1dafdd5725f/qa/tasks/ceph_test_case.py", line 185, in wait_until_equal
2021-06-18T13:11:01.336 INFO:tasks.cephfs_test_runner: elapsed, expect_val, val
2021-06-18T13:11:01.337 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30 seconds waiting for 1 (currently 0)
not sure if this is related.
15d0d33
to
2d433ab
Compare
2d433ab
to
2fcc938
Compare
|
All Green so I think it should be ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to separate out the test changes into smaller commits and add corresponding explanation in them instead of adding them to this big commit.
nit, why we need to ignore "OSDMAP_FLAGS" could also use an explanation
Current progress module only checks pg stats and osdmap when it is notified by the cluster. However, this is expensive in large cluster with many pools and osds. we change it to only check both pg stats and osdmap every 5 seconds. in the function _osd_in_out() we now calculate `is_relocated` by: old_osds != new_osds such that it does not matter if the difference between osds are positive or negative. Signed-off-by: Kamoltat <ksirivad@redhat.com>
Changes some the tests in teuthology to make the test more deterministic. Using: `ceph osd set norecover` and `ceph osd set nobackfill` when marking osds in or out. As this will delay the recovery and make sure it the test cases get the chance to check that there is actually events poping up in the progress module. took out test_osd_cannot_recover from tasks/mgr/test_progress.py since it is no longer a relevant test case since recovery will get triggered regardless if pg is unmoved. Ignoring `OSDMAP_FLAGS` in teuthology because we are using norecover and nobackfill to delay the recovery process, therefore, it will create a health warning and fails the teuthology test. Signed-off-by: Kamoltat <ksirivad@redhat.com>
2fcc938
to
5f33f2f
Compare
|
@jdurgin |
Current progress module onnly checks pg stats
and osdmap when it is notified by the cluster.
However, this is expensive in large cluster
with many pools and osds. we
change it to only check both pg stats and osdmap
every 5 seconds.
Also change some the tests in teuthology to make
the test more deterministic.
Using:
ceph osd set norecoverandceph osd set nobackfillwhen marking osds inor out. As this will delay the recovery and make
sure it the test cases get the chance to check
that there is actually events poping up in
the progress module.
Signed-off-by: Kamoltat ksirivad@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox