Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pybind/mgr/progress: introduce 5 second sleep interval #41907

Merged
merged 2 commits into from Jul 21, 2021

Conversation

kamoltat
Copy link
Member

Current progress module onnly checks pg stats
and osdmap when it is notified by the cluster.
However, this is expensive in large cluster
with many pools and osds. we
change it to only check both pg stats and osdmap
every 5 seconds.

Also change some the tests in teuthology to make
the test more deterministic.
Using:

ceph osd set norecover and
ceph osd set nobackfill when marking osds in
or out. As this will delay the recovery and make
sure it the test cases get the chance to check
that there is actually events poping up in
the progress module.

Signed-off-by: Kamoltat ksirivad@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@kamoltat kamoltat force-pushed the wip-ksirivad-progress-time-interval branch 2 times, most recently from e7258fa to fb51a52 Compare June 17, 2021 03:58
@@ -330,28 +346,29 @@ def test_osd_cannot_recover(self):
# First do some failures that will result in a normal rebalance
# (Assumption: we're in a test environment that is configured
# not to require replicas be on different hosts, like teuthology)
self._set_no_recovery_backfill(True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, also, this calls are normally paired for guarding the test. what about:

def recovery_backfill_disabled(self):
    self.mgr_cluster.mon_manager.raw_cluster_cmd(
        'osd set nobackfill'.split())
     self.mgr_cluster.mon_manager.raw_cluster_cmd(
        'osd set norecover'.split())
    yield
    self.mgr_cluster.mon_manager.raw_cluster_cmd(
        'osd unset nobackfill'.split())
     self.mgr_cluster.mon_manager.raw_cluster_cmd(
        'osd unset norecover'.split())
# ...
with self.recovery_backfill_disabled():
    # test

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021-06-18T13:11:01.333 INFO:tasks.cephfs_test_runner:======================================================================
2021-06-18T13:11:01.333 INFO:tasks.cephfs_test_runner:ERROR: test_osd_came_back (tasks.mgr.test_progress.TestProgress)
2021-06-18T13:11:01.334 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-06-18T13:11:01.334 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-06-18T13:11:01.334 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_25c891fcbe5987f64bb6ba26ef6bd1dafdd5725f/qa/tasks/mgr/test_progress.py", line 316, in test_osd_came_back
2021-06-18T13:11:01.335 INFO:tasks.cephfs_test_runner:    ev1 = self._simulate_failure()
2021-06-18T13:11:01.335 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_25c891fcbe5987f64bb6ba26ef6bd1dafdd5725f/qa/tasks/mgr/test_progress.py", line 205, in _simulate_failure
2021-06-18T13:11:01.335 INFO:tasks.cephfs_test_runner:    period=1)
2021-06-18T13:11:01.336 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_25c891fcbe5987f64bb6ba26ef6bd1dafdd5725f/qa/tasks/ceph_test_case.py", line 185, in wait_until_equal
2021-06-18T13:11:01.336 INFO:tasks.cephfs_test_runner:    elapsed, expect_val, val
2021-06-18T13:11:01.337 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30 seconds waiting for 1 (currently 0)

not sure if this is related.

@kamoltat kamoltat force-pushed the wip-ksirivad-progress-time-interval branch 3 times, most recently from 15d0d33 to 2d433ab Compare July 9, 2021 22:44
@kamoltat kamoltat requested a review from a team as a code owner July 9, 2021 22:44
@kamoltat kamoltat force-pushed the wip-ksirivad-progress-time-interval branch from 2d433ab to 2fcc938 Compare July 11, 2021 22:39
@kamoltat
Copy link
Member Author

kamoltat commented Jul 12, 2021

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to separate out the test changes into smaller commits and add corresponding explanation in them instead of adding them to this big commit.

nit, why we need to ignore "OSDMAP_FLAGS" could also use an explanation

Current progress module only checks pg stats
and osdmap when it is notified by the cluster.
However, this is expensive in large cluster
with many pools and osds. we
change it to only check both pg stats and osdmap
every 5 seconds.

in the function _osd_in_out() we now calculate
`is_relocated` by: old_osds != new_osds such that
it does not matter if the difference between osds
are positive or negative.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
Changes some the tests in teuthology to make
the test more deterministic.
Using:

`ceph osd set norecover` and
`ceph osd set nobackfill` when marking osds in
or out. As this will delay the recovery and make
sure it the test cases get the chance to check
that there is actually events poping up in
the progress module.

took out test_osd_cannot_recover from
tasks/mgr/test_progress.py since it is no longer
a relevant test case since recovery will get
triggered regardless if pg is unmoved.

Ignoring `OSDMAP_FLAGS` in teuthology
because we are using norecover and nobackfill
to delay the recovery process, therefore, it
will create a health warning and fails the
teuthology test.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-progress-time-interval branch from 2fcc938 to 5f33f2f Compare July 13, 2021 19:34
@kamoltat
Copy link
Member Author

@jdurgin
All the failures and dead jobs are unrelated, therefore, I think it should be good to go
https://pulpito.ceph.com/ksirivad-2021-07-21_04:36:45-rados-wip-ksirivad-progress-time-interval-distro-basic-smithi/

@jdurgin
Copy link
Member

jdurgin commented Aug 27, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants