Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nautilus: core: maybe_remove_pg_upmap can be super inefficient for large clusters #28756

merged 8 commits into from Jul 30, 2019


Copy link

commented Jun 26, 2019

xiexingguo added 8 commits Jun 1, 2019
test/osd: add performance test case for maybe_remove_pg_upmap
Tom Byrne reported that maybe_remove_pg_upmap might become
super inefficient for large clusters with balancer on.
To identify and resolve the problem, we need to add some good
measurements first.

Signed-off-by: xie xingguo <>
(cherry picked from commit c0ce22b)
osd/OSDMap: maybe_remove_pg_upmaps - s/pg_to_raw_up/pg_to_raw_upmap/
The upmap results are directly applied after calling
_pg_to_raw_osds, which means it basically has nothing to do
with the up/down status.
In other words, if a pg_upmap/pg_upmap_items remapped a pg
into some down osds and is now causing collided result,
we should still be able to detect and cancel that.

Signed-off-by: xie xingguo <>
(cherry picked from commit d9ed406)
osd/OSDMap: maybe_remove_pg_upmaps - avoid do_crush twice
which is extremely time-consuming.
Half of the amount of time of calling maybe_remove_pg_upmaps
has been saved by applying this patch as a result..

Was: maybe_remove_pg_upmaps (~10000 pg_upmap_items) latency:104s
Now: maybe_remove_pg_upmaps (~10000 pg_upmap_items) latency:56s

Signed-off-by: xie xingguo <>
(cherry picked from commit 02e5499)

- nautilus has "vector" where master has "std::vector"
osd: maybe_remove_pg_upmaps -> clean_pg_upmaps
It should always be the preferred option to kill the unnecessary
or duplicated code, which is good for maintenance.
Also I've noticed there is already a clean_temps helper, so re-naming
maybe_remove_pg_upmaps to clean_pg_upmaps to at least keep pace with
that sounds to be a natural choice for me..

Signed-off-by: xie xingguo <>
(cherry picked from commit a37106f)
osd/OSDMap: split clean_pg_upmaps into smaller helpers
- it's good to read.
- the updating pending_inc part should be made independent
  since it is going to be racy while running in parallel.

Signed-off-by: xie xingguo <>
(cherry picked from commit 4d5cf1e)
mon/OSDMonitor: do clean_pg_upmaps the parallel way if necessary
There could definitely be some certain cases we could reliably
skip this kind of checking, but there is no easy way to separate
those out.
However, this is clearly the general way to do the massive pg
upmap clean-up job more efficiently and hence should make sense
in all cases.

Signed-off-by: xie xingguo <>
(cherry picked from commit c395f45)
test: add parallel clean_pg_upmaps test
With parallel clean_pg_upmaps feature on, the total time cost
of the performance test which now can utilize up to 8 threads for
parallel upmap validating decreased from:

maybe_remove_pg_upmaps (~10000 pg_upmap_items) latency:104s


clean_pg_upmaps (~10000 pg_upmap_items) latency:7s

Note that by default the mon uses only 4 worker threads for
CPU intensive background work, you could further increase
the "mon_cpu_threads" option value if you decided the
time-consuming of clean_pg_upmaps still matters.

Signed-off-by: xie xingguo <>
(cherry picked from commit a45112a)
osd/OSDMapMapping: make ParallelPGMapper can accept input pgs
The existing "prime temp" machinism is a good inspiration
for cluster with a large number of pgs that need to do various
calculations quickly.
I am planning to do the upmap tidy-up work the same way, hence
the need for an alternate way of specifying pgs to process other
than taking directly from the map.

Signed-off-by: xie xingguo <>
(cherry picked from commit f6fd4a3)

@smithfarm smithfarm self-assigned this Jun 26, 2019

@smithfarm smithfarm added this to the nautilus milestone Jun 26, 2019

@smithfarm smithfarm requested review from xiexingguo and liewegas Jun 26, 2019

@smithfarm smithfarm changed the title nautilus: maybe_remove_pg_upmap can be super inefficient for large clusters nautilus: core: maybe_remove_pg_upmap can be super inefficient for large clusters Jun 26, 2019

Copy link

left a comment



This comment has been minimized.

Copy link

commented Jul 24, 2019

@yuriw yuriw merged commit b09a987 into ceph:nautilus Jul 30, 2019

4 checks passed

Docs: build check OK - docs built
Signed-off-by all commits in this PR are signed
Unmodified Submodules submodules for project are unmodified
make check make check succeeded

@smithfarm smithfarm deleted the smithfarm:wip-40231-nautilus branch Aug 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
3 participants
You can’t perform that action at this time.