Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the second phase of data compaction does not reflect by _active_tasks #980

Closed
vitaly-goot opened this issue Nov 10, 2017 · 7 comments
Closed

Comments

@vitaly-goot
Copy link
Contributor

vgoot [2:20 PM]
Question about data compaction:
From my understanding of the code, it appears that data compaction has 2 phases (phase one -> sort; phase two -> merge).
It seems that _active_tasks reflect the progress of phase I only. For shards with a big amount of documents (e.g. 20M+) the 'merge part' time can be quite substantial.
Is it possible to add progress of phase II to _active_tasks ?

jan [12:18]
@vgoot best to open an issue on GitHub apache/couchdb

@kocolosk
Copy link
Member

Hi @vitaly-goot yes I hear you on this one. I'll note that there is some major refactoring of compaction underway in #806 which should significantly reduce the amount of time spent in the merge phase (right @davisp / @nickva ?). But I'm guessing there's still an opportunity to add instrumentation after that PR lans.

@davisp
Copy link
Member

davisp commented Nov 14, 2017

Yeap, #806 is held up waiting for PSE to land. Also I'm pretty sure I had a PR somewhere that added progress reporting during the second phase but I'm not able to find it just now. But the plan is definitely to add that with the work on revamping the compactor.

@vitaly-goot
Copy link
Contributor Author

Submitted my workaround for that issue (see #1006)
Testing it on older CouchDB-1.6 (which we are running). Looking good.
I rebased all relevant 1.6 changes with yours recent 2.1 code line (but never tested).

This change will not report the progress of the merge phase per se but it is useful enough for us since we want to learn about disk burn rate and disk overhead to get better capacity planning.

To get proper emsort progress reporting, it seems like emsort:get_state() needs to be changed to return additional information (other than Root). In fact, two counters could surface (added counter and merged counter). That state is saved on disk by couch_db_updater (via #comp_header.meta_state). It seems that format of that state can be changed safely since this is all temporary compacted files.

@vitaly-goot
Copy link
Contributor Author

Adding merge phase reporting.

couch_emsort changes:
A new counter to keep track the number of merged paths (emsort.bb_chains). That counter is committed to compact meta file during sort phase. The format of compact.meta file changed (to keep that counter) but new code should be able to work with both disk representations. On merge phase, that counter is used to report a ratio of how many bb_chains were merged vs. added.

@wohali wohali added the dbcore label Jan 16, 2018
@janl
Copy link
Member

janl commented Jul 14, 2018

@davisp does PSE resolve this?

@wohali wohali removed the dbcore label Jun 25, 2020
@nickva
Copy link
Contributor

nickva commented Nov 4, 2021

This was implemented in 123bf82 by @davisp. Thanks @vitaly-goot for the initial idea and proof of concept

@nickva nickva closed this as completed Nov 4, 2021
@vitaly-goot
Copy link
Contributor Author

vitaly-goot commented Nov 4, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants