move remonitor code into DOWN message #3144

tonysun83 · 2020-09-10T18:38:22Z

Smoosh monitors the compactor pid to determine when the compaction jobs
finishes, and uses this for its idea of concurrency. However, this isn't
accurate in the case where the compaction job has to re-spawn to catch up on
intervening changes since the same logical compaction job continues with
another pid and smoosh is not aware. In such cases, a smoosh channel with
concurrency one can start arbitrarily many additional database compaction jobs.

To solve this problem, we added a check to see if a compaction PID exists for
a db in start_compact. But this is the wrong approach because it’s for a
shard that comes off the queue. So it’s a different shard and the following
can occur:

Enqueue a bunch of stuff into channel with concurrency 1
Begin highest priority job, Shard1, in channel
Compaction finishes, discovers compaction file is behind main file
Smoosh-monitored PID for Shard1 exits, a new one starts to finish the job
Smoosh receives the 'DOWN' message, begins the next highest priority job,
Shard2
Channel concurrency is now 2, not 1

This change moves the check into the 'DOWN' message so that we can check for
that specific shard. If the compaction PID exists then it means a new process
was spawned and we just monitor that one and add it back to the queue. The
length of the queue does not change and therefore we won’t spawn new
compaction jobs.

Overview

Testing recommendations

Related Issues or Pull Requests

Checklist

Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
A PR for documentation changes has been made in https://github.com/apache/couchdb-documentation

davisp

Overall this look good. Had one style and one logic bit to change.

src/smoosh/src/smoosh_channel.erl

davisp · 2020-09-10T18:50:55Z

src/smoosh/src/smoosh_channel.erl

-        end;
+        Ref = erlang:monitor(process, DbPid),
+        DbPid ! {'$gen_call', {self(), Ref}, start_compact},
+        State#state{starting=[{Ref, couch_db:name(Db)}|State#state.starting]};


I don't think this is correct. A compaction could be running due to manual intervention or perhaps if smoosh crashed and left a compaction running. I'd just change the comment to be something like "Compaction is already running, so monitor existing compaction pid".

there isn't a comment here. are you talking about line 295 :

couch_log:notice("Db ~s continuing compaction", [smoosh_utils:stringify(DbName)])

to

couch_log:notice("Compaction is already running for ~p, so monitor existing compaction pid ~p", [smoosh_utils:stringify(DbName), CPID])

I changed this below

also, this section is a revert to what we originally had: https://github.com/cloudant/smoosh/pull/54/files#diff-7ff50b91998e5bf2a1f4cf4a8250f607L236-L238

No, I'm saying the comment on lines 284/285 from before the patch are a comment about "database still compaction...". I think it'd be better to just change that comment rather than changing the whole thing back.

oh I see what you mean, basically leave the initial check in as well

davisp

+1

Smoosh monitors the compactor pid to determine when the compaction jobs finishes, and uses this for its idea of concurrency. However, this isn't accurate in the case where the compaction job has to re-spawn to catch up on intervening changes since the same logical compaction job continues with another pid and smoosh is not aware. In such cases, a smoosh channel with concurrency one can start arbitrarily many additional database compaction jobs. To solve this problem, we added a check to see if a compaction PID exists for a db in `start_compact`. But wee need to add another check because this check is only for shard that comes off the queue. So the following can still occur: 1. Enqueue a bunch of stuff into channel with concurrency 1 2. Begin highest priority job, Shard1, in channel 3. Compaction finishes, discovers compaction file is behind main file 4. Smoosh-monitored PID for Shard1 exits, a new one starts to finish the job 5. Smoosh receives the 'DOWN' message, begins the next highest priority job, Shard2 6. Channel concurrency is now 2, not 1 This change adds another check into the 'DOWN' message so that it checks for that specific shard. If the compaction PID exists then it means a new process was spawned and we just monitor that one and add it back to the queue. The length of the queue does not change and therefore we won’t spawn new compaction jobs.

3.x porting - add remonitor code to DOWN message (#3144)

tonysun83 force-pushed the re-monitor-compaction-pid branch from 1492947 to 727accf Compare September 10, 2020 18:39

davisp requested changes Sep 10, 2020

View reviewed changes

davisp approved these changes Sep 10, 2020

View reviewed changes

tonysun83 force-pushed the re-monitor-compaction-pid branch from 188af27 to 1952852 Compare September 10, 2020 20:04

tonysun83 merged commit a94e693 into master Sep 10, 2020

tonysun83 deleted the re-monitor-compaction-pid branch September 10, 2020 20:35

jiangphcn mentioned this pull request Nov 6, 2020

3.x porting - add remonitor code to DOWN message (#3144) #3250

Merged

4 tasks

jiangphcn added a commit that referenced this pull request Nov 7, 2020

Merge pull request #3250 from apache/3.x-re-monitor-compaction-pid

c563243

3.x porting - add remonitor code to DOWN message (#3144)

schneuwlym mentioned this pull request May 19, 2021

Compaction dies constantly after a certain amount of documents #3571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move remonitor code into DOWN message #3144

move remonitor code into DOWN message #3144

tonysun83 commented Sep 10, 2020 •

edited

Loading

davisp left a comment

davisp Sep 10, 2020

tonysun83 Sep 10, 2020

tonysun83 Sep 10, 2020 •

edited

Loading

tonysun83 Sep 10, 2020

davisp Sep 10, 2020

tonysun83 Sep 10, 2020

davisp left a comment

move remonitor code into DOWN message #3144

move remonitor code into DOWN message #3144

Conversation

tonysun83 commented Sep 10, 2020 • edited Loading

Overview

Testing recommendations

Related Issues or Pull Requests

Checklist

davisp left a comment

Choose a reason for hiding this comment

davisp Sep 10, 2020

Choose a reason for hiding this comment

tonysun83 Sep 10, 2020

Choose a reason for hiding this comment

tonysun83 Sep 10, 2020 • edited Loading

Choose a reason for hiding this comment

tonysun83 Sep 10, 2020

Choose a reason for hiding this comment

davisp Sep 10, 2020

Choose a reason for hiding this comment

tonysun83 Sep 10, 2020

Choose a reason for hiding this comment

davisp left a comment

Choose a reason for hiding this comment

tonysun83 commented Sep 10, 2020 •

edited

Loading

tonysun83 Sep 10, 2020 •

edited

Loading