-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move remonitor code into DOWN message #3144
Conversation
1492947
to
727accf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this look good. Had one style and one logic bit to change.
src/smoosh/src/smoosh_channel.erl
Outdated
end; | ||
Ref = erlang:monitor(process, DbPid), | ||
DbPid ! {'$gen_call', {self(), Ref}, start_compact}, | ||
State#state{starting=[{Ref, couch_db:name(Db)}|State#state.starting]}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. A compaction could be running due to manual intervention or perhaps if smoosh crashed and left a compaction running. I'd just change the comment to be something like "Compaction is already running, so monitor existing compaction pid".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there isn't a comment here. are you talking about line 295 :
couch_log:notice("Db ~s continuing compaction",
[smoosh_utils:stringify(DbName)])
to
couch_log:notice("Compaction is already running for ~p, so monitor existing compaction pid ~p",
[smoosh_utils:stringify(DbName), CPID])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this section is a revert to what we originally had: https://github.com/cloudant/smoosh/pull/54/files#diff-7ff50b91998e5bf2a1f4cf4a8250f607L236-L238
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm saying the comment on lines 284/285 from before the patch are a comment about "database still compaction...". I think it'd be better to just change that comment rather than changing the whole thing back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see what you mean, basically leave the initial check in as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Smoosh monitors the compactor pid to determine when the compaction jobs finishes, and uses this for its idea of concurrency. However, this isn't accurate in the case where the compaction job has to re-spawn to catch up on intervening changes since the same logical compaction job continues with another pid and smoosh is not aware. In such cases, a smoosh channel with concurrency one can start arbitrarily many additional database compaction jobs. To solve this problem, we added a check to see if a compaction PID exists for a db in `start_compact`. But wee need to add another check because this check is only for shard that comes off the queue. So the following can still occur: 1. Enqueue a bunch of stuff into channel with concurrency 1 2. Begin highest priority job, Shard1, in channel 3. Compaction finishes, discovers compaction file is behind main file 4. Smoosh-monitored PID for Shard1 exits, a new one starts to finish the job 5. Smoosh receives the 'DOWN' message, begins the next highest priority job, Shard2 6. Channel concurrency is now 2, not 1 This change adds another check into the 'DOWN' message so that it checks for that specific shard. If the compaction PID exists then it means a new process was spawned and we just monitor that one and add it back to the queue. The length of the queue does not change and therefore we won’t spawn new compaction jobs.
188af27
to
1952852
Compare
Smoosh monitors the compactor pid to determine when the compaction jobs finishes, and uses this for its idea of concurrency. However, this isn't accurate in the case where the compaction job has to re-spawn to catch up on intervening changes since the same logical compaction job continues with another pid and smoosh is not aware. In such cases, a smoosh channel with concurrency one can start arbitrarily many additional database compaction jobs. To solve this problem, we added a check to see if a compaction PID exists for a db in `start_compact`. But wee need to add another check because this check is only for shard that comes off the queue. So the following can still occur: 1. Enqueue a bunch of stuff into channel with concurrency 1 2. Begin highest priority job, Shard1, in channel 3. Compaction finishes, discovers compaction file is behind main file 4. Smoosh-monitored PID for Shard1 exits, a new one starts to finish the job 5. Smoosh receives the 'DOWN' message, begins the next highest priority job, Shard2 6. Channel concurrency is now 2, not 1 This change adds another check into the 'DOWN' message so that it checks for that specific shard. If the compaction PID exists then it means a new process was spawned and we just monitor that one and add it back to the queue. The length of the queue does not change and therefore we won’t spawn new compaction jobs.
Smoosh monitors the compactor pid to determine when the compaction jobs finishes, and uses this for its idea of concurrency. However, this isn't accurate in the case where the compaction job has to re-spawn to catch up on intervening changes since the same logical compaction job continues with another pid and smoosh is not aware. In such cases, a smoosh channel with concurrency one can start arbitrarily many additional database compaction jobs. To solve this problem, we added a check to see if a compaction PID exists for a db in `start_compact`. But wee need to add another check because this check is only for shard that comes off the queue. So the following can still occur: 1. Enqueue a bunch of stuff into channel with concurrency 1 2. Begin highest priority job, Shard1, in channel 3. Compaction finishes, discovers compaction file is behind main file 4. Smoosh-monitored PID for Shard1 exits, a new one starts to finish the job 5. Smoosh receives the 'DOWN' message, begins the next highest priority job, Shard2 6. Channel concurrency is now 2, not 1 This change adds another check into the 'DOWN' message so that it checks for that specific shard. If the compaction PID exists then it means a new process was spawned and we just monitor that one and add it back to the queue. The length of the queue does not change and therefore we won’t spawn new compaction jobs.
3.x porting - add remonitor code to DOWN message (#3144)
Smoosh monitors the compactor pid to determine when the compaction jobs
finishes, and uses this for its idea of concurrency. However, this isn't
accurate in the case where the compaction job has to re-spawn to catch up on
intervening changes since the same logical compaction job continues with
another pid and smoosh is not aware. In such cases, a smoosh channel with
concurrency one can start arbitrarily many additional database compaction jobs.
To solve this problem, we added a check to see if a compaction PID exists for
a db in
start_compact
. But this is the wrong approach because it’s for ashard that comes off the queue. So it’s a different shard and the following
can occur:
Shard2
This change moves the check into the 'DOWN' message so that we can check for
that specific shard. If the compaction PID exists then it means a new process
was spawned and we just monitor that one and add it back to the queue. The
length of the queue does not change and therefore we won’t spawn new
compaction jobs.
Overview
Testing recommendations
Related Issues or Pull Requests
Checklist
rel/overlay/etc/default.ini