Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer stall after RequeueWithoutBackoff #128

Merged
merged 2 commits into from
Mar 26, 2015

Conversation

jehiah
Copy link
Member

@jehiah jehiah commented Mar 24, 2015

If a consumer is in backoff and attempts to recover with RDY 1 and then a message, but response with RequeueWithoutBackoff it will stall and not exit backoff. (failing test attached)

cc: @georgicodes

@jehiah jehiah added the bug label Mar 24, 2015
@mreiferson
Copy link
Member

🔥

@jehiah
Copy link
Member Author

jehiah commented Mar 25, 2015

RFR @mreiferson

@mreiferson
Copy link
Member

I'm not following the need for jehiah@f0be171

@jehiah
Copy link
Member Author

jehiah commented Mar 25, 2015

I moved the lock so that it covered both the r.inBackoffBlock() check on line 648 and the updating of backoffDuration which is set on line 692. Without that spanning both i suspect there is a race condition in startStopContinueBackoff. (though to be honest i just noticed this reading the code; there could be other factors that keep that from not being an issue).

As i read it, that potential race could allow simultaneous finishes on more than one message/connection (this is fed by per-connection goroutines) to update rdy and backoff unexpectedly.

@mreiferson
Copy link
Member

@jehiah that's atomic though? The only thing it needs to protect is backoffCounter... Let's remove that commit?

@jehiah
Copy link
Member Author

jehiah commented Mar 25, 2015

Yes, backoffCounter updates need to be protected, but that update also needs to also include the check and update for inBackoffBlock or it will be possible to have it executed multiple times (sequentially) in the same backoff time period.

ie: pathalogical current case is you have 10 messages on different connections req to trigger backoff; they all pass the inBackoffBlock and then sequentially (because of the current lock) each update backoffCounter. We want only one of those to proceed past the inBackoffBlock check and update the counter. The first message that passes should prevent any other message (ie: by the lock also encompassing the update to backoffDuration).

@mreiferson
Copy link
Member

Ahh, I thought you meant a "data" race.

Gonna read through again, thanks for the info...

@mreiferson
Copy link
Member

OK - I think you're right, nice catch.

While I was reading the code I cleaned up some other related things (pushed here).

My suggestion is - can't we use atomic ops for backoffCounter? This would mean that backoffMtx is more clearly only used to serialize execution of startStopContinueBackoff.

@mreiferson
Copy link
Member

(I'm going to give that a try)

@mreiferson mreiferson force-pushed the backoff_requeue_no_backoff_128 branch from 31cb33f to ba79f30 Compare March 25, 2015 18:07
@mreiferson
Copy link
Member

@jehiah check those

@jehiah
Copy link
Member Author

jehiah commented Mar 26, 2015

@mreiferson changes look good. two more small tweaks pushed. If you are good i'll squash this down.

@mreiferson
Copy link
Member

👍 🔨

@mreiferson
Copy link
Member

P.S. bump go versions in travis too?

@twmb
Copy link
Contributor

twmb commented Mar 26, 2015

Does backoff with a MaxBackoffDuration of 0 not race with resume?

The time.AfterFunc in backoff will immediately try to call resume, randomly picking a connection to updateRDY(1), while backoff is still finishing trying to updateRDY(0) for all connections.

I might be reading it wrong, though; I've not yet finished reading all of conn/consumer/delegates to see how it all fits together.

@mreiferson
Copy link
Member

@twmb nice catch, I think you're right.

@jehiah we need to reorder those operations (RDY 0 should be sent before we set the timeout for resume() and we shouldn't ever send RDY 0 or timeout if the calculated duration is 0).

@twmb
Copy link
Contributor

twmb commented Mar 26, 2015

Not sending RDY 0 if backoff duration is 0 is another way of addressing #129.

* expand locking to cover updating backoffDuration
* use atomic ops for backoffCounter
* set RDY 0 before setting backoff timer (for cases when timer is 0)
* clarify/consolidated backoff/resume functions
@jehiah jehiah force-pushed the backoff_requeue_no_backoff_128 branch from a44589f to b592d58 Compare March 26, 2015 16:26
@jehiah jehiah force-pushed the backoff_requeue_no_backoff_128 branch from 3516a9b to 8923105 Compare March 26, 2015 16:48
mreiferson added a commit that referenced this pull request Mar 26, 2015
Consumer stall after RequeueWithoutBackoff
@mreiferson mreiferson merged commit da20c0d into nsqio:master Mar 26, 2015
@mreiferson mreiferson deleted the backoff_requeue_no_backoff_128 branch March 26, 2015 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants