Skip to content

Commit

Permalink
MB-44944 intersect scan hangs on timeout
Browse files Browse the repository at this point in the history
The issue is a race condition between the timeout stop and the first child terminating.
When the first child terminates, and then the timeout is received, we wait for all children,
but we had already successfully detected that the first child had gone, so we wait one time too many.
We had already started to deal with missed signals in 6.6.1, but it turns out there's more.
The issue can be traced back to at least Vulcan.

Change-Id: Ic5fe08ad05f047cc7daaaded9835482f47b75386
Reviewed-on: http://review.couchbase.org/c/query/+/148493
Reviewed-by: Sitaram Vemulapalli <sitaram.vemulapalli@couchbase.com>
Reviewed-by: Bingjie Miao <bingjie.miao@couchbase.com>
Well-Formed: Build Bot <build@couchbase.com>
Tested-by: Marco Greco <marco.greco@couchbase.com>
  • Loading branch information
Marco Greco committed Mar 16, 2021
1 parent ba6227e commit 023958a
Showing 1 changed file with 7 additions and 6 deletions.
13 changes: 7 additions & 6 deletions execution/base.go
Expand Up @@ -762,20 +762,21 @@ func (this *base) childrenWaitNoStop(ops ...Operator) {
for _, o := range ops {
b := o.getBase()
b.activeCond.L.Lock()
state := b.opState
b.activeCond.L.Unlock()
switch state {
case _RUNNING, _STOPPING, _HALTING, _COMPLETED, _STOPPED, _HALTED:
switch b.opState {
case _RUNNING, _STOPPING, _HALTING:
// signal reliably sent
this.ValueExchange().retrieveChildNoStop()
b.activeCond.Wait()
case _COMPLETED, _STOPPED, _HALTED:
// signal reliably sent, but already stopped
case _CREATED, _PAUSED, _KILLED, _PANICKED:
// signal reliably not sent
default:

// we are waiting after we've sent a stop but before we have terminated
// flag bad states
assert(false, fmt.Sprintf("child has unexpected state %v", state))
assert(false, fmt.Sprintf("child has unexpected state %v", b.opState))
}
b.activeCond.L.Unlock()
}
this.switchPhase(_EXECTIME)
}
Expand Down

0 comments on commit 023958a

Please sign in to comment.