gremlin-go: fix deadlock on (*DriverRemoteConnection).Close#1734
Conversation
|
I'm adding here a reproducer just in case it is useful to review the PR: main.go package main
import (
"log"
"sync"
gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver"
)
const (
gremlinServerEndpoint = "ws://localhost:8182/gremlin"
nWorkers = 25
)
func main() {
var wg sync.WaitGroup
for i := 0; i < nWorkers; i++ {
wg.Add(1)
go func(id int) {
worker(id)
wg.Done()
}(i)
}
log.Print("waiting")
wg.Wait()
log.Print("done")
}
func worker(id int) {
conn, err := gremlingo.NewDriverRemoteConnection(gremlinServerEndpoint, func(settings *gremlingo.DriverRemoteConnectionSettings) {
settings.LogVerbosity = gremlingo.Debug
})
if err != nil {
log.Printf("[%v] could not create remote connection: %v", id, err)
return
}
defer conn.Close()
g := gremlingo.Traversal_().WithRemote(conn)
result, err := g.
Inject(1).
SideEffect(&gremlingo.Lambda{"Thread.sleep(20_000)", "gremlin-groovy"}).
Next()
if err != nil {
log.Printf("[%v] error: %v", id, err)
return
}
log.Printf("[%v] result: %v", id, result.GetInterface())
}Steps:
|
Codecov Report
@@ Coverage Diff @@
## 3.5-dev #1734 +/- ##
===========================================
+ Coverage 63.51% 63.56% +0.05%
===========================================
Files 23 23
Lines 3601 3606 +5
===========================================
+ Hits 2287 2292 +5
Misses 1136 1136
Partials 178 178
Continue to review full report at Codecov.
|
|
@simonz-bq @lyndonbauto I think you are the authors of this part of the code. Is there anything I can do to help moving this forward? Thanks! |
|
@jroimartin hey so sorry I missed this! It's way past my bedtime but I skimmed the code and it looks good to me. 1 request, could you take the code you reproduced the issue with and make it a test in our testing framework? After that I will merge it. If for you reason the test can't be added let me know Thanks a lot for fixing this :) |
Fix a deadlock when a remote connection is closed and, at the same time, the
connection fails. The following call graphs illustrate the issue:
goroutine run by main:
(*DriverRemoteConnection).Close
(*Client).Close
(*loadBalancingPool).close
(*connection).close
(*gremlinServerWSProtocol).close
(*WaitGroup).Wait
gremlin-go's readLoop goroutine:
(*gremlinServerWSProtocol).readLoop
readErrorHandler
(*connection).errorCallback-fm
(*connection).errorCallback
(*gremlinServerWSProtocol).close
(*WaitGroup).Wait
(*WaitGroup).Done
protocol.closed is only modified by (*gremlinServerWSProtocol).close. If it is
true it means that (*WaitGroup).Wait has already been called, so it should not
be called again.
|
LGTM. Can go in once the test is written. I could be wrong but it shouldn't be too difficult, as to replicate the code path a unit test can cover it. |
|
Thanks for the review! The thing is that I need to force a failure in the connection to trigger the deadlock, so I was thinking on adding an integration test based on the following code. I rewrote the reproducer to exemplify the idea, so you can play with it if you want. package main
import (
"context"
"fmt"
"log"
"net/http"
"net/http/httptest"
"net/http/httputil"
"net/url"
"sync"
"time"
gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver"
)
const gremlinServerEndpoint = "http://localhost:8182/"
func main() {
gsURL, err := url.Parse(gremlinServerEndpoint)
if err != nil {
log.Fatalf("invalid URL: %v", err)
}
rp := httputil.NewSingleHostReverseProxy(gsURL)
ctx, cancel := context.WithTimeout(context.Background(), 1500*time.Millisecond)
defer cancel()
rpServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
rp.ServeHTTP(w, r.WithContext(ctx))
}))
defer rpServer.Close()
gremlinRpEndpoint := fmt.Sprintf("ws://%v/gremlin", rpServer.Listener.Addr().String())
var wg sync.WaitGroup
for i := 0; i < 25; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
conn, err := gremlingo.NewDriverRemoteConnection(gremlinRpEndpoint, func(settings *gremlingo.DriverRemoteConnectionSettings) {
settings.LogVerbosity = gremlingo.Debug
})
if err != nil {
log.Printf("[%v] could not create remote connection: %v", id, err)
return
}
defer conn.Close()
g := gremlingo.Traversal_().WithRemote(conn)
result, err := g.
Inject(1).
SideEffect(&gremlingo.Lambda{"Thread.sleep(3000)", "gremlin-groovy"}).
Next()
if err != nil {
log.Printf("[%v] error: %v", id, err)
return
}
log.Printf("[%v] result: %v", id, result.GetInterface())
}(i)
}
done := make(chan bool)
go func() {
log.Printf("waiting")
wg.Wait()
log.Printf("done")
done <- true
}()
select {
case <-time.After(5000 * time.Millisecond):
log.Fatal("timeout")
case <-done:
}
}It basically creates a reverse proxy pointing to the gremlin-server. Its HTTP handler makes the requests to timeout after some time, causing the connection error I need. Then a timer is used to identify the deadlock. Is it an overkill? Given that the issue involves |
|
@jroimartin I'm not a committer to have much weight in my opinion, but from my point of view in reality the server can malfunction in a myriad of ways, many of which are not replicable to test via code since the malfunctions are not triggerable via code. In said cases, we use our discretion as developers to see the potential code paths and do our best to ensure that it is robust. So, in this case, what I am asking for and believe is sufficient, is a unit test that would fail after a timeout caused by the deadlock you described. In this unit test, all we need to do is check the path in your newly added code |
c85fffb to
d8f91f8
Compare
|
Unit test added in d8f91f8. Is this what you had in mind? |
|
LGTM |
|
Was there a Jira ticket for this? |
@mikepersonick no, there wasn't. |
Further analysis shows that apache#1734 is not the right fix. A deadlock can still happen in (*gremlinServerWSProtocol).readLoop if errorCallback is called before the connection protocol is closed and, because of this, (*gremlinServerWSProtocol).close calls (*gremlinServerWSProtocol).wg.Wait. This commit adds the parameter "wait" to protocol.close. So, the call to (*gremlinServerWSProtocol).wg.Wait can be avoided when called from within an error callback. Also, the test that was introduced has been updated. Additionally, it fixes a data race due to concurrent use of connection.state. The following stack trace shows an instance of this problem. Read at 0x00c007c41340 by goroutine 686: github.com/apache/tinkerpop/gremlin-go/v3/driver.(*connection).close() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connection.go:65 +0x3b github.com/apache/tinkerpop/gremlin-go/v3/driver.(*loadBalancingPool).close() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connectionPool.go:57 +0x16f github.com/apache/tinkerpop/gremlin-go/v3/driver.(*Client).Close() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/client.go:144 +0x2d7 github.com/apache/tinkerpop/gremlin-go/v3/driver.(*DriverRemoteConnection).Close() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/driverRemoteConnection.go:159 +0x404 main.worker.func2() /home/n/tmp/tinkerpop-repros/gremlin-go-data-race/main.go:50 +0x39 runtime.deferreturn() /home/n/goroot/src/runtime/panic.go:436 +0x32 main.main.func1() /home/n/tmp/tinkerpop-repros/gremlin-go-data-race/main.go:33 +0x35 main.main.func2() /home/n/tmp/tinkerpop-repros/gremlin-go-data-race/main.go:35 +0x47 Previous write at 0x00c007c41340 by goroutine 3613: github.com/apache/tinkerpop/gremlin-go/v3/driver.(*connection).errorCallback() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connection.go:57 +0x66 github.com/apache/tinkerpop/gremlin-go/v3/driver.(*connection).errorCallback-fm() <autogenerated>:1 +0x39 github.com/apache/tinkerpop/gremlin-go/v3/driver.readErrorHandler() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/protocol.go:96 +0x112 github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gremlinServerWSProtocol).readLoop() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/protocol.go:69 +0x615 github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol.func1() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/protocol.go:202 +0x58 Goroutine 686 (running) created at: main.main() /home/n/tmp/tinkerpop-repros/gremlin-go-data-race/main.go:32 +0x77 Goroutine 3613 (running) created at: github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/protocol.go:202 +0x431 github.com/apache/tinkerpop/gremlin-go/v3/driver.createConnection() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connection.go:107 +0x266 github.com/apache/tinkerpop/gremlin-go/v3/driver.newLoadBalancingPool.func1() /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connectionPool.go:150 +0x109 Tested against commit 29346f5 (3.5-dev).
Further analysis shows that apache#1734 is not the right fix. A deadlock can still happen in (*gremlinServerWSProtocol).readLoop if errorCallback is called before the connection protocol is closed and, because of this, (*gremlinServerWSProtocol).close calls (*gremlinServerWSProtocol).wg.Wait. This commit adds the parameter "wait" to protocol.close. So, the call to (*gremlinServerWSProtocol).wg.Wait can be avoided when called from within an error callback. Also, the test that was introduced has been updated.
Further analysis shows that apache#1734 is not the right fix. A deadlock can still happen in (*gremlinServerWSProtocol).readLoop if errorCallback is called before the connection protocol is closed and, because of this, (*gremlinServerWSProtocol).close calls (*gremlinServerWSProtocol).wg.Wait. This commit adds the parameter "wait" to protocol.close. So, the call to (*gremlinServerWSProtocol).wg.Wait can be skipped when called from within an error callback. Also, the test that was introduced has been updated.
Further analysis shows that apache#1734 is not the right fix. A deadlock can still happen in (*gremlinServerWSProtocol).readLoop if errorCallback is called before the connection protocol is closed and, because of this, (*gremlinServerWSProtocol).close calls (*gremlinServerWSProtocol).wg.Wait. This commit adds the parameter "wait" to protocol.close. So, the call to (*gremlinServerWSProtocol).wg.Wait can be skipped when called from within an error callback. Also, the test that was introduced has been updated.
Fix a deadlock when a remote connection is closed and, at the same time, the connection fails. The following call graphs illustrate the issue:
goroutine run by main:
gremlin-go's readLoop goroutine:
protocol.closedis only modified by(*gremlinServerWSProtocol).close. If it is true it means that(*WaitGroup).Waithas already been called, so it should not be called again.