Skip to content

gremlin-go: fix deadlock on (*DriverRemoteConnection).Close#1734

Merged
mikepersonick merged 3 commits into
apache:3.5-devfrom
jroimartin:fix-deadlock-conn-close
Jul 7, 2022
Merged

gremlin-go: fix deadlock on (*DriverRemoteConnection).Close#1734
mikepersonick merged 3 commits into
apache:3.5-devfrom
jroimartin:fix-deadlock-conn-close

Conversation

@jroimartin
Copy link
Copy Markdown
Contributor

Fix a deadlock when a remote connection is closed and, at the same time, the connection fails. The following call graphs illustrate the issue:

goroutine run by main:

(*DriverRemoteConnection).Close
  (*Client).Close
    (*loadBalancingPool).close
      (*connection).close
        (*gremlinServerWSProtocol).close
          (*WaitGroup).Wait

gremlin-go's readLoop goroutine:

(*gremlinServerWSProtocol).readLoop
  readErrorHandler
    (*connection).errorCallback-fm
      (*connection).errorCallback
        (*gremlinServerWSProtocol).close
          (*WaitGroup).Wait
  (*WaitGroup).Done

protocol.closed is only modified by (*gremlinServerWSProtocol).close. If it is true it means that (*WaitGroup).Wait has already been called, so it should not be called again.

@jroimartin
Copy link
Copy Markdown
Contributor Author

jroimartin commented Jun 25, 2022

I'm adding here a reproducer just in case it is useful to review the PR:

main.go

package main

import (
	"log"
	"sync"

	gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver"
)

const (
	gremlinServerEndpoint = "ws://localhost:8182/gremlin"
	nWorkers              = 25
)

func main() {
	var wg sync.WaitGroup
	for i := 0; i < nWorkers; i++ {
		wg.Add(1)
		go func(id int) {
			worker(id)
			wg.Done()
		}(i)
	}
	log.Print("waiting")
	wg.Wait()
	log.Print("done")
}

func worker(id int) {
	conn, err := gremlingo.NewDriverRemoteConnection(gremlinServerEndpoint, func(settings *gremlingo.DriverRemoteConnectionSettings) {
		settings.LogVerbosity = gremlingo.Debug
	})
	if err != nil {
		log.Printf("[%v] could not create remote connection: %v", id, err)
		return
	}
	defer conn.Close()

	g := gremlingo.Traversal_().WithRemote(conn)

	result, err := g.
		Inject(1).
		SideEffect(&gremlingo.Lambda{"Thread.sleep(20_000)", "gremlin-groovy"}).
		Next()
	if err != nil {
		log.Printf("[%v] error: %v", id, err)
		return
	}

	log.Printf("[%v] result: %v", id, result.GetInterface())
}

Steps:

  1. Run gremlin-server:
docker run --rm -ti -p 8182:8182 --name gremlin-server tinkerpop/gremlin-server:3.5.4-SNAPSHOT
  1. Run reproducer:
go run main.go
  1. Stop gremlin-server with C-c.
  2. The reproducer should hang.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 27, 2022

Codecov Report

Merging #1734 (d8f91f8) into 3.5-dev (71f3ee0) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           3.5-dev    #1734      +/-   ##
===========================================
+ Coverage    63.51%   63.56%   +0.05%     
===========================================
  Files           23       23              
  Lines         3601     3606       +5     
===========================================
+ Hits          2287     2292       +5     
  Misses        1136     1136              
  Partials       178      178              
Impacted Files Coverage Δ
gremlin-go/driver/protocol.go 64.61% <100.00%> (+1.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40508a4...d8f91f8. Read the comment docs.

@jroimartin
Copy link
Copy Markdown
Contributor Author

@simonz-bq @lyndonbauto I think you are the authors of this part of the code. Is there anything I can do to help moving this forward? Thanks!

@lyndonbauto
Copy link
Copy Markdown
Contributor

@jroimartin hey so sorry I missed this! It's way past my bedtime but I skimmed the code and it looks good to me.

1 request, could you take the code you reproduced the issue with and make it a test in our testing framework? After that I will merge it. If for you reason the test can't be added let me know

Thanks a lot for fixing this :)

Fix a deadlock when a remote connection is closed and, at the same time, the
connection fails. The following call graphs illustrate the issue:

goroutine run by main:

(*DriverRemoteConnection).Close
  (*Client).Close
    (*loadBalancingPool).close
      (*connection).close
        (*gremlinServerWSProtocol).close
          (*WaitGroup).Wait

gremlin-go's readLoop goroutine:

(*gremlinServerWSProtocol).readLoop
  readErrorHandler
    (*connection).errorCallback-fm
      (*connection).errorCallback
        (*gremlinServerWSProtocol).close
          (*WaitGroup).Wait
  (*WaitGroup).Done

protocol.closed is only modified by (*gremlinServerWSProtocol).close. If it is
true it means that (*WaitGroup).Wait has already been called, so it should not
be called again.
@simonz-bq
Copy link
Copy Markdown
Contributor

LGTM. Can go in once the test is written.

I could be wrong but it shouldn't be too difficult, as to replicate the code path a unit test can cover it.

@jroimartin
Copy link
Copy Markdown
Contributor Author

jroimartin commented Jun 30, 2022

Thanks for the review!

The thing is that I need to force a failure in the connection to trigger the deadlock, so I was thinking on adding an integration test based on the following code. I rewrote the reproducer to exemplify the idea, so you can play with it if you want.

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"net/http/httptest"
	"net/http/httputil"
	"net/url"
	"sync"
	"time"

	gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver"
)

const gremlinServerEndpoint = "http://localhost:8182/"

func main() {
	gsURL, err := url.Parse(gremlinServerEndpoint)
	if err != nil {
		log.Fatalf("invalid URL: %v", err)
	}
	rp := httputil.NewSingleHostReverseProxy(gsURL)

	ctx, cancel := context.WithTimeout(context.Background(), 1500*time.Millisecond)
	defer cancel()

	rpServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		rp.ServeHTTP(w, r.WithContext(ctx))
	}))
	defer rpServer.Close()

	gremlinRpEndpoint := fmt.Sprintf("ws://%v/gremlin", rpServer.Listener.Addr().String())

	var wg sync.WaitGroup
	for i := 0; i < 25; i++ {
		wg.Add(1)
		go func(id int) {
			defer wg.Done()

			conn, err := gremlingo.NewDriverRemoteConnection(gremlinRpEndpoint, func(settings *gremlingo.DriverRemoteConnectionSettings) {
				settings.LogVerbosity = gremlingo.Debug
			})
			if err != nil {
				log.Printf("[%v] could not create remote connection: %v", id, err)
				return
			}
			defer conn.Close()

			g := gremlingo.Traversal_().WithRemote(conn)

			result, err := g.
				Inject(1).
				SideEffect(&gremlingo.Lambda{"Thread.sleep(3000)", "gremlin-groovy"}).
				Next()
			if err != nil {
				log.Printf("[%v] error: %v", id, err)
				return
			}
			log.Printf("[%v] result: %v", id, result.GetInterface())
		}(i)
	}

	done := make(chan bool)

	go func() {
		log.Printf("waiting")
		wg.Wait()
		log.Printf("done")
		done <- true
	}()

	select {
	case <-time.After(5000 * time.Millisecond):
		log.Fatal("timeout")
	case <-done:
	}
}

It basically creates a reverse proxy pointing to the gremlin-server. Its HTTP handler makes the requests to timeout after some time, causing the connection error I need. Then a timer is used to identify the deadlock. Is it an overkill?

Given that the issue involves (*gremlinServerWSProtocol).readLoop, (*connection).errorCallbackand (*gremlinServerWSProtocol).close, I'm not sure if a unit test fits well here. @simonz-bq you are more familiar with the code, what do you think?

@simonz-bq
Copy link
Copy Markdown
Contributor

@jroimartin I'm not a committer to have much weight in my opinion, but from my point of view in reality the server can malfunction in a myriad of ways, many of which are not replicable to test via code since the malfunctions are not triggerable via code. In said cases, we use our discretion as developers to see the potential code paths and do our best to ensure that it is robust.

So, in this case, what I am asking for and believe is sufficient, is a unit test that would fail after a timeout caused by the deadlock you described. In this unit test, all we need to do is check the path in your newly added code if protocol.closed { protocol.mutex.Unlock() return nil } to make sure it's hit and returns instead of messing with the waitgroup which caused the deadlock.

@jroimartin jroimartin force-pushed the fix-deadlock-conn-close branch from c85fffb to d8f91f8 Compare June 30, 2022 23:05
@jroimartin
Copy link
Copy Markdown
Contributor Author

Unit test added in d8f91f8. Is this what you had in mind?

@simonz-bq
Copy link
Copy Markdown
Contributor

LGTM

@mikepersonick mikepersonick merged commit 8e5c93b into apache:3.5-dev Jul 7, 2022
@mikepersonick
Copy link
Copy Markdown
Member

Was there a Jira ticket for this?

@jroimartin
Copy link
Copy Markdown
Contributor Author

Was there a Jira ticket for this?

@mikepersonick no, there wasn't.

@jroimartin jroimartin deleted the fix-deadlock-conn-close branch July 7, 2022 21:47
jroimartin added a commit to jroimartin/tinkerpop that referenced this pull request Jul 11, 2022
Further analysis shows that apache#1734 is not the right fix. A
deadlock can still happen in (*gremlinServerWSProtocol).readLoop if
errorCallback is called before the connection protocol is closed and, because
of this, (*gremlinServerWSProtocol).close calls
(*gremlinServerWSProtocol).wg.Wait.

This commit adds the parameter "wait" to protocol.close. So, the call to
(*gremlinServerWSProtocol).wg.Wait can be avoided when called from within an
error callback. Also, the test that was introduced has been updated.

Additionally, it fixes a data race due to concurrent use of connection.state.
The following stack trace shows an instance of this problem.

Read at 0x00c007c41340 by goroutine 686:
  github.com/apache/tinkerpop/gremlin-go/v3/driver.(*connection).close()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connection.go:65 +0x3b
  github.com/apache/tinkerpop/gremlin-go/v3/driver.(*loadBalancingPool).close()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connectionPool.go:57 +0x16f
  github.com/apache/tinkerpop/gremlin-go/v3/driver.(*Client).Close()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/client.go:144 +0x2d7
  github.com/apache/tinkerpop/gremlin-go/v3/driver.(*DriverRemoteConnection).Close()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/driverRemoteConnection.go:159 +0x404
  main.worker.func2()
      /home/n/tmp/tinkerpop-repros/gremlin-go-data-race/main.go:50 +0x39
  runtime.deferreturn()
      /home/n/goroot/src/runtime/panic.go:436 +0x32
  main.main.func1()
      /home/n/tmp/tinkerpop-repros/gremlin-go-data-race/main.go:33 +0x35
  main.main.func2()
      /home/n/tmp/tinkerpop-repros/gremlin-go-data-race/main.go:35 +0x47

Previous write at 0x00c007c41340 by goroutine 3613:
  github.com/apache/tinkerpop/gremlin-go/v3/driver.(*connection).errorCallback()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connection.go:57 +0x66
  github.com/apache/tinkerpop/gremlin-go/v3/driver.(*connection).errorCallback-fm()
      <autogenerated>:1 +0x39
  github.com/apache/tinkerpop/gremlin-go/v3/driver.readErrorHandler()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/protocol.go:96 +0x112
  github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gremlinServerWSProtocol).readLoop()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/protocol.go:69 +0x615
  github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol.func1()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/protocol.go:202 +0x58

Goroutine 686 (running) created at:
  main.main()
      /home/n/tmp/tinkerpop-repros/gremlin-go-data-race/main.go:32 +0x77

Goroutine 3613 (running) created at:
  github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/protocol.go:202 +0x431
  github.com/apache/tinkerpop/gremlin-go/v3/driver.createConnection()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connection.go:107 +0x266
  github.com/apache/tinkerpop/gremlin-go/v3/driver.newLoadBalancingPool.func1()
      /home/n/src/tinkerpop/.wt/3.5-dev/gremlin-go/driver/connectionPool.go:150 +0x109

Tested against commit 29346f5 (3.5-dev).
jroimartin added a commit to jroimartin/tinkerpop that referenced this pull request Jul 12, 2022
Further analysis shows that apache#1734 is not the right fix. A
deadlock can still happen in (*gremlinServerWSProtocol).readLoop if
errorCallback is called before the connection protocol is closed and, because
of this, (*gremlinServerWSProtocol).close calls
(*gremlinServerWSProtocol).wg.Wait.

This commit adds the parameter "wait" to protocol.close. So, the call to
(*gremlinServerWSProtocol).wg.Wait can be avoided when called from within an
error callback. Also, the test that was introduced has been updated.
jroimartin added a commit to jroimartin/tinkerpop that referenced this pull request Jul 12, 2022
Further analysis shows that apache#1734 is not the right fix. A
deadlock can still happen in (*gremlinServerWSProtocol).readLoop if
errorCallback is called before the connection protocol is closed and, because
of this, (*gremlinServerWSProtocol).close calls
(*gremlinServerWSProtocol).wg.Wait.

This commit adds the parameter "wait" to protocol.close. So, the call to
(*gremlinServerWSProtocol).wg.Wait can be skipped when called from within an
error callback. Also, the test that was introduced has been updated.
jroimartin added a commit to jroimartin/tinkerpop that referenced this pull request Jul 12, 2022
Further analysis shows that apache#1734 is not the right fix. A
deadlock can still happen in (*gremlinServerWSProtocol).readLoop if
errorCallback is called before the connection protocol is closed and, because
of this, (*gremlinServerWSProtocol).close calls
(*gremlinServerWSProtocol).wg.Wait.

This commit adds the parameter "wait" to protocol.close. So, the call to
(*gremlinServerWSProtocol).wg.Wait can be skipped when called from within an
error callback. Also, the test that was introduced has been updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants