Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittently seeing "sync: WaitGroup is reused before previous Wait has returned" in transport.go #103

Closed
caldempsey opened this issue Dec 3, 2019 · 7 comments · Fixed by #140
Assignees
Milestone

Comments

@caldempsey
Copy link

caldempsey commented Dec 3, 2019

I've seen the panic "sync: WaitGroup is reused before the previous Wait has returned" in a routine at github.com/getsentry/sentry-go/transport.go:201. Not sure how to debug this or how to explore the conditions. I can't find anything in the documentation that can help me explore the conditions post-panic. The stack trace leads me to...

vendor/github.com/getsentry/sentry-go/transport.go:201 +0x31 
github.com/getsentry/sentry-go.(*HTTPTransport).Flush 
vendor/github.com/getsentry/sentry-go/transport.go:200 +0x7c 

Any advice is appreciated on tools I can use to better break this open when it does occur!

@caldempsey caldempsey changed the title sync: WaitGroup is reused before previous Wait has returned Intermittently seeing "sync: WaitGroup is reused before previous Wait has returned" in transport.go Dec 3, 2019
@rhcarvalho
Copy link
Contributor

Hi @mmacheerpuppy! Thanks for reporting the issue!


As a quick reference, this is the code reported to be faulty:

sentry-go/transport.go

Lines 199 to 217 in 54d5222

// Flush notifies when all the buffered events have been sent by returning `true`
// or `false` if timeout was reached.
func (t *HTTPTransport) Flush(timeout time.Duration) bool {
c := make(chan struct{})
go func() {
t.wg.Wait()
close(c)
}()
select {
case <-c:
Logger.Println("Buffer flushed successfully.")
return true
case <-time.After(timeout):
Logger.Println("Buffer flushing reached the timeout.")
return false
}
}

Indeed that usage of sync.WaitGroup is not conforming to a well behaved pattern: typically there is one goroutine that owns the WaitGroup and where the Wait method is called.


@mmacheerpuppy I imagine your code might be calling Flush concurrently from multiple goroutines and that reveals the problem. What can be done on your side is having a look to see if Flush is being called from multiple places concurrently. That is certainly unnecessary. In general, you'd only need to call sentry.Flush from the main goroutine before your program terminates in order to wait for whatever in-flight events to be sent.

It's a bit late here in my time zone, so I'll defer writing a test to expose the problem and a fix later. Thanks again for reporting the issue!

@rhcarvalho rhcarvalho self-assigned this Dec 3, 2019
@caldempsey
Copy link
Author

caldempsey commented Dec 3, 2019

@rhcarvalho Oh thanks, that explains a lot, I have used Flush liberally as a de-facto method to ensure all errors have been sent from the application (I admit I kind of took it as an await statement originally). I suspected something was a bit fishy with the usage of sync.WaitGroup but honestly I'm a little new to it myself (coming from a Scala/Elixir background) so not aware of all the conditions that actually throw a panic. Appreciate your late response and I really appreciate your time, I'm also in EU time.

Thank you so much ♥

@BonnieMilian
Copy link

Thank you @rhcarvalho for the explanation, I had the same issue, was doing flush per each report.
I will try what you suggest thanks :D

panic: sync: WaitGroup is reused before previous Wait has returned
goroutine 52062 [running]:
sync.(*WaitGroup).Wait(0xc00006a458)
	/usr/local/go/src/sync/waitgroup.go:132 +0xae
github.com/getsentry/sentry-go.(*HTTPTransport).Flush.func1(0xc00006a420, 0xc0022013e0)
	/go/pkg/mod/github.com/getsentry/sentry-go@v0.3.0/transport.go:201 +0x31
created by github.com/getsentry/sentry-go.(*HTTPTransport).Flush
	/go/pkg/mod/github.com/getsentry/sentry-go@v0.3.0/transport.go:200 +0x7c

@rhcarvalho rhcarvalho added this to To do in Sentry SDK for Go Dec 5, 2019
@rhcarvalho rhcarvalho pinned this issue Dec 17, 2019
@rhcarvalho rhcarvalho added this to the 0.5.0 milestone Dec 28, 2019
@rhcarvalho
Copy link
Contributor

This test can be used to reproduce the issue:

type testWriter testing.T

func (t *testWriter) Write(p []byte) (int, error) {
	t.Logf("%s", p)
	return len(p), nil
}

func TestHTTPTransportFlush(t *testing.T) {
	var counter uint64
	ts := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		t.Logf("[SERVER] received event: #%d", atomic.AddUint64(&counter, 1))
	}))
	defer ts.Close()

	Logger.SetOutput((*testWriter)(t))

	tr := NewHTTPTransport()
	tr.Configure(ClientOptions{
		Dsn:        fmt.Sprintf("https://user@%s/42", ts.Listener.Addr()),
		HTTPClient: ts.Client(),
	})

	var wg sync.WaitGroup
	for i := 0; i < 2; i++ {
		i := i
		wg.Add(1)
		go func() {
			defer wg.Done()
			for j := 0; j < 2; j++ {
				t.Logf("tr.SendEvent #%d from goroutine #%d", j, i)
				tr.SendEvent(NewEvent())
				ok := tr.Flush(100 * time.Millisecond)
				if !ok {
					t.Error("Flush() timed out")
				}
			}
		}()
	}
	wg.Wait()
}

Example run:

$ go test -failfast -run TestHTTPTransportFlush -count=100 -v
=== RUN   TestHTTPTransportFlush
--- PASS: TestHTTPTransportFlush (0.00s)
    transport_test.go:189: tr.SendEvent #0 from goroutine #1
    transport_test.go:189: tr.SendEvent #0 from goroutine #0
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Sending  event [] to 127.0.0.1 project: 42
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Sending  event [] to 127.0.0.1 project: 42
    transport_test.go:170: [SERVER] received event: #1
    transport_test.go:170: [SERVER] received event: #2
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Buffer flushed successfully.
    transport_test.go:189: tr.SendEvent #1 from goroutine #0
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Buffer flushed successfully.
    transport_test.go:189: tr.SendEvent #1 from goroutine #1
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Sending  event [] to 127.0.0.1 project: 42
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Sending  event [] to 127.0.0.1 project: 42
    transport_test.go:170: [SERVER] received event: #3
    transport_test.go:170: [SERVER] received event: #4
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Buffer flushed successfully.
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Buffer flushed successfully.
=== RUN   TestHTTPTransportFlush
panic: sync: WaitGroup is reused before previous Wait has returned

goroutine 15 [running]:
sync.(*WaitGroup).Wait(0xc0000204b8)
        /usr/lib/golang/src/sync/waitgroup.go:132 +0xad
github.com/getsentry/sentry-go.(*HTTPTransport).Flush.func1(0xc000020480, 0xc00002c8a0)
        .../sentry-go/transport.go:205 +0x31
created by github.com/getsentry/sentry-go.(*HTTPTransport).Flush
        .../sentry-go/transport.go:204 +0x7c
exit status 2
FAIL    github.com/getsentry/sentry-go  0.012s

Without instrumenting HTTPTransport.Flush to enforce an specific order of events, the test is flaky by nature. What's more, the panic is caused by an implementation detail. Since I'd rather be testing behavior, and I certainly do not want flaky tests, I am inclined not to add such test together with the fix.

Posting it here for transparency and future reference.

@rhcarvalho
Copy link
Contributor

Running the test above with -race also reviews a data race involving calls to wg.Add and wg.Wait:

==================
WARNING: DATA RACE
Read at 0x00c00038c2c8 by goroutine 62:
  sync.(*WaitGroup).Add()
      /usr/lib/golang/src/internal/race/race.go:37 +0x18d
  github.com/getsentry/sentry-go.(*HTTPTransport).SendEvent()
      .../sentry-go/transport.go:184 +0x554
  github.com/getsentry/sentry-go.TestHTTPTransportFlush.func2()
      .../sentry-go/transport_test.go:190 +0x26c

Previous write at 0x00c00038c2c8 by goroutine 63:
  sync.(*WaitGroup).Wait()
      /usr/lib/golang/src/internal/race/race.go:41 +0xee
  github.com/getsentry/sentry-go.(*HTTPTransport).Flush.func1()
      .../sentry-go/transport.go:208 +0x3e

Goroutine 62 (running) created at:
  github.com/getsentry/sentry-go.TestHTTPTransportFlush()
      .../sentry-go/transport_test.go:186 +0x40e
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:909 +0x199

Goroutine 63 (finished) created at:
  github.com/getsentry/sentry-go.(*HTTPTransport).Flush()
      .../sentry-go/transport.go:206 +0x8c
  github.com/getsentry/sentry-go.TestHTTPTransportFlush.func2()
      .../sentry-go/transport_test.go:191 +0x286
==================

@rhcarvalho
Copy link
Contributor

A small fix to prevent the panic "sync: WaitGroup is reused before previous Wait has returned" is to synchronize calls to wg.Wait() with a semaphore:

diff --git a/transport.go b/transport.go
index d0f39d9..90e7ec3 100644
--- a/transport.go
+++ b/transport.go
@@ -102,10 +102,12 @@ type HTTPTransport struct {
 	client    *http.Client
 	transport *http.Transport
 
-	buffer        chan *http.Request
+	buffer         chan *http.Request
+	wg             sync.WaitGroup // counter of buffered requests
+	flushSemaphore chan struct{}  // limit concurrent calls to Flush
+
 	disabledUntil time.Time
 
-	wg    sync.WaitGroup
 	start sync.Once
 
 	// Size of the transport buffer. Defaults to 30.
@@ -130,9 +132,10 @@ func (t *HTTPTransport) Configure(options ClientOptions) {
 		Logger.Printf("%v\n", err)
 		return
 	}
-
 	t.dsn = dsn
+
 	t.buffer = make(chan *http.Request, t.BufferSize)
+	t.flushSemaphore = make(chan struct{}, 1)
 
 	if options.HTTPTransport != nil {
 		t.transport = options.HTTPTransport
@@ -202,8 +205,10 @@ func (t *HTTPTransport) Flush(timeout time.Duration) bool {
 	c := make(chan struct{})
 
 	go func() {
+		t.flushSemaphore <- struct{}{}
 		t.wg.Wait()
 		close(c)
+		<-t.flushSemaphore
 	}()
 
 	select {

However, that doesn't fix the race between wg.Add and wg.Wait.

This type of race was discussed here: https://groups.google.com/d/msg/golang-nuts/W5fol0e4qt8/_XEPBGMQkNwJ.

[...] The issue is that there is no
happens-before relation between Waitgroup.Add and Waitgroup.Wait, so
your code does an Add after a Wait, the race detector will tell you.
You need to establish a synchronization to guarantee the Add does not
pass the Wait.
-rob

Sentry SDK for Go automation moved this from To do to Done Jan 22, 2020
rhcarvalho added a commit that referenced this issue Jan 22, 2020
* test: HTTPTransport

The ConcurrentSendAndFlush test reveals a data race in the old
HTTPTransport implementation.

* fix: HTTPTransport.Flush panic and data race

Rewrite HTTPTransport internals to remove a data race and occasional
panics.

Prior to this, HTTPTransport used a `sync.WaitGroup` to track how many
in-flight requests existed, and `Flush` waited until the observed number
of in-flight requests reached zero.

Unsynchronized access to the WaitGroup lead to panics (reuse before
wg.Wait returns) and data races (undefined order of wg.Add and wg.Wait
calls).

The new implementation changes the `Flush` behavior to wait until the
current in-flight requests are processed, inline with other SDKs and the
Unified API.

Fixes #103.
@rhcarvalho
Copy link
Contributor

If anyone runs into this, please update to v0.5.0 or newer (go get -u github.com/getsentry/sentry-go), and please have a look at the README.md or example/basic/main.go for an example of an idiomatic way of using sentry.Flush.

@rhcarvalho rhcarvalho unpinned this issue Feb 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants