[Issue 662] Fix race in connection.go waitUntilReady() #663

bschofield · 2021-11-08T16:45:44Z

Fixes #662

Motivation

As described in #662, there appears to be a potential race condition in connection.go function waitUntilReady(): the cond.Broadcast() can occur before the cond.Wait().

[EDIT:] To be explicit, this is the issue:

[goroutine A] calls c.getState() and sees that it is not set to connectionReady
[goroutine B] changes the state to connectionReady
[goroutine B] sends a cond.Broadcast(), which goes nowhere because no goroutine is waiting.
[goroutine A] calls cond.Wait(), which never completes

Modifications

Function waitUntilReady() was previously holding the global c.Lock() on the connection. From my reading of the code, this mutex is intended to protect the cnx variable. I think that the use of c.Lock() in waitUntilReady() was probably just a typo.

Instead, I think the author probably intended to grab the lock associated with the sync.Cond, i.e. c.cond.L.Lock(). This looks like the correct thing to do when using a sync.Cond. I think there should be a corresponding lock around the cond.Broadcast() also. See https://stackoverflow.com/questions/36857167/how-to-correctly-use-sync-cond/42772799#42772799 for more details.

Verifying this change

I am unsure if this change is covered by existing tests. It fixes a rare race condition, so I think it would be quite difficult to test for.

I have deployed this change on my own production system, and it doesn't obviously break anything. I will report back if I see any issues that might be related to it.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API: no
The schema: no
The default values of configurations: no
The wire protocol: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

bschofield · 2021-11-08T18:17:33Z

@cckellogg @BewareMyPower I think this relates to your discussion in https://github.com/apache/pulsar-client-go/pull/631/files#r723350028, would appreciate it if you could take a look.

You correctly removed the use of c.Lock(), but I think c.cond.L.Lock() should be used instead. And although it is absolutely true that the sync.Cond documentation says you aren't required to hold the c.cond.L.Lock() when broadcasting, if you don't then it looks like you can end up with the race observed in #662. Thanks in advance!

BewareMyPower · 2021-11-09T01:34:36Z

pulsar-client-go/pulsar/internal/connection.go

Line 212 in fe3b7c4

cnx.cond = sync.NewCond(cnx)

And see https://pkg.go.dev/sync@go1.17.1#NewCond

NewCond returns a new Cond with Locker l.

// NewCond returns a new Cond with Locker l.
func NewCond(l Locker) *Cond {
    return &Cond{L: l}
}

So I think c.Lock() is the same as c.cond.L.Lock().

Back to your issue, how could you ensure c.cond.Wait() must be called before c.cond.BroadCast() is called? And I think it's not necessary. Please see my code snippet.

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	m := sync.Mutex{}
	c := sync.NewCond(&m)
	flag := false

	defer m.Unlock()
	m.Lock()

	go func() {
		// No locks here
		fmt.Println("Before Broadcast")
		flag = true
		c.Broadcast()
		fmt.Println("After Broadcast")
	}()

	// Sleep for 2 seconds to make sure c.Broadcast() is called before c.Wait()
	time.Sleep(time.Duration(2) * time.Second)
	fmt.Println("Before Wait")
	for !flag {
		c.Wait()
	}
	fmt.Printf("After Wait, flag is %v\n", flag)
}

I called time.Sleep to make sure c.Wait() is called after c.Broadcast(). And the output is

Before Broadcast
After Broadcast
Before Wait
After Wait, flag is true

BewareMyPower · 2021-11-09T01:42:08Z

Sorry I gave a wrong example. The c.Wait() is not called because flag is already true. It's right if my example was modified from

	for !flag {
		c.Wait()
	}

to

	for {
		c.Wait()
		if flag {
			break
		}
	}

BewareMyPower · 2021-11-09T02:05:53Z

I agree that we should call c.Broadcast() after c.Wait(), but the code here is not like the example in https://stackoverflow.com/questions/36857167/how-to-correctly-use-sync-cond/42772799#42772799.

    m := &sync.Mutex{}
    c := sync.NewCond(m)
    m.Lock()
    go func() {
        m.Lock() // Wait for c.Wait()
        c.Broadcast()
        m.Unlock()
    }()
    c.Wait() // Unlocks m, waits, then locks m again
    m.Unlock()

Because in that example, the m.Lock() has already been called before the goroutine is executed. So c.Broadcast() won't be called before c.Wait() releases the mutex internally. However, for the code here, how could you ensure that when changeState is called, the lock has already been acquired in waitUntilReady?

BewareMyPower · 2021-11-09T02:17:56Z

The current waitUntilReady implementation is

func (c *connection) waitUntilReady() error {
	c.Lock()
	defer c.Unlock()
	c.cond.L.Lock()
	defer c.cond.L.Unlock()

	for c.getState() != connectionReady {
		c.log.Debugf("Wait until connection is ready state=%s", c.getState().String())
                 // 1. `c.getState()` is called again
		if c.getState() == connectionClosed {
			return errors.New("connection error")
		}
                // 2. Even if the 2nd call of `c.getState()` returns connectionReady, `c.cond.Wait()` will still be called
		c.cond.Wait()
	}
	return nil
}

See my comments for my concern. getState() and setState() are both atomic. But if we the combination of two getState() calls is not atomic.

Therefore, I think a better solution is

func (c *connection) waitUntilReady() error {
	for {
		switch state := c.getState(); state {
		case connectionReady:
			return nil
		case connectionClosed:
			return errors.New("connection error")
		default:
			c.log.Debugf("Wait until connection is ready state=%s", state.String())
			// Here we use c.cond.L.Lock() because we might not use c.Lock as the cond's internal Locker in future
			c.cond.L.Lock()
			defer c.cond.L.Unlock()
			c.cond.Wait()
		}
	}
}

cckellogg · 2021-11-09T03:57:22Z

pulsar/internal/connection.go

@@ -327,8 +327,8 @@ func (c *connection) doHandshake() bool {
 }

 func (c *connection) waitUntilReady() error {
-	c.Lock()
-	defer c.Unlock()
+	c.cond.L.Lock()


Curious what's the difference between doing c.Lock() vs c.cond.L.Lock()? Same for the changeState function? From looking at the code they should share the same lock?

They're two different mutexes. See #663 (comment) for an explanation.

I agree that waitUntilReady() and changeState() should share the same lock.

Nope, I was wrong! See #663 (comment) which proves that c.Lock() and c.cond.L.Lock() are the same thing.

bschofield · 2021-11-09T08:44:28Z

Thanks for taking a look at this, @BewareMyPower and @cckellogg!

So I think c.Lock() is the same as c.cond.L.Lock().

I'm pretty sure that's not the case. If we take a look at the definition of the connection type:

type connection struct {
	sync.Mutex
	cond              *sync.Cond
[...]

...then we can see that connection struct is built out of a sync.Mutex using struct composition. That means that calls to c.Lock() are calling the sync.Mutex which is part of the connection struct itself.

If we next take a look at the definition of sync.Cond we see:

type Cond struct {
	noCopy noCopy

	// L is held while observing or changing the condition
	L Locker
[...]

(where Locker is go's interface type for a sync.Mutex). So these are two separate locks: they are not the same.

Another way we can convince ourselves of this is by noting that it is forbidden to copy a sync.Mutex, you have to take pointers to it. Since the struct composition on connection does not use a pointer type, there would be no way to alias the mutex inside cond to the mutex on connection. They must be different things.

bschofield · 2021-11-09T09:01:22Z

@BewareMyPower:

However, for the code here, how could you ensure that when changeState is called, the lock has already been acquired in waitUntilReady?

It doesn't matter whether changeState() is called first, or waitUntilReady() is called first. Using the lock prevents the race condition that can occur in the gap between the test in for c.getState() != connectionReady and the call to c.cond.Wait() occurring.

If changeState gets called before waitUntilReady, then the broadcast happens first and all is well.

if waitUntilReady gets called before changeState, then waitUntilReady will block on the cond.Wait(), whilst still holding the lock. However, cond.Wait() call has a clever internal release of the lock which I think should still allow the changeState to happen:

func (c *Cond) Wait() {
	c.checker.check()
	t := runtime_notifyListAdd(&c.notify)
	c.L.Unlock()
	runtime_notifyListWait(&c.notify, t)
	c.L.Lock()
}

(note that the c.L here is what we are calling c.cond.L).

bschofield · 2021-11-09T09:05:49Z

@BewareMyPower: I tried this modification of your code, which I think is exactly the one you suggested:

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	m := sync.Mutex{}
	c := sync.NewCond(&m)
	flag := false

	defer m.Unlock()
	m.Lock()

	go func() {
		// No locks here
		fmt.Println("Before Broadcast")
		flag = true
		c.Broadcast()
		fmt.Println("After Broadcast")
	}()

	// Sleep for 2 seconds to make sure c.Broadcast() is called before c.Wait()
	time.Sleep(time.Duration(2) * time.Second)
	fmt.Println("Before Wait")
	for {
		c.Wait()
		if flag {
			break
		}
	}
	fmt.Printf("After Wait, flag is %v\n", flag)
}

When I run that, this happens:

Before Broadcast
After Broadcast
Before Wait
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0000b8050, 0x0)
	/home/ben/sdk/go1.17.1/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x4b4360)
	/home/ben/sdk/go1.17.1/src/sync/cond.go:56 +0x8c
main.main()
	/home/ben/tmp/foo.go:29 +0x191

So, I think that test case (which you very kindly provided) confirms that we agree that it really is necessary to call cond.Broadcast() after cond.Wait()?

bschofield · 2021-11-09T09:25:26Z

@BewareMyPower:

Therefore, I think a better solution is [...]

With that solution, the bug that I am experiencing in production could still occur:

[goroutine A] calls c.getState() and sees that it is not set to connectionReady
[goroutine B] changes the state to connectionReady
[goroutine B] sends a cond.Broadcast(), which goes nowhere because no goroutine is waiting.
[goroutine A] calls cond.Wait(), which never completes

To avoid this, it is necessary to hold the lock before calling c.getState().

BewareMyPower · 2021-11-09T09:26:37Z

we agree that it really is necessary to call cond.Broadcast() after cond.Wait()?

Yeah, I agree.

If changeState gets called before waitUntilReady, then the broadcast happens first and all is well.

if waitUntilReady gets called before changeState, then waitUntilReady will block on the cond.Wait(), whilst still holding the lock.

It makes sense. We should make the combination of setState and c.Broadcast() an atomic operation.

however you should avoid calling defer inside the for loop because a new defer call will be pushed onto the stack for each loop invocation.

Thanks for pointing it out. I'm not so familiar with Go, just being used to C++ object's lifetime rule, which is a little different.

The only question I'm still confused is, could you explain following code and its output?

package main

import (
	"fmt"
	"sync"
)

type connection struct {
	sync.Mutex
	cond *sync.Cond
}

func NewConnection() *connection {
	cnx := &connection{}
	cnx.cond = sync.NewCond(cnx)
	return cnx
}

func main() {
	cnx := NewConnection()
	fmt.Printf("%v\n%v\n", cnx, cnx.cond.L)

	cnx.cond.L.Lock()
	cnx.Lock()
	cnx.Unlock()
	cnx.cond.L.Unlock()
}

&{{0 0} 0xc000104040}
&{{0 0} 0xc000104040}
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc000010244, 0x0, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000010240)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
main.main()
	main.go:24 +0x175

It looks like cnx.Lock() and cnx.cond.L.Lock() are the same.

BewareMyPower · 2021-11-09T09:30:15Z

@BewareMyPower:

Therefore, I think a better solution is [...]

With that solution, the bug that I am experiencing in production could still occur:

[goroutine A] calls c.getState() and sees that it is not set to connectionReady [goroutine B] changes the state to connectionReady [goroutine B] sends a cond.Broadcast(), which goes nowhere because no goroutine is waiting. [goroutine A] calls cond.Wait(), which never completes

To avoid this, it is necessary to hold the lock before calling c.getState().

It's right. I also realized this problem just now. Here we should lock the whole changeState. I'm just wondering whether there should be another lock for this. But c.Lock() might be used in more places in future. (If c.Lock() and c.cond.L.Lock() are the same)

bschofield · 2021-11-09T09:35:33Z

The only question I'm still confused is, could you explain following code and its output?

Ag! I'm so sorry @BewareMyPower, you're 100% correct: c.Lock() and c.cond.L.Lock() absolutely are the same lock. I had totally missed that sync.NewCond took a Locker as an argument. Thanks for clarifying that, and my apologies.

I'm just wondering whether there should be another lock for this.

I think that's up to you guys -- it depends what the ultimate intent of the lock is. Is it to protect against all state changes in the connection, or just some of them?

BewareMyPower · 2021-11-09T09:39:06Z

I think that's up to you guys -- it depends what the ultimate intent of the lock is. Is it to protect against all state changes in the connection, or just some of them.

Yeah, it won't block this PR. Just say it because I've seen the similar case in C++ client that a common mutex is used everywhere.

For this PR, I think you can change c.cond.L. back to c., then it will LGTM.

…ocker as an argument. Add comments to make it clear why this lock is required, to hopefully avoid accidental removal in future.

bschofield · 2021-11-09T09:49:33Z

OK, I made that change and added a comment so hopefully nobody removes the lock in future. Thanks again for looking at this, I really appreciate it!

Fix issue 662: race in connection.go waitUntilReady().

e189cf4

BewareMyPower requested review from merlimat and cckellogg November 9, 2021 02:10

BewareMyPower assigned bschofield Nov 9, 2021

BewareMyPower added the type/bug label Nov 9, 2021

BewareMyPower requested review from freeznet and wolfstudy November 9, 2021 02:24

cckellogg reviewed Nov 9, 2021

View reviewed changes

ben added 2 commits November 9, 2021 09:47

c.Lock() is the same as c.cond.Lock(), because sync.NewCond takes a L…

defbae0

…ocker as an argument. Add comments to make it clear why this lock is required, to hopefully avoid accidental removal in future.

Clarify comment.

c78f71d

BewareMyPower approved these changes Nov 9, 2021

View reviewed changes

cckellogg approved these changes Nov 9, 2021

View reviewed changes

cckellogg merged commit 774769d into apache:master Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue 662] Fix race in connection.go waitUntilReady() #663

[Issue 662] Fix race in connection.go waitUntilReady() #663

bschofield commented Nov 8, 2021 •

edited

Loading

bschofield commented Nov 8, 2021 •

edited

Loading

BewareMyPower commented Nov 9, 2021 •

edited

Loading

BewareMyPower commented Nov 9, 2021

BewareMyPower commented Nov 9, 2021

BewareMyPower commented Nov 9, 2021

cckellogg Nov 9, 2021

bschofield Nov 9, 2021

bschofield Nov 9, 2021 •

edited

Loading

bschofield Nov 9, 2021

bschofield commented Nov 9, 2021

bschofield commented Nov 9, 2021 •

edited

Loading

bschofield commented Nov 9, 2021

bschofield commented Nov 9, 2021

BewareMyPower commented Nov 9, 2021 •

edited

Loading

BewareMyPower commented Nov 9, 2021

bschofield commented Nov 9, 2021 •

edited

Loading

BewareMyPower commented Nov 9, 2021 •

edited

Loading

bschofield commented Nov 9, 2021

[Issue 662] Fix race in connection.go waitUntilReady() #663

[Issue 662] Fix race in connection.go waitUntilReady() #663

Conversation

bschofield commented Nov 8, 2021 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

bschofield commented Nov 8, 2021 • edited Loading

BewareMyPower commented Nov 9, 2021 • edited Loading

BewareMyPower commented Nov 9, 2021

BewareMyPower commented Nov 9, 2021

BewareMyPower commented Nov 9, 2021

cckellogg Nov 9, 2021

Choose a reason for hiding this comment

bschofield Nov 9, 2021

Choose a reason for hiding this comment

bschofield Nov 9, 2021 • edited Loading

Choose a reason for hiding this comment

bschofield Nov 9, 2021

Choose a reason for hiding this comment

bschofield commented Nov 9, 2021

bschofield commented Nov 9, 2021 • edited Loading

bschofield commented Nov 9, 2021

bschofield commented Nov 9, 2021

BewareMyPower commented Nov 9, 2021 • edited Loading

BewareMyPower commented Nov 9, 2021

bschofield commented Nov 9, 2021 • edited Loading

BewareMyPower commented Nov 9, 2021 • edited Loading

bschofield commented Nov 9, 2021

bschofield commented Nov 8, 2021 •

edited

Loading

bschofield commented Nov 8, 2021 •

edited

Loading

BewareMyPower commented Nov 9, 2021 •

edited

Loading

bschofield Nov 9, 2021 •

edited

Loading

bschofield commented Nov 9, 2021 •

edited

Loading

BewareMyPower commented Nov 9, 2021 •

edited

Loading

bschofield commented Nov 9, 2021 •

edited

Loading

BewareMyPower commented Nov 9, 2021 •

edited

Loading