Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 662] Fix race in connection.go waitUntilReady() #663

Merged
merged 3 commits into from Nov 11, 2021

Conversation

bschofield
Copy link
Contributor

@bschofield bschofield commented Nov 8, 2021

Fixes #662

Motivation

As described in #662, there appears to be a potential race condition in connection.go function waitUntilReady(): the cond.Broadcast() can occur before the cond.Wait().

[EDIT:] To be explicit, this is the issue:

[goroutine A] calls c.getState() and sees that it is not set to connectionReady
[goroutine B] changes the state to connectionReady
[goroutine B] sends a cond.Broadcast(), which goes nowhere because no goroutine is waiting.
[goroutine A] calls cond.Wait(), which never completes

Modifications

Function waitUntilReady() was previously holding the global c.Lock() on the connection. From my reading of the code, this mutex is intended to protect the cnx variable. I think that the use of c.Lock() in waitUntilReady() was probably just a typo.

Instead, I think the author probably intended to grab the lock associated with the sync.Cond, i.e. c.cond.L.Lock(). This looks like the correct thing to do when using a sync.Cond. I think there should be a corresponding lock around the cond.Broadcast() also. See https://stackoverflow.com/questions/36857167/how-to-correctly-use-sync-cond/42772799#42772799 for more details.

Verifying this change

I am unsure if this change is covered by existing tests. It fixes a rare race condition, so I think it would be quite difficult to test for.

I have deployed this change on my own production system, and it doesn't obviously break anything. I will report back if I see any issues that might be related to it.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: no
  • The schema: no
  • The default values of configurations: no
  • The wire protocol: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@bschofield
Copy link
Contributor Author

bschofield commented Nov 8, 2021

@cckellogg @BewareMyPower I think this relates to your discussion in https://github.com/apache/pulsar-client-go/pull/631/files#r723350028, would appreciate it if you could take a look.

You correctly removed the use of c.Lock(), but I think c.cond.L.Lock() should be used instead. And although it is absolutely true that the sync.Cond documentation says you aren't required to hold the c.cond.L.Lock() when broadcasting, if you don't then it looks like you can end up with the race observed in #662. Thanks in advance!

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Nov 9, 2021

cnx.cond = sync.NewCond(cnx)

And see https://pkg.go.dev/sync@go1.17.1#NewCond

NewCond returns a new Cond with Locker l.

// NewCond returns a new Cond with Locker l.
func NewCond(l Locker) *Cond {
    return &Cond{L: l}
}

So I think c.Lock() is the same as c.cond.L.Lock().

Back to your issue, how could you ensure c.cond.Wait() must be called before c.cond.BroadCast() is called? And I think it's not necessary. Please see my code snippet.

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	m := sync.Mutex{}
	c := sync.NewCond(&m)
	flag := false

	defer m.Unlock()
	m.Lock()

	go func() {
		// No locks here
		fmt.Println("Before Broadcast")
		flag = true
		c.Broadcast()
		fmt.Println("After Broadcast")
	}()

	// Sleep for 2 seconds to make sure c.Broadcast() is called before c.Wait()
	time.Sleep(time.Duration(2) * time.Second)
	fmt.Println("Before Wait")
	for !flag {
		c.Wait()
	}
	fmt.Printf("After Wait, flag is %v\n", flag)
}

I called time.Sleep to make sure c.Wait() is called after c.Broadcast(). And the output is

Before Broadcast
After Broadcast
Before Wait
After Wait, flag is true

@BewareMyPower
Copy link
Contributor

Sorry I gave a wrong example. The c.Wait() is not called because flag is already true. It's right if my example was modified from

	for !flag {
		c.Wait()
	}

to

	for {
		c.Wait()
		if flag {
			break
		}
	}

@BewareMyPower
Copy link
Contributor

I agree that we should call c.Broadcast() after c.Wait(), but the code here is not like the example in https://stackoverflow.com/questions/36857167/how-to-correctly-use-sync-cond/42772799#42772799.

    m := &sync.Mutex{}
    c := sync.NewCond(m)
    m.Lock()
    go func() {
        m.Lock() // Wait for c.Wait()
        c.Broadcast()
        m.Unlock()
    }()
    c.Wait() // Unlocks m, waits, then locks m again
    m.Unlock()

Because in that example, the m.Lock() has already been called before the goroutine is executed. So c.Broadcast() won't be called before c.Wait() releases the mutex internally. However, for the code here, how could you ensure that when changeState is called, the lock has already been acquired in waitUntilReady?

@BewareMyPower
Copy link
Contributor

The current waitUntilReady implementation is

func (c *connection) waitUntilReady() error {
	c.Lock()
	defer c.Unlock()
	c.cond.L.Lock()
	defer c.cond.L.Unlock()

	for c.getState() != connectionReady {
		c.log.Debugf("Wait until connection is ready state=%s", c.getState().String())
                 // 1. `c.getState()` is called again
		if c.getState() == connectionClosed {
			return errors.New("connection error")
		}
                // 2. Even if the 2nd call of `c.getState()` returns connectionReady, `c.cond.Wait()` will still be called
		c.cond.Wait()
	}
	return nil
}

See my comments for my concern. getState() and setState() are both atomic. But if we the combination of two getState() calls is not atomic.

Therefore, I think a better solution is

func (c *connection) waitUntilReady() error {
	for {
		switch state := c.getState(); state {
		case connectionReady:
			return nil
		case connectionClosed:
			return errors.New("connection error")
		default:
			c.log.Debugf("Wait until connection is ready state=%s", state.String())
			// Here we use c.cond.L.Lock() because we might not use c.Lock as the cond's internal Locker in future
			c.cond.L.Lock()
			defer c.cond.L.Unlock()
			c.cond.Wait()
		}
	}
}

@@ -327,8 +327,8 @@ func (c *connection) doHandshake() bool {
}

func (c *connection) waitUntilReady() error {
c.Lock()
defer c.Unlock()
c.cond.L.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what's the difference between doing c.Lock() vs c.cond.L.Lock()? Same for the changeState function? From looking at the code they should share the same lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're two different mutexes. See #663 (comment) for an explanation.

Copy link
Contributor Author

@bschofield bschofield Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that waitUntilReady() and changeState() should share the same lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I was wrong! See #663 (comment) which proves that c.Lock() and c.cond.L.Lock() are the same thing.

@bschofield
Copy link
Contributor Author

Thanks for taking a look at this, @BewareMyPower and @cckellogg!

So I think c.Lock() is the same as c.cond.L.Lock().

I'm pretty sure that's not the case. If we take a look at the definition of the connection type:

type connection struct {
	sync.Mutex
	cond              *sync.Cond
[...]

...then we can see that connection struct is built out of a sync.Mutex using struct composition. That means that calls to c.Lock() are calling the sync.Mutex which is part of the connection struct itself.

If we next take a look at the definition of sync.Cond we see:

type Cond struct {
	noCopy noCopy

	// L is held while observing or changing the condition
	L Locker
[...]

(where Locker is go's interface type for a sync.Mutex). So these are two separate locks: they are not the same.

Another way we can convince ourselves of this is by noting that it is forbidden to copy a sync.Mutex, you have to take pointers to it. Since the struct composition on connection does not use a pointer type, there would be no way to alias the mutex inside cond to the mutex on connection. They must be different things.

@bschofield
Copy link
Contributor Author

bschofield commented Nov 9, 2021

@BewareMyPower:

However, for the code here, how could you ensure that when changeState is called, the lock has already been acquired in waitUntilReady?

It doesn't matter whether changeState() is called first, or waitUntilReady() is called first. Using the lock prevents the race condition that can occur in the gap between the test in for c.getState() != connectionReady and the call to c.cond.Wait() occurring.

If changeState gets called before waitUntilReady, then the broadcast happens first and all is well.

if waitUntilReady gets called before changeState, then waitUntilReady will block on the cond.Wait(), whilst still holding the lock. However, cond.Wait() call has a clever internal release of the lock which I think should still allow the changeState to happen:

func (c *Cond) Wait() {
	c.checker.check()
	t := runtime_notifyListAdd(&c.notify)
	c.L.Unlock()
	runtime_notifyListWait(&c.notify, t)
	c.L.Lock()
}

(note that the c.L here is what we are calling c.cond.L).

@bschofield
Copy link
Contributor Author

@BewareMyPower: I tried this modification of your code, which I think is exactly the one you suggested:

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	m := sync.Mutex{}
	c := sync.NewCond(&m)
	flag := false

	defer m.Unlock()
	m.Lock()

	go func() {
		// No locks here
		fmt.Println("Before Broadcast")
		flag = true
		c.Broadcast()
		fmt.Println("After Broadcast")
	}()

	// Sleep for 2 seconds to make sure c.Broadcast() is called before c.Wait()
	time.Sleep(time.Duration(2) * time.Second)
	fmt.Println("Before Wait")
	for {
		c.Wait()
		if flag {
			break
		}
	}
	fmt.Printf("After Wait, flag is %v\n", flag)
}

When I run that, this happens:

Before Broadcast
After Broadcast
Before Wait
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0000b8050, 0x0)
	/home/ben/sdk/go1.17.1/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x4b4360)
	/home/ben/sdk/go1.17.1/src/sync/cond.go:56 +0x8c
main.main()
	/home/ben/tmp/foo.go:29 +0x191

So, I think that test case (which you very kindly provided) confirms that we agree that it really is necessary to call cond.Broadcast() after cond.Wait()?

@bschofield
Copy link
Contributor Author

@BewareMyPower:

Therefore, I think a better solution is [...]

With that solution, the bug that I am experiencing in production could still occur:

[goroutine A] calls c.getState() and sees that it is not set to connectionReady
[goroutine B] changes the state to connectionReady
[goroutine B] sends a cond.Broadcast(), which goes nowhere because no goroutine is waiting.
[goroutine A] calls cond.Wait(), which never completes

To avoid this, it is necessary to hold the lock before calling c.getState().

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Nov 9, 2021

we agree that it really is necessary to call cond.Broadcast() after cond.Wait()?

Yeah, I agree.

If changeState gets called before waitUntilReady, then the broadcast happens first and all is well.

if waitUntilReady gets called before changeState, then waitUntilReady will block on the cond.Wait(), whilst still holding the lock.

It makes sense. We should make the combination of setState and c.Broadcast() an atomic operation.

however you should avoid calling defer inside the for loop because a new defer call will be pushed onto the stack for each loop invocation.

Thanks for pointing it out. I'm not so familiar with Go, just being used to C++ object's lifetime rule, which is a little different.


The only question I'm still confused is, could you explain following code and its output?

package main

import (
	"fmt"
	"sync"
)

type connection struct {
	sync.Mutex
	cond *sync.Cond
}

func NewConnection() *connection {
	cnx := &connection{}
	cnx.cond = sync.NewCond(cnx)
	return cnx
}

func main() {
	cnx := NewConnection()
	fmt.Printf("%v\n%v\n", cnx, cnx.cond.L)

	cnx.cond.L.Lock()
	cnx.Lock()
	cnx.Unlock()
	cnx.cond.L.Unlock()
}
&{{0 0} 0xc000104040}
&{{0 0} 0xc000104040}
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_SemacquireMutex(0xc000010244, 0x0, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000010240)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
main.main()
	main.go:24 +0x175

It looks like cnx.Lock() and cnx.cond.L.Lock() are the same.

@BewareMyPower
Copy link
Contributor

@BewareMyPower:

Therefore, I think a better solution is [...]

With that solution, the bug that I am experiencing in production could still occur:

[goroutine A] calls c.getState() and sees that it is not set to connectionReady [goroutine B] changes the state to connectionReady [goroutine B] sends a cond.Broadcast(), which goes nowhere because no goroutine is waiting. [goroutine A] calls cond.Wait(), which never completes

To avoid this, it is necessary to hold the lock before calling c.getState().

It's right. I also realized this problem just now. Here we should lock the whole changeState. I'm just wondering whether there should be another lock for this. But c.Lock() might be used in more places in future. (If c.Lock() and c.cond.L.Lock() are the same)

@bschofield
Copy link
Contributor Author

bschofield commented Nov 9, 2021

The only question I'm still confused is, could you explain following code and its output?

Ag! I'm so sorry @BewareMyPower, you're 100% correct: c.Lock() and c.cond.L.Lock() absolutely are the same lock. I had totally missed that sync.NewCond took a Locker as an argument. Thanks for clarifying that, and my apologies.

I'm just wondering whether there should be another lock for this.

I think that's up to you guys -- it depends what the ultimate intent of the lock is. Is it to protect against all state changes in the connection, or just some of them?

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Nov 9, 2021

I think that's up to you guys -- it depends what the ultimate intent of the lock is. Is it to protect against all state changes in the connection, or just some of them.

Yeah, it won't block this PR. Just say it because I've seen the similar case in C++ client that a common mutex is used everywhere.

For this PR, I think you can change c.cond.L. back to c., then it will LGTM.

ben added 2 commits November 9, 2021 09:47
…ocker as an argument. Add comments to make it clear why this lock is required, to hopefully avoid accidental removal in future.
@bschofield
Copy link
Contributor Author

OK, I made that change and added a comment so hopefully nobody removes the lock in future. Thanks again for looking at this, I really appreciate it!

@cckellogg cckellogg merged commit 774769d into apache:master Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

newPartitionProducer() occasionally hangs, waiting for a ready connection
3 participants