Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race while accessing connection in partitionConsumer #535

Merged
merged 1 commit into from
Jun 24, 2021

Conversation

dferstay
Copy link
Contributor

@dferstay dferstay commented Jun 8, 2021

The partitionConsumer maintains a few internal go-routines, two of which
access the underlying internal.Connection. The main runEvenstLoop()
go-routine reads the connection field while a separate go-routine is used
to detect connnection loss, initiate reconnection, and sets the connection.

Previously, access to the conn field was not synchronized.

Now, the conn field is read and written atomically; avoiding race
conditions.

Signed-off-by: Daniel Ferstay dferstay@splunk.com

Motivation

While attempting to submit a separate PR (#534) I found that the pulsar/reader_test consistently failed with the following data race. This change in this PR is an attempt to fix it.

2021-06-07T22:33:43.1825587Z ==================
2021-06-07T22:33:43.1825992Z WARNING: DATA RACE
2021-06-07T22:33:43.1826513Z Read at 0x00c0003de4a8 by goroutine 463:
2021-06-07T22:33:43.1828038Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).requestGetLastMessageID()
2021-06-07T22:33:43.1829418Z       /pulsar-client-go/pulsar/consumer_partition.go:279 +0x27e
2021-06-07T22:33:43.1830873Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).internalGetLastMessageID()
2021-06-07T22:33:43.1832427Z       /pulsar-client-go/pulsar/consumer_partition.go:270 +0xea
2021-06-07T22:33:43.1833799Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).runEventsLoop()
2021-06-07T22:33:43.1835074Z       /pulsar-client-go/pulsar/consumer_partition.go:806 +0x2cb
2021-06-07T22:33:43.1835559Z 
2021-06-07T22:33:43.1836251Z Previous write at 0x00c0003de4a8 by goroutine 294:
2021-06-07T22:33:43.1837949Z time="2021-06-07T22:33:41Z" level=info msg="[Connected consumer]" consumerID=2 name= subscription=reader-kcnmq topic=my-topic-971826719
2021-06-07T22:33:43.1839441Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).grabConn()
2021-06-07T22:33:43.1841513Z       /pulsar-client-go/pulsar/consumer_partition.go:974 +0x1875
2021-06-07T22:33:43.1842783Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).reconnectToBroker()
2021-06-07T22:33:43.1844507Z       /pulsar-client-go/pulsar/consumer_partition.go:887 +0x2db
2021-06-07T22:33:43.1845873Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).runEventsLoop.func2()
2021-06-07T22:33:43.1847183Z       /pulsar-client-go/pulsar/consumer_partition.go:791 +0xbe
2021-06-07T22:33:43.1847678Z 
2021-06-07T22:33:43.1848159Z Goroutine 463 (running) created at:
2021-06-07T22:33:43.1849312Z   github.com/apache/pulsar-client-go/pulsar.newPartitionConsumer()
2021-06-07T22:33:43.1850983Z       /pulsar-client-go/pulsar/consumer_partition.go:208 +0xf46
2021-06-07T22:33:43.1852092Z   github.com/apache/pulsar-client-go/pulsar.newReader()
2021-06-07T22:33:43.1853173Z       /pulsar-client-go/pulsar/reader_impl.go:105 +0x8ab
2021-06-07T22:33:43.1854294Z   github.com/apache/pulsar-client-go/pulsar.(*client).CreateReader()
2021-06-07T22:33:43.1855391Z       /pulsar-client-go/pulsar/client_impl.go:170 +0xcb
2021-06-07T22:33:43.1857422Z   github.com/apache/pulsar-client-go/pulsar.TestReaderLatestInclusiveHasNext()
2021-06-07T22:33:43.1858855Z       /pulsar-client-go/pulsar/reader_test.go:587 +0x946
2021-06-07T22:33:43.1859491Z   testing.tRunner()
2021-06-07T22:33:43.1860118Z       /usr/local/go/src/testing/testing.go:909 +0x199
2021-06-07T22:33:43.1860498Z 
2021-06-07T22:33:43.1860966Z Goroutine 294 (running) created at:
2021-06-07T22:33:43.1862630Z   github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).runEventsLoop()
2021-06-07T22:33:43.1864091Z       /pulsar-client-go/pulsar/consumer_partition.go:784 +0x174
2021-06-07T22:33:43.1865119Z ==================

Modifications

Store the internal.Connection managed for the partitionConsumer in an atomic.Value: https://golang.org/pkg/sync/atomic/#Value

Verifying this change

This change is already covered by existing tests that use partitionConsumer instances, such as pulsar/reader_test.

@merlimat merlimat requested a review from cckellogg June 8, 2021 00:11
pc.log.Info("Connected consumer")
pc.conn.AddConsumeHandler(pc.consumerID, pc)
pc._getConn().AddConsumeHandler(pc.consumerID, pc)
Copy link
Contributor

@cckellogg cckellogg Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior if _getConn() ever returns nil? Is that possible? Same for all the other places.

Copy link
Contributor Author

@dferstay dferstay Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cckellogg ,

Good question; _getConn() should never return nil.

An invariant in this code is that the partitionConsumer.conn field must be set and is never nil.
The grabConn() method sets the partitionConsumer.conn field; grabConn() is called from the newPartitionConsumer factory method which will fail construction of the partitionConsumer if grabConn() returns an error.

The above said, it is probably better for the cast in _getConn() to be unchecked and let the code panic() if the invariant is broken rather than returning nil causing a nil pointer de-reference further down the line. Thoughts?

Copy link
Contributor Author

@dferstay dferstay Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the cast unchecked and added a comment to explain why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was change that moved the broker reconnect out of the events go routine (https://github.com/apache/pulsar-client-go/pull/376/files). This is now causing the data race issue.

The question is if there is a reconnection what should happen with the pending events in the event channel? Right now they are processed using a stale/closed connection. Even with this fix it's possible events will try to get processed using a stale connection. Maybe that is ok but for me it makes the code difficult to follow and reason about. Ideally, we could come up with a cleaner way so we are never unintentionally using a stale/closed connection.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cckellogg ,
I'm thinking that since PR 376 drains the connection incomingRequestsCh on close it should be possible remove the extra go-routine in the partitionConsumer and select from the connectionClosedCh in the partitonConsumer.runEventsLoop. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cckellogg ,
I've attempted the above with the following commit:
17335cd

If this change is accepted I'll clean up this branch and update the PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 5dfcc13 we also short-circuit broker re-connection attempts on consumer close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cckellogg ,

I've reverted the behaviour of this PR back to the first approach, making access to the connection in the partitionConsumer atomic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest CI failure is going to be addressed by #544

@dferstay dferstay force-pushed the fix-data-race-in-reader-test branch from d3c9ab3 to 18b976a Compare June 8, 2021 17:50
dferstay pushed a commit to dferstay/pulsar-client-go that referenced this pull request Jun 10, 2021
The partitionConsumer maintains a few internal go-routines, two of which
access the underlying internal.Connection.  The main runEvenstLoop()
go-routine reads the connection field while a separate go-routine is used
to detect connection loss, initiate re-connection, and set the connection.

The go-routine that initiates re-connection on connection loss was added
in the following PR in order to address a deadlock:
apache#535

The above PR also includes the following changes:
* connection drains and fails the incomingRequestsCh when the conneciton is
  closed.
* partitionConsumer uses a separate channel to communicate connection loss
  to the re-connection go-routine.

With the above it in place it is possible for the partitionConsumer to
handle the connection loss in the main runEventsLoop(), allowing us to use
a single go-routine to manage the connection and resolve the data race.

Signed-off-by: Daniel Ferstay <dferstay@splunk.com>
@dferstay dferstay force-pushed the fix-data-race-in-reader-test branch from 5dfcc13 to 18b976a Compare June 19, 2021 21:18
@dferstay dferstay force-pushed the fix-data-race-in-reader-test branch from 18b976a to 6339d74 Compare June 21, 2021 17:24
The partitionConsumer maintains a few internal go-routines, two of which
access the underlying internal.Connection.  The main runEvenstLoop()
go-routine reads the connection field while a separate go-routine is used
to detect connnection loss, initiate reconnection, and sets the connection.

Previously, access to the conn field was not synchronized.

Now, the conn field is read and written atomically; resolving the data race.

Signed-off-by: Daniel Ferstay <dferstay@splunk.com>
@dferstay dferstay force-pushed the fix-data-race-in-reader-test branch from 6339d74 to dadae1d Compare June 22, 2021 16:39
@dferstay dferstay requested a review from cckellogg June 22, 2021 23:41
@merlimat merlimat added this to the 0.6.0 milestone Jun 24, 2021
@merlimat merlimat merged commit 96ce2de into apache:master Jun 24, 2021
wolfstudy added a commit that referenced this pull request Jan 12, 2022
Signed-off-by: xiaolongran <rxl@apache.org>

### Motivation

In #700, we use a separate go rutine to handle the logic of reconnect, so here you may encounter the same data race problem as #535

### Modifications

Now, the conn field is read and written atomically; avoiding race conditions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants