Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SendAsync() can stall with large BatchingMaxPublishDelay #437

Closed
bschofield opened this issue Jan 8, 2021 · 4 comments
Closed

SendAsync() can stall with large BatchingMaxPublishDelay #437

bschofield opened this issue Jan 8, 2021 · 4 comments

Comments

@bschofield
Copy link
Contributor

bschofield commented Jan 8, 2021

Observed behavior

When using SendAsync() together with a large BatchingMaxPublishDelay, such that batch flushing is driven mainly by BatchingMaxMessages, send stalls can occur.

I don't fully understand the cause of these, but increasing MaxPendingMessages seems to make them go away.

It may be relevant that I am producing to a topic with several partitions.

I originally came across this problem because I recently moved from the cgo client, and for the same configuration the pure go client was exhibiting very much worse throughput characteristics for high-rate sends: I was getting a max of ~6k records/sec when the cgo client had been giving me ~50k/sec.

Steps to reproduce

Create a producer with a large BatchingMaxPublishDelay and the other values default, e.g.

pulsar.ProducerOptions{
  Topic:                   topic,
  CompressionType:         pulsar.ZLib,
  BatchingMaxPublishDelay: 100 * time.Second,
}

Enable debug logging and produce to a partitioned topic with a reasonable number of partitions (in my case, six), using SendAsync(), with a callback function set (callback does nothing except crash on error). Note that the debug log will frequently stall after a Received send request message, and pause until a flush initiated by the max publish delay occurs.

Increase MaxPendingMessages to 2000 and try again. The stalls now go away.

Increase MaxPendingMessages to 6000 and try again. The stalls are still gone, and throughput perhaps appears better than the case above.

System configuration

Pulsar version: 2.6.1
Client version: 71cc54f (current master)

@bschofield
Copy link
Contributor Author

bschofield commented Jan 8, 2021

Profiling the goroutines in my test program, at the time the send is stalled, results in this.

Screenshot from 2021-01-08 15-33-43

Note that partitionProducer.internalSendAsync() is blocked on semaphore acquisition. Since I don't have the DisableBlockIfQueueFull option sent, I think this must be at line 586 of producer_partition.go:

p.publishSemaphore.Acquire()

In the creation of partitionProducer at line 108, publishSemaphore is initialized with a size from maxPendingMessages:

publishSemaphore: internal.NewSemaphore(int32(maxPendingMessages)),

I am unclear why this semaphore size is inadequate, but this seems like it might be the source of the issue.

Is there perhaps a case where the semaphore can fill before the message-count-based batch flush actually occurs, so that the corresponding release in ReceivedSendReceipt() never actually fires?

@bschofield
Copy link
Contributor Author

bschofield commented Jan 8, 2021

Currently, in batch_builder.go, batchContainer.Add() returns false if the current batch container has too many bytes, but not if the batch container has too many messages.

If we change that, i.e. edit

} else if bc.hasSpace(payload) {
	// The current batch is full. Producer has to call Flush() to
	return false
}

to

} else if bc.IsFull() || bc.hasSpace(payload) {
	// The current batch is full. Producer has to call Flush() to
	return false
}

...then the problem also seems to go away, without changing MaxPendingMessages.

It could be the case that increasing MaxPendingMessages is just a band-aid which is causing the batch containers to hit the size limit, and the real bug is here.

However, I'm not confident enough that this is the correct fix to send a PR for it. Would appreciate @merlimat or @wolfstudy taking a look if you have time.

@bschofield
Copy link
Contributor Author

bschofield commented Jan 11, 2021

I worked around this bug by setting BatchingMaxMessages = math.MaxInt32, so that batching is driven only by the space limit and the max delay (which I set to 1 second).

This seems to give me throughput comparable to, or better than, the cgo client.

@bschofield
Copy link
Contributor Author

I believe this is probably fixed by #528, and so I am closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant