SendAsync() can stall with large BatchingMaxPublishDelay #437

bschofield · 2021-01-08T14:24:23Z

Observed behavior

When using SendAsync() together with a large BatchingMaxPublishDelay, such that batch flushing is driven mainly by BatchingMaxMessages, send stalls can occur.

I don't fully understand the cause of these, but increasing MaxPendingMessages seems to make them go away.

It may be relevant that I am producing to a topic with several partitions.

I originally came across this problem because I recently moved from the cgo client, and for the same configuration the pure go client was exhibiting very much worse throughput characteristics for high-rate sends: I was getting a max of ~6k records/sec when the cgo client had been giving me ~50k/sec.

Steps to reproduce

Create a producer with a large BatchingMaxPublishDelay and the other values default, e.g.

pulsar.ProducerOptions{
  Topic:                   topic,
  CompressionType:         pulsar.ZLib,
  BatchingMaxPublishDelay: 100 * time.Second,
}

Enable debug logging and produce to a partitioned topic with a reasonable number of partitions (in my case, six), using SendAsync(), with a callback function set (callback does nothing except crash on error). Note that the debug log will frequently stall after a Received send request message, and pause until a flush initiated by the max publish delay occurs.

Increase MaxPendingMessages to 2000 and try again. The stalls now go away.

Increase MaxPendingMessages to 6000 and try again. The stalls are still gone, and throughput perhaps appears better than the case above.

System configuration

Pulsar version: 2.6.1
Client version: 71cc54f (current master)

The text was updated successfully, but these errors were encountered:

bschofield · 2021-01-08T15:42:20Z

Profiling the goroutines in my test program, at the time the send is stalled, results in this.

Note that partitionProducer.internalSendAsync() is blocked on semaphore acquisition. Since I don't have the DisableBlockIfQueueFull option sent, I think this must be at line 586 of producer_partition.go:

p.publishSemaphore.Acquire()

In the creation of partitionProducer at line 108, publishSemaphore is initialized with a size from maxPendingMessages:

publishSemaphore: internal.NewSemaphore(int32(maxPendingMessages)),

I am unclear why this semaphore size is inadequate, but this seems like it might be the source of the issue.

Is there perhaps a case where the semaphore can fill before the message-count-based batch flush actually occurs, so that the corresponding release in ReceivedSendReceipt() never actually fires?

bschofield · 2021-01-08T16:36:29Z

Currently, in batch_builder.go, batchContainer.Add() returns false if the current batch container has too many bytes, but not if the batch container has too many messages.

If we change that, i.e. edit

} else if bc.hasSpace(payload) {
	// The current batch is full. Producer has to call Flush() to
	return false
}

to

} else if bc.IsFull() || bc.hasSpace(payload) {
	// The current batch is full. Producer has to call Flush() to
	return false
}

...then the problem also seems to go away, without changing MaxPendingMessages.

It could be the case that increasing MaxPendingMessages is just a band-aid which is causing the batch containers to hit the size limit, and the real bug is here.

However, I'm not confident enough that this is the correct fix to send a PR for it. Would appreciate @merlimat or @wolfstudy taking a look if you have time.

bschofield · 2021-01-11T12:23:45Z

I worked around this bug by setting BatchingMaxMessages = math.MaxInt32, so that batching is driven only by the space limit and the max delay (which I set to 1 second).

This seems to give me throughput comparable to, or better than, the cgo client.

bschofield · 2021-10-30T11:02:22Z

I believe this is probably fixed by #528, and so I am closing.

sijie mentioned this issue Jan 8, 2021

ISSUE-437: SendAsync() can stall with large BatchingMaxPublishDelay streamnative/pulsar-client-go#136

Closed

bschofield closed this as completed Oct 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SendAsync() can stall with large BatchingMaxPublishDelay #437

SendAsync() can stall with large BatchingMaxPublishDelay #437

bschofield commented Jan 8, 2021 •

edited

Loading

bschofield commented Jan 8, 2021 •

edited

Loading

bschofield commented Jan 8, 2021 •

edited

Loading

bschofield commented Jan 11, 2021 •

edited

Loading

bschofield commented Oct 30, 2021

SendAsync() can stall with large BatchingMaxPublishDelay #437

SendAsync() can stall with large BatchingMaxPublishDelay #437

Comments

bschofield commented Jan 8, 2021 • edited Loading

Observed behavior

Steps to reproduce

System configuration

bschofield commented Jan 8, 2021 • edited Loading

bschofield commented Jan 8, 2021 • edited Loading

bschofield commented Jan 11, 2021 • edited Loading

bschofield commented Oct 30, 2021

bschofield commented Jan 8, 2021 •

edited

Loading

bschofield commented Jan 8, 2021 •

edited

Loading

bschofield commented Jan 8, 2021 •

edited

Loading

bschofield commented Jan 11, 2021 •

edited

Loading