Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARTEMIS-2336 Use zero copy to replicate journal/page/large message file (AGAIN) #2845

Closed
wants to merge 6 commits into from

Conversation

franz1981
Copy link
Contributor

@franz1981 franz1981 commented Sep 22, 2019

I've opened this PR for discussion: I would like to re-introduce ARTEMIS-2336, but I've allowed wildfly or any user that doesn't want/can to use zero copy to be able to use the existing artemis code.

I've opened netty/netty#9592 too to "enhance" ChunkedNioFile in order to solve a bug on our implementation: in the meantime I've "shadowed" my solution on Netty directly into AbsoluteChunkedNioFile to not rely on any specific Netty version.

This PR could make use for InVM connection the same optimization sent on #2844 while reading file (RandomAccessFile) on ReplicationSyncFileMessage.

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 22, 2019

@ehsavoie @clebertsuconic FYI this version is not failing anymore the wildfly tests on https://issues.apache.org/jira/browse/ARTEMIS-2496

@@ -151,6 +151,11 @@ public ChannelImpl(final CoreRemotingConnection connection,
}

this.interceptors = interceptors;
//zero copy transfer is initialized only for replication channels
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clebertsuconic This one is ugly I know :)
We can find a better way together ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. this looks like a Hack to me! when is sendFiles used without being the replica channel?

Copy link
Contributor Author

@franz1981 franz1981 Sep 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now is not and I've added an assert to make the test suite to fail when is not properly configured, but we can always configure Netty pipeline upfront (right after SSL handler) to handle it without affecting normal usage, If is more clear..

if (channel.pipeline().get(SslHandler.class) == null) {
return new NonClosingDefaultFileRegion(fileChannel, offset, dataSize);
private Object getFileObject(FileChannel fileChannel, long offset, int dataSize) {
if (USE_FILE_REGION && channel.pipeline().get(SslHandler.class) == null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

channel.pipeline().get(SslHandler.class) could be costy: @wy96f we don't have any chance to evaluate this before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this. I'll run load generator to verify. Great work!

Copy link
Contributor Author

@franz1981 franz1981 Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very subtle, because the most of the cost come from checking the volatile and that will prevent further loads/stores to be moved upfront.. anyway I'm more interested into seeing how -Dio.netty.file.region=false compare with master and the zero copy version :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I'm setting up the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still getting on wildfly other failures, but not anything related to the file replication itself (that seems fixed), nonetheless it's happening only on this PR :(

Copy link
Contributor Author

@franz1981 franz1981 Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proceed in the investigation and I see that we are failing here: https://github.com/netty/netty/blob/21b7e29ea7c211bb2b889bae3a0c6c5d9f60fb01/handler/src/main/java/io/netty/handler/stream/ChunkedWriteHandler.java#L320

And we won't get notified anymore on ChunkedWriteHandler::channelWritabilityChanged

It means to me that:

  1. The receiver side has stopped reading data ie effectively the channel is not writeable anymore
  2. The sender side (on Netty) has some bug while marking progress and/or propagate writeability changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found the issue: was on our side. I'm going to update this PR with the change ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proceed in the investigation and I see that we are failing here: https://github.com/netty/netty/blob/21b7e29ea7c211bb2b889bae3a0c6c5d9f60fb01/handler/src/main/java/io/netty/handler/stream/ChunkedWriteHandler.java#L320

And we won't get notified anymore on ChunkedWriteHandler::channelWritabilityChanged

It means to me that:

  1. The receiver side has stopped reading data ie effectively the channel is not writeable anymore
  2. The sender side (on Netty) has some bug while marking progress and/or propagate writeability changes

https://github.com/netty/netty/blob/21b7e29ea7c211bb2b889bae3a0c6c5d9f60fb01/handler/src/main/java/io/netty/handler/stream/ChunkedWriteHandler.java#L320 would not be called as doFlush is running in netty executor and channel is still not writable after writing message.
Should not be 1 bcs netstat showed no backlog. I found channelWritabilityChanged was never called during replication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were missing this 438c0fb#diff-917ee65648802d0c63faa660eb9f8debR68
to propagate a writeable channel and resume sending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finishing running the tests, but I believe you can try to run your load generator as well now (fingers crossed!)

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 22, 2019

@wy96f I remember you've provided some numbers for ARTEMIS-2336 using one of your load generator: it would be nice to compare the results with this pr using -Dio.netty.file.region=true|false and vs master to verify that -Dio.netty.file.region=false is not worst then master.

@clebertsuconic
Copy link
Contributor

lets wait the release I'm doing on monday before we can start considering this?
We will need to test it within Wildfly to make sure it won't cause an issue in there.

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 22, 2019

@clebertsuconic yes, agree and I would like to run a soak test + wait @wy96f results as well +1
The wildfly tests seems pretty stable with this, but we have many of them to be tried first

Copy link
Contributor

@wy96f wy96f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see FileRegion occupies 0 byte, https://github.com/netty/netty/blob/ff7a9fa091a8bf2e10020f83fc4df1c44098bbbb/transport/src/main/java/io/netty/channel/DefaultMessageSizeEstimator.java#L45. So only non file packet accounts for the pending write bytes in channel, and flowControl is not working very well in this case, will this have a negative impact?


if (packetsSent % flowControlSize == 0) {
flushReplicationStream(action);
raf = new RandomAccessFile(file.getJavaFile(), "r");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we send a 0 size file, we need to close raf as ReplicationSyncFileMessage::release will not be called, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch: let me take a look!

Copy link
Contributor Author

@franz1981 franz1981 Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that :

               if (syncFileMessage.getFileId() != -1 && syncFileMessage.getDataSize() > 0) {
                  replicatingChannel.send(syncFileMessage, syncFileMessage.getFileChannel(),
                                          syncFileMessage.getOffset(), syncFileMessage.getDataSize(),
                                          lastChunk ? (Channel.Callback) success -> syncFileMessage.release() : null);

So we never send syncFileMessage with file size == 0 and when we will do it with

                  replicatingChannel.send(syncFileMessage);

We don't release it when completed. It seems a leak, but is not, because we will likely to call it on the previous iteration and it should be already ready to be release or released upon completion of the previous call.
I don't think we never send 0 bytes file sized on the wire AFAIK, but if it would happen we're not releasing it correctly: we should add

                  replicatingChannel.send(syncFileMessage, lastChunk ? (Channel.Callback) success -> syncFileMessage.release() : null);

It would cause on the normal path (file::size > 0) to have syncFileMessage.release called twice ATM (we mark lastChunk == true for the last 2 sent packets), but RandomAccessFile::close should be idempotent so...no harm and it would make the zero sized file to work correctly :) wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK 0 bytes file will be created in some cases:

  1. In the beginning of startReplication, 0 bytes page file will be created. If backup fails and connects to live again, 0 bytes page file will be sent.
  2. When all pages are consumed, PageCursorProviderImpl::cleanupComplete will be called generating 0 bytes page file, and might be sent to backup later.

That works, but we'll add a new send method only used here? How about not opening raf and new ReplicationSyncFileMessage(content, pageStore, id, null, null, offset, toSend) when file size is 0 so we don't take care to release it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about not opening raf and new ReplicationSyncFileMessage(content, pageStore, id, null, null, offset, toSend) when file size is 0 so we don't take care to release it?

I like it, effectively is wasted effort to open/close it for nothing...
I wil manage the 2 things in 2 separate commits:

  1. to address the 0 length transferts
  2. to simplify flow control: given that is a concurrent backpressure doesn't need to be super-precise, slowing down the system for no reasons

Copy link
Contributor Author

@franz1981 franz1981 Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still searching for a reproducer to test 9853633

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 25, 2019

@wy96f

. So only non file packet accounts for the pending write bytes in channel, and flowControl is not working very well in this case, will this have a negative impact?

I see that DefaultMessageSizeEstimator for FileRegion isn't handled nor ChannelOutboundBuffer::decrementPendingOutboundBytes is called afted sendFile succeed: it means that our "lazy" backpressure is quite limiting while using FileRegions.
For ChunkedNioFile is a different story, but equally interesting: DefaultMessageSizeEstimator is not able to recognize ChunkedNioFile , but ChannelOutboundBuffer::decrementPendingOutboundBytes is correctly handling it because ChunkedWriteHandler is transparently handling the files as ByteBufs.
I think that is time to simplify blockUntilWritable to make it more lazy ie to just leverage on Netty isWritable instead of trying to calculate exactly the pending bytes :)
I'm adding a commit to address that :)

* <a href="http://en.wikipedia.org/wiki/Zero-copy">zero-copy file transfer</a>
* such as {@code sendfile()}, you might want to use {@link FileRegion} instead.
*/
class AbsoluteChunkedNioFile implements ChunkedInput<ByteBuf> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your upstream fix to netty was merged. As i understand this is a shadow of that as you put in pr comment. As thats merged, and they have fairly frequent release. Lets wait for them to release, and then remove this class before merging.

@wy96f
Copy link
Contributor

wy96f commented Sep 26, 2019

@franz1981 I find new problem with -Dio.netty.file.region=false. I generated 48GB files with load generator. In the case of -Dio.netty.file.region=true and master, log(something like this 2019-09-26 11:02:49,348 DEBUG [org.apache.activemq.artemis.core.replication.ReplicationManager] sending 1048576 bytes on file xxxx) showed it took about 7 minutes to transfer files, then synchronization done message sent. However in the case of -Dio.netty.file.region=false, log showed it took about about 40 seconds to transfer files, then sync done message sent. The fact is flow control didn't work and it took 40 seconds to build up PendingWrite in the queue rather than to transfer files. This leads to sync done message timeouts.

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 26, 2019

@wy96f
It's strange that the back-pressure (writability) propagation fix isn't working: have you used the last version that include that fix?
Flow control on Netty for chunked nio files should increase/decrease outbound pending bytes as with "normal" ByteBuf writes...can you check if the writeability changes are correctly propagated to ChunkedWriteHandler?
I will be in vacation from today so don't have access for about 1 month to my computer: I will do my best to help as I can on my return, but we are near to fix this :)

@franz1981
Copy link
Contributor Author

Maybe on ActiveMQChannelHandler there are others events that are not correctly propagated through the pipeline and that would wake up the chunk writer?

@wy96f
Copy link
Contributor

wy96f commented Sep 26, 2019

@franz1981 Have fun in vacation :)

There is no problem with writability propagation, it works very well. when I set initial-replication-sync-timeout to a big value(E.g. 7 minutes), all of the queued up messages were sent and replication succeeded.

The packet sending process with -Dio.netty.file.region=true or master is:

  1. channel.writeAndFlush(artemis thread)
  2. add bytebuf in outboundbuffer -- increase size, flush it -- decrease size(netty thread)

The message sending process with -Dio.netty.file.region=false is:

  1. channel.writeAndFlush(artemis thread)
  2. add message in queue in ChunkedWriteHandler(netty thread)
  3. if channel writable, add bytebuf in outboundbuffer -- increase size and flush it -- decrease size(netty thread)
  4. If channel state transfers from unwritable to writable, call step3(netty thread)

For -Dio.netty.file.region=false, given message will be first put into queue then put in outboundbuffer only when channel writable, size in outboundbuffer will be limited to highWaterMark(default 128k). When flush proceeds and size drops to lowWaterMark(default 32k), channel is writable again, over and over again. I guess flowControl often sees channel writable(actually lots of queued up messages in ChunkedWriteHandler's queue) so it's not limiting well. In the end, sync done message would not be delivered in time due to too many messages queued up.

For -Dio.netty.file.region=true or master, size in outboundbuffer will continue to grow(netty thread is running all the time). When it exceeds 128k meaning channel not writable and flowControl triggers, broker(artemis thread) will not send packet until data in outboundbuffer is flushed. So there will be not much data queued up. Make sense?

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 26, 2019

@wy96f
Good analysis :)
Yep, so the simplest possible solution I see is to tune differently highWaterMark > 1 MB to allow the chunk writer queue to continue to be drained.

Both ChunkedInput and FileRegions are missing (on Netty) a correct size estimation on ChannelOutboundBuffer and this would imply senders to push many of them in burst: what makes them behave differently is that FileRegion is getting backpressured only by TCP while ChunkedInputs start to getting backpressured by Netty itself into ChunkedWriterHandler, given that any read ByteBuf is being accounted into ChannelOutboundBuffer thus preventing subsequent pending writes to proceed due to the small high watermark limit (if compared to the chunk size).

@wy96f
Copy link
Contributor

wy96f commented Sep 27, 2019

@franz1981 Hi, I made tests with writeBufferHighWaterMark=2MB, 10MB, 100MB, 200MB, the replication still failed(shocked).
After some analysis, i think results might be reasonable. Whatever writeBufferHighWaterMark value we tune to, the total time with channel writable is similar(If using big value, it would take long to saturate channel; If using small value, it would take more times to saturate channel although shorter time). Considering netty thread would read chunk file to add ByteBuf in outboundbuffer(size will be added) and artemis thread just put the packets into netty executor(that's more fast), packets will definitely build up in the chunk writer queue.

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 27, 2019

@wy96f thanks to have tried! It sounds strange to me: I was thinking the reason why was taking more was due to being continuosly stopped/being awaken and sending short chunks to the network (ie more syscalls with less data). I have a strong feeling that sendFile send a 1 MB chunk directly without using the TCP buffer at all....if is the case, it means we should increase the chunkSize (1 MB or at least 100K) and the TCP buffer accordingly (that's very small, by default afaik).
Did you observe that the network was saturated in both cases?
If you use async-profiler you can check what the kernel does and where most the cost is for both cases....

@wy96f
Copy link
Contributor

wy96f commented Sep 27, 2019

@franz1981

2019-09-27 13:47:01,943 DEBUG [org.apache.activemq.artemis.core.replication.ReplicationManager] sending 1048576 bytes on file 000002541.page
2019-09-27 13:47:01,943 DEBUG [org.apache.activemq.artemis.core.replication.ReplicationManager] sending 1048576 bytes on file 000002541.page
2019-09-27 13:47:01,943 DEBUG [org.apache.activemq.artemis.core.replication.ReplicationManager] sending 1048496 bytes on file 000002541.page
2019-09-27 13:47:01,945 DEBUG [org.apache.activemq.artemis.core.replication.ReplicationManager] sending 0 bytes on file 000002541.page



^C
[artemis@windqpstdb05 bin]$ sar -n DEV 1
Linux 2.6.32-279.19.1.el6_sn.12.x86_64 (windqpstdb05) 	09/27/2019 	_x86_64_	(8 CPU)

01:47:37 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:47:38 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:47:38 PM      eth0   4548.00   2576.00    294.39 108552.23      0.00      0.00      0.00

01:47:38 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:47:39 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:47:39 PM      eth0   4520.00   2528.00    292.59 108009.43      0.00      0.00      0.00

01:47:39 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:47:40 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:47:40 PM      eth0   4497.00   2588.00    291.05 107394.71      0.00      0.00      0.00

01:47:40 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:47:41 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:47:41 PM      eth0   4483.00   2561.00    290.18 106670.38      0.00      0.00      0.00

01:47:41 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:47:42 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
01:47:42 PM      eth0   4494.00   2584.00    290.85 107486.57      0.00      0.00      0.00

Yes, i saw the network was saturated(initial-replication-sync-timeout was set to 300000, otherwise replication failed due to timeout).
Note that log showed 13:47:01,945 last page sent, and sar showed 01:47:37 PM queued up data still transferring and saturating network.

BTW,<acceptor name="artemis">tcp://10.244.201.200:61616?;tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300</acceptor>
I used this in broker.xml.

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 27, 2019

If both cases (with/without file region) are saturating the network, why the latter will take more time? The total amount of data sent should be the same...
Do you have tried master as well?
I'm start to think that the pipelining happening while copying data in a non-netty thread and a separate Netty thread sending them across network is beneficial to improve the overall throughput, because we can do something (ie reading the file) while Netty is taking care to send data across network. But that means that if we don't use file regions I expect that network is not saturated 100% of the time and sometime wait ChunkInput to finish reading data and saturate it again: sar shows averages over sampling intervals so I believe that we can't spot those spikes...
I don't know if your acceptor configuration of TCP buffer is working,need to inspect the logs...
And profiling could be helpful as well...

@wy96f
Copy link
Contributor

wy96f commented Sep 27, 2019

@franz1981

If both cases (with/without file region) are saturating the network, why the latter will take more time? The total amount of data sent should be the same...

They're taking same time(~7 mins). As i said,
In the case of -Dio.netty.file.region=true and master, log(something like this 2019-09-26 11:02:49,348 DEBUG [org.apache.activemq.artemis.core.replication.ReplicationManager] sending 1048576 bytes on file xxxx) showed it took about 7 minutes to transfer files, then synchronization done message sent. However in the case of -Dio.netty.file.region=false, log showed it took about about 40 seconds to transfer files, then sync done message sent.
Yes, i saw the network was saturated(initial-replication-sync-timeout was set to 300000, otherwise replication failed due to timeout). Note that log showed 13:47:01,945 last page sent, and sar showed 01:47:37 PM queued up data still transferring and saturating network.
Saturation lasted for ~7 mins.

Do you have tried master as well?

The same.

@wy96f
Copy link
Contributor

wy96f commented Sep 27, 2019

@franz1981
profiling without file region:
https://filebin.net/r9o4bupoym9zxwk9/netty_false.svg?t=ld5f197t
Note I profiled after 40s(after log showed last page sent) so most of samples were about netty.

profiling with file region:
https://filebin.net/r9o4bupoym9zxwk9/netty_true.svg?t=mo3tho33

profiling master:
https://filebin.net/r9o4bupoym9zxwk9/profiler_master.svg?t=mo3tho33

@franz1981
Copy link
Contributor Author

Sorry I have missed that the file region=true version was failing on timeout as well: tbh I'm not sure it makes sense to have to have such small timeout in that case, given that depends by disk speed/network bandwidth and total transferred size....
Just one advice: try collecting samples with -t..I wasn't expecting blockUntiWritable to be that costy....

@wy96f
Copy link
Contributor

wy96f commented Sep 27, 2019

tbh I'm not sure it makes sense to have to have such small timeout in that case, given that depends by disk speed/network bandwidth and total transferred size....

Do you mean initial-replication-sync-timeout? If so, it's ok with default(30000) in master and file region=true, but not in file region=false, so i set it to 3000000 for profiling.

Just one advice: try collecting samples with -t..I wasn't expecting blockUntiWritable to be that costy....

./profiler.sh -d 60 -t -f xx.svg pid like this, profiling for master,file region=false/true? For file region=false, should i collect samples from beginning or after 40s(after log showed last page sent)?

@franz1981
Copy link
Contributor Author

To make profiling traces comparable in number of samples you should need to limit the duration and make them equals eg -d 10 (seconds)
Ok to do it after 40 seconds depending which part of the catching up we are interested in...
Re the timeout I need to re-read your previous answers: you said that both the approaches (file region/chunked files) max out network and take the same time, but only the latter will fail on replication timeout?

@wy96f
Copy link
Contributor

wy96f commented Sep 27, 2019

Re the timeout I need to re-read your previous answers: you said that both the approaches (file region/chunked files) max out network and take the same time, but only the latter will fail on replication timeout?

Yes. In file region, sync done msg was sent and received by backup 7 mins later. In chunked file, sync done msg was sent 40 seconds later and received by backup 7 min later because of packets queued up on live server side.

collecting samples with -t. see https://filebin.net/1lyef9lv6rkew9ks

@franz1981
Copy link
Contributor Author

@wy96f probably the size estimator of Netty or/and the pending writes count is not working as I was expecting with file regions...

@franz1981
Copy link
Contributor Author

@wy96f I'll look better into this on my back, but please take a look yourself if you have any idea.
From what I've understood we can:

wdyt?

@wy96f
Copy link
Contributor

wy96f commented Oct 14, 2019

@franz1981

fix https://github.com/netty/netty/blob/ff7a9fa091a8bf2e10020f83fc4df1c44098bbbb/transport/src/main/java/io/netty/channel/DefaultMessageSizeEstimator.java#L45 in order to account for chunked files correctly and flow control them while being enqueued into chunk writer

Not sure. I think netty is just limiting data in memory so not accounting for chunked file and file region?

change how sync done timeout is being calculated and consider the total time to replicate (from the first file region/chunk sent) instead of the time from when the sync done is being enqueued in Netty

We block all journal related operations during sync done interval to make sure replica has received all data, so we‘d better not change it.

Maybe we can:

  • We use original implementation when broker doesn't support zero region. IMO the chunked file way is more or less same with original one. At some point original version might be better bcs it reads file data in non-netty thread that can be beneficial where netty threads is busy taking care of new sent packets.

  • Another possible alternative is to use rate limiter like guava instead of netty flow control to work around the problem?

@franz1981
Copy link
Contributor Author

@wy96f sadly the rate limiter algorithm in guava is not very good fore fast paced networks and if with small token size, despite what the doc says about it :)

Netty is considering 0 sized the FileRegion sent on the outbound buffer afaik and just "unknown" the one based on chunked files, but the point is that the latter will be accumulated and won't back-pressure the channel.becauee are sent through a background accumulator (the additional outbound channel in the pipeline) ie the solution to provide a custom estimator makes sense imho, but I need my computer at hand to try it.

My point on the existing sync done is that is not actually correct because it assume that the previous files are being sent before sending that sync done while is not in any case, because Netty is asynchronous...the chunked file case just make it more evident..
My 2 cents :)

@wy96f
Copy link
Contributor

wy96f commented Oct 14, 2019

@franz1981 Ok, I see your points of views. Maybe we can try with a custom estimator :)

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 28, 2019

@wy96f
One quick question: on #2845 (comment) I see that you've said:

with -Dio.netty.file.region=true or master

But actually master and -Dio.netty.file.region=true are very different!
master has the feature with zero copy not enabled and ByteBuf are correctly estimated, while -Dio.netty.file.region=true on this PR is using custom FileRegions that doesn't seems correctly estimated according to https://github.com/netty/netty/blob/ff7a9fa091a8bf2e10020f83fc4df1c44098bbbb/transport/src/main/java/io/netty/channel/DefaultMessageSizeEstimator.java#L45.
In theory -Dio.netty.file.region=true should time out due to the wrong estimation (like -Dio.netty.file.region=false): do you have tested -Dio.netty.file.region=true on this PR with your long replication backlog test?

@wy96f
Copy link
Contributor

wy96f commented Oct 29, 2019

I was dubious about it too, but after some analysis, I somewhat understood it.
In fact size estimator is limiting non file packets in file region=true.
The log showed flow control triggered every ~450 times send. This make senses as each non file packet occupies ~300 bytes, and total size of them just over highWaterMark(default 128k). And flow control waited for ~4 seconds to allow 450 non file and 450 file packet to be drained(~450*1MB chunk size=~450MB data transferred in ~4 seconds, i.e. ~100MB/s).
I uploaded a file containing a small section of artemis.log, see https://filebin.net/emt7r3h7yg30zq8e/artemis_section.log?t=r9f67qpl

@franz1981
Copy link
Contributor Author

franz1981 commented Oct 29, 2019

Ok, now it is more clear, thanks!
So ByteBufs start to be accumulated due to FileRegions time to send.
I will take another look to the file region = false case, but probably accounting correctly with a custom msg estimator could make the back-pressure to be triggered before, but i'm not sure...

In the worst case I could just use something similar to what we were using before, maybe cleaning it up a lil more given that we are touching these bits

@franz1981
Copy link
Contributor Author

@wy96f I've tried to understand what's happening for ChunkedNioFile and probably it needs to be fixed Netty-side (see netty/netty#3413 (comment)), let's see.

@franz1981
Copy link
Contributor Author

franz1981 commented Nov 4, 2019

@wy96f I've added an additional commit to control the backlog of unflushed Netty writes, to improve the precision of the initial replication timeout performed right after "sending" all the replicated files: due to how Netty works (the previous comment has the relevant info about it), probably is the safer way to fail fast due to slow network/dead connections on replication. Let me know if your test won't be too slow due to this (with both zero copy or not) and if it seems correct now (without changing the default timeout on initial replication).

@wy96f
Copy link
Contributor

wy96f commented Nov 6, 2019

@franz1981 I tested with my load generator. The pr worked fine and showed no sign of slowing :)

@franz1981
Copy link
Contributor Author

@clebertsuconic I believe this is getting in good shape for a review, according to the test results. :)

@michaelandrepearce
Copy link
Contributor

@franz1981 i think you are good to merge this, its been open a long time, and seems no one has opposition

@franz1981
Copy link
Contributor Author

@michaelandrepearce Please wait a bit more: I've an opened issue on Netty related to ChunkedWriteHandler that could simplify this one, when merged. sadly I haven't had much time to work on the Netty side of it and it has blocked this one by consequence :(

@michaelandrepearce
Copy link
Contributor

@franz1981 whats occuring with this, wanting to start clearing down old / stagnant PR's a bit of spring cleaning so to say - so we focus on stuff thats actively working on / relevant .

@franz1981
Copy link
Contributor Author

Closing this because stale from long time and community didn't react/ask for this again.

@franz1981 franz1981 closed this Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants