New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARTEMIS-2336 Use zero copy to replicate journal/page/large message file (AGAIN) #2845
Conversation
@ehsavoie @clebertsuconic FYI this version is not failing anymore the wildfly tests on https://issues.apache.org/jira/browse/ARTEMIS-2496 |
@@ -151,6 +151,11 @@ public ChannelImpl(final CoreRemotingConnection connection, | |||
} | |||
|
|||
this.interceptors = interceptors; | |||
//zero copy transfer is initialized only for replication channels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clebertsuconic This one is ugly I know :)
We can find a better way together ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. this looks like a Hack to me! when is sendFiles used without being the replica channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now is not and I've added an assert to make the test suite to fail when is not properly configured, but we can always configure Netty pipeline upfront (right after SSL handler) to handle it without affecting normal usage, If is more clear..
...g/apache/activemq/artemis/core/protocol/core/impl/wireformat/ReplicationSyncFileMessage.java
Show resolved
Hide resolved
if (channel.pipeline().get(SslHandler.class) == null) { | ||
return new NonClosingDefaultFileRegion(fileChannel, offset, dataSize); | ||
private Object getFileObject(FileChannel fileChannel, long offset, int dataSize) { | ||
if (USE_FILE_REGION && channel.pipeline().get(SslHandler.class) == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
channel.pipeline().get(SslHandler.class)
could be costy: @wy96f we don't have any chance to evaluate this before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice this. I'll run load generator to verify. Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very subtle, because the most of the cost come from checking the volatile and that will prevent further loads/stores to be moved upfront.. anyway I'm more interested into seeing how -Dio.netty.file.region=false
compare with master and the zero copy version :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I'm setting up the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still getting on wildfly other failures, but not anything related to the file replication itself (that seems fixed), nonetheless it's happening only on this PR :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've proceed in the investigation and I see that we are failing here: https://github.com/netty/netty/blob/21b7e29ea7c211bb2b889bae3a0c6c5d9f60fb01/handler/src/main/java/io/netty/handler/stream/ChunkedWriteHandler.java#L320
And we won't get notified anymore on ChunkedWriteHandler::channelWritabilityChanged
It means to me that:
- The receiver side has stopped reading data ie effectively the channel is not writeable anymore
- The sender side (on Netty) has some bug while marking progress and/or propagate writeability changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found the issue: was on our side. I'm going to update this PR with the change ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've proceed in the investigation and I see that we are failing here: https://github.com/netty/netty/blob/21b7e29ea7c211bb2b889bae3a0c6c5d9f60fb01/handler/src/main/java/io/netty/handler/stream/ChunkedWriteHandler.java#L320
And we won't get notified anymore on ChunkedWriteHandler::channelWritabilityChanged
It means to me that:
- The receiver side has stopped reading data ie effectively the channel is not writeable anymore
- The sender side (on Netty) has some bug while marking progress and/or propagate writeability changes
https://github.com/netty/netty/blob/21b7e29ea7c211bb2b889bae3a0c6c5d9f60fb01/handler/src/main/java/io/netty/handler/stream/ChunkedWriteHandler.java#L320 would not be called as doFlush
is running in netty executor and channel is still not writable after writing message.
Should not be 1 bcs netstat
showed no backlog. I found channelWritabilityChanged
was never called during replication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were missing this 438c0fb#diff-917ee65648802d0c63faa660eb9f8debR68
to propagate a writeable channel and resume sending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finishing running the tests, but I believe you can try to run your load generator as well now (fingers crossed!)
@wy96f I remember you've provided some numbers for ARTEMIS-2336 using one of your load generator: it would be nice to compare the results with this pr using |
lets wait the release I'm doing on monday before we can start considering this? |
@clebertsuconic yes, agree and I would like to run a soak test + wait @wy96f results as well +1 |
e031d0f
to
438c0fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see FileRegion
occupies 0 byte, https://github.com/netty/netty/blob/ff7a9fa091a8bf2e10020f83fc4df1c44098bbbb/transport/src/main/java/io/netty/channel/DefaultMessageSizeEstimator.java#L45. So only non file packet accounts for the pending write bytes in channel, and flowControl
is not working very well in this case, will this have a negative impact?
|
||
if (packetsSent % flowControlSize == 0) { | ||
flushReplicationStream(action); | ||
raf = new RandomAccessFile(file.getJavaFile(), "r"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we send a 0 size file, we need to close raf as ReplicationSyncFileMessage::release will not be called, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch: let me take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that :
if (syncFileMessage.getFileId() != -1 && syncFileMessage.getDataSize() > 0) {
replicatingChannel.send(syncFileMessage, syncFileMessage.getFileChannel(),
syncFileMessage.getOffset(), syncFileMessage.getDataSize(),
lastChunk ? (Channel.Callback) success -> syncFileMessage.release() : null);
So we never send syncFileMessage with file size == 0 and when we will do it with
replicatingChannel.send(syncFileMessage);
We don't release it when completed. It seems a leak, but is not, because we will likely to call it on the previous iteration and it should be already ready to be release or released upon completion of the previous call.
I don't think we never send 0 bytes file sized on the wire AFAIK, but if it would happen we're not releasing it correctly: we should add
replicatingChannel.send(syncFileMessage, lastChunk ? (Channel.Callback) success -> syncFileMessage.release() : null);
It would cause on the normal path (file::size > 0) to have syncFileMessage.release
called twice ATM (we mark lastChunk == true for the last 2 sent packets), but RandomAccessFile::close
should be idempotent so...no harm and it would make the zero sized file to work correctly :) wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK 0 bytes file will be created in some cases:
- In the beginning of
startReplication
, 0 bytes page file will be created. If backup fails and connects to live again, 0 bytes page file will be sent. - When all pages are consumed,
PageCursorProviderImpl::cleanupComplete
will be called generating 0 bytes page file, and might be sent to backup later.
That works, but we'll add a new send method only used here? How about not opening raf and new ReplicationSyncFileMessage(content, pageStore, id, null, null, offset, toSend)
when file size is 0 so we don't take care to release it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about not opening raf and new ReplicationSyncFileMessage(content, pageStore, id, null, null, offset, toSend) when file size is 0 so we don't take care to release it?
I like it, effectively is wasted effort to open/close it for nothing...
I wil manage the 2 things in 2 separate commits:
- to address the 0 length transferts
- to simplify flow control: given that is a concurrent backpressure doesn't need to be super-precise, slowing down the system for no reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still searching for a reproducer to test 9853633
I see that |
* <a href="http://en.wikipedia.org/wiki/Zero-copy">zero-copy file transfer</a> | ||
* such as {@code sendfile()}, you might want to use {@link FileRegion} instead. | ||
*/ | ||
class AbsoluteChunkedNioFile implements ChunkedInput<ByteBuf> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your upstream fix to netty was merged. As i understand this is a shadow of that as you put in pr comment. As thats merged, and they have fairly frequent release. Lets wait for them to release, and then remove this class before merging.
@franz1981 I find new problem with -Dio.netty.file.region=false. I generated 48GB files with load generator. In the case of -Dio.netty.file.region=true and master, log(something like this |
@wy96f |
Maybe on ActiveMQChannelHandler there are others events that are not correctly propagated through the pipeline and that would wake up the chunk writer? |
@franz1981 Have fun in vacation :) There is no problem with writability propagation, it works very well. when I set initial-replication-sync-timeout to a big value(E.g. 7 minutes), all of the queued up messages were sent and replication succeeded. The packet sending process with -Dio.netty.file.region=true or master is:
The message sending process with -Dio.netty.file.region=false is:
For -Dio.netty.file.region=false, given message will be first put into queue then put in For -Dio.netty.file.region=true or master, size in |
@wy96f Both ChunkedInput and FileRegions are missing (on Netty) a correct size estimation on ChannelOutboundBuffer and this would imply senders to push many of them in burst: what makes them behave differently is that FileRegion is getting backpressured only by TCP while ChunkedInputs start to getting backpressured by Netty itself into ChunkedWriterHandler, given that any read ByteBuf is being accounted into ChannelOutboundBuffer thus preventing subsequent pending writes to proceed due to the small high watermark limit (if compared to the chunk size). |
@franz1981 Hi, I made tests with writeBufferHighWaterMark=2MB, 10MB, 100MB, 200MB, the replication still failed(shocked). |
@wy96f thanks to have tried! It sounds strange to me: I was thinking the reason why was taking more was due to being continuosly stopped/being awaken and sending short chunks to the network (ie more syscalls with less data). I have a strong feeling that sendFile send a 1 MB chunk directly without using the TCP buffer at all....if is the case, it means we should increase the chunkSize (1 MB or at least 100K) and the TCP buffer accordingly (that's very small, by default afaik). |
Yes, i saw the network was saturated(initial-replication-sync-timeout was set to 300000, otherwise replication failed due to timeout). BTW, |
If both cases (with/without file region) are saturating the network, why the latter will take more time? The total amount of data sent should be the same... |
They're taking same time(~7 mins). As i said,
The same. |
@franz1981 profiling with file region: profiling master: |
Sorry I have missed that the file region=true version was failing on timeout as well: tbh I'm not sure it makes sense to have to have such small timeout in that case, given that depends by disk speed/network bandwidth and total transferred size.... |
Do you mean initial-replication-sync-timeout? If so, it's ok with default(30000) in master and file region=true, but not in file region=false, so i set it to 3000000 for profiling.
|
To make profiling traces comparable in number of samples you should need to limit the duration and make them equals eg -d 10 (seconds) |
Yes. In file region, sync done msg was sent and received by backup 7 mins later. In chunked file, sync done msg was sent 40 seconds later and received by backup 7 min later because of packets queued up on live server side. collecting samples with -t. see https://filebin.net/1lyef9lv6rkew9ks |
@wy96f probably the size estimator of Netty or/and the pending writes count is not working as I was expecting with file regions... |
@wy96f I'll look better into this on my back, but please take a look yourself if you have any idea.
wdyt? |
Not sure. I think netty is just limiting data in memory so not accounting for chunked file and file region?
We block all journal related operations during sync done interval to make sure replica has received all data, so we‘d better not change it. Maybe we can:
|
@wy96f sadly the rate limiter algorithm in guava is not very good fore fast paced networks and if with small token size, despite what the doc says about it :) Netty is considering 0 sized the FileRegion sent on the outbound buffer afaik and just "unknown" the one based on chunked files, but the point is that the latter will be accumulated and won't back-pressure the channel.becauee are sent through a background accumulator (the additional outbound channel in the pipeline) ie the solution to provide a custom estimator makes sense imho, but I need my computer at hand to try it. My point on the existing sync done is that is not actually correct because it assume that the previous files are being sent before sending that sync done while is not in any case, because Netty is asynchronous...the chunked file case just make it more evident.. |
@franz1981 Ok, I see your points of views. Maybe we can try with a custom estimator :) |
@wy96f
But actually master and |
I was dubious about it too, but after some analysis, I somewhat understood it. |
Ok, now it is more clear, thanks! In the worst case I could just use something similar to what we were using before, maybe cleaning it up a lil more given that we are touching these bits |
@wy96f I've tried to understand what's happening for ChunkedNioFile and probably it needs to be fixed Netty-side (see netty/netty#3413 (comment)), let's see. |
This reverts commit 70c2200
1cedc9a
to
923fccb
Compare
@wy96f I've added an additional commit to control the backlog of unflushed Netty writes, to improve the precision of the initial replication timeout performed right after "sending" all the replicated files: due to how Netty works (the previous comment has the relevant info about it), probably is the safer way to fail fast due to slow network/dead connections on replication. Let me know if your test won't be too slow due to this (with both zero copy or not) and if it seems correct now (without changing the default timeout on initial replication). |
@franz1981 I tested with my load generator. The pr worked fine and showed no sign of slowing :) |
@clebertsuconic I believe this is getting in good shape for a review, according to the test results. :) |
@franz1981 i think you are good to merge this, its been open a long time, and seems no one has opposition |
@michaelandrepearce Please wait a bit more: I've an opened issue on Netty related to |
@franz1981 whats occuring with this, wanting to start clearing down old / stagnant PR's a bit of spring cleaning so to say - so we focus on stuff thats actively working on / relevant . |
Closing this because stale from long time and community didn't react/ask for this again. |
I've opened this PR for discussion: I would like to re-introduce ARTEMIS-2336, but I've allowed wildfly or any user that doesn't want/can to use zero copy to be able to use the existing artemis code.
I've opened netty/netty#9592 too to "enhance"
ChunkedNioFile
in order to solve a bug on our implementation: in the meantime I've "shadowed" my solution on Netty directly intoAbsoluteChunkedNioFile
to not rely on any specific Netty version.This PR could make use for InVM connection the same optimization sent on #2844 while reading file (RandomAccessFile) on
ReplicationSyncFileMessage
.