New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29971][CORE] Fix buffer leaks in TransportFrameDecoder/TransportCipher
#26609
Conversation
a8420ee
to
c766150
Compare
Test build #114150 has finished for PR 26609 at commit
|
bf1299b
to
d35f381
Compare
Test build #114151 has finished for PR 26609 at commit
|
Test build #114155 has finished for PR 26609 at commit
|
Test build #114159 has finished for PR 26609 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ByteArrayReadableChannel
is kind of a weird class, I'd rather make the (only) caller do the right thing instead.
common/network-common/src/main/java/org/apache/spark/network/util/ByteArrayReadableChannel.java
Outdated
Show resolved
Hide resolved
common/network-common/src/main/java/org/apache/spark/network/util/ByteArrayReadableChannel.java
Outdated
Show resolved
Hide resolved
common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java
Outdated
Show resolved
Hide resolved
TransportFrameDecoder/TransportCipher
Thank you for making a PR, @normanmaurer ! |
@vanzin also related to your comment of |
As I mentioned that's a weird class. It's meant to be fed the buffer, and the caller is expected to consume all the data right away, as If you're really so worried about all this you could make this class a private static class inside |
@vanzin like mentioned above I am not sure it is safe to assume it always consume all data as an |
@dongjoon-hyun @vanzin I just added another commit to add a test case related to your handling of |
Test build #114238 has finished for PR 26609 at commit
|
@vanzin @dongjoon-hyun can you let me know how I can ignore the linter for the two "InternalError" instances that are thrown by the testcode ? |
I'm not sure I follow you. If there's an internal error the whole channel is pretty much gone, nothing gets transferred anymore. In the caller, you just do a Again, I understand you're thinking of this as generic code anyone can use, but it's not. This is internal Spark code that serves a specific purpose, behaves in a very particular way (it actually only exists to work around a bug in the commons-crypto library) and is not used anywhere else. In other words, the buffer copying code you're adding is basically dead code. If it's ever triggered it's actually a bug in the existing code, so instead of adding that code you should be adding a check and throwing an exception, as I suggested. |
@vanzin I am not talking about the changes in |
Still not following you.
(You need to release the buffer in |
@vanzin yes we could also just remove all the |
common/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherTest.java
Outdated
Show resolved
Hide resolved
That sounds good to me. |
@vanzin done.. |
Test build #114251 has finished for PR 26609 at commit
|
common/network-common/src/main/java/org/apache/spark/network/util/ByteArrayReadableChannel.java
Outdated
Show resolved
Hide resolved
common/network-common/src/main/java/org/apache/spark/network/util/ByteArrayReadableChannel.java
Show resolved
Hide resolved
common/network-common/src/main/java/org/apache/spark/network/util/ByteArrayReadableChannel.java
Outdated
Show resolved
Hide resolved
common/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherTest.java
Outdated
Show resolved
Hide resolved
...network-common/src/test/java/org/apache/spark/network/util/ByteArrayReadableChannelTest.java
Outdated
Show resolved
Hide resolved
common/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherTest.java
Outdated
Show resolved
Hide resolved
common/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherTest.java
Outdated
Show resolved
Hide resolved
@vanzin can you have a look again ? |
common/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherTest.java
Outdated
Show resolved
Hide resolved
Oh, you changed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the style issues the linter picked up, looks ok.
common/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherTest.java
Outdated
Show resolved
Hide resolved
common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java
Outdated
Show resolved
Hide resolved
@vanzin I hope I fixed all now. |
Test build #114252 has finished for PR 26609 at commit
|
Test build #114254 has finished for PR 26609 at commit
|
Merging to master / 2.4. I'll fix the import order in the new test during merge. |
Actually, while merging I noticed something else... could you name the new test class (While there, the import order convention is to have java/javax imports first.) |
So, this is still under review status for those comments, right? @vanzin |
Yes, as you can see by the fact that it's still open... |
…coder` and `TransportCipher` Motivation: We need to carefully manage the ownership / lifecycle of `ByteBuf` instances so we don't leak any of these. We did not correctly do this in all cases: - when end up in invalid cipher state. - when partial data was received and the channel is closed before the full frame is decoded Modifications: - Release message in finally block to ensure it is also released in case of cipher is not valid - Move closing / releasing logic to `handlerRemoved(...)` so we are guarenteed that is always called. - Add tests for `TransportCipher` - Correctly release `frameBuf` it is not null when the handler is removed (and so also when the channel becomes inactive) Result: Fixes netty/netty#9784.
d8d2bbe
to
fb79d0a
Compare
@vanzin @dongjoon-hyun addressed all + squashed and adjusted commit message to reflect the final patch. |
Test build #114281 has finished for PR 26609 at commit
|
Test build #114280 has finished for PR 26609 at commit
|
retest this please |
TransportFrameDecoder/TransportCipher
TransportFrameDecoder/TransportCipher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Test build #114306 has finished for PR 26609 at commit
|
import java.nio.channels.ReadableByteChannel; | ||
|
||
import io.netty.buffer.ByteBuf; | ||
|
||
public class ByteArrayReadableChannel implements ReadableByteChannel { | ||
private ByteBuf data; | ||
private boolean closed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, does this need to be volatile? I'm not sure if there are thread safety concerns here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
streams / channels are not thread-safe.
int totalRead = 0; | ||
while (data.readableBytes() > 0 && dst.remaining() > 0) { | ||
int bytesToRead = Math.min(data.readableBytes(), dst.remaining()); | ||
dst.put(data.readSlice(bytesToRead).nioBuffer()); | ||
totalRead += bytesToRead; | ||
} | ||
|
||
if (data.readableBytes() == 0) { | ||
data.release(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking this is no longer needed because of the new cleanup?
Norman, it's not clear whether you wanted your last commit message to be the PR description? Github does not do that automatically. |
I think the PR description is fine
… Am 22.11.2019 um 20:16 schrieb Marcelo Vanzin ***@***.***>:
Norman, it's not clear whether you wanted your last commit message to be the PR description? Github does not do that automatically.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Merging to master / 2.4. |
2.4 does not have SPARK-26674 so the cherry-pick was not clean (and it's not a super trivial fix). So need to either cherry-pick that change too, or post a fixed PR for 2.4, if you think this is important in that branch. |
@vanzin let me have a look ... imho this fix is important as it could result in memory leaks |
Thank you, @vanzin and @normanmaurer . |
…ortCipher` - Correctly release `ByteBuf` in `TransportCipher` in all cases - Move closing / releasing logic to `handlerRemoved(...)` so we are guaranteed that is always called. We need to carefully manage the ownership / lifecycle of `ByteBuf` instances so we don't leak any of these. We did not correctly do this in all cases: - when end up in invalid cipher state. - when partial data was received and the channel is closed before the full frame is decoded Fixes netty/netty#9784. No. Pass the newly added UTs. Closes apache#26609 from normanmaurer/leaks_2_4. Authored-by: Norman Maurer <norman_maurer@apple.com>
@dongjoon-hyun @vanzin done #26660 |
What changes were proposed in this pull request?
ByteBuf
inTransportCipher
in all caseshandlerRemoved(...)
so we are guaranteed that is always called.frameBuf
it is not null when the handler is removed (and so also when the channel becomes inactive)Why are the changes needed?
We need to carefully manage the ownership / lifecycle of
ByteBuf
instances so we don't leak any of these. We did not correctly do this in all cases:Fixes netty/netty#9784.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass the newly added UTs.