Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29971][CORE] Fix buffer leaks in TransportFrameDecoder/TransportCipher #26609

Closed
wants to merge 1 commit into from

Conversation

normanmaurer
Copy link
Contributor

@normanmaurer normanmaurer commented Nov 20, 2019

What changes were proposed in this pull request?

  • Correctly release ByteBuf in TransportCipher in all cases
  • Move closing / releasing logic to handlerRemoved(...) so we are guaranteed that is always called.
  • Correctly release frameBuf it is not null when the handler is removed (and so also when the channel becomes inactive)

Why are the changes needed?

We need to carefully manage the ownership / lifecycle of ByteBuf instances so we don't leak any of these. We did not correctly do this in all cases:

  • when end up in invalid cipher state.
  • when partial data was received and the channel is closed before the full frame is decoded

Fixes netty/netty#9784.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the newly added UTs.

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114150 has finished for PR 26609 at commit a8420ee.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@normanmaurer normanmaurer force-pushed the fix_leaks branch 2 times, most recently from bf1299b to d35f381 Compare November 20, 2019 09:29
@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114151 has finished for PR 26609 at commit c766150.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114155 has finished for PR 26609 at commit bf1299b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114159 has finished for PR 26609 at commit d35f381.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByteArrayReadableChannel is kind of a weird class, I'd rather make the (only) caller do the right thing instead.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29971] Fix multiple possible buffer leaks in `TransportFrameDe… [SPARK-29971] Fix multiple possible buffer leaks in TransportFrameDecoder/TransportCipher Nov 20, 2019
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 20, 2019

Thank you for making a PR, @normanmaurer !
I updated the PR description according to Apache Spark community style.

@normanmaurer
Copy link
Contributor Author

@vanzin also related to your comment of ByteArrayReadableChannel.... imho you transfer ownership of the ByteBuf to ByteArrayReadableChannel so it should be its responsibility to ensure we release it.

@vanzin
Copy link
Contributor

vanzin commented Nov 20, 2019

you transfer ownership of the ByteBuf to ByteArrayReadableChannel

As I mentioned that's a weird class. It's meant to be fed the buffer, and the caller is expected to consume all the data right away, as TransportCipher does. So technically TransportCipher never loses ownership of that buffer, and this class is just an optimization to avoid instantiating a new one every time data arrives.

If you're really so worried about all this you could make this class a private static class inside TransportCipher which is the only place where it's used.

@normanmaurer
Copy link
Contributor Author

@vanzin like mentioned above I am not sure it is safe to assume it always consume all data as an InternalError may be thrown (you even guard against it). So I would suggest to be better safe then sorry. If you are super concerned about the extra logic in feedData(...) I would also be happy to let it throw if we already have a reference to a ByteBuf. Just let me know if you would prefer this and I will change the code. That said I am not sure why it is such a big deal to add the code to feedData as I did as I don't think it is overly complex. But I am happy to adapt if needed :)

@normanmaurer
Copy link
Contributor Author

@dongjoon-hyun @vanzin I just added another commit to add a test case related to your handling of InternalError which would have leaked two buffers before these changes in here.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114238 has finished for PR 26609 at commit b06ec10.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@normanmaurer
Copy link
Contributor Author

@vanzin @dongjoon-hyun can you let me know how I can ignore the linter for the two "InternalError" instances that are thrown by the testcode ?

@vanzin
Copy link
Contributor

vanzin commented Nov 21, 2019

I am not sure it is safe to assume it always consume all data as an InternalError

I'm not sure I follow you. If there's an internal error the whole channel is pretty much gone, nothing gets transferred anymore.

In the caller, you just do a try...finally and release the buffer in the finally. In the absence of errors, the contract is that all data in the buffer fed to the feedData method is read by TransportCipher.

Again, I understand you're thinking of this as generic code anyone can use, but it's not. This is internal Spark code that serves a specific purpose, behaves in a very particular way (it actually only exists to work around a bug in the commons-crypto library) and is not used anywhere else.

In other words, the buffer copying code you're adding is basically dead code. If it's ever triggered it's actually a bug in the existing code, so instead of adding that code you should be adding a check and throwing an exception, as I suggested.

@normanmaurer
Copy link
Contributor Author

@vanzin I am not talking about the changes in feedData(...) here (which I think we can also just let it throw an exception as stated above). I am saying that if we not release the buffer in close() we may leak as there is no guarantee that we consume the whole buffer and so release it if an InternalError was thrown.

@vanzin
Copy link
Contributor

vanzin commented Nov 21, 2019

Still not following you.

feedData is called once in channelRead, and the contract in the code is that all that buffer's data must be read in that same call to channelRead. So you can release the buffer in channelRead just fine as far as I can see. close doesn't need to do anything.

(You need to release the buffer in channelRead in any case, don't you? Since close is not called for each call to channelRead).

@normanmaurer
Copy link
Contributor Author

@vanzin yes we could also just remove all the release magic in ByteArrayReadableChannel.read(...) and close(...) and handle all in channelRead(...) that would be fine with me as well. When I started the patch my assumption was that ByteArrayReadableChannel may be used by users and so we should not do this because of this. If this not a problem then sure lets do it. Please let me know.

@vanzin
Copy link
Contributor

vanzin commented Nov 21, 2019

remove all the release magic in ByteArrayReadableChannel.read(...) and close(...) and handle all in channelRead(...)

That sounds good to me.

@normanmaurer
Copy link
Contributor Author

@vanzin done..

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114251 has finished for PR 26609 at commit 34440e8.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@normanmaurer
Copy link
Contributor Author

@vanzin can you have a look again ?

@dongjoon-hyun
Copy link
Member

Oh, you changed to Mock. Then, please ignore my outdated comments~

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the style issues the linter picked up, looks ok.

@normanmaurer
Copy link
Contributor Author

@vanzin I hope I fixed all now.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114252 has finished for PR 26609 at commit b4b1b76.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114254 has finished for PR 26609 at commit c745649.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Nov 22, 2019

Merging to master / 2.4. I'll fix the import order in the new test during merge.

@vanzin
Copy link
Contributor

vanzin commented Nov 22, 2019

Actually, while merging I noticed something else... could you name the new test class TransportCipherSuite to match all the other test suites?

(While there, the import order convention is to have java/javax imports first.)

@dongjoon-hyun
Copy link
Member

So, this is still under review status for those comments, right? @vanzin

@vanzin
Copy link
Contributor

vanzin commented Nov 22, 2019

Yes, as you can see by the fact that it's still open...

…coder` and `TransportCipher`

Motivation:

We need to carefully manage the ownership / lifecycle of `ByteBuf` instances so we don't leak any of these. We did not correctly do this in all cases:
 - when end up in invalid cipher state.
 - when partial data was received and the channel is closed before the full frame is decoded

Modifications:

- Release message in finally block to ensure it is also released in case of cipher is not valid
- Move closing / releasing logic to `handlerRemoved(...)` so we are guarenteed that is always called.
- Add tests for `TransportCipher`
- Correctly release `frameBuf` it is not null when the handler is removed (and so also when the channel becomes inactive)

Result:

Fixes netty/netty#9784.
@normanmaurer
Copy link
Contributor Author

@vanzin @dongjoon-hyun addressed all + squashed and adjusted commit message to reflect the final patch.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114281 has finished for PR 26609 at commit fb79d0a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114280 has finished for PR 26609 at commit d8d2bbe.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class TransportCipherSuite

@vanzin
Copy link
Contributor

vanzin commented Nov 22, 2019

retest this please

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29971] Fix multiple possible buffer leaks in TransportFrameDecoder/TransportCipher [SPARK-29971][CORE] Fix buffer leaks in TransportFrameDecoder/TransportCipher Nov 22, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114306 has finished for PR 26609 at commit fb79d0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import java.nio.channels.ReadableByteChannel;

import io.netty.buffer.ByteBuf;

public class ByteArrayReadableChannel implements ReadableByteChannel {
private ByteBuf data;
private boolean closed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, does this need to be volatile? I'm not sure if there are thread safety concerns here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streams / channels are not thread-safe.

int totalRead = 0;
while (data.readableBytes() > 0 && dst.remaining() > 0) {
int bytesToRead = Math.min(data.readableBytes(), dst.remaining());
dst.put(data.readSlice(bytesToRead).nioBuffer());
totalRead += bytesToRead;
}

if (data.readableBytes() == 0) {
data.release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking this is no longer needed because of the new cleanup?

@vanzin
Copy link
Contributor

vanzin commented Nov 22, 2019

Norman, it's not clear whether you wanted your last commit message to be the PR description? Github does not do that automatically.

@normanmaurer
Copy link
Contributor Author

normanmaurer commented Nov 22, 2019 via email

@vanzin
Copy link
Contributor

vanzin commented Nov 22, 2019

Merging to master / 2.4.

@vanzin vanzin closed this in f28eab2 Nov 22, 2019
@vanzin
Copy link
Contributor

vanzin commented Nov 22, 2019

2.4 does not have SPARK-26674 so the cherry-pick was not clean (and it's not a super trivial fix). So need to either cherry-pick that change too, or post a fixed PR for 2.4, if you think this is important in that branch.

@normanmaurer
Copy link
Contributor Author

@vanzin let me have a look ... imho this fix is important as it could result in memory leaks

@dongjoon-hyun
Copy link
Member

Thank you, @vanzin and @normanmaurer .
Yes. Please create a PR against branch-2.4 for branch-2.4 memory leak, @normanmaurer .

normanmaurer added a commit to normanmaurer/spark that referenced this pull request Nov 25, 2019
…ortCipher`

- Correctly release `ByteBuf` in `TransportCipher` in all cases
- Move closing / releasing logic to `handlerRemoved(...)` so we are guaranteed that is always called.

We need to carefully manage the ownership / lifecycle of `ByteBuf` instances so we don't leak any of these. We did not correctly do this in all cases:
 - when end up in invalid cipher state.
 - when partial data was received and the channel is closed before the full frame is decoded

Fixes netty/netty#9784.

No.

Pass the newly added UTs.

Closes apache#26609 from normanmaurer/leaks_2_4.

Authored-by: Norman Maurer <norman_maurer@apple.com>
@normanmaurer
Copy link
Contributor Author

@dongjoon-hyun @vanzin done #26660

@normanmaurer normanmaurer deleted the fix_leaks branch November 25, 2019 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants