Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer #3259

Closed

Conversation

JoshRosen
Copy link
Contributor

This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs.

This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask. I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case.

Thanks to @cristianopris for narrowing down this issue!

@JoshRosen
Copy link
Contributor Author

/cc @andrewor14 and @rxin for review.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23342 has started for PR 3259 at commit 3200c33.

  • This patch merges cleanly.

@@ -913,8 +918,10 @@ private[nio] class ConnectionManager(
}
}

val timoutTaskHandle = ackTimeoutMonitor.newTimeout(timeoutTask, ackTimeout, TimeUnit.SECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timout?

@vanzin
Copy link
Contributor

vanzin commented Nov 14, 2014

LGTM aside from typo.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23342 has finished for PR 3259 at commit 3200c33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23342/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Nov 14, 2014

LGTM too.

@cristianopris
Copy link

Hi Josh, thanks for looking into this.

I'm not sure this fully fixes the issue though. In my heap dump it looks like the memory leak is caused by the TimerTask holding a reference to the Promise, which would hold a reference to the message when it's set succesfully. A suggested fix would be to override cancel in the TimerTask to set the promise reference to null if the task is cancelled (and the run() method will not actually get to run).

Edit: Oh I see you also changed to HashedWheelTimer which is much more prompt in removing canceled tasks, so this must be fine. Could be worth confirming this though..

Thanks

@JoshRosen
Copy link
Contributor Author

Ah, good catch regarding the promise field. I don't like the idea of relying on HashedWheelTimer in order to fix this issue; it would be better to prevent the capture of a message reference, since that's going to be a less brittle fix. I'll see if there's a clean way to refactor this so that TimerTask doesn't hold a promise reference.

@JoshRosen
Copy link
Contributor Author

@cristianopris I've updated this so that the TimerTask keeps a WeakReference to promise, which I think should address that source of memory leaks. Thanks to @andrewor14 for discussing this with me; based on our chat, I've added several clarifying comments.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23390 has started for PR 3259 at commit 2a2e92d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23390 has finished for PR 3259 at commit 2a2e92d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23390/
Test PASSed.

@andrewor14
Copy link
Contributor

Latest changes LGTM

val e = new IOException("sendMessageReliably failed because ack " +
s"was not received within $ackTimeout sec")
if (!promise.tryFailure(e)) {
logWarning("Ignore error because promise is completed", e)
Option(promiseReference.get) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not

val p = promiseReference.get
if (p == null) {
  ...
} else {
  ...
}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually on the fence about this, but your comment tips me towards the == null camp since it removes a level of nesting / indentation.

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23420 has started for PR 3259 at commit afcc8d6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23420 has finished for PR 3259 at commit afcc8d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23420/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Nov 16, 2014

Merging in master & branch-1.2. Thanks!

@asfgit asfgit closed this in 7850e0c Nov 16, 2014
asfgit pushed a commit that referenced this pull request Nov 16, 2014
…sks; use HashedWheelTimer

This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs.

This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask.  I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case.

Thanks to cristianopris for narrowing down this issue!

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3259 from JoshRosen/connection-manager-timeout-bugfix and squashes the following commits:

afcc8d6 [Josh Rosen] Address rxin's review feedback.
2a2e92d [Josh Rosen] Keep only WeakReference to promise in TimerTask;
0f0913b [Josh Rosen] Spelling fix: timout => timeout
3200c33 [Josh Rosen] Use Netty HashedWheelTimer
f847dd4 [Josh Rosen] Don't capture entire message in ACK timeout task.

(cherry picked from commit 7850e0c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@sarutak
Copy link
Member

sarutak commented Nov 17, 2014

We also need to fix this issue for branch-1.1 right?

@rxin
Copy link
Contributor

rxin commented Nov 17, 2014

I don't think this merges cleanly into branch-1.1. Can one of you submit a pull request for that branch? Thanks.

@sarutak
Copy link
Member

sarutak commented Nov 17, 2014

I opend #3321 for branch-1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants