Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer (For branch-1.1) #3321

Closed

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Nov 17, 2014

This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs.

This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask. I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case.

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

QA tests have started for PR 3321 at commit 786af91.

  • This patch merges cleanly.

@rxin
Copy link
Contributor

rxin commented Nov 17, 2014

@JoshRosen can you take a look since you wrote the original patch?

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

QA tests have finished for PR 3321 at commit 786af91.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23489/
Test PASSed.

@JoshRosen
Copy link
Contributor

This looks good to me. I tried doing a cherry-pick myself and it looks like the original merge conflicts were caused by the ConnectionManager being moved to the nio subpackage in 1.2 and due to 73bf3f2 / #2593 and a couple of other commits that weren't present in branch-1.1, which caused the tryFailure / trySuccess code to be missing from the diff context.

@JoshRosen
Copy link
Contributor

(Edit: earlier comment said branch-1.2; I meant branch-1.1).

@andrewor14
Copy link
Contributor

Yeah it would be good to get this in before we cut another RC for 1.1.1

@JoshRosen
Copy link
Contributor

Alright, this looks good, so I'm going to merge it into branch-1.1. Thanks @sarutak!

@JoshRosen
Copy link
Contributor

I've merged this, but GitHub won't automatically close PRs made against non-master branches. Here's the commit in the Apache Git (the mirroring hasn't picked it up yet): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=91b5fa82477e5fd43712fdf067d92a31d4037a83

Do you mind closing this now that I've merged it? Thanks again!

@sarutak
Copy link
Member Author

sarutak commented Nov 18, 2014

Thanks for notification. I close this PR.

@sarutak sarutak closed this Nov 18, 2014
@sarutak sarutak deleted the connection-manager-timeout-bugfix branch April 11, 2015 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants