[SPARK-3731] [PySpark] fix memory leak in PythonRDD #2668

davies · 2014-10-06T06:09:46Z

The parent.getOrCompute() of PythonRDD is executed in a separated thread, it should release the memory reserved for shuffle and unrolling finally.

SparkQA · 2014-10-06T06:14:44Z

QA tests have started for PR 2668 at commit ae98be2.

This patch merges cleanly.

SparkQA · 2014-10-06T07:23:20Z

QA tests have finished for PR 2668 at commit ae98be2.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheTableCommand(tableName: String) extends Command
- case class CacheTableCommand(
- case class UncacheTableCommand(tableName: String) extends LeafNode with Command
- case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

AmplabJenkins · 2014-10-06T07:23:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21319/Test PASSed.

JoshRosen · 2014-10-07T19:18:17Z

LGTM, so I'll merge this (and backport for 1.1.1). Thanks!

pwendell · 2014-10-07T19:21:22Z

Awesome thanks for looking at this @davies

The parent.getOrCompute() of PythonRDD is executed in a separated thread, it should release the memory reserved for shuffle and unrolling finally. Author: Davies Liu <davies.liu@gmail.com> Closes #2668 from davies/leak and squashes the following commits: ae98be2 [Davies Liu] fix memory leak in PythonRDD (cherry picked from commit bc87cc4) Signed-off-by: Josh Rosen <joshrosen@apache.org> Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

… on a per-task, not per-thread, basis Spark's ShuffleMemoryManager and MemoryStore track memory on a per-thread basis, which causes problems in the handful of cases where we have tasks that use multiple threads. In PythonRDD, RRDD, ScriptTransformation, and PipedRDD we consume the input iterator in a separate thread in order to write it to an external process. As a result, these RDD's input iterators are consumed in a different thread than the thread that created them, which can cause problems in our memory allocation tracking. For example, if allocations are performed in one thread but deallocations are performed in a separate thread then memory may be leaked or we may get errors complaining that more memory was allocated than was freed. I think that the right way to fix this is to change our accounting to be performed on a per-task instead of per-thread basis. Note that the current per-thread tracking has caused problems in the past; SPARK-3731 (#2668) fixes a memory leak in PythonRDD that was caused by this issue (that fix is no longer necessary as of this patch). Author: Josh Rosen <joshrosen@databricks.com> Closes #7734 from JoshRosen/memory-tracking-fixes and squashes the following commits: b4b1702 [Josh Rosen] Propagate TaskContext to writer threads. 57c9b4e [Josh Rosen] Merge remote-tracking branch 'origin/master' into memory-tracking-fixes ed25d3b [Josh Rosen] Address minor PR review comments 44f6497 [Josh Rosen] Fix long line. 7b0f04b [Josh Rosen] Fix ShuffleMemoryManagerSuite f57f3f2 [Josh Rosen] More thread -> task changes fa78ee8 [Josh Rosen] Move Executor's cleanup into Task so that TaskContext is defined when cleanup is performed 5e2f01e [Josh Rosen] Fix capitalization 1b0083b [Josh Rosen] Roll back fix in PySpark, which is no longer necessary 2e1e0f8 [Josh Rosen] Use TaskAttemptIds to track shuffle memory c9e8e54 [Josh Rosen] Use TaskAttemptIds to track unroll memory

fix memory leak in PythonRDD

ae98be2

asfgit closed this in bc87cc4 Oct 7, 2014

JoshRosen mentioned this pull request Jul 28, 2015

[SPARK-9419] ShuffleMemoryManager and MemoryStore should track memory on a per-task, not per-thread, basis #7734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3731] [PySpark] fix memory leak in PythonRDD #2668

[SPARK-3731] [PySpark] fix memory leak in PythonRDD #2668

davies commented Oct 6, 2014

SparkQA commented Oct 6, 2014

SparkQA commented Oct 6, 2014

AmplabJenkins commented Oct 6, 2014

JoshRosen commented Oct 7, 2014

pwendell commented Oct 7, 2014

[SPARK-3731] [PySpark] fix memory leak in PythonRDD #2668

[SPARK-3731] [PySpark] fix memory leak in PythonRDD #2668

Conversation

davies commented Oct 6, 2014

SparkQA commented Oct 6, 2014

SparkQA commented Oct 6, 2014

AmplabJenkins commented Oct 6, 2014

JoshRosen commented Oct 7, 2014

pwendell commented Oct 7, 2014