[SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex #22603

peter-toth · 2018-10-01T19:35:48Z

What changes were proposed in this pull request?

InMemoryFileIndex contains a cache of LocatedFileStatus objects. Each LocatedFileStatus object can contain several BlockLocations or some subclass of it. Filling up this cache by listing files happens recursively either on the driver or on the executors, depending on the parallel discovery threshold (spark.sql.sources.parallelPartitionDiscovery.threshold). If the listing happens on the executors block location objects are converted to simple BlockLocation objects to ensure serialization requirements. If it happens on the driver then there is no conversion and depending on the file system a BlockLocation object can be a subclass like HdfsBlockLocation and consume more memory. This PR adds the conversion to the latter case and decreases memory consumption.

How was this patch tested?

Added unit test.

mgaido91

cc @cloud-fan @gatorsmile @HyukjinKwon

mgaido91 · 2018-10-02T09:09:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

@@ -315,7 +315,12 @@ object InMemoryFileIndex extends Logging {
        // which is very slow on some file system (RawLocalFileSystem, which is launch a
        // subprocess and parse the stdout).
        try {
-          val locations = fs.getFileBlockLocations(f, 0, f.getLen)
+          val locations = fs.getFileBlockLocations(f, 0, f.getLen).map(
+            loc => if (loc.getClass == classOf[BlockLocation]) {


lo.isInstanceOf[BlockLocation]? Or even better, what about using pattern matching?

Thanks @mgaido91, but loc is always an instance of BlockLocation (might be a subclass such as HdfsBlockLocation) so isInstanceOf[BlockLocation] or pattern matching would return always true.
I want to test that the class of loc is exactly BlockLocation and if it is we don't need to convert it.

ah right, sorry @peter-toth. Thanks. Anyway, please move loc to the previous line and use curly braces for map. I think that is the most widely spread syntax in the codebase. Thanks.

Change-Id: I57c862ca076015f36aaee1da02c7fce80d740890

cloud-fan · 2018-10-02T11:25:45Z

ok to test

SparkQA · 2018-10-02T14:59:58Z

Test build #96856 has finished for PR 22603 at commit 45f0c81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-03T17:28:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

@@ -315,7 +315,13 @@ object InMemoryFileIndex extends Logging {
        // which is very slow on some file system (RawLocalFileSystem, which is launch a
        // subprocess and parse the stdout).
        try {
-          val locations = fs.getFileBlockLocations(f, 0, f.getLen)
+          val locations = fs.getFileBlockLocations(f, 0, f.getLen).map { loc =>


Hi, @peter-toth .
Could you add one line comment to explain this conversion?

dongjoon-hyun · 2018-10-03T18:06:20Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

+
+        val inMemoryFileIndex = new InMemoryFileIndex(
+          spark, Seq(new Path(file.getCanonicalPath)), Map.empty, None) {
+          def leafFileStatuses = leafFiles.map(_._2)


nit, def leafFileStatuses = leafFiles.values?

dongjoon-hyun · 2018-10-03T18:10:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

@@ -248,6 +248,25 @@ class FileIndexSuite extends SharedSQLContext {
      assert(spark.read.parquet(path.getAbsolutePath).schema.exists(_.name == colToUnescape))
    }
  }
+
+  test("SPARK-25062 - InMemoryCache stores only simple BlockLocations") {


InMemoryCache -> InMemoryFileIndex? And, simple BlockLocations may look unclear later.

peter-toth · 2018-10-04T09:47:23Z

Thanks @dongjoon-hyun for the review. I've fixed your findings.

SparkQA · 2018-10-04T13:29:31Z

Test build #96932 has finished for PR 22603 at commit 7b0bc56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-04T20:36:36Z

Could you review this, @cloud-fan , @gatorsmile , @HyukjinKwon ?

cloud-fan · 2018-10-05T15:28:02Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

+class SpecialBlockLocationFileSystem extends RawLocalFileSystem {
+
+  class SpecialBlockLocation(
+    names: Array[String],


4 spaces indentation

cloud-fan · 2018-10-05T15:28:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

+    length: Long) extends BlockLocation(names, hosts, offset, length)
+
+  override def getFileBlockLocations(
+    file: FileStatus,


cloud-fan · 2018-10-05T15:28:28Z

LGTM

dongjoon-hyun · 2018-10-06T16:55:11Z

@peter-toth . Could you address @cloud-fan 's comments?

Change-Id: Ifc1a90ade3938cdaf049d2c0c874f1840f6fcc28

peter-toth · 2018-10-06T17:31:10Z

Thanks @cloud-fan for the review. I've fixed your findings.

SparkQA · 2018-10-06T21:22:43Z

Test build #97065 has finished for PR 22603 at commit a50ae71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-06T21:49:22Z

Congratulation for your first contribution, @peter-toth . And, thank you, @cloud-fan and @mgaido91 .

Merged to master.

dongjoon-hyun · 2018-10-06T21:52:52Z

@peter-toth . What is your Apache JIRA user id? I need to assign you to the resolved SPARK-25062, but I cannot find your id and user name Peter Toth.

peter-toth · 2018-10-07T07:10:33Z

Thanks @dongjoon-hyun , petertoth is my JIRA user id.

dongjoon-hyun · 2018-10-07T16:45:36Z

@peter-toth . Could you comment on the JIRA?

https://issues.apache.org/jira/browse/SPARK-25062?focusedCommentId=16641131&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16641131

## What changes were proposed in this pull request? `InMemoryFileIndex` contains a cache of `LocatedFileStatus` objects. Each `LocatedFileStatus` object can contain several `BlockLocation`s or some subclass of it. Filling up this cache by listing files happens recursively either on the driver or on the executors, depending on the parallel discovery threshold (`spark.sql.sources.parallelPartitionDiscovery.threshold`). If the listing happens on the executors block location objects are converted to simple `BlockLocation` objects to ensure serialization requirements. If it happens on the driver then there is no conversion and depending on the file system a `BlockLocation` object can be a subclass like `HdfsBlockLocation` and consume more memory. This PR adds the conversion to the latter case and decreases memory consumption. ## How was this patch tested? Added unit test. Closes apache#22603 from peter-toth/SPARK-25062. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

SPARK-25062: clean up BlockLocations in InMemoryFileIndex

5735dba

mgaido91 reviewed Oct 2, 2018

View reviewed changes

SPARK-25062: fix review findings

45f0c81

Change-Id: I57c862ca076015f36aaee1da02c7fce80d740890

peter-toth changed the title ~~SPARK-25062: clean up BlockLocations in InMemoryFileIndex~~ [SPARK-25062][SQL] clean up BlockLocations in InMemoryFileIndex Oct 2, 2018

peter-toth changed the title ~~[SPARK-25062][SQL] clean up BlockLocations in InMemoryFileIndex~~ [SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex Oct 2, 2018

dongjoon-hyun reviewed Oct 3, 2018

View reviewed changes

SPARK-25062: fix review findings 2

7b0bc56

dongjoon-hyun approved these changes Oct 4, 2018

View reviewed changes

cloud-fan reviewed Oct 5, 2018

View reviewed changes

SPARK-25062: fix review findings 3

a50ae71

Change-Id: Ifc1a90ade3938cdaf049d2c0c874f1840f6fcc28

asfgit closed this in b0cee96 Oct 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex #22603

[SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex #22603

peter-toth commented Oct 1, 2018 •

edited

mgaido91 left a comment

mgaido91 Oct 2, 2018

peter-toth Oct 2, 2018

mgaido91 Oct 2, 2018

cloud-fan commented Oct 2, 2018

SparkQA commented Oct 2, 2018

dongjoon-hyun Oct 3, 2018

dongjoon-hyun Oct 3, 2018

dongjoon-hyun Oct 3, 2018

peter-toth commented Oct 4, 2018

SparkQA commented Oct 4, 2018

dongjoon-hyun commented Oct 4, 2018

cloud-fan Oct 5, 2018

cloud-fan Oct 5, 2018

cloud-fan commented Oct 5, 2018

dongjoon-hyun commented Oct 6, 2018

peter-toth commented Oct 6, 2018

SparkQA commented Oct 6, 2018

dongjoon-hyun commented Oct 6, 2018

dongjoon-hyun commented Oct 6, 2018

peter-toth commented Oct 7, 2018

dongjoon-hyun commented Oct 7, 2018

[SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex #22603

[SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex #22603

Conversation

peter-toth commented Oct 1, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 2, 2018

SparkQA commented Oct 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth commented Oct 4, 2018

SparkQA commented Oct 4, 2018

dongjoon-hyun commented Oct 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 5, 2018

dongjoon-hyun commented Oct 6, 2018

peter-toth commented Oct 6, 2018

SparkQA commented Oct 6, 2018

dongjoon-hyun commented Oct 6, 2018

dongjoon-hyun commented Oct 6, 2018

peter-toth commented Oct 7, 2018

dongjoon-hyun commented Oct 7, 2018

peter-toth commented Oct 1, 2018 •

edited