[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation time #17702

xuanyuanking · 2017-04-20T11:19:39Z

What changes were proposed in this pull request?

This PR change the work of getting glob path in parallel, which can make complex wildcard path more quickly, the mainly changes in details:
1.Add new function getGlobbedPaths in DataSource, return all paths represented by the wildcard, follow the logic of InMemoryFileIndex.bulkListLeafFiles and reuse the config.
2.Add new function expandGlobPath in SparkHadoopUtil, to expand the first dir represented by the wildcard

How was this patch tested?

Existing UT.

SparkQA · 2017-04-20T14:08:45Z

Test build #75984 has finished for PR 17702 at commit b27ef4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-04-21T03:10:22Z

@marmbrus Can you take a look of this? Thanks :)

xuanyuanking · 2017-04-25T10:29:49Z

cc @zsxwing @tdas, can you review this? Founded the relative code of yours before. Thanks :)

xuanyuanking · 2017-04-28T02:36:38Z

@HyukjinKwon Can you help me to find a appropriate reviewer about this?

HyukjinKwon · 2017-04-28T13:20:31Z

Not sure. Probably, @cloud-fan or @gatorsmile ?

xuanyuanking · 2017-04-30T09:11:42Z

ping @cloud-fan and @gatorsmile , could you have a look about this ? Thanks :)

gatorsmile · 2017-04-30T17:40:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -146,6 +146,11 @@ object SQLConf {
    .longConf
    .createWithDefault(Long.MaxValue)

+  val GLOB_PATH_IN_PARALLEL = buildConf("spark.sql.globPathInParallel")
+    .doc("When true, resolve the glob path in parallel, the strategy same with ")


Sorry for the cut-off, I add a patch to fix this.

gatorsmile · 2017-04-30T17:42:08Z

Can you show us the performance difference?

cloud-fan · 2017-05-02T06:22:30Z

yea I'm also wondering how useful it is

xuanyuanking · 2017-05-02T09:01:32Z

Thanks for your review. @gatorsmile @cloud-fan

Can you show us the performance difference?

No problem, I reproduce our online case offline like below

Test env:

item	detail
Spark version	current master 2.2.0-SNAPSHOT
Hadoop version	2.7.2
HDFS	8 servers(128G, 20 core + 20 hyper threading)
Test case	`spark.read.text("/app/dc/test_for_ls////").count()` first level below `test_for_ls` contains 96 directory, each directory has 1000 directory in next level, the third and forth only have 1 directory and file each.

Without parallel resolve:

With parallel resolve:

Discussion:

More complex scenario and deeper directory level will have more optimization, in this local test it can bring us 100% faster.
spark.sql.globPathInParallel config will only parallel the process of resolving glob path.
If driver and cluster are in different geographical region, this improvement can produce at least 5* boosting in our scenario because of the resolving work become a parallel job on the cluster.

SparkQA · 2017-05-02T11:52:24Z

Test build #76378 has finished for PR 17702 at commit 2dcf96f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-05-08T01:51:46Z

@gatorsmile @cloud-fan, do we need other performance test?

xuanyuanking · 2017-05-12T09:15:52Z

ping @cloud-fan

SparkQA · 2017-06-01T10:36:34Z

Test build #77634 has finished for PR 17702 at commit 897d316.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-02T17:43:43Z

The logic looks very similar to InMemoryFileIndex.bulkListLeafFiles, is it possible to consolidate them?

xuanyuanking · 2017-06-05T06:15:13Z

@cloud-fan Thanks for your reply!
It's possible to consolidate them but may be not so necessary? I can consolidate them by replace the logic in getGlobbedPaths list below to InMemoryFileIndex.bulkListLeafFiles

+      val expanded = sparkSession.sparkContext
+        .parallelize(paths, paths.length)
+        .map { pathString =>
+          SparkHadoopUtil.get.globPathIfNecessary(new Path(pathString)).map(_.toString)
+        }.collect()
+      expanded.flatMap(paths => paths.map(new Path(_))).toSeq

, and also change the interface of bulkListLeafFiles by adding a default param to parse the function globPathIfNecessary, because of the glob path can be nested in many levels.
Maybe current logic to add a new parallelize job is better than consolidate, what do you think?

gatorsmile · 2017-06-15T23:37:17Z

retest this please

SparkQA · 2017-06-16T02:27:40Z

Test build #78137 has finished for PR 17702 at commit 897d316.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-16T02:46:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -389,6 +389,23 @@ case class DataSource(
  }

  /**
+   * Return all paths represented by the wildcard string.
+   */
+  private def getGlobbedPaths(qualified: Path): Seq[Path] = {


at least we should follow InMemoryFileIndex.bulkListLeafFiles and Picks the listing strategy adaptively depending on the number of paths to list

You are right.
I'll fix this and also limit the max parallelism num in next patch, reuse the config in InMemoryFileIndex.bulkListLeafFiles.

SparkQA · 2017-06-16T05:17:38Z

Test build #78156 has started for PR 17702 at commit a3a3509.

xuanyuanking · 2017-06-16T08:44:09Z

Test failed may cause by the env? process was terminated by signal 9 in jenkins log.

xuanyuanking · 2017-06-16T08:45:18Z

retest this please

SparkQA · 2017-06-16T11:33:25Z

Test build #78171 has finished for PR 17702 at commit a3a3509.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-30T12:03:20Z

Test build #82351 has finished for PR 17702 at commit 0ee6943.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-10-01T01:55:39Z

retest this please

xuanyuanking · 2017-10-02T01:36:46Z

retest this please

SparkQA · 2017-10-02T04:48:10Z

Test build #82378 has finished for PR 17702 at commit d019656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-11-06T11:11:15Z

cc @zsxwing

zsxwing · 2017-11-06T21:40:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+   * Return all paths represented by the wildcard string.
+   * Follow [[InMemoryFileIndex]].bulkListLeafFile and reuse the conf.
+   */
+  private def getGlobbedPaths(


Could you move this method to object DataSource?

Done in next commit.

zsxwing · 2017-11-06T21:40:43Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

@@ -246,6 +246,18 @@ class SparkHadoopUtil extends Logging {
    if (isGlobPath(pattern)) globPath(fs, pattern) else Seq(pattern)
  }

+  def expandGlobPath(fs: FileSystem, pattern: Path): Seq[String] = {


Please add unit tests for this method.

Add UT in SparkHadoopUtilSuite.scala

SparkQA · 2017-11-13T12:05:59Z

Test build #83783 has finished for PR 17702 at commit 660b95a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-14T03:47:37Z

Test build #83819 has finished for PR 17702 at commit ec9c1c1.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-11-14T04:57:40Z

retest this please

SparkQA · 2017-11-14T08:05:01Z

Test build #83822 has finished for PR 17702 at commit ec9c1c1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-11-14T08:16:50Z

retest this please

SparkQA · 2017-11-14T11:35:29Z

Test build #83837 has finished for PR 17702 at commit ec9c1c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-11-14T11:45:50Z

@zsxwing Thanks for your comments, ready for review.

xuanyuanking · 2017-12-05T05:29:52Z

gental ping @zsxwing

vanzin · 2017-12-09T00:53:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      val parallelPartitionDiscoveryParallelism =
+        sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism
+      val numParallelism = Math.min(paths.size, parallelPartitionDiscoveryParallelism)
+      val expanded = sparkSession.sparkContext


Why do this using a Spark job, instead of just a local thread pool?

I see this is the same thing done by InMemoryFileIndex, but it feels unnecessarily expensive.

@vanzin Thanks for you reply.

Why do this using a Spark job, instead of just a local thread pool?

As the DFS generally deployed together with NodeManagers for better data locality, while using client mode and driver in different region with cluster, using a Spark job will resolve the problem of cross region interaction in our scenario.

NMs are part of YARN, not HDFS.

This code will talk to HDFS's NameNodes; there is generally only one of those you'll be talking to. Latency here is not an issue, throughput is, so I still don't see why this shouldn't be done with a local thread pool.

Yep, I means YARN and HDFS always deploy in same region, but driver we can't control because it's our customer's machine in client mode like spark sql or shell.
For example we deploy YARN and HDFS in Beijing CN, user use spark sql on Shanghai CN.
Maybe this scenario shouldn't consider in this patch? What's your opinion @vanzin

Where the driver is deployed is not of concern; first because, as I said, this is not about latency, but parallelizing multiple calls to the NN. Second, because if your driver is in a different network, it will have the same latency when talking to the executors as if it would talk directly to the NN, so you're not fixing anything, you're just adding an extra hop (= more latency).

Thanks for your suggestion and detailed explanation, I'll reimplement this to local thread pool.

Sorry for the late reply, finished in next commit.

…uration

SparkQA · 2018-01-23T08:05:01Z

Test build #86517 has finished for PR 17702 at commit dc373ae.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-01-24T02:18:45Z

retest this please

SparkQA · 2018-01-24T05:35:25Z

Test build #86558 has finished for PR 17702 at commit dc373ae.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-01-24T07:22:23Z

retest this please

SparkQA · 2018-01-24T08:05:01Z

Test build #86573 has finished for PR 17702 at commit dc373ae.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-01-24T09:45:02Z

retest this please

SparkQA · 2018-01-24T13:39:16Z

Test build #86583 has finished for PR 17702 at commit dc373ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-01-25T03:47:09Z

ping @vanzin

cloud-fan · 2018-06-12T22:46:20Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

@@ -252,6 +252,18 @@ class SparkHadoopUtil extends Logging {
    if (isGlobPath(pattern)) globPath(fs, pattern) else Seq(pattern)
  }

+  def expandGlobPath(fs: FileSystem, pattern: Path): Seq[String] = {
+    val arr = pattern.toString.split("/")


we should not parse the path string ourselves, it's too risky, we may miss some special cases like windows path, escape character, etc. Let's take a look at org.apache.hadoop.fs.Globber and see if we can reuse some parser API there.

Thanks for your reply, agree with you.

cloud-fan · 2018-06-12T23:00:59Z

This approach only works if the first level glob pattern matches a lot of directories, e.g. /my_path/*/*. Otherwise, we can't apply it, e.g. /my_path/{ab, cd}/*.

My proposal: think about how glob works

split path into parts, e.g. /a/*/* -> a, *, *
for each path part, expand it if it's glob pattern, then flatMap the expanded results and expand the next path part, repeat until the last path part.

Step by step, we first expand /a/*/* to /a/b1/*; /a/b2/*, and then /a/b1/c1; /a/b1/c2; /a/b2/c1; /a/b2/c2. Theoritically, we can add a check in each step, if the current to-be-expanded list is above a treshold, do the next expand in parallel.

Maybe we should just fork the Hadoop Globber and improve it to run in parallel.

xuanyuanking · 2018-06-13T15:21:30Z

This approach only works if the first level glob pattern matches a lot of directories.

Yep, actually in our internal usage, we leave the problem to user and they should use first wild cast to represent most of file.

Maybe we should just fork the Hadoop Globber and improve it to run in parallel.

Thanks for your detailed explain and guidance, I'll reconsider this and open another PR.

gatorsmile reviewed Apr 30, 2017

View reviewed changes

cloud-fan reviewed Jun 16, 2017

View reviewed changes

xuanyuanking force-pushed the SPARK-20408 branch from a3a3509 to 0ee6943 Compare September 30, 2017 09:44

zsxwing requested changes Nov 6, 2017

View reviewed changes

vanzin reviewed Dec 9, 2017

View reviewed changes

xuanyuanking added 6 commits January 19, 2018 16:27

Resolve conflicts with SPARK-21374

c4a68b3

HadoopFileSystem not searializable, fix by passing SerializableConfig…

4f0cbfa

…uration

Move getGlobbedPaths to object DataSource

d1724b1

Add UT for expandGlobPath

169a80a

Fix ut for Seq not in expected order

b8a28eb

reimplement by using local thread pool

dc373ae

xuanyuanking force-pushed the SPARK-20408 branch from ec9c1c1 to dc373ae Compare January 23, 2018 06:47

cloud-fan reviewed Jun 12, 2018

View reviewed changes

xuanyuanking closed this Jun 13, 2018

xuanyuanking mentioned this pull request Jun 23, 2018

[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation time #21618

Closed

[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation time #17702

[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation time #17702

Conversation

xuanyuanking commented Apr 20, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 20, 2017

xuanyuanking commented Apr 21, 2017

xuanyuanking commented Apr 25, 2017

xuanyuanking commented Apr 28, 2017

HyukjinKwon commented Apr 28, 2017

xuanyuanking commented Apr 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Apr 30, 2017

cloud-fan commented May 2, 2017

xuanyuanking commented May 2, 2017 • edited

Test env:

Without parallel resolve:

With parallel resolve:

Discussion:

SparkQA commented May 2, 2017

xuanyuanking commented May 8, 2017

xuanyuanking commented May 12, 2017

SparkQA commented Jun 1, 2017

cloud-fan commented Jun 2, 2017

xuanyuanking commented Jun 5, 2017

gatorsmile commented Jun 15, 2017

SparkQA commented Jun 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 16, 2017

xuanyuanking commented Jun 16, 2017 • edited

xuanyuanking commented Jun 16, 2017

SparkQA commented Jun 16, 2017

SparkQA commented Sep 30, 2017

xuanyuanking commented Oct 1, 2017

xuanyuanking commented Oct 2, 2017

SparkQA commented Oct 2, 2017

jiangxb1987 commented Nov 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2017

SparkQA commented Nov 14, 2017

xuanyuanking commented Nov 14, 2017

SparkQA commented Nov 14, 2017

xuanyuanking commented Nov 14, 2017

SparkQA commented Nov 14, 2017

xuanyuanking commented Nov 14, 2017

xuanyuanking commented Dec 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 23, 2018

xuanyuanking commented Jan 24, 2018

SparkQA commented Jan 24, 2018

xuanyuanking commented Jan 24, 2018

SparkQA commented Jan 24, 2018

xuanyuanking commented Jan 24, 2018

SparkQA commented Jan 24, 2018

xuanyuanking commented Jan 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 12, 2018

xuanyuanking commented Jun 13, 2018

xuanyuanking commented Apr 20, 2017 •

edited

xuanyuanking commented May 2, 2017 •

edited

xuanyuanking commented Jun 16, 2017 •

edited