New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation time #17702
Conversation
Test build #75984 has finished for PR 17702 at commit
|
@marmbrus Can you take a look of this? Thanks :) |
@HyukjinKwon Can you help me to find a appropriate reviewer about this? |
Not sure. Probably, @cloud-fan or @gatorsmile ? |
ping @cloud-fan and @gatorsmile , could you have a look about this ? Thanks :) |
@@ -146,6 +146,11 @@ object SQLConf { | |||
.longConf | |||
.createWithDefault(Long.MaxValue) | |||
|
|||
val GLOB_PATH_IN_PARALLEL = buildConf("spark.sql.globPathInParallel") | |||
.doc("When true, resolve the glob path in parallel, the strategy same with ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the cut-off, I add a patch to fix this.
Can you show us the performance difference? |
yea I'm also wondering how useful it is |
Thanks for your review. @gatorsmile @cloud-fan
No problem, I reproduce our online case offline like below Test env:
Without parallel resolve:With parallel resolve:Discussion:
|
Test build #76378 has finished for PR 17702 at commit
|
@gatorsmile @cloud-fan, do we need other performance test? |
ping @cloud-fan |
Test build #77634 has finished for PR 17702 at commit
|
The logic looks very similar to |
@cloud-fan Thanks for your reply!
, and also change the interface of |
retest this please |
Test build #78137 has finished for PR 17702 at commit
|
@@ -389,6 +389,23 @@ case class DataSource( | |||
} | |||
|
|||
/** | |||
* Return all paths represented by the wildcard string. | |||
*/ | |||
private def getGlobbedPaths(qualified: Path): Seq[Path] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least we should follow InMemoryFileIndex.bulkListLeafFiles
and Picks the listing strategy adaptively depending on the number of paths to list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right.
I'll fix this and also limit the max parallelism num in next patch, reuse the config in InMemoryFileIndex.bulkListLeafFiles
.
Test build #78156 has started for PR 17702 at commit |
Test failed may cause by the env? |
retest this please |
Test build #78171 has finished for PR 17702 at commit
|
a3a3509
to
0ee6943
Compare
Test build #82351 has finished for PR 17702 at commit
|
retest this please |
1 similar comment
retest this please |
Test build #82378 has finished for PR 17702 at commit
|
cc @zsxwing |
* Return all paths represented by the wildcard string. | ||
* Follow [[InMemoryFileIndex]].bulkListLeafFile and reuse the conf. | ||
*/ | ||
private def getGlobbedPaths( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this method to object DataSource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in next commit.
@@ -246,6 +246,18 @@ class SparkHadoopUtil extends Logging { | |||
if (isGlobPath(pattern)) globPath(fs, pattern) else Seq(pattern) | |||
} | |||
|
|||
def expandGlobPath(fs: FileSystem, pattern: Path): Seq[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add unit tests for this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add UT in SparkHadoopUtilSuite.scala
Test build #83783 has finished for PR 17702 at commit
|
Test build #83819 has finished for PR 17702 at commit
|
retest this please |
Test build #83822 has finished for PR 17702 at commit
|
retest this please |
Test build #83837 has finished for PR 17702 at commit
|
@zsxwing Thanks for your comments, ready for review. |
gental ping @zsxwing |
val parallelPartitionDiscoveryParallelism = | ||
sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism | ||
val numParallelism = Math.min(paths.size, parallelPartitionDiscoveryParallelism) | ||
val expanded = sparkSession.sparkContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this using a Spark job, instead of just a local thread pool?
I see this is the same thing done by InMemoryFileIndex
, but it feels unnecessarily expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin Thanks for you reply.
Why do this using a Spark job, instead of just a local thread pool?
As the DFS generally deployed together with NodeManagers for better data locality, while using client mode and driver in different region with cluster, using a Spark job will resolve the problem of cross region interaction in our scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NMs are part of YARN, not HDFS.
This code will talk to HDFS's NameNodes; there is generally only one of those you'll be talking to. Latency here is not an issue, throughput is, so I still don't see why this shouldn't be done with a local thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I means YARN and HDFS always deploy in same region, but driver we can't control because it's our customer's machine in client mode like spark sql or shell.
For example we deploy YARN and HDFS in Beijing CN, user use spark sql on Shanghai CN.
Maybe this scenario shouldn't consider in this patch? What's your opinion @vanzin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where the driver is deployed is not of concern; first because, as I said, this is not about latency, but parallelizing multiple calls to the NN. Second, because if your driver is in a different network, it will have the same latency when talking to the executors as if it would talk directly to the NN, so you're not fixing anything, you're just adding an extra hop (= more latency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion and detailed explanation, I'll reimplement this to local thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply, finished in next commit.
ec9c1c1
to
dc373ae
Compare
Test build #86517 has finished for PR 17702 at commit
|
retest this please |
Test build #86558 has finished for PR 17702 at commit
|
retest this please |
Test build #86573 has finished for PR 17702 at commit
|
retest this please |
Test build #86583 has finished for PR 17702 at commit
|
ping @vanzin |
@@ -252,6 +252,18 @@ class SparkHadoopUtil extends Logging { | |||
if (isGlobPath(pattern)) globPath(fs, pattern) else Seq(pattern) | |||
} | |||
|
|||
def expandGlobPath(fs: FileSystem, pattern: Path): Seq[String] = { | |||
val arr = pattern.toString.split("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not parse the path string ourselves, it's too risky, we may miss some special cases like windows path, escape character, etc. Let's take a look at org.apache.hadoop.fs.Globber
and see if we can reuse some parser API there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply, agree with you.
This approach only works if the first level glob pattern matches a lot of directories, e.g. My proposal: think about how glob works
Step by step, we first expand Maybe we should just fork the Hadoop |
Yep, actually in our internal usage, we leave the problem to user and they should use first wild cast to represent most of file.
Thanks for your detailed explain and guidance, I'll reconsider this and open another PR. |
What changes were proposed in this pull request?
This PR change the work of getting glob path in parallel, which can make complex wildcard path more quickly, the mainly changes in details:
1.Add new function
getGlobbedPaths
in DataSource, return all paths represented by the wildcard, follow the logic ofInMemoryFileIndex.bulkListLeafFiles
and reuse the config.2.Add new function
expandGlobPath
in SparkHadoopUtil, to expand the first dir represented by the wildcardHow was this patch tested?
Existing UT.