[SPARK-17159][Streaming] optimise check for new files in FileInputDStream #17745

steveloughran · 2017-04-24T13:15:16Z

What changes were proposed in this pull request?

Changes to FileInputDStream to eliminate multiple getFileStatus() calls when scanning directories for new files.

This is a minor optimisation when working with filesystems, but significant when working with object stores, as it eliminates HTTP requests per source file scanning the system. The current cost is 1-3 probing to see if a path is a directory or not, one more to actually timestamp a file. The new patch gets the file status and retains it through all the operations, so does not need to re-evaluate it.

The impact of this optimisation is 3 HTTP requests per source directory and 1 per file, for every single directory in the scan list, and for every file in the scanned directories, irrespective of the age of the directories. At 100+mS per HEAD request against S3, the speedup is significant, even when there are few files in the scanned directories.

Before

Two separate list operations, globStatus() to find directories, then listStatus() to scan for new files under directories.
The path filter in the globStatus() operations calls getFileStatus(filename) to probe for a file being a directory;
getFileStatus() is also used in the listStatus() call to check the timestamp.

Against an object store getFileStatus() can cost 1-4 HTTPS requests per call (HEAD path, HEAD path + "/", LIST path),

As both list operations return an array or iterator of FileStatus objects, the operations are utterly superfluous. Instead the filtering can take
place after the listing has returned.

After

The output of globStatus() is filtered to select only directories.
The output of listStatus() is filtered by timestamp.
The special failure case of globStatus(): no path, is handled specially in the warning text by saying "No Directory to scan", and omitting the full stack trace.
The fileToModTime map is superflous, and so deleted.

How was this patch tested?

There is a new test in org.apache.spark.streaming.InputStreamsSuite
I have object store integration tests in an external repository, which have been used to verify functionality and that the number of HTTP requests is reduced when invoked against S3A endpoints.

…ile status requests when querying files. This is a minor optimisation when working with filesystems, but significant when working with object stores. Change-Id: I269d98902f615818941c88de93a124c65453756e

SparkQA · 2017-04-24T14:27:34Z

Test build #76106 has finished for PR 17745 at commit f3ffe1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2017-09-21T10:41:59Z

Due to lack of support/interest, moved to https://github.com/hortonworks-spark/cloud-integration

ScrapCodes · 2018-08-23T10:02:38Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

-        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
-      }
-      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
+      val directories = Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])


In this approach, we might be fetching a very large list of files and then filtering through the directories. If the fetched, list is too large, then it can be a problem.

So, on looking at the code of glob status, it does filter at the end, so doing something like above might just be ok.

Also globStatus does a listStatus() per child directory or a getFileStatus() in case input pattern is not a glob, each call to listStatus does 3+ http calls and each call to getFileStatus does 2 http calls.

globStatus is flawed; key limit is that it does a tree walk. It needs to be replaced with an object-store-list specific one. See HADOOP-13371.

The issue with implementing an s3a flat-list and filter is that if the wildcard is a few entries up from the child path and there are lots of children, e..g

s3a://bucket/data/year=201?/month=*/day=*/

then if there are many files under year/month/day entries, all get listed during the filter.

What I think would need to be done is to be able to config the FS to limit the depth of where it switches to bulk listing; so here I could say "depth=2", and so the year=? would be done via globbing, but the month= and day= would be better.

Or maybe just start with making the whole thing optional, and let the caller deal with it.

Anyway, options here

fix the Hadoop side call. Nice and broadly useful

see if spark can be moved off the globStatus call. Will change matching. But if you provide a new "cloudstore" connector, that could be done, couldn't it?

Yes, having an object store specific version of glob, will be broadly helpful. In the mean time, this patch seems to be saving a lot of http requests.

Still a lot; I think we can do a new one.

Latest version of this code is here; I think its time to set up a module in bahir for this

ScrapCodes · 2018-08-29T07:23:43Z

Can you please reopen this? I had like to discuss, if we can merge it in the spark itself.

ScrapCodes · 2018-09-03T10:44:44Z

It appears, there are more people using object store now, than ever. For those who are attached to old versions of spark streaming, having this would be good.

Hi @steveloughran, are you planning to work on it ? or shall I take it forward from here?
I am contemplating what can be done. So far the plan is we will temporarily maintain it as an experimental component in Apache Bahir, for the time it is not merged in mainstream spark. If you are willing to maintain the component, then please send a pull request to Bahir with just this patch applied.

steveloughran · 2018-09-03T13:41:15Z

Patch is in the spark cloud integration module, you can take it and try to get into ASF spark provided you also add some credit to me in the patch.

…Object store. Based on apache#17745. Original work by Steve Loughran. This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. This is a minor optimisation when working with filesystems, but significant when working with object stores. Change-Id: I269d98902f615818941c88de93a124c65453756e

…g against Object store. ## What changes were proposed in this pull request? Original work by Steve Loughran. Based on #17745. This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. Each call to file status is 3+ http calls to object store. This patch eliminates the need for it, by using FileStatus objects. This is a minor optimisation when working with filesystems, but significant when working with object stores. ## How was this patch tested? Tests included. Existing tests pass. Closes #22339 from ScrapCodes/PR_17745. Lead-authored-by: Prashant Sharma <prashant@apache.org> Co-authored-by: Steve Loughran <stevel@hortonworks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…g against Object store. ## What changes were proposed in this pull request? Original work by Steve Loughran. Based on apache#17745. This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. Each call to file status is 3+ http calls to object store. This patch eliminates the need for it, by using FileStatus objects. This is a minor optimisation when working with filesystems, but significant when working with object stores. ## How was this patch tested? Tests included. Existing tests pass. Closes apache#22339 from ScrapCodes/PR_17745. Lead-authored-by: Prashant Sharma <prashant@apache.org> Co-authored-by: Steve Loughran <stevel@hortonworks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…reaming against Object store. apache#22339 Original work by Steve Loughran. Based on apache#17745. This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. Each call to file status is 3+ http calls to object store. This patch eliminates the need for it, by using FileStatus objects. This is a minor optimisation when working with filesystems, but significant when working with object stores.

SPARK-17159 minimal patch of hchanges to FileInputDStream to reduce F…

f3ffe1d

…ile status requests when querying files. This is a minor optimisation when working with filesystems, but significant when working with object stores. Change-Id: I269d98902f615818941c88de93a124c65453756e

steveloughran closed this Sep 28, 2017

ScrapCodes reviewed Aug 23, 2018

View reviewed changes

ScrapCodes mentioned this pull request Sep 5, 2018

[SPARK-17159][STREAM] Significant speed up for running spark streaming against Object store. #22339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17159][Streaming] optimise check for new files in FileInputDStream #17745

[SPARK-17159][Streaming] optimise check for new files in FileInputDStream #17745

steveloughran commented Apr 24, 2017

SparkQA commented Apr 24, 2017

steveloughran commented Sep 21, 2017

ScrapCodes Aug 23, 2018

ScrapCodes Aug 23, 2018

steveloughran Aug 23, 2018

ScrapCodes Aug 24, 2018

steveloughran Aug 26, 2018

ScrapCodes commented Aug 29, 2018

ScrapCodes commented Sep 3, 2018

steveloughran commented Sep 3, 2018

[SPARK-17159][Streaming] optimise check for new files in FileInputDStream #17745

[SPARK-17159][Streaming] optimise check for new files in FileInputDStream #17745

Conversation

steveloughran commented Apr 24, 2017

What changes were proposed in this pull request?

Before

After

How was this patch tested?

SparkQA commented Apr 24, 2017

steveloughran commented Sep 21, 2017

ScrapCodes Aug 23, 2018

Choose a reason for hiding this comment

ScrapCodes Aug 23, 2018

Choose a reason for hiding this comment

steveloughran Aug 23, 2018

Choose a reason for hiding this comment

ScrapCodes Aug 24, 2018

Choose a reason for hiding this comment

steveloughran Aug 26, 2018

Choose a reason for hiding this comment

ScrapCodes commented Aug 29, 2018

ScrapCodes commented Sep 3, 2018

steveloughran commented Sep 3, 2018