[SPARK-17159][STREAM] Significant speed up for running spark streaming against Object store. #22339

ScrapCodes · 2018-09-05T09:05:01Z

What changes were proposed in this pull request?

Original work by Steve Loughran.
Based on #17745.

This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. Each call to file status is 3+ http calls to object store. This patch eliminates the need for it, by using FileStatus objects.

This is a minor optimisation when working with filesystems, but significant when working with object stores.

How was this patch tested?

Tests included. Existing tests pass.

…Object store. Based on apache#17745. Original work by Steve Loughran. This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. This is a minor optimisation when working with filesystems, but significant when working with object stores. Change-Id: I269d98902f615818941c88de93a124c65453756e

SparkQA · 2018-09-05T10:31:31Z

Test build #95706 has finished for PR 22339 at commit 2fba9af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-06T04:04:19Z

Hi, @ScrapCodes . Could you do the followings?

Update the title to [SPARK-17159][SS]...
Remove Please review http://spark.apache.org/contributing.html .... from PR description
Share the numbers because the PR title has Significant speed up

dongjoon-hyun · 2018-09-13T16:54:32Z

Retest this please.

SparkQA · 2018-09-13T18:26:46Z

Test build #96047 has finished for PR 22339 at commit 2fba9af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2018-09-26T07:29:18Z

For numbers, while testing with object store having 50 files/dirs, without this patch it took 130 REST requests for 2 batches to complete and with this patch it took 56 rest requests. So number of rest calls are reduced, and this translates to speedup. How much speed up is dependent on number of files, but for the particular test, I have run, it was 2x.

ScrapCodes · 2018-09-28T06:29:54Z

Hi @srowen, would you like to take a look? Is there anything I can do, if this patch is missing something? I have tested it thoroughly against an object store.

srowen

If @steveloughran is into it, I think this is OK. I see why it's faster.

srowen · 2018-09-28T14:15:00Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

      val newFiles = directories.flatMap(dir =>
-        fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
+        fs.listStatus(dir)
+            .filter(isNewFile(_, currentTime, modTimeIgnoreThreshold))


Nit: I think the indent is too deep here?

srowen · 2018-09-28T14:15:18Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

      val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
-      logInfo("Finding new files took " + timeTaken + " ms")
-      logDebug("# cached file times = " + fileToModTime.size)
+      logInfo(s"Finding new files took $timeTaken ms")


I wonder if this should be a debug statement. I don't feel strongly about it.

It was originally @ info, so if it it filled up logs too much there'd be complaints. What's important is that the time to scan is printed, either @ info or debug, so someone can see what's happening. Probably what does need logging @ warn is when the time to scan is greater than the window, or just getting close to it.

srowen · 2018-09-28T14:15:34Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

-        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
-      }
-      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
+      val directories = Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])


I guess the .getOrElse could come at the end, but it hardly matters.

steveloughran · 2018-09-28T17:24:53Z

Why the speedups? Comes from that glob filter calling getFileStatus() on every entry, which is is 1-3 HTTP requests and a few hundred millis per call, when instead that can be handled later. As a result, the more files you have in a path, the more time the scan takes, until eventually the scan time > window interval at which point your code is dead.

The other stuff is simply associated optimisations.

Now, I'm obviously happy with this, especially as I seem I getting credit. And it will help speedup working with any store. But I need to warn people: it is not sufficient

The key problem here is: files uploaded by S3 multipart upload get a timestamp on when the upload began, not finished —yet only become visible at the end of the upload. If a caller starts up an upload in window t, and doesn't complete it until window t+1, then it may get missed.

There's not much which can be done here, except in documenting the risk.

What is a good solution? It'd be to use the cloud-infra-providers own event notification mechanism and subscribe to changes in a store. AWS, Azure and GCS all offer something like this.

There's a home for the S3 one of those in spark-kinesis, perhaps. Not got free time to work on it, I'm afraid, but if someone starts coding it, list me on the PR and I'll take a look

srowen · 2018-09-28T17:33:44Z

Yeah I agree, I was saying I do think it will speed things up. If it's a non-trivial win it's worthwhile even if it isn't the last optimization here. Is there any downside to this?

steveloughran · 2018-09-28T18:27:11Z

no, no cost penalties. Slightly lower namenode load too. If you had many, many spark streaming clients scanning directories, HDFS ops teams would eventually get upset. This will postpone the day

srowen

@ScrapCodes looks good to me except perhaps the tiny style comment above, and possibly the log statement question

dongjoon-hyun · 2018-10-02T06:42:21Z

Retest this please.

SparkQA · 2018-10-02T07:05:02Z

Test build #96843 has finished for PR 22339 at commit 2fba9af.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-03T04:36:50Z

Test build #96886 has finished for PR 22339 at commit dab9bf3.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-03T05:47:30Z

Test build #96885 has finished for PR 22339 at commit 542872c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-03T06:07:25Z

Test build #96887 has finished for PR 22339 at commit d91c815.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-10-05T01:22:27Z

Merged to master

ScrapCodes · 2018-10-05T06:27:53Z

Thank you @srowen and @steveloughran.

…g against Object store. ## What changes were proposed in this pull request? Original work by Steve Loughran. Based on apache#17745. This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. Each call to file status is 3+ http calls to object store. This patch eliminates the need for it, by using FileStatus objects. This is a minor optimisation when working with filesystems, but significant when working with object stores. ## How was this patch tested? Tests included. Existing tests pass. Closes apache#22339 from ScrapCodes/PR_17745. Lead-authored-by: Prashant Sharma <prashant@apache.org> Co-authored-by: Steve Loughran <stevel@hortonworks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…reaming against Object store. apache#22339 Original work by Steve Loughran. Based on apache#17745. This is a minimal patch of changes to FileInputDStream to reduce File status requests when querying files. Each call to file status is 3+ http calls to object store. This patch eliminates the need for it, by using FileStatus objects. This is a minor optimisation when working with filesystems, but significant when working with object stores.

### What changes were proposed in this pull request? The pr aims to delete `TimeStampedHashMap` and its UT. ### Why are the changes needed? During Pr #43578, we found that the class `TimeStampedHashMap` is no longer in use. Based on the suggestion, we have removed it. #43578 (comment) - First time this class `TimeStampedHashMap` be introduced: b18d708#diff-77b12178a7036c71135074c6ddf7d659e5a69906264d5e3061087e4352e304ed introduced this data structure - After #22339, this class `TimeStampedHashMap` is only being used in UT `TimeStampedHashMapSuite`. So, after Spark 3.0, this data structure has not been used by any production code of Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43633 from panbingkun/remove_TimeStampedHashMap. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

ScrapCodes changed the title ~~SPARK-17159 Significant speed up for running spark streaming against Object store.~~ [SPARK-17159][SS] Significant speed up for running spark streaming against Object store. Sep 6, 2018

ScrapCodes changed the title ~~[SPARK-17159][SS] Significant speed up for running spark streaming against Object store.~~ [SPARK-17159][STREAM] Significant speed up for running spark streaming against Object store. Sep 7, 2018

srowen approved these changes Sep 28, 2018

View reviewed changes

srowen reviewed Oct 1, 2018

View reviewed changes

ScrapCodes added 2 commits October 3, 2018 09:48

Fixed indents.

542872c

more code feedback.

dab9bf3

fixed compile errors.

d91c815

srowen approved these changes Oct 3, 2018

View reviewed changes

asfgit closed this in 3ae4f07 Oct 5, 2018

ScrapCodes deleted the PR_17745 branch October 5, 2018 05:49

LuciferYang mentioned this pull request Nov 1, 2023

[SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to MapOps & Fix method += in trait Growable is deprecated #43578

Closed

panbingkun mentioned this pull request Nov 2, 2023

[SPARK-45767][CORE] Delete TimeStampedHashMap and its UT #43633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17159][STREAM] Significant speed up for running spark streaming against Object store. #22339

[SPARK-17159][STREAM] Significant speed up for running spark streaming against Object store. #22339

ScrapCodes commented Sep 5, 2018 •

edited

SparkQA commented Sep 5, 2018

dongjoon-hyun commented Sep 6, 2018

dongjoon-hyun commented Sep 13, 2018

SparkQA commented Sep 13, 2018

ScrapCodes commented Sep 26, 2018

ScrapCodes commented Sep 28, 2018

srowen left a comment

srowen Sep 28, 2018

srowen Sep 28, 2018

steveloughran Sep 28, 2018

srowen Sep 28, 2018

steveloughran commented Sep 28, 2018 •

edited

srowen commented Sep 28, 2018

steveloughran commented Sep 28, 2018

srowen left a comment

dongjoon-hyun commented Oct 2, 2018

SparkQA commented Oct 2, 2018

SparkQA commented Oct 3, 2018

SparkQA commented Oct 3, 2018

SparkQA commented Oct 3, 2018

srowen commented Oct 5, 2018

ScrapCodes commented Oct 5, 2018

[SPARK-17159][STREAM] Significant speed up for running spark streaming against Object store. #22339

[SPARK-17159][STREAM] Significant speed up for running spark streaming against Object store. #22339

Conversation

ScrapCodes commented Sep 5, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 5, 2018

dongjoon-hyun commented Sep 6, 2018

dongjoon-hyun commented Sep 13, 2018

SparkQA commented Sep 13, 2018

ScrapCodes commented Sep 26, 2018

ScrapCodes commented Sep 28, 2018

srowen left a comment

Choose a reason for hiding this comment

srowen Sep 28, 2018

Choose a reason for hiding this comment

srowen Sep 28, 2018

Choose a reason for hiding this comment

steveloughran Sep 28, 2018

Choose a reason for hiding this comment

srowen Sep 28, 2018

Choose a reason for hiding this comment

steveloughran commented Sep 28, 2018 • edited

srowen commented Sep 28, 2018

steveloughran commented Sep 28, 2018

srowen left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Oct 2, 2018

SparkQA commented Oct 2, 2018

SparkQA commented Oct 3, 2018

SparkQA commented Oct 3, 2018

SparkQA commented Oct 3, 2018

srowen commented Oct 5, 2018

ScrapCodes commented Oct 5, 2018

ScrapCodes commented Sep 5, 2018 •

edited

steveloughran commented Sep 28, 2018 •

edited