[SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch #27649

HeartSaVioR · 2020-02-20T14:17:24Z

What changes were proposed in this pull request?

This patch addresses the case where compact metadata file is read twice in FileStreamSource during restarting query.

When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons.

The patch finds the latest compaction batch when restoring from metadata log, and put entries for the batch into the file entry cache which would avoid reading compact batch file twice.

FileStreamSourceLog doesn't know about offset / commit metadata in checkpoint so doesn't know which exactly batch to start from, but in practice, only couple of latest batches are candidates to
be started from when restarting query. This patch leverages the fact to skip calculation if possible.

Why are the changes needed?

Spark incurs unnecessary cost on reading the compact metadata file twice on some case, which may not be ignorable when the query has been processed huge number of files so far.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT.

…g twice if the query restarts from compact batch

HeartSaVioR · 2020-02-20T14:19:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

@@ -122,8 +123,35 @@ class FileStreamSourceLog(
    }
    batches
  }
+
+  def restore(): Array[FileEntry] = {


To not touch existing semantic of allFiles(), I simply add a new method to cover the new semantic. I'll just override allFiles() if it's preferred.

SparkQA · 2020-02-20T18:51:04Z

Test build #118720 has finished for PR 27649 at commit a5c4120.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-02-20T22:22:49Z

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

Lines 92 to 95 in fc4e56a

    
           // The below code may only be happened when original metadata log file has been removed, so we 
        
           // have to get the batch from latest compacted log file. This is quite time-consuming and may 
        
           // not be happened in the current FileStreamSource code path, since we only fetch the 
        
           // latest metadata log file.

We're describing reading from latest compacted log file as "quite time-consuming", and this patch addresses another case of reading from latest compacted log file, which would bring actual benefit.

HeartSaVioR · 2020-02-21T03:10:32Z

cc. @tdas @zsxwing @gaborgsomogyi

gaborgsomogyi · 2020-03-25T13:37:23Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+        val allFiles = metadata.allFiles()
+
+        // batch 4 is a compact batch which logs would be cached in fileEntryCache
+        fileEntryCache.containsKey(4)


Don't we need some kind of assertion here?

lol I have no idea why I missed assert here. Nice finding. Will add.

SparkQA · 2020-03-26T12:26:40Z

Test build #120409 has finished for PR 27649 at commit 4e3046c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-03-26T12:42:01Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+        val allFiles = metadata.allFiles()
+
+        // batch 4 is a compact batch which logs would be cached in fileEntryCache
+        assert(fileEntryCache.containsKey(4L))


I think it would be good to test that only 4 added:

(0 to 3).foreach { batchId => assert(!fileEntryCache.containsKey(batchId)) }

gaborgsomogyi · 2020-03-26T12:43:33Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+
+        // restore() will restore the logs for the latest compact batch into file entry cache
+        assert(metadata2.restore() === allFiles)
+        assert(fileEntryCache2.containsKey(4L))


Similar here.

gaborgsomogyi · 2020-03-26T13:16:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

+    files.lastOption.foreach { lastEntry =>
+      val latestBatchId = lastEntry.batchId
+      val latestCompactedBatchId = getAllValidBatches(latestBatchId, compactInterval)(0)
+      if (latestCompactedBatchId > 0 &&


Seems like it's not working when one set spark.sql.streaming.fileSource.log.compactInterval to 1.

That's just to prune the case where it may not help much, but yeah let's make it simple. It won't hurt in either way.

gaborgsomogyi · 2020-03-26T13:29:24Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+        (5 to 5 + FileStreamSourceLog.PREV_NUM_BATCHES_TO_READ_IN_RESTORE).foreach { batchId =>
+          metadata2.add(batchId, createEntries(batchId, 100))
+        }
+        val allFiles2 = metadata2.allFiles()


This can be inlined, right?

Not sure I understand about "inline" here.

As I've seen this val only used in one place.

gaborgsomogyi · 2020-03-26T13:37:20Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+        // if the latest batch is too far from latest compact batch, because it's unlikely Spark
+        // will request the batch for the start point.
+        assert(metadata2.restore() === allFiles2)
+        assert(!fileEntryCache3.containsKey(4L))


Similar here.

gaborgsomogyi · 2020-03-26T13:59:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

+      val latestBatchId = lastEntry.batchId
+      val latestCompactedBatchId = getAllValidBatches(latestBatchId, compactInterval)(0)
+      if (latestCompactedBatchId > 0 &&
+          (latestBatchId - latestCompactedBatchId) < PREV_NUM_BATCHES_TO_READ_IN_RESTORE) {


Maybe a comment would be good why this heuristic is useful.

I thought I forgot to explain, but looks like I explained already:

// It doesn't know about offset / commit metadata in checkpoint so doesn't know which exactly // batch to start from, but in practice, only couple of latest batches are candidates to // be started. We leverage the fact to skip calculation if possible.

only couple of latest batches is the threshold - I heuristically took 2 here.

HeartSaVioR · 2020-03-27T07:13:46Z

Thanks for reviewing. I guess I addressed review comments, please take a look again. Thanks!

SparkQA · 2020-03-27T09:06:53Z

Test build #120466 has finished for PR 27649 at commit 175ba90.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-03-27T09:14:05Z

Seems unrelated.

gaborgsomogyi · 2020-03-27T09:14:18Z

retest this please

gaborgsomogyi

LGTM.

SparkQA · 2020-03-27T14:42:51Z

Test build #120478 has finished for PR 27649 at commit 175ba90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-27T14:46:17Z

Test build #120480 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-04-13T22:52:04Z

retest this, please

SparkQA · 2020-04-14T04:39:55Z

Test build #121230 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-04-30T12:48:19Z

retest this, please

SparkQA · 2020-04-30T18:01:02Z

Test build #122130 has finished for PR 27649 at commit 6406e36.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-05-01T00:55:57Z

retest this, please

SparkQA · 2020-05-01T05:27:37Z

Test build #122151 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-05-26T22:05:43Z

retest this, please

SparkQA · 2020-05-27T03:09:24Z

Test build #123142 has finished for PR 27649 at commit 6406e36.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-05-28T03:46:16Z

retest this, please

SparkQA · 2020-05-28T07:05:01Z

Test build #123210 has finished for PR 27649 at commit 6406e36.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-28T11:35:21Z

Test build #123220 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-14T01:49:05Z

retest this, please

SparkQA · 2020-06-14T03:20:33Z

Test build #123988 has finished for PR 27649 at commit 6406e36.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-24T10:30:09Z

retest this, please

SparkQA · 2020-06-24T15:46:01Z

Test build #124481 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-12T20:28:25Z

retest this, please

SparkQA · 2020-07-13T02:08:17Z

Test build #125727 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-08-18T05:01:37Z

retest this, please

SparkQA · 2020-08-18T07:05:01Z

Test build #127535 has finished for PR 27649 at commit 6406e36.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-08-18T07:08:39Z

retest this, please

SparkQA · 2020-08-18T12:10:15Z

Test build #127550 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-09-15T00:23:30Z

retest this, please

SparkQA · 2020-09-15T07:02:13Z

Test build #128676 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-10-24T02:42:39Z

cc. @viirya @xuanyuanking as well to expand the possibility of reviews.

HeartSaVioR · 2020-10-24T02:43:05Z

retest this, please

SparkQA · 2020-10-24T03:28:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34822/

SparkQA · 2020-10-24T03:51:19Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34822/

SparkQA · 2020-10-24T06:56:49Z

Test build #130221 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-28T02:36:25Z

retest this, please

HeartSaVioR · 2020-11-28T03:37:24Z

cc. @tdas @zsxwing @gaborgsomogyi @viirya @xuanyuanking

Just a final reminder. I'll merge this in early next week if there's no further comments, according to the feedback from dev@ mailing list.

SparkQA · 2020-11-28T07:22:12Z

Test build #131891 has finished for PR 27649 at commit 6406e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-30T22:03:09Z

There's no feedback so far. I'll merge once the Jenkins is happy with the new build.

HeartSaVioR · 2020-11-30T22:03:16Z

retest this, please

HeartSaVioR · 2020-11-30T22:06:18Z

I see Jenkins migration is happening. I'll kick the Github Action instead.

HeartSaVioR · 2020-12-01T04:10:33Z

GA passed. Merging to master.

[SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata lo…

a5c4120

…g twice if the query restarts from compact batch

HeartSaVioR commented Feb 20, 2020

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Feb 28, 2020

gaborgsomogyi reviewed Mar 25, 2020

View reviewed changes

Fix silly mistake

4e3046c

gaborgsomogyi reviewed Mar 26, 2020

View reviewed changes

Reflect review comments

175ba90

another review comment

6406e36

gaborgsomogyi approved these changes Mar 27, 2020

View reviewed changes

HeartSaVioR mentioned this pull request Jun 30, 2020

[SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log #27694

Closed

HeartSaVioR closed this Nov 30, 2020

HeartSaVioR reopened this Nov 30, 2020

HeartSaVioR closed this in 2af2da5 Dec 1, 2020

HeartSaVioR deleted the SPARK-30900 branch December 1, 2020 04:23

[SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch #27649

[SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch #27649

Conversation

HeartSaVioR commented Feb 20, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

SparkQA commented Feb 20, 2020

HeartSaVioR commented Feb 20, 2020

HeartSaVioR commented Feb 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR commented Mar 27, 2020

SparkQA commented Mar 27, 2020

gaborgsomogyi commented Mar 27, 2020

gaborgsomogyi commented Mar 27, 2020

gaborgsomogyi left a comment • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 27, 2020

SparkQA commented Mar 27, 2020

HeartSaVioR commented Apr 13, 2020

SparkQA commented Apr 14, 2020

HeartSaVioR commented Apr 30, 2020

SparkQA commented Apr 30, 2020

HeartSaVioR commented May 1, 2020

SparkQA commented May 1, 2020

HeartSaVioR commented May 26, 2020

SparkQA commented May 27, 2020

HeartSaVioR commented May 28, 2020

SparkQA commented May 28, 2020

SparkQA commented May 28, 2020

HeartSaVioR commented Jun 14, 2020

SparkQA commented Jun 14, 2020

HeartSaVioR commented Jun 24, 2020

SparkQA commented Jun 24, 2020

HeartSaVioR commented Jul 12, 2020

SparkQA commented Jul 13, 2020

HeartSaVioR commented Aug 18, 2020

SparkQA commented Aug 18, 2020

HeartSaVioR commented Aug 18, 2020

SparkQA commented Aug 18, 2020

HeartSaVioR commented Sep 15, 2020

SparkQA commented Sep 15, 2020

HeartSaVioR commented Oct 24, 2020 • edited Loading

HeartSaVioR commented Oct 24, 2020

SparkQA commented Oct 24, 2020

SparkQA commented Oct 24, 2020

SparkQA commented Oct 24, 2020

HeartSaVioR commented Nov 28, 2020

HeartSaVioR commented Nov 28, 2020 • edited Loading

SparkQA commented Nov 28, 2020

HeartSaVioR commented Nov 30, 2020

HeartSaVioR commented Nov 30, 2020

HeartSaVioR commented Nov 30, 2020

HeartSaVioR commented Dec 1, 2020

HeartSaVioR commented Feb 20, 2020 •

edited

Loading

gaborgsomogyi left a comment •

edited

Loading

HeartSaVioR commented Oct 24, 2020 •

edited

Loading

HeartSaVioR commented Nov 28, 2020 •

edited

Loading