[Spark-18187] [SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting. #15852

tcondie · 2016-11-11T18:38:13Z

What changes were proposed in this pull request?

CompactibleFileStreamLog relys on "compactInterval" to detect a compaction batch. If the "compactInterval" is reset by user, CompactibleFileStreamLog will return wrong answer, resulting data loss. This PR procides a way to check the validity of 'compactInterval', and calculate an appropriate value.

How was this patch tested?

When restart a stream, we change the 'spark.sql.streaming.fileSource.log.compactInterval' different with the former one.

The primary solution to this issue was given by @uncleGen
Added extensions include an additional metadata field in OffsetSeq and CompactibleFileStreamLog APIs. @zsxwing

…rval" to detect a compaction batch

change compactInterval from 4 to 5

…spark-18187

…8187

srowen · 2016-11-11T19:03:28Z

(Update the title please; see others for format)

SparkQA · 2016-11-11T20:50:15Z

Test build #68530 has finished for PR 15852 at commit 6901eac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class OffsetSeq(offsets: Seq[Option[Offset]], metadata: Option[String] = None)

SparkQA · 2016-11-11T21:25:49Z

Test build #68533 has finished for PR 15852 at commit 96d2dfe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-11T23:45:32Z

Test build #68538 has finished for PR 15852 at commit d216acf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-11T23:51:18Z

Test build #68539 has finished for PR 15852 at commit 4f8ff16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-12T00:27:53Z

Test build #68541 has finished for PR 15852 at commit 7a71be8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2016-11-14T02:27:42Z

@uncleGen @tcondie thanks for working on this.

My major concern is this approach might disallow changing the compactInterval once there were at least two compact files. Should we disallow it? Or as an alternative, what do you think of the approach taken in #15828?

…8187

SparkQA · 2016-11-15T00:10:07Z

Test build #68639 has finished for PR 15852 at commit efa7022.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2016-11-15T02:19:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+    // 2. If there are two or more '.compact' files, we use the interval of patch id suffix with
+    // '.compact' as compactInterval. It is unclear whether this case will ever happen in the
+    // current code, since only the latest '.compact' file is retained i.e., other are garbage
+    // collected.


The log garbage operation is controlled by 'spark.sql.streaming.fileSource.log.deletion'. When it is 'false', there may be two or more '.compact' files.

Right. Please update the comment accordingly.

SparkQA · 2016-11-15T03:16:37Z

Test build #68644 has finished for PR 15852 at commit 24e3617.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2016-11-15T03:17:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+        // Find the first divisor >= default compact interval
+        def properDivisors(n: Int, min: Int) =
+          (min to n/2).filter(i => n % i == 0) :+ n
+


'to' => 'until' ?

I would use the following codes to avoid generating the real number sequence:

def properDivisors(n: Int, min: Int) = (min to n/2).view.filter(n % _ == 0) :+ n interval = properDivisors(latestCompactBatchId + 1, defaultCompactInterval).head

uncleGen · 2016-11-15T03:20:55Z

LGTM overall. If this is accepted, then i will close #15827

zsxwing

The approach looks good to me. I just left some style suggestions.

zsxwing · 2016-11-15T21:13:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

-  protected def compactInterval: Int
+  protected def defaultCompactInterval: Int
+
+  protected final lazy val compactInterval: Int = {


nit: please change protected to private since this should not be used by subclassed now.

FileStreamSourceLog uses compactInterval in multiple places. Please advise?

FileStreamSourceLog uses compactInterval in multiple places. Please advise?

Sorry. My bad. Didn't notice that.

zsxwing · 2016-11-15T23:13:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

@@ -38,8 +38,9 @@ class FileStreamSourceLog(
  import CompactibleFileStreamLog._

  // Configurations about metadata compaction
-  protected override val compactInterval =
+  protected override def defaultCompactInterval: Int =


nit: def => val

zsxwing · 2016-11-16T23:08:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+    // 2. If there are two or more '.compact' files, we use the interval of patch id suffix with
+    // '.compact' as compactInterval. It is unclear whether this case will ever happen in the
+    // current code, since only the latest '.compact' file is retained i.e., other are garbage
+    // collected.


Right. Please update the comment accordingly.

zsxwing · 2016-11-16T23:10:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+      val latestCompactBatchId = compactibleBatchIds(0)
+      val previousCompactBatchId = compactibleBatchIds(1)
+      interval = (latestCompactBatchId - previousCompactBatchId).toInt
+      logInfo(s"Compact interval case 2 = $interval")


nit: Please use a better message, like

Set the compact interval to XXX [the previous two batch Ids: XXX, XXX]

This was debugging info that I was going to remove.

I think it's better to provide this one. It only outputs once and is pretty helpful when some bug happens here.

zsxwing · 2016-11-16T23:14:24Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

-    def verify(execution: StreamExecution)
-      (batchId: Long, expectedBatches: Int): Boolean = {
+    def verify(execution: StreamExecution, batchId: Long,
+               expectedBatches: Int, expectedCompactInterval: Int): Boolean = {


nit: the correct style should be

def verify( execution: StreamExecution, batchId: Long, expectedBatches: Int, expectedCompactInterval: Int): Boolean = {

zsxwing · 2016-11-16T23:15:28Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

@@ -161,7 +161,8 @@ trait StreamTest extends QueryTest with SharedSQLContext with Timeouts {
  /** Starts the stream, resuming if data has already been processed. It must not be running. */
  case class StartStream(
      trigger: Trigger = ProcessingTime(0),
-      triggerClock: Clock = new SystemClock)
+      triggerClock: Clock = new SystemClock,
+      pairs: mutable.Map[String, String] = mutable.Map.empty)


nit: use Map[String, String] instead since it won't be changed.

nit: it's better to rename pairs to additionalConfs for readability.

zsxwing · 2016-11-16T23:17:21Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

            verify(currentStream == null, "stream already running")
            verify(triggerClock.isInstanceOf[SystemClock]
              || triggerClock.isInstanceOf[StreamManualClock],
              "Use either SystemClock or StreamManualClock to start the stream")
            if (triggerClock.isInstanceOf[StreamManualClock]) {
              manualClockExpectedTime = triggerClock.asInstanceOf[StreamManualClock].getTimeMillis()
            }
+
+            pairs.foreach(pair => spark.conf.set(pair._1, pair._2))


You need to also change the confs back at the end of this method to avoid affecting other tests sharing the same SparkSession.

zsxwing · 2016-11-16T23:22:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+      logInfo(s"Compact interval case 2 = $interval")
+    } else if (compactibleBatchIds.length == 1) {
+      // Case 3
+      val latestCompactBatchId = compactibleBatchIds(0).toInt


Could you pull this branch into a method in object CompactibleFileStreamLog so that it's easy to write tests for this complicated logic? And please add tests as well.

zsxwing · 2016-11-16T23:23:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+        // default compact interval > than any divisor other than latest compact id
+        interval = latestCompactBatchId + 1
+      }
+      logInfo(s"Compact interval case 3 = $interval")


nit: it's better to include all infos in the log, like

Set the compact interval to XXX [latestCompactBatchId: XXX, defaultCompactInterval: XXX]

zsxwing · 2016-11-16T23:41:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+        // Find the first divisor >= default compact interval
+        def properDivisors(n: Int, min: Int) =
+          (min to n/2).filter(i => n % i == 0) :+ n
+


I would use the following codes to avoid generating the real number sequence:

def properDivisors(n: Int, min: Int) = (min to n/2).view.filter(n % _ == 0) :+ n interval = properDivisors(latestCompactBatchId + 1, defaultCompactInterval).head

SparkQA · 2016-11-17T19:24:55Z

Test build #68795 has finished for PR 15852 at commit 82adb39.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-11-17T19:48:33Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

@@ -932,26 +940,28 @@ class FileStreamSourceSuite extends FileStreamSourceTest {
      ) {
        val fileStream = createFileStream("text", src.getCanonicalPath)
        val filtered = fileStream.filter($"value" contains "keep")
+        val updateConf = new mutable.HashMap[String, String]()


nit: Use Map(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "5")

zsxwing · 2016-11-17T19:59:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+    } else if (defaultInterval < (latestCompactBatchId + 1) / 2) {
+      // Find the first divisor >= default compact interval
+      def properDivisors(min: Int, n: Int) =
+        (min to n/2).filter(i => n % i == 0) :+ n


nit: Use (min to n/2).view.filter(i => n % i == 0) :+ n so that it will stop when finding the first element.

…8187

zsxwing

Looks good overall. Just some nits.

zsxwing · 2016-11-18T00:54:28Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

+                })
+            } finally {
+              // Rollback previous configuration values
+              resetConfValues.foreach {


The reset logic should be at the end of the method. Otherwise, it will change confs during a query is running.

zsxwing · 2016-11-18T00:57:12Z

.../src/test/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLogSuite.scala

+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.test.SharedSQLContext
+
+class CompactibleFileStreamLogSuite extends SparkFunSuite with SharedSQLContext {


nit: SharedSQLContext is not needed. Please remove it to avoid creating SQLContext.

zsxwing · 2016-11-18T00:58:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+
+      properDivisors(defaultInterval, latestCompactBatchId + 1).head
+    }
+    else {


nit: move else { to the above line.

SparkQA · 2016-11-18T02:03:14Z

Test build #68807 has finished for PR 15852 at commit 50207b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-18T02:11:00Z

Test build #68808 has finished for PR 15852 at commit 6211faa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-18T02:18:05Z

Test build #68809 has finished for PR 15852 at commit 13abd7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-18T03:46:12Z

Test build #68816 has finished for PR 15852 at commit dbd8b67.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-18T07:16:51Z

Test build #68827 has finished for PR 15852 at commit 6537e7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-11-18T19:10:46Z

LGTM. Thanks! Merging to master and 2.1.

…terval" direcly with user setting. ## What changes were proposed in this pull request? CompactibleFileStreamLog relys on "compactInterval" to detect a compaction batch. If the "compactInterval" is reset by user, CompactibleFileStreamLog will return wrong answer, resulting data loss. This PR procides a way to check the validity of 'compactInterval', and calculate an appropriate value. ## How was this patch tested? When restart a stream, we change the 'spark.sql.streaming.fileSource.log.compactInterval' different with the former one. The primary solution to this issue was given by uncleGen Added extensions include an additional metadata field in OffsetSeq and CompactibleFileStreamLog APIs. zsxwing Author: Tyson Condie <tcondie@gmail.com> Author: genmao.ygm <genmao.ygm@genmaoygmdeMacBook-Air.local> Closes #15852 from tcondie/spark-18187. (cherry picked from commit 51baca2) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…terval" direcly with user setting. ## What changes were proposed in this pull request? CompactibleFileStreamLog relys on "compactInterval" to detect a compaction batch. If the "compactInterval" is reset by user, CompactibleFileStreamLog will return wrong answer, resulting data loss. This PR procides a way to check the validity of 'compactInterval', and calculate an appropriate value. ## How was this patch tested? When restart a stream, we change the 'spark.sql.streaming.fileSource.log.compactInterval' different with the former one. The primary solution to this issue was given by uncleGen Added extensions include an additional metadata field in OffsetSeq and CompactibleFileStreamLog APIs. zsxwing Author: Tyson Condie <tcondie@gmail.com> Author: genmao.ygm <genmao.ygm@genmaoygmdeMacBook-Air.local> Closes apache#15852 from tcondie/spark-18187.

genmao.ygm and others added 8 commits November 9, 2016 16:21

SPARK-18187: CompactibleFileStreamLog should not rely on "compactInte…

65395dd

…rval" to detect a compaction batch

comment update

d556933

revert

8b56f70

unit test - compacat metadata log

4a7e28c

change compactInterval from 4 to 5

bug fix: /zero

23e1baf

Merge branch 'SPARK-18187' of https://github.com/uncleGen/spark into …

7d37e08

…spark-18187

Merge branch 'master' of https://github.com/apache/spark into spark-1…

d3f7bbf

…8187

extend offset seq to include metadata

6901eac

update compact interval derivation

96d2dfe

update compact interval derivation

d216acf

tcondie changed the title ~~Spark 18187~~ Spark-18187 [SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting. Nov 11, 2016

tcondie added 3 commits November 11, 2016 13:56

update comment

4f8ff16

update comments

56981c6

update

7a71be8

tcondie added 3 commits November 14, 2016 14:10

Merge branch 'master' of https://github.com/apache/spark into spark-1…

51389a8

…8187

update compact interval derivation logic

cf5fd7b

update comments

efa7022

lazy initialize compactInterval

24e3617

uncleGen reviewed Nov 15, 2016

View reviewed changes

zsxwing requested changes Nov 16, 2016

View reviewed changes

update based on feedback from Ryan

82adb39

zsxwing reviewed Nov 17, 2016

View reviewed changes

tcondie added 4 commits November 17, 2016 14:36

Merge branch 'master' of https://github.com/apache/spark into spark-1…

5852bcd

…8187

update write batch structure in HDFS metadata log

50207b5

rollback changes to HDFSMetadataLog

6211faa

revert changes to HDFSMetadataLog

13abd7d

zsxwing requested changes Nov 18, 2016

View reviewed changes

revisions based on feedback from Ryan

dbd8b67

update

6537e7c

tcondie changed the title ~~Spark-18187 [SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting.~~ [Spark-18187] [SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting. Nov 18, 2016

asfgit closed this in 51baca2 Nov 18, 2016

zsxwing mentioned this pull request Nov 18, 2016

[SPARK-18187][STREAMING] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting. #15827

Closed

[Spark-18187] [SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting. #15852

[Spark-18187] [SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting. #15852

Conversation

tcondie commented Nov 11, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Nov 11, 2016

SparkQA commented Nov 11, 2016

SparkQA commented Nov 11, 2016

SparkQA commented Nov 11, 2016

SparkQA commented Nov 11, 2016

SparkQA commented Nov 12, 2016

lw-lin commented Nov 14, 2016 • edited

SparkQA commented Nov 15, 2016

uncleGen Nov 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 15, 2016

uncleGen Nov 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uncleGen commented Nov 15, 2016

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2016

SparkQA commented Nov 18, 2016

SparkQA commented Nov 18, 2016

SparkQA commented Nov 18, 2016

SparkQA commented Nov 18, 2016

zsxwing commented Nov 18, 2016

lw-lin commented Nov 14, 2016 •

edited

uncleGen Nov 15, 2016 •

edited

uncleGen Nov 15, 2016 •

edited