[SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 APIs #20445

tdas · 2018-01-31T01:37:25Z

What changes were proposed in this pull request?

This PR migrates the MemoryStream to DataSourceV2 APIs.

One additional change is in the reported keys in StreamingQueryProgress.durationMs. "getOffset" and "getBatch" replaced with "setOffsetRange" and "getEndOffset" as tracking these make more sense. Unit tests changed accordingly.

How was this patch tested?

Existing unit tests, few updated unit tests.

tdas · 2018-01-31T01:42:03Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceReaderHolder.scala

   */
-  def fullOutput: Seq[AttributeReference]


@cloud-fan This fixes the bug I spoke to you offline about.
The target of this PR is only master, not 2.3.x. So if you want to have this fix in 2.3.0, please make a separate PR accordingly.

If this PR has to be merged to 2.3.0 branch does it require more additional changes?

SparkQA · 2018-01-31T04:20:20Z

Test build #86857 has finished for PR 20445 at commit 5adf1fe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-31T06:42:49Z

Test build #86855 has finished for PR 20445 at commit e66d809.

This patch fails from timeout after a configured wait of `300m`.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-02-01T23:38:11Z

Test build #86950 has finished for PR 20445 at commit 478ad17.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-01T23:39:24Z

This PR is currently blocked by a DataSourceV2ScanExec bug, which is being fixed in this PR #20387

jose-torres · 2018-01-31T23:45:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

-              Optional.empty())
-
-              (s, Some(s.getEndOffset))
+            reportTimeTaken("setOffsetRange") {


I agree that the old metric names don't make much sense anymore, but I worry about changing external-facing behavior as part of an API migration.

jose-torres · 2018-02-01T23:40:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

@@ -89,7 +96,7 @@ case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext)

  def addData(data: TraversableOnce[A]): Offset = {
    val encoded = data.toVector.map(d => encoder.toRow(d).copy())
-    val plan = new LocalRelation(schema.toAttributes, encoded, isStreaming = true)
+    val plan = new LocalRelation(attributes, encoded, isStreaming = false)
    val ds = Dataset[A](sqlContext.sparkSession, plan)
    logDebug(s"Adding ds: $ds")


Do we still need to store the batches as datasets, now that we're just collect()ing them back out in createDataReaderFactories()?

Good point.

SparkQA · 2018-02-01T23:43:40Z

Test build #86951 has finished for PR 20445 at commit 6389d80.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T04:30:52Z

Test build #86960 has finished for PR 20445 at commit 3f50f33.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-03T00:02:13Z

Test build #87010 has finished for PR 20445 at commit c713048.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MemoryStreamDataReaderFactory(records: Array[UnsafeRow])

SparkQA · 2018-02-03T03:00:44Z

Test build #87018 has finished for PR 20445 at commit 1204755.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-06T18:58:10Z

jenkins retest this please

SparkQA · 2018-02-06T20:38:50Z

Test build #87122 has finished for PR 20445 at commit 1204755.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-06T23:12:38Z

jenkins retest this please.

SparkQA · 2018-02-07T00:59:12Z

Test build #87133 has finished for PR 20445 at commit 1204755.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-07T08:05:01Z

Test build #87144 has finished for PR 20445 at commit f0ce5df.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-07T18:56:05Z

Jenkins retest this please

tdas · 2018-02-07T18:59:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ForeachSinkSuite.scala

-        ForeachSinkSuite.Process(value = 4),
-        ForeachSinkSuite.Close(None)
-      )
+        val events = ForeachSinkSuite.allEvents()


This test assumed that the output would arrive in specific order after repartitioning, which isnt guaranteed. So I rewrote the test to verify the output in an order-independent way.

tdas · 2018-02-07T19:01:28Z

Retest this please

jose-torres · 2018-02-07T19:18:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

@@ -149,18 +149,12 @@ case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext)
  }

  private def generateDebugString(
-      blocks: Iterable[Array[UnsafeRow]],
+      blocks: Seq[UnsafeRow],


nit: it's probably more "rows" than "blocks" now

right! i thought of changing but forgot. my bad.

jose-torres · 2018-02-07T19:59:57Z

LGTM pending passing run of that HiveDDLSuite test

SparkQA · 2018-02-07T21:11:11Z

Test build #87174 has finished for PR 20445 at commit f0ce5df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-07T22:39:26Z

seems like an unrelated flaky test ^

SparkQA · 2018-02-07T22:59:40Z

Test build #87176 has finished for PR 20445 at commit c3508e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-07T23:22:18Z

Merging to master.

## What changes were proposed in this pull request? This PR migrates the MemoryStream to DataSourceV2 APIs. One additional change is in the reported keys in StreamingQueryProgress.durationMs. "getOffset" and "getBatch" replaced with "setOffsetRange" and "getEndOffset" as tracking these make more sense. Unit tests changed accordingly. ## How was this patch tested? Existing unit tests, few updated unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#20445 from tdas/SPARK-23092.

This PR migrates the MemoryStream to DataSourceV2 APIs. One additional change is in the reported keys in StreamingQueryProgress.durationMs. "getOffset" and "getBatch" replaced with "setOffsetRange" and "getEndOffset" as tracking these make more sense. Unit tests changed accordingly. Existing unit tests, few updated unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#20445 from tdas/SPARK-23092. Ref: LIHADOOP-48531 RB=1832973 G=superfriends-reviewers R=latang,yezhou,zolin,mshen,fli A=

brkyvz and others added 12 commits January 11, 2018 08:44

save for so far

7c09b37

Save so far

78c50f8

save so far

2777b5b

Compiles and I think also runs correctly

50a541b

save

fd61724

fix merge conflicts

7a0b564

fix hive

a81c2ec

Undo changes to HiveSessionStateBuilder.scala

1a4f410

Merge remote-tracking branch 'apache-github/master' into HEAD

083e93c

Fixed the setOffsetRange bug

a817c8d

Fixed DataSourceV2ScanExec canonicalization bug

35b8854

Fixed metrics reported by MicroBatchExecution

e66d809

tdas commented Jan 31, 2018

View reviewed changes

brkyvz mentioned this pull request Jan 31, 2018

[SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 APIs #20279

Closed

Merge remote-tracking branch 'apache-github/master' into SPARK-23092

5adf1fe

tdas mentioned this pull request Jan 31, 2018

[SPARK-23097][SQL][SS] Migrate text socket source to V2 #20382

Closed

tdas added 2 commits February 1, 2018 15:01

Reverted changes to DataSourceV2*

478ad17

Merge remote-tracking branch 'apache-github/master' into SPARK-23092

6389d80

jose-torres reviewed Feb 1, 2018

View reviewed changes

Updated package paths to fix compilation

3f50f33

rdblue mentioned this pull request Feb 2, 2018

[SPARK-23203][SQL]: DataSourceV2: Use immutable logical plans. #20387

Closed

Store added data as rows not datasets

c713048

Fixed ForeachSinkSuite

1204755

tdas mentioned this pull request Feb 6, 2018

[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485

Closed

Fixed bug

f0ce5df

tdas commented Feb 7, 2018

View reviewed changes

jose-torres reviewed Feb 7, 2018

View reviewed changes

Addressed comments

c3508e9

asfgit closed this in 30295bf Feb 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 APIs #20445

[SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 APIs #20445

tdas commented Jan 31, 2018 •

edited

Loading

tdas Jan 31, 2018 •

edited

Loading

krishari2020 Apr 27, 2020

SparkQA commented Jan 31, 2018

SparkQA commented Jan 31, 2018

SparkQA commented Feb 1, 2018

tdas commented Feb 1, 2018

jose-torres Jan 31, 2018

jose-torres Feb 1, 2018

tdas Feb 2, 2018

SparkQA commented Feb 1, 2018

SparkQA commented Feb 2, 2018

SparkQA commented Feb 3, 2018

SparkQA commented Feb 3, 2018

tdas commented Feb 6, 2018

SparkQA commented Feb 6, 2018

tdas commented Feb 6, 2018

SparkQA commented Feb 7, 2018

SparkQA commented Feb 7, 2018

tdas commented Feb 7, 2018

tdas Feb 7, 2018

tdas commented Feb 7, 2018

jose-torres Feb 7, 2018

tdas Feb 7, 2018

jose-torres commented Feb 7, 2018

SparkQA commented Feb 7, 2018

tdas commented Feb 7, 2018 •

edited

Loading

SparkQA commented Feb 7, 2018

tdas commented Feb 7, 2018

[SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 APIs #20445

[SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 APIs #20445

Conversation

tdas commented Jan 31, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

tdas Jan 31, 2018 • edited Loading

Choose a reason for hiding this comment

krishari2020 Apr 27, 2020

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2018

SparkQA commented Jan 31, 2018

SparkQA commented Feb 1, 2018

tdas commented Feb 1, 2018

jose-torres Jan 31, 2018

Choose a reason for hiding this comment

jose-torres Feb 1, 2018

Choose a reason for hiding this comment

tdas Feb 2, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 1, 2018

SparkQA commented Feb 2, 2018

SparkQA commented Feb 3, 2018

SparkQA commented Feb 3, 2018

tdas commented Feb 6, 2018

SparkQA commented Feb 6, 2018

tdas commented Feb 6, 2018

SparkQA commented Feb 7, 2018

SparkQA commented Feb 7, 2018

tdas commented Feb 7, 2018

tdas Feb 7, 2018

Choose a reason for hiding this comment

tdas commented Feb 7, 2018

jose-torres Feb 7, 2018

Choose a reason for hiding this comment

tdas Feb 7, 2018

Choose a reason for hiding this comment

jose-torres commented Feb 7, 2018

SparkQA commented Feb 7, 2018

tdas commented Feb 7, 2018 • edited Loading

SparkQA commented Feb 7, 2018

tdas commented Feb 7, 2018

tdas commented Jan 31, 2018 •

edited

Loading

tdas Jan 31, 2018 •

edited

Loading

tdas commented Feb 7, 2018 •

edited

Loading