[SPARK-23097][SQL][SS] Migrate text socket source to V2 #20382

jerryshao · 2018-01-24T09:07:28Z

What changes were proposed in this pull request?

This PR moves structured streaming text socket source to V2.

Questions: do we need to remove old "socket" source?

How was this patch tested?

Unit test and manual verification.

jerryshao · 2018-01-24T09:08:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWriter.scala

@@ -56,7 +58,7 @@ trait ConsoleWriter extends Logging {
    println("-------------------------------------------")
    // scalastyle:off println
    spark
-      .createDataFrame(spark.sparkContext.parallelize(rows), schema)
+      .createDataFrame(rows.toList.asJava, schema)


Change here to avoid triggering new distributed job.

this fix should go into 2.3 branch. thanks for catching this.

OK, I will create a separate PR for this small fix.

SparkQA · 2018-01-24T12:32:35Z

Test build #86581 has finished for PR 20382 at commit 8f3b548.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-01-24T12:40:47Z

@jose-torres can you please help to review, thanks!

jose-torres

I think we shouldn't remove the old source, contrary to what I did with the console sink. We should add a conf to disable the V2 implementation on a per-source basis, which we can use to (a) fall back if some user finds the new implementation problematic and (b) run tests with the conf to make sure that the V1 execution path still works.

I'll write a PR to handle that.

jose-torres · 2018-01-24T17:02:57Z

...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSourceV2.scala

+}
+
+class TextSocketSourceProviderV2 extends DataSourceV2
+    with MicroBatchReadSupport with DataSourceRegister with Logging {


The intent is for the V2 and V1 source to live in the same register, so existing queries can start using the V2 source with no change needed. This also allows the V2 implementation to be validated by passing all the old tests.

RateSourceV2 is a bad example; it only exists because I didn't have time to write a fully compatible rate source. I'll work on fixing it.

@jose-torres , you mean that instead of creating a new V2 socket source, modifying current V1 socket source to make it work with V2, am I understanding correctly?

The idea is that the existing TextSocketSourceProvider will have the MicroBatchReadSupport implementation here, in addition to the StreamSourceProvider implementation it already has.

I see, thanks for the clarify. Let me change it.

jose-torres · 2018-01-24T17:07:21Z

...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSourceV2.scala

+  private val host = options.get(HOST).get()
+  private val port = options.get(PORT).get().toInt
+  private val includeTimestamp = options.getBoolean(INCLUDE_TIMESTAMP, false)
+  private val numPartitions = options.getInt(NUM_PARTITIONS, 1)


To match the old parallelize behavior, the default number of partitions should be sparkContext.defaultParallelism.

jose-torres · 2018-01-24T17:08:50Z

...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSourceV2.scala

+  private var lastOffsetCommitted: Long = -1L
+
+  override def setOffsetRange(start: Optional[Offset], end: Optional[Offset]): Unit = {
+    if (!initialized) {


Is it possible to initialize in the constructor?

This is what I want to bring out. Originally I initialized this in constructor like old socket source. But I found that MicroBatchReader will be created in two different places with two objects. So initializing in constructor will create two sock threads and connectors. This is different from V1 source. In V1 source, we only created source once, but with V2 MicroBatchReader we will create two objects in two different places (one for schema), which means such side-affect actions in constructor will have two copies. Ideally we should only create this MicroBatchReader once.

I don't think this will solve that problem, since each reader will just have its own initialize bit.

In general, I think it's fine if we do a bit of extra work. V1 sources do have to support being created multiple times (in e.g. restart scenarios), and the lifecycles of the two V2 readers being created here don't overlap. (We should be closing the tempReader created in DataStreamReader, though.)

jose-torres · 2018-01-24T17:17:45Z

...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSourceV2.scala

+
+  override def commit(end: Offset): Unit = synchronized {
+    val newOffset = end.asInstanceOf[TextSocketStreamOffset]
+    val offsetDiff = (newOffset.offset - lastOffsetCommitted).toInt


nit: conversion to int is unnecessary

SparkQA · 2018-01-25T15:25:41Z

Test build #86640 has finished for PR 20382 at commit 56c60f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-01-25T22:14:04Z

It's unfortunate that the socket tests don't actually run streams end to end, but I think that's orthogonal to this PR.

Can you run one of the programming guide examples using socket source (e.g. org.apache.spark.examples.sql.streaming.StructuredSessionization) to make sure it works after this PR? If it does, LGTM

jerryshao · 2018-01-26T00:49:24Z

Jenkins, retest this please.

jerryshao · 2018-01-26T01:06:03Z

Hi @jose-torres , thanks for your reviewing. I tried both the example you mentioned and simple spark-shell command, I think it works, but the path will always go to V2 MicroBatchReader (still need you PR to fallback to V1 Source).

jose-torres · 2018-01-26T01:09:22Z

Right, that makes sense. LGTM

SparkQA · 2018-01-26T04:09:49Z

Test build #86671 has finished for PR 20382 at commit 56c60f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-26T04:48:30Z

Test build #86677 has finished for PR 20382 at commit 9ceb3be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-01-26T05:35:33Z

@zsxwing @tdas would you please help to review, thanks!

tdas · 2018-01-31T01:21:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/socket.scala

-    Try(params.getOrElse("includeTimestamp", "false").toBoolean) match {
-      case Success(bool) => bool
+class TextSocketSourceProvider extends DataSourceV2
+  with MicroBatchReadSupport with StreamSourceProvider with DataSourceRegister with Logging {


Why do we still need StreamSourceProvider?

If I don't misunderstand @jose-torres 's intention, basically he wanted this socket source to work also in V1 code path.

aah, i see earlier comments.

TD and I discussed this offline. It should be fine to remove the V1 StreamSourceProvider implementation, because:

this isn't a production-quality source, so users shouldn't need to fall back to it

this source won't be particularly useful at exercising the V1 execution pipeline once we transition all sources to V2

OK, I will update the patch accordingly.

tdas · 2018-01-31T03:15:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TextSocketReader.scala

+
+import org.apache.spark.internal.Logging
+
+trait TextSocketReader extends Logging {


Please add docs!! This is a base interface used by two source implementations.
Also rename this such that its clear that this a base class and not an actual Reader (i.e. not a subclass of DataSourceV2 readers). Maybe TextSocketReaderBase

tdas · 2018-01-31T03:18:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/socket.scala

+  override def toString: String = s"TextSocketSource[host: $host, port: $port]"
+}
+
+case class TextSocketOffset(offset: Long) extends V2Offset {


I would wait for my PR #20445 to go in where I migrate LongOffset to use OffsetV2

tdas · 2018-01-31T03:20:25Z

I am holding off further comments on this PR until the major change of eliminating v1 Source is done. That would cause significant refactoring (including the fact that the common trait wont be needed).

BTW, I strongly suggest moving the socket code to execution.streaming.sources, like other v2 sources.

jerryshao · 2018-01-31T03:23:07Z

Sure, will waiting for others to be merged, thanks @tdas .

tdas · 2018-02-07T21:01:13Z

#20445 will be merged in a few hours. please go ahead and update your PR with the refactoring that was suggested (mainly, no v1 version).

jerryshao · 2018-02-08T00:43:49Z

Sure, I will do it.

SparkQA · 2018-02-08T08:05:02Z

Test build #87199 has finished for PR 20382 at commit fdc9b9c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TextSocketMicroBatchReader(options: DataSourceOptions) extends MicroBatchReader with Logging

SparkQA · 2018-02-08T08:05:02Z

Test build #87202 has finished for PR 20382 at commit 874c91c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-08T08:08:30Z

jenkins test this please

SparkQA · 2018-02-08T11:13:56Z

Test build #87203 has finished for PR 20382 at commit 874c91c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-02-09T00:45:18Z

Hi @tdas , would you please help to review again, thanks!

tdas

overall looks good, just a few comments.

tdas · 2018-02-09T02:31:38Z

sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

 org.apache.spark.sql.execution.streaming.RateSourceProvider
+org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider


can you add a redirection in the DataSource.backwardCompatibilityMap for this?

tdas · 2018-02-09T02:32:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala

- * A source that reads text lines through a TCP socket, designed only for tutorials and debugging.
- * This source will *not* work in production applications due to multiple reasons, including no
- * support for fault recovery and keeping all of the text read in memory forever.
+ * A MicroBatchReader that reads text lines through a TCP socket, designed only for tutorials and


nit: tutorials -> testing (i know it was like that, but lets fix it since we are changing it anyway)

Tutorials is correct here; see e.g. StructuredSessionization.scala

tdas · 2018-02-09T02:33:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala

+ * A MicroBatchReader that reads text lines through a TCP socket, designed only for tutorials and
+ * debugging. This MicroBatchReader will *not* work in production applications due to multiple
+ * reasons, including no support for fault recovery and keeping all of the text read in memory
+ * forever.


this does not keep it forever. so remove this reason, just keep "no support for fault recover".

tdas · 2018-02-09T02:37:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala

+  }
+
+  override def readSchema(): StructType = {
+    val includeTimestamp = options.getBoolean("includeTimestamp", false)


supernit: is there need for a variable here?

tdas · 2018-02-09T02:37:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala

-  override def schema: StructType = if (includeTimestamp) TextSocketSource.SCHEMA_TIMESTAMP
-  else TextSocketSource.SCHEMA_REGULAR
+  override def setOffsetRange(
+      start: Optional[Offset],


nit: wont this fit on a single line?

tdas · 2018-02-09T02:46:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala

@@ -164,54 +213,43 @@ class TextSocketSource(host: String, port: Int, includeTimestamp: Boolean, sqlCo
    }
  }

-  override def toString: String = s"TextSocketSource[host: $host, port: $port]"
+  override def toString: String = s"TextSocketMicroBatchReader[host: $host, port: $port]"


This shows up in the StreamingQueryProgressEvent as description, so it may be better to have it as "TextSocket[..."

tdas · 2018-02-09T02:50:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala

+      schema: Optional[StructType],
+      checkpointLocation: String,
+      options: DataSourceOptions): MicroBatchReader = {
+    checkParameters(options.asMap().asScala.toMap)


why not check it as DataSourceOptions (which is known to be case-insensitive) rather than a map which raises questions about case sensitivity?

tdas · 2018-02-09T02:52:00Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

@@ -177,11 +177,14 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
          Optional.ofNullable(userSpecifiedSchema.orNull),
          Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
          options)
+        val schema = tempReader.readSchema()
+        // Stop tempReader to avoid side-affect thing


nit: side-affect -> side-effect.

good catch.

tdas · 2018-02-09T02:52:24Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

@@ -177,11 +177,14 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
          Optional.ofNullable(userSpecifiedSchema.orNull),
          Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
          options)
+        val schema = tempReader.readSchema()
+        // Stop tempReader to avoid side-affect thing
+        tempReader.stop()


i feel like this needs a try finally approach as well.

tdas · 2018-02-09T02:54:00Z

.../src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala

@@ -0,0 +1,246 @@
+/*


why does this show up as a new file? was this not a "git mv"? something went wrong, i would prefer that i can see a simple diff. Not much should change in the tests.

Sorry @tdas , I did it by simply "mv", not "git mv". This doesn't change a lot, just to be suited for data source v2 API.

SparkQA · 2018-02-13T07:23:32Z

Test build #87371 has finished for PR 20382 at commit 647c5cd.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-13T08:05:01Z

Test build #87372 has finished for PR 20382 at commit f3fc90c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-13T08:05:01Z

Test build #87370 has finished for PR 20382 at commit 068c050.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

Change-Id: I22a5cef90b269b29e6dbb442aba77aa3c1f3e2c4

jerryshao · 2018-02-26T06:17:07Z

Jenkins, retest this please.

SparkQA · 2018-02-26T08:05:01Z

Test build #87664 has finished for PR 20382 at commit fd890ad.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-02-26T08:07:59Z

Jenkins, retest this please.

SparkQA · 2018-02-26T11:21:26Z

Test build #87667 has finished for PR 20382 at commit fd890ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-28T12:14:18Z

.../src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala

+        StopStream
+      )
+
+      assert(!batch2Stamp.before(batch1Stamp))


there is a slim chance that batch2stamp will be same as batch1stamp.
maybe worth adding a sleep(10) to ensure this.
you should also check batch1stamp with timestamp taken directly before the query. otherwise it may pass tests if the query generated batch1stamp = -1 and batch2stamp = -2.

Hi @tdas , what's the meaning of "you should also check batch1stamp with timestamp taken directly before the query. ", I'm not clearly sure what specifically are you pointing to?

val timestamp = System.currentTimeMillis
testStream(...)(
// get batch1stamp
)
// assert batch1stamp >= timestamp

I see. Will update it.

tdas · 2018-02-28T12:14:51Z

.../src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala

+    intercept[IOException] {
+      batchReader = provider.createMicroBatchReader(
+        Optional.empty(), "", new DataSourceOptions(parameters.asJava))
+    }


assert on the message.

In my local test, the assert message is Can't assign requested address, but on Jenkins, it is Connection refused. The difference might be due to different OS/native method.

I think it would be better to not check the message due to different outputs. Even if we change to follow Jenkins way, it still fails in my local Mac.

thats fine.

tdas · 2018-02-28T12:17:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala


  @GuardedBy("this")
-  protected var currentOffset: LongOffset = new LongOffset(-1)
+  private[sources] var currentOffset: LongOffset = LongOffset(-1L)


this does not make sense. you are directly accessing something that should be accessed while synchronized on this.

tdas · 2018-03-01T01:41:16Z

@jerryshao please address the above comment, then we are good to merge!

jerryshao · 2018-03-01T01:58:18Z

Sure, I will do it today.

SparkQA · 2018-03-01T09:06:08Z

Test build #87819 has finished for PR 20382 at commit 1073be4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-03-01T10:03:03Z

relevant test failed. please make sure that there is no flakiness in the tests.

SparkQA · 2018-03-01T13:00:03Z

Test build #87825 has finished for PR 20382 at commit 6d38bed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-01T14:48:17Z

Test build #87831 has finished for PR 20382 at commit 762f1da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-03-02T20:26:19Z

LGTM. Merging to master.

This PR moves structured streaming text socket source to V2. Questions: do we need to remove old "socket" source? Unit test and manual verification. Author: jerryshao <sshao@hortonworks.com> Closes apache#20382 from jerryshao/SPARK-23097. Ref: LIHADOOP-48531

jerryshao commented Jan 24, 2018

View reviewed changes

jose-torres reviewed Jan 24, 2018

View reviewed changes

tdas reviewed Jan 31, 2018

View reviewed changes

jerryshao force-pushed the SPARK-23097 branch from 9ceb3be to fdc9b9c Compare February 8, 2018 07:32

tdas suggested changes Feb 12, 2018

View reviewed changes

jerryshao force-pushed the SPARK-23097 branch from 068c050 to 647c5cd Compare February 13, 2018 07:13

jerryshao added 7 commits February 26, 2018 13:48

Remove V1 support and address the comments

50c53e3

Change-Id: I22a5cef90b269b29e6dbb442aba77aa3c1f3e2c4

Change the code based on comments

a224a1b

Minor style fix

153cd43

Address the comments

70b2b48

Fix compile issue

f69f490

rewrite the tests based on comments

323e853

Add new test

d0b1d8b

tdas reviewed Feb 28, 2018

View reviewed changes

Address the comments

1073be4

jerryshao force-pushed the SPARK-23097 branch from fd890ad to 1073be4 Compare March 1, 2018 06:44

Not to check exception message due to different outputs

6d38bed

Update the unit test

762f1da

asfgit closed this in 707e650 Mar 2, 2018


		import org.apache.spark.internal.Logging

		trait TextSocketReader extends Logging {

		org.apache.spark.sql.execution.streaming.RateSourceProvider
		org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider

[SPARK-23097][SQL][SS] Migrate text socket source to V2 #20382

[SPARK-23097][SQL][SS] Migrate text socket source to V2 #20382

Conversation

jerryshao commented Jan 24, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 24, 2018

jerryshao commented Jan 24, 2018

jose-torres left a comment

Choose a reason for hiding this comment

jose-torres Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 25, 2018

jose-torres commented Jan 25, 2018 • edited Loading

jerryshao commented Jan 26, 2018

jerryshao commented Jan 26, 2018

jose-torres commented Jan 26, 2018

SparkQA commented Jan 26, 2018

SparkQA commented Jan 26, 2018

jerryshao commented Jan 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Jan 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas commented Jan 31, 2018

jerryshao commented Jan 31, 2018

tdas commented Feb 7, 2018

jerryshao commented Feb 8, 2018

SparkQA commented Feb 8, 2018

SparkQA commented Feb 8, 2018

tdas commented Feb 8, 2018

SparkQA commented Feb 8, 2018

jerryshao commented Feb 9, 2018

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 13, 2018

SparkQA commented Feb 13, 2018

SparkQA commented Feb 13, 2018

jerryshao commented Feb 26, 2018

SparkQA commented Feb 26, 2018

jerryshao commented Feb 26, 2018

SparkQA commented Feb 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

tdas commented Mar 1, 2018

jerryshao commented Mar 1, 2018

jose-torres Jan 24, 2018 •

edited

Loading

jose-torres commented Jan 25, 2018 •

edited

Loading

tdas Jan 31, 2018 •

edited

Loading

tdas Feb 28, 2018 •

edited

Loading