Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23097][SQL][SS] Migrate text socket source to V2 #20382

Closed
wants to merge 14 commits into from

Conversation

jerryshao
Copy link
Contributor

What changes were proposed in this pull request?

This PR moves structured streaming text socket source to V2.

Questions: do we need to remove old "socket" source?

How was this patch tested?

Unit test and manual verification.

@@ -56,7 +58,7 @@ trait ConsoleWriter extends Logging {
println("-------------------------------------------")
// scalastyle:off println
spark
.createDataFrame(spark.sparkContext.parallelize(rows), schema)
.createDataFrame(rows.toList.asJava, schema)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change here to avoid triggering new distributed job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fix should go into 2.3 branch. thanks for catching this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will create a separate PR for this small fix.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86581 has finished for PR 20382 at commit 8f3b548.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

@jose-torres can you please help to review, thanks!

Copy link
Contributor

@jose-torres jose-torres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't remove the old source, contrary to what I did with the console sink. We should add a conf to disable the V2 implementation on a per-source basis, which we can use to (a) fall back if some user finds the new implementation problematic and (b) run tests with the conf to make sure that the V1 execution path still works.

I'll write a PR to handle that.

}

class TextSocketSourceProviderV2 extends DataSourceV2
with MicroBatchReadSupport with DataSourceRegister with Logging {
Copy link
Contributor

@jose-torres jose-torres Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is for the V2 and V1 source to live in the same register, so existing queries can start using the V2 source with no change needed. This also allows the V2 implementation to be validated by passing all the old tests.

RateSourceV2 is a bad example; it only exists because I didn't have time to write a fully compatible rate source. I'll work on fixing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jose-torres , you mean that instead of creating a new V2 socket source, modifying current V1 socket source to make it work with V2, am I understanding correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that the existing TextSocketSourceProvider will have the MicroBatchReadSupport implementation here, in addition to the StreamSourceProvider implementation it already has.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the clarify. Let me change it.

private val host = options.get(HOST).get()
private val port = options.get(PORT).get().toInt
private val includeTimestamp = options.getBoolean(INCLUDE_TIMESTAMP, false)
private val numPartitions = options.getInt(NUM_PARTITIONS, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the old parallelize behavior, the default number of partitions should be sparkContext.defaultParallelism.

private var lastOffsetCommitted: Long = -1L

override def setOffsetRange(start: Optional[Offset], end: Optional[Offset]): Unit = {
if (!initialized) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to initialize in the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I want to bring out. Originally I initialized this in constructor like old socket source. But I found that MicroBatchReader will be created in two different places with two objects. So initializing in constructor will create two sock threads and connectors. This is different from V1 source. In V1 source, we only created source once, but with V2 MicroBatchReader we will create two objects in two different places (one for schema), which means such side-affect actions in constructor will have two copies. Ideally we should only create this MicroBatchReader once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will solve that problem, since each reader will just have its own initialize bit.

In general, I think it's fine if we do a bit of extra work. V1 sources do have to support being created multiple times (in e.g. restart scenarios), and the lifecycles of the two V2 readers being created here don't overlap. (We should be closing the tempReader created in DataStreamReader, though.)


override def commit(end: Offset): Unit = synchronized {
val newOffset = end.asInstanceOf[TextSocketStreamOffset]
val offsetDiff = (newOffset.offset - lastOffsetCommitted).toInt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: conversion to int is unnecessary

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86640 has finished for PR 20382 at commit 56c60f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jose-torres
Copy link
Contributor

jose-torres commented Jan 25, 2018

It's unfortunate that the socket tests don't actually run streams end to end, but I think that's orthogonal to this PR.

Can you run one of the programming guide examples using socket source (e.g. org.apache.spark.examples.sql.streaming.StructuredSessionization) to make sure it works after this PR? If it does, LGTM

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@jerryshao
Copy link
Contributor Author

Hi @jose-torres , thanks for your reviewing. I tried both the example you mentioned and simple spark-shell command, I think it works, but the path will always go to V2 MicroBatchReader (still need you PR to fallback to V1 Source).

@jose-torres
Copy link
Contributor

Right, that makes sense. LGTM

@SparkQA
Copy link

SparkQA commented Jan 26, 2018

Test build #86671 has finished for PR 20382 at commit 56c60f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2018

Test build #86677 has finished for PR 20382 at commit 9ceb3be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

@zsxwing @tdas would you please help to review, thanks!

Try(params.getOrElse("includeTimestamp", "false").toBoolean) match {
case Success(bool) => bool
class TextSocketSourceProvider extends DataSourceV2
with MicroBatchReadSupport with StreamSourceProvider with DataSourceRegister with Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still need StreamSourceProvider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't misunderstand @jose-torres 's intention, basically he wanted this socket source to work also in V1 code path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah, i see earlier comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TD and I discussed this offline. It should be fine to remove the V1 StreamSourceProvider implementation, because:

  • this isn't a production-quality source, so users shouldn't need to fall back to it
  • this source won't be particularly useful at exercising the V1 execution pipeline once we transition all sources to V2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will update the patch accordingly.


import org.apache.spark.internal.Logging

trait TextSocketReader extends Logging {
Copy link
Contributor

@tdas tdas Jan 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docs!! This is a base interface used by two source implementations.
Also rename this such that its clear that this a base class and not an actual Reader (i.e. not a subclass of DataSourceV2 readers). Maybe TextSocketReaderBase

override def toString: String = s"TextSocketSource[host: $host, port: $port]"
}

case class TextSocketOffset(offset: Long) extends V2Offset {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wait for my PR #20445 to go in where I migrate LongOffset to use OffsetV2

@tdas
Copy link
Contributor

tdas commented Jan 31, 2018

I am holding off further comments on this PR until the major change of eliminating v1 Source is done. That would cause significant refactoring (including the fact that the common trait wont be needed).

BTW, I strongly suggest moving the socket code to execution.streaming.sources, like other v2 sources.

@jerryshao
Copy link
Contributor Author

Sure, will waiting for others to be merged, thanks @tdas .

@tdas
Copy link
Contributor

tdas commented Feb 7, 2018

#20445 will be merged in a few hours. please go ahead and update your PR with the refactoring that was suggested (mainly, no v1 version).

@jerryshao
Copy link
Contributor Author

Sure, I will do it.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87199 has finished for PR 20382 at commit fdc9b9c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class TextSocketMicroBatchReader(options: DataSourceOptions) extends MicroBatchReader with Logging

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87202 has finished for PR 20382 at commit 874c91c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Feb 8, 2018

jenkins test this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87203 has finished for PR 20382 at commit 874c91c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Hi @tdas , would you please help to review again, thanks!

Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, just a few comments.

org.apache.spark.sql.execution.streaming.RateSourceProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a redirection in the DataSource.backwardCompatibilityMap for this?

* A source that reads text lines through a TCP socket, designed only for tutorials and debugging.
* This source will *not* work in production applications due to multiple reasons, including no
* support for fault recovery and keeping all of the text read in memory forever.
* A MicroBatchReader that reads text lines through a TCP socket, designed only for tutorials and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tutorials -> testing (i know it was like that, but lets fix it since we are changing it anyway)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tutorials is correct here; see e.g. StructuredSessionization.scala

* A MicroBatchReader that reads text lines through a TCP socket, designed only for tutorials and
* debugging. This MicroBatchReader will *not* work in production applications due to multiple
* reasons, including no support for fault recovery and keeping all of the text read in memory
* forever.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not keep it forever. so remove this reason, just keep "no support for fault recover".

}

override def readSchema(): StructType = {
val includeTimestamp = options.getBoolean("includeTimestamp", false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: is there need for a variable here?

override def schema: StructType = if (includeTimestamp) TextSocketSource.SCHEMA_TIMESTAMP
else TextSocketSource.SCHEMA_REGULAR
override def setOffsetRange(
start: Optional[Offset],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wont this fit on a single line?

@@ -164,54 +213,43 @@ class TextSocketSource(host: String, port: Int, includeTimestamp: Boolean, sqlCo
}
}

override def toString: String = s"TextSocketSource[host: $host, port: $port]"
override def toString: String = s"TextSocketMicroBatchReader[host: $host, port: $port]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows up in the StreamingQueryProgressEvent as description, so it may be better to have it as "TextSocket[..."

schema: Optional[StructType],
checkpointLocation: String,
options: DataSourceOptions): MicroBatchReader = {
checkParameters(options.asMap().asScala.toMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not check it as DataSourceOptions (which is known to be case-insensitive) rather than a map which raises questions about case sensitivity?

@@ -177,11 +177,14 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
Optional.ofNullable(userSpecifiedSchema.orNull),
Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
options)
val schema = tempReader.readSchema()
// Stop tempReader to avoid side-affect thing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: side-affect -> side-effect.

good catch.

@@ -177,11 +177,14 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
Optional.ofNullable(userSpecifiedSchema.orNull),
Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
options)
val schema = tempReader.readSchema()
// Stop tempReader to avoid side-affect thing
tempReader.stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like this needs a try finally approach as well.

@@ -0,0 +1,246 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this show up as a new file? was this not a "git mv"? something went wrong, i would prefer that i can see a simple diff. Not much should change in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @tdas , I did it by simply "mv", not "git mv". This doesn't change a lot, just to be suited for data source v2 API.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87371 has finished for PR 20382 at commit 647c5cd.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87372 has finished for PR 20382 at commit f3fc90c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87370 has finished for PR 20382 at commit 068c050.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 26, 2018

Test build #87664 has finished for PR 20382 at commit fd890ad.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 26, 2018

Test build #87667 has finished for PR 20382 at commit fd890ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

StopStream
)

assert(!batch2Stamp.before(batch1Stamp))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a slim chance that batch2stamp will be same as batch1stamp.
maybe worth adding a sleep(10) to ensure this.
you should also check batch1stamp with timestamp taken directly before the query. otherwise it may pass tests if the query generated batch1stamp = -1 and batch2stamp = -2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tdas , what's the meaning of "you should also check batch1stamp with timestamp taken directly before the query. ", I'm not clearly sure what specifically are you pointing to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val timestamp = System.currentTimeMillis
testStream(...)(
// get batch1stamp
)
// assert batch1stamp >= timestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Will update it.

intercept[IOException] {
batchReader = provider.createMicroBatchReader(
Optional.empty(), "", new DataSourceOptions(parameters.asJava))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert on the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my local test, the assert message is Can't assign requested address, but on Jenkins, it is Connection refused. The difference might be due to different OS/native method.

I think it would be better to not check the message due to different outputs. Even if we change to follow Jenkins way, it still fails in my local Mac.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats fine.


@GuardedBy("this")
protected var currentOffset: LongOffset = new LongOffset(-1)
private[sources] var currentOffset: LongOffset = LongOffset(-1L)
Copy link
Contributor

@tdas tdas Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not make sense. you are directly accessing something that should be accessed while synchronized on this.

@tdas
Copy link
Contributor

tdas commented Mar 1, 2018

@jerryshao please address the above comment, then we are good to merge!

@jerryshao
Copy link
Contributor Author

Sure, I will do it today.

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87819 has finished for PR 20382 at commit 1073be4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Mar 1, 2018

relevant test failed. please make sure that there is no flakiness in the tests.

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87825 has finished for PR 20382 at commit 6d38bed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87831 has finished for PR 20382 at commit 762f1da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Mar 2, 2018

LGTM. Merging to master.

@asfgit asfgit closed this in 707e650 Mar 2, 2018
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
This PR moves structured streaming text socket source to V2.

Questions: do we need to remove old "socket" source?

Unit test and manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes apache#20382 from jerryshao/SPARK-23097.

Ref: LIHADOOP-48531
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants