[SPARK-20979][SS]Add RateSource to generate values for tests and benchmark #18199

zsxwing · 2017-06-05T05:26:02Z

What changes were proposed in this pull request?

This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily.

This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L.

It supports the following options:

rowsPerSecond (e.g. 100, default: 1): How many rows should be generated per second.
rampUpTime (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes rowsPerSecond. Using finer granularities than seconds will be truncated to integer seconds.
numPartitions (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach rowsPerSecond, but the query may be resource constrained, and numPartitions can be tweaked to help reach the desired speed.

Here is a simple example that prints 10 rows per seconds:

    spark.readStream
      .format("rate")
      .option("rowsPerSecond", "10")
      .load()
      .writeStream
      .format("console")
      .start()

The idea came from @marmbrus and he did the initial work.

How was this patch tested?

The added tests.

SparkQA · 2017-06-05T05:27:38Z

Test build #77730 has started for PR 18199 at commit 5b45b1b.

zsxwing · 2017-06-05T17:25:58Z

retest this please

SparkQA · 2017-06-05T19:37:25Z

Test build #77749 has finished for PR 18199 at commit 5b45b1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

Would be great to formulate how this source works. Also it would be great to say that we try our best to reach tuplesPerSecond, but we may be resource constrained, and numPartitions can be tweaked to help reach the provided options.

brkyvz · 2017-06-05T18:52:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+    val tuplesPerSecond = params.get("tuplesPerSecond").map(_.toLong).getOrElse(1L)
+    if (tuplesPerSecond <= 0) {
+      throw new IllegalArgumentException(
+        s"Invalid value '${params("tuplesPerSecond")}' for option 'tuplesPerSecond', " +


nit: Invalid value '${params("tuplesPerSecond")}'. The option 'tuplesPerSecond' must be a positive?

brkyvz · 2017-06-05T18:53:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+          "must be positive")
+    }
+
+    val rampUpTimeSeconds = params.get("rampUpTimeSeconds").map(_.toLong).getOrElse(0L)


I wonder if we should take this value as a duration string? e.g. option("rampUpTime", "5s")

brkyvz · 2017-06-05T20:09:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+  private val maxSeconds = Long.MaxValue / tuplesPerSecond
+
+  if (rampUpTimeSeconds > maxSeconds) {
+    throw new ArithmeticException("integer overflow. Max offset with tuplesPerSecond " +


nit: Integer also may be better to write $tuplesPerSecond tuplesPerSecond instead of tuplesPerSecond $tuplesPerSecond.

brkyvz · 2017-06-05T20:10:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+      s"$tuplesPerSecond is $maxSeconds, but 'rampUpTimeSeconds' is $rampUpTimeSeconds.")
+  }
+
+  private val startTimeMs = {


do we need to go to this complexity for this source??

It's better to add versioning at the beginning. Just a lesson from Kafka source.

brkyvz · 2017-06-05T20:11:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+  override def getBatch(start: Option[Offset], end: Offset): DataFrame = {
+    val startSeconds = start.flatMap(LongOffset.convert(_).map(_.offset)).getOrElse(0L)
+    val endSeconds = LongOffset.convert(end).map(_.offset).getOrElse(0L)
+    assert(startSeconds <= endSeconds)


nit: A meaningful assertion message would be useful

brkyvz · 2017-06-05T20:11:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+    val endSeconds = LongOffset.convert(end).map(_.offset).getOrElse(0L)
+    assert(startSeconds <= endSeconds)
+    if (endSeconds > maxSeconds) {
+      throw new ArithmeticException("integer overflow. Max offset with " +


brkyvz · 2017-06-05T20:20:54Z

You basically read my mind for the formulation!

zsxwing · 2017-06-05T20:25:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+
+  val clock = if (useManualClock) new ManualClock else new SystemClock
+
+  private val maxSeconds = Long.MaxValue / tuplesPerSecond


This will be <= the real max allowed seconds because it doesn't take rampUpTimeSeconds into consideration. I don't find a simple way to detect overflow quickly with rampUpTimeSeconds.

However, this should be fine because the user usually won't hit this problem. The overflow detection is just to not surprise people because range will return an empty RDD if overflow happens (See the below codes).

scala> sc.range(Long.MaxValue, Long.MinValue, 1).count() res0: Long = 0

brkyvz · 2017-06-05T21:53:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+
+    val localStartTimeMs = startTimeMs + TimeUnit.SECONDS.toMillis(startSeconds)
+    val relativeMsPerValue =
+      TimeUnit.SECONDS.toMillis(endSeconds - startSeconds) / (rangeEnd - rangeStart)


integer division bug! This can easily return 0 right?

@brkyvz did you mean rangeEnd - rangeStart == 0? It's handled above.

no I mean endSeconds - startSeconds => 2000, and rangeEnd - rangeStart => 50,000 =>
relativeMsPerValue = 0

Assume you meant TimeUnit.SECONDS.toMillis(endSeconds - startSeconds) => 2000. Yeah, then it's fine. It will return multiple rows with the same timestamp.

i guess it may be okay, but it won't create a uniform distribution of event timestamps in that case though. Not sure if that's a requirement.

Okey. Try to generate make event timestamps be uniform distribution. The codes become more complicated though.

SparkQA · 2017-06-05T22:37:58Z

Test build #77757 has finished for PR 18199 at commit 3a95b55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-05T23:22:38Z

Test build #77760 has finished for PR 18199 at commit ad32a7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2017-06-06T00:55:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

@@ -199,13 +199,52 @@ class RateStreamSource(
    }

    val localStartTimeMs = startTimeMs + TimeUnit.SECONDS.toMillis(startSeconds)
-    val relativeMsPerValue =
-      TimeUnit.SECONDS.toMillis(endSeconds - startSeconds) / (rangeEnd - rangeStart)


I thought that you would only change TimeUnit.SECONDS.toMillis(endSeconds - startSeconds).toDouble. Wasn't expecting all this change!

Just to avoid floating point inaccuracy :)

I guess that's acceptable. Right now the code got overcomplicated :(

SparkQA · 2017-06-06T02:54:15Z

Test build #77762 has finished for PR 18199 at commit 240c27b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-09T18:09:33Z

Test build #77846 has finished for PR 18199 at commit 240c27b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2017-06-09T18:20:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+            // The following condition is the same as
+            // "relativeValue < (valueSizePerMs + 1) * remainderValue", just rewrite it to avoid
+            // overflow.
+            if (relativeValue - remainderValue < valueSizePerMs * remainderValue) {


can we add parenthesis around relativeValue - remainderValue?

also around valueSizePerMs * remainderValue
=> (relativeValue - remainderValue) < (valueSizePerMs * remainderValue)

brkyvz · 2017-06-09T18:46:01Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceSuite.scala

+      .as[(java.sql.Timestamp, Long)]
+      .map(v => (v._1.getTime, v._2))
+    val expectedAnswer =
+      (0 until 1000).map(v => (v / 2, v)) ++ // Two values share the same timestamp.


why is this?

Because there are 1000 timestamps in one second but we have 1500 values.

brkyvz · 2017-06-09T18:48:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

+  val VERSION = 1
+}
+
+class RateStreamSource(


should we add a InterfaceStability.Evolving? I don't know where we use those. Just in case we change the namings, etc

I don't think so. The class won't appear in the public Scaladoc/Javadoc. The user cannot see this tag in any place unless they jump to this file.

brkyvz · 2017-06-09T18:55:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/RateSourceProvider.scala

@@ -199,13 +199,52 @@ class RateStreamSource(
    }

    val localStartTimeMs = startTimeMs + TimeUnit.SECONDS.toMillis(startSeconds)
-    val relativeMsPerValue =
-      TimeUnit.SECONDS.toMillis(endSeconds - startSeconds) / (rangeEnd - rangeStart)


I guess that's acceptable. Right now the code got overcomplicated :(

zsxwing · 2017-06-09T21:14:31Z

@brkyvz I changed to use double to simplify the codes.

SparkQA · 2017-06-09T23:29:12Z

Test build #77858 has finished for PR 18199 at commit d5e7492.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2017-06-09T23:39:11Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceSuite.scala

+  test("valueAtSecond") {
+    import RateStreamSource._
+
+    assert(valueAtSecond(seconds = 0, rowsPerSecond = 5, rampUpTimeSeconds = 2) === 0)


would be nice to add one test where rampUpTimeSeconds = 0

brkyvz · 2017-06-09T23:39:58Z

Left one last comment. Otherwise LGTM

SparkQA · 2017-06-12T03:09:04Z

Test build #77901 has finished for PR 18199 at commit 1d8454d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

Just noticed that we don't have a nice toString method for this source. Can be added in a follow up.

zsxwing · 2017-06-12T19:42:20Z

Just noticed that we don't have a nice toString method for this source. Can be added in a follow up.

Let me just do it now. Since it's pretty easy.

SparkQA · 2017-06-12T21:15:29Z

Test build #77942 has finished for PR 18199 at commit a2fa0b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-12T21:54:00Z

Test build #77943 has finished for PR 18199 at commit 53a65fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-06-12T21:57:55Z

Thanks! Merging to master.

…chmark ## What changes were proposed in this pull request? This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily. This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L. It supports the following options: - `rowsPerSecond` (e.g. 100, default: 1): How many rows should be generated per second. - `rampUpTime` (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes `rowsPerSecond`. Using finer granularities than seconds will be truncated to integer seconds. - `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach `rowsPerSecond`, but the query may be resource constrained, and `numPartitions` can be tweaked to help reach the desired speed. Here is a simple example that prints 10 rows per seconds: ``` spark.readStream .format("rate") .option("rowsPerSecond", "10") .load() .writeStream .format("console") .start() ``` The idea came from marmbrus and he did the initial work. ## How was this patch tested? The added tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #18199 from zsxwing/rate.

zsxwing · 2017-06-13T23:41:49Z

I also merged this to branch 2.2 since this is a separated feature.

…chmark ## What changes were proposed in this pull request? This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily. This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L. It supports the following options: - `rowsPerSecond` (e.g. 100, default: 1): How many rows should be generated per second. - `rampUpTime` (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes `rowsPerSecond`. Using finer granularities than seconds will be truncated to integer seconds. - `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach `rowsPerSecond`, but the query may be resource constrained, and `numPartitions` can be tweaked to help reach the desired speed. Here is a simple example that prints 10 rows per seconds: ``` spark.readStream .format("rate") .option("rowsPerSecond", "10") .load() .writeStream .format("console") .start() ``` The idea came from marmbrus and he did the initial work. ## How was this patch tested? The added tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#18199 from zsxwing/rate.

marmbrus and others added 2 commits June 1, 2017 17:29

add time source

554eafb

polish and tests

5b45b1b

The speed should be always <= tuplesPerSecond

3a95b55

brkyvz reviewed Jun 5, 2017

View reviewed changes

zsxwing commented Jun 5, 2017

View reviewed changes

zsxwing added 2 commits June 5, 2017 14:12

Address and fix timestamp for rampUpTimeSeconds

479582b

Update comments

ad32a7f

brkyvz reviewed Jun 5, 2017

View reviewed changes

uniform distribution

240c27b

zsxwing force-pushed the rate branch from 59dbd2f to 240c27b Compare June 6, 2017 00:45

brkyvz reviewed Jun 6, 2017

View reviewed changes

brkyvz reviewed Jun 9, 2017

View reviewed changes

zsxwing added 2 commits June 9, 2017 13:51

tuplesPerSecond -> rowsPerSecond

f471651

Use double to simpily the codes

d5e7492

brkyvz reviewed Jun 9, 2017

View reviewed changes

More tests

1d8454d

Fix comments

a2fa0b7

brkyvz approved these changes Jun 12, 2017

View reviewed changes

toString

53a65fb

asfgit closed this in 74a432d Jun 12, 2017

zsxwing deleted the rate branch June 12, 2017 22:03


		val clock = if (useManualClock) new ManualClock else new SystemClock

		private val maxSeconds = Long.MaxValue / tuplesPerSecond

[SPARK-20979][SS]Add RateSource to generate values for tests and benchmark #18199

[SPARK-20979][SS]Add RateSource to generate values for tests and benchmark #18199

Conversation

zsxwing commented Jun 5, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 5, 2017

zsxwing commented Jun 5, 2017

SparkQA commented Jun 5, 2017

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz commented Jun 5, 2017

zsxwing Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 5, 2017

SparkQA commented Jun 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 6, 2017

SparkQA commented Jun 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Jun 9, 2017

SparkQA commented Jun 9, 2017

Choose a reason for hiding this comment

brkyvz commented Jun 9, 2017

SparkQA commented Jun 12, 2017

brkyvz left a comment

Choose a reason for hiding this comment

zsxwing commented Jun 12, 2017

SparkQA commented Jun 12, 2017

SparkQA commented Jun 12, 2017

zsxwing commented Jun 12, 2017

zsxwing commented Jun 13, 2017

zsxwing commented Jun 5, 2017 •

edited

Loading

zsxwing Jun 5, 2017 •

edited

Loading