[SPARK-18516][SQL] Split state and progress in streaming #15954

marmbrus · 2016-11-21T05:20:42Z

This PR separates the status of a StreamingQuery into two separate APIs:

status - describes the status of a StreamingQuery at this moment, including what phase of processing is currently happening and if data is available.
recentProgress - an array of statistics about the most recent microbatches that have executed.

A recent progress contains the following information:

{
  "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
  "name" : "query-29",
  "timestamp" : 1479705392724,
  "inputRowsPerSecond" : 230.76923076923077,
  "processedRowsPerSecond" : 10.869565217391303,
  "durationMs" : {
    "triggerExecution" : 276,
    "queryPlanning" : 3,
    "getBatch" : 5,
    "getOffset" : 3,
    "addBatch" : 234,
    "walCommit" : 30
  },
  "currentWatermark" : 0,
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[topic-14]]",
    "startOffset" : {
      "topic-14" : {
        "2" : 0,
        "4" : 1,
        "1" : 0,
        "3" : 0,
        "0" : 0
      }
    },
    "endOffset" : {
      "topic-14" : {
        "2" : 1,
        "4" : 2,
        "1" : 0,
        "3" : 0,
        "0" : 1
      }
    },
    "numRecords" : 3,
    "inputRowsPerSecond" : 230.76923076923077,
    "processedRowsPerSecond" : 10.869565217391303
  } ]
}

Additionally, in order to make it possible to correlate progress updates across restarts, we change the id field from an integer that is unique with in the JVM to a UUID that is globally unique.

marmbrus · 2016-11-21T05:21:21Z

/cc @tdas

SparkQA · 2016-11-21T05:24:36Z

Test build #68916 has finished for PR 15954 at commit 213081a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T07:29:41Z

Test build #68923 has finished for PR 15954 at commit f4357d1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T17:39:22Z

Test build #68944 has finished for PR 15954 at commit 6eb1396.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T19:52:16Z

Test build #68945 has finished for PR 15954 at commit f1bd871.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Round 1 of comments. I am still looking.

tdas · 2016-11-22T00:54:37Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala

   */
-  @deprecated("use status.sourceStatuses", "2.0.2")
-  def sourceStatuses: Array[SourceStatus]
+  def recentProgress: Array[StreamingQueryProgress]


shouldnt this be recentProgresses?

Hmmm, yeah maybe. Its not clear to me that progress is inherently singular and progresses is kind of a mouthful. It is maybe nice for Arrays to always be plural though.

well. in the name StreamingQueryProgress we have effectively defined that "progress" means data from one trigger (i.e. singular). So I think its better to be progresses.

tdas · 2016-11-22T00:56:39Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/SourceProgress.scala

+
+/**
+ * :: Experimental ::
+ * Reports metrics on data being read from a given streaming source.


This should say that this information related a trigger where progress was made in processing data from sources.

Sure, we can copy the docs from the main class: Each event relates to processing done for a single trigger of the streaming query.

tdas · 2016-11-22T00:56:48Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/SourceProgress.scala

+ * @param description Description of the source.
+ * @param startOffset The starting offset for data being read.
+ * @param endOffset The ending offset for data being read.
+ * @param numRecords The number of records read from this source.


Is this is the numrecords read from this source since the beginning or in the last trigger.

I think if we update the docs as you suggest above this will be clear.

tdas · 2016-11-22T02:00:22Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.scala

+ * Holds statistics about state that is being stored for a given streaming query.
+ */
+@Experimental
+class StateOperator private[sql](


StateOperator -> StateOperatorProgress

tdas · 2016-11-22T02:01:01Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.scala

+ */
+@Experimental
+class StateOperator private[sql](
+    val numEntries: Long,


numEntries -> numTotal would make it more consistent with numUpdated

Also, needs docs to make it clear that numUpdated is with reference to the last progress

+1 to docs. I think numTotal is less clear. Total of what? It is a count of the number of entries that the state store is holding.

Same question then applies to numUpdated as well.
How about numRowsTotal, numRowsUpdated?
Then in future we could add sizeBytesTotal, sizeBytesUpdated, etc.

Those sound good.

SparkQA · 2016-11-22T03:46:05Z

Test build #68972 has finished for PR 15954 at commit 59a9139.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T02:20:47Z

Test build #69276 has finished for PR 15954 at commit 247ada6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AdvanceManualClock(timeToAdd: Long) extends StreamAction

SparkQA · 2016-11-29T02:34:38Z

Test build #69275 has finished for PR 15954 at commit c64632c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T03:12:54Z

Test build #69278 has finished for PR 15954 at commit 8ac76c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T04:12:41Z

Test build #69295 has started for PR 15954 at commit b41a662.

SparkQA · 2016-11-29T04:33:58Z

Test build #69293 has finished for PR 15954 at commit d6200d1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T04:57:33Z

Test build #69288 has finished for PR 15954 at commit f6d60df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T05:16:42Z

Test build #69290 has finished for PR 15954 at commit 32ff04e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T07:42:34Z

Test build #69308 has started for PR 15954 at commit d92f4bf.

marmbrus

Looks pretty good! Thanks for your help getting the tests back in shape. I think the only important question to answer before we can merge is the contents of QueryTerminatedEvent.

marmbrus · 2016-11-29T19:00:58Z

python/pyspark/sql/streaming.py

+        Returns the most recent :class:`StreamingQueryProgress` update of this streaming query.
+        :return: a map
+        """
+        return json.loads(self._jsq.lastProgress().toString())


I'd use json as above instead of relying on the fact that the toString is json.

marmbrus · 2016-11-29T19:05:38Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala

-   * all queries that have been started in the current process.
-   * @since 2.0.0
+   * Returns the unique id of this query.  An id is tied to the checkpoint location and will
+   * be the same across restarts of a given streaming query.


We should fix the TODO earlier, or remove this promise for now.

marmbrus · 2016-11-29T19:06:27Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala

@@ -51,7 +53,7 @@ trait StreamingQuery {
  def sparkSession: SparkSession

  /**
-   * Whether the query is currently active or not
+   * Returns `true` if this query is actively running.
   * @since 2.0.0


Nit: other places have a blank line before @since

marmbrus · 2016-11-29T19:08:54Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala

   */
  @Experimental
  class QueryTerminatedEvent private[sql](
-      val queryStatus: StreamingQueryStatus,
-      val exception: Option[String]) extends Event
+    val lastProgress: StreamingQueryProgress,


What is this if no progress is ever made? null? I would consider leaving this just the id, because otherwise if the query dies before progress is made, now you can't get the id at all.

marmbrus · 2016-11-29T19:10:16Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+
+object StreamingQueryManager {
+  private val _nextId = new AtomicLong(0)
+  def nextId: Long = _nextId.getAndIncrement()


marmbrus · 2016-11-29T19:10:22Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

@@ -279,3 +287,8 @@ class StreamingQueryManager private[sql] (sparkSession: SparkSession) {
    }
  }
 }
+
+object StreamingQueryManager {


Can you tell me why was this moved from the StreamExecution?

I made it private, made I feel this could have stayed in object StreamExecution

marmbrus · 2016-11-29T19:10:44Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.scala

+
+/**
+ * :: Experimental ::
+ * Statistics about updates made to a stateful operators in a [[StreamingQuery]] in a trigger.


nit: during a trigger?

marmbrus · 2016-11-29T19:12:05Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.scala

+ * Statistics about updates made to a stateful operators in a [[StreamingQuery]] in a trigger.
+ */
+@Experimental
+class StateOperatorProgress private[sql](


We should move SourceProgress here or put StateOperatorProgress in its own file. We might also consider putting them all in org.apache.spark.sql.streaming.progress, but there might not be time.

marmbrus · 2016-11-29T19:12:31Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.scala

+  val id: UUID,
+  val name: String,
+  val timestamp: Long,
+  val batchId: Long, // TODO: epoch?


We probably will not do this TODO.

marmbrus · 2016-11-29T19:14:19Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

-
-  class QueryStatusCollector extends StreamingQueryListener {
+  /** Collects events from the StreamingQueryListener for testing */
+  class EventCollector extends StreamingQueryListener {


I don't think this needs to be an inner class of StreamTest. This file is pretty long/complicated as is.

Ah yes, it was being used in multiple files at some point so I put it here. but not any more. I will put it only in StreamingQueryListener.

SparkQA · 2016-11-29T22:12:25Z

Test build #69352 has finished for PR 15954 at commit aa8af9c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T22:27:36Z

Test build #69353 has finished for PR 15954 at commit c11d2e5.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2016-11-29T22:32:46Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala

   */
-  @deprecated("use status.sourceStatuses", "2.0.2")
-  def sourceStatuses: Array[SourceStatus]
+  def recentProgresses: Array[StreamingQueryProgress]


Are these for the last n triggers? Or is it last n instantaneous progress updates, e.g. finished reading from a source etc

Last n triggers.

brkyvz · 2016-11-29T22:46:05Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+   *
+   * @since 2.1.0
+   */
+  def get(id: String): StreamingQuery = get(UUID.fromString(id))


with this I guess we can't provide API's for get(name: String)

I think thats okay. A globally unique ID is a better identifier.

SparkQA · 2016-11-29T23:35:43Z

Test build #69346 has finished for PR 15954 at commit d9d8f82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2016-11-30T00:31:24Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala

@@ -33,25 +35,27 @@ trait StreamingQuery {
   * Returns the name of the query. This name is unique across all active queries. This can be


is this doc still true?

yeah. it is.

SparkQA · 2016-11-30T01:19:20Z

Test build #69355 has finished for PR 15954 at commit 69d9b4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-11-30T01:23:24Z

LGTM, merging to master and 2.1

This PR separates the status of a `StreamingQuery` into two separate APIs: - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available. - `recentProgress` - an array of statistics about the most recent microbatches that have executed. A recent progress contains the following information: ``` { "id" : "2be8670a-fce1-4859-a530-748f29553bb6", "name" : "query-29", "timestamp" : 1479705392724, "inputRowsPerSecond" : 230.76923076923077, "processedRowsPerSecond" : 10.869565217391303, "durationMs" : { "triggerExecution" : 276, "queryPlanning" : 3, "getBatch" : 5, "getOffset" : 3, "addBatch" : 234, "walCommit" : 30 }, "currentWatermark" : 0, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaSource[Subscribe[topic-14]]", "startOffset" : { "topic-14" : { "2" : 0, "4" : 1, "1" : 0, "3" : 0, "0" : 0 } }, "endOffset" : { "topic-14" : { "2" : 1, "4" : 2, "1" : 0, "3" : 0, "0" : 1 } }, "numRecords" : 3, "inputRowsPerSecond" : 230.76923076923077, "processedRowsPerSecond" : 10.869565217391303 } ] } ``` Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #15954 from marmbrus/queryProgress. (cherry picked from commit c3d08e2) Signed-off-by: Michael Armbrust <michael@databricks.com>

…erver ## What changes were proposed in this pull request? As `queryStatus` in StreamingQueryListener events was removed in apache#15954, parsing 2.0.2 structured streaming logs will throw the following errror: ``` [info] com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"]) [info] at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"]) [info] at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) [info] at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839) [info] at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099) ... ``` This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs. ## How was this patch tested? `query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16085 from zsxwing/SPARK-18655.

…erver ## What changes were proposed in this pull request? As `queryStatus` in StreamingQueryListener events was removed in #15954, parsing 2.0.2 structured streaming logs will throw the following errror: ``` [info] com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"]) [info] at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"]) [info] at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) [info] at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839) [info] at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099) ... ``` This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs. ## How was this patch tested? `query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16085 from zsxwing/SPARK-18655. (cherry picked from commit c4979f6) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

This PR separates the status of a `StreamingQuery` into two separate APIs: - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available. - `recentProgress` - an array of statistics about the most recent microbatches that have executed. A recent progress contains the following information: ``` { "id" : "2be8670a-fce1-4859-a530-748f29553bb6", "name" : "query-29", "timestamp" : 1479705392724, "inputRowsPerSecond" : 230.76923076923077, "processedRowsPerSecond" : 10.869565217391303, "durationMs" : { "triggerExecution" : 276, "queryPlanning" : 3, "getBatch" : 5, "getOffset" : 3, "addBatch" : 234, "walCommit" : 30 }, "currentWatermark" : 0, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaSource[Subscribe[topic-14]]", "startOffset" : { "topic-14" : { "2" : 0, "4" : 1, "1" : 0, "3" : 0, "0" : 0 } }, "endOffset" : { "topic-14" : { "2" : 1, "4" : 2, "1" : 0, "3" : 0, "0" : 1 } }, "numRecords" : 3, "inputRowsPerSecond" : 230.76923076923077, "processedRowsPerSecond" : 10.869565217391303 } ] } ``` Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#15954 from marmbrus/queryProgress.

…erver ## What changes were proposed in this pull request? As `queryStatus` in StreamingQueryListener events was removed in apache#15954, parsing 2.0.2 structured streaming logs will throw the following errror: ``` [info] com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"]) [info] at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"]) [info] at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) [info] at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839) [info] at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099) ... ``` This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs. ## How was this patch tested? `query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16085 from zsxwing/SPARK-18655.

This PR separates the status of a `StreamingQuery` into two separate APIs: - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available. - `recentProgress` - an array of statistics about the most recent microbatches that have executed. A recent progress contains the following information: ``` { "id" : "2be8670a-fce1-4859-a530-748f29553bb6", "name" : "query-29", "timestamp" : 1479705392724, "inputRowsPerSecond" : 230.76923076923077, "processedRowsPerSecond" : 10.869565217391303, "durationMs" : { "triggerExecution" : 276, "queryPlanning" : 3, "getBatch" : 5, "getOffset" : 3, "addBatch" : 234, "walCommit" : 30 }, "currentWatermark" : 0, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaSource[Subscribe[topic-14]]", "startOffset" : { "topic-14" : { "2" : 0, "4" : 1, "1" : 0, "3" : 0, "0" : 0 } }, "endOffset" : { "topic-14" : { "2" : 1, "4" : 2, "1" : 0, "3" : 0, "0" : 1 } }, "numRecords" : 3, "inputRowsPerSecond" : 230.76923076923077, "processedRowsPerSecond" : 10.869565217391303 } ] } ``` Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#15954 from marmbrus/queryProgress.

…erver ## What changes were proposed in this pull request? As `queryStatus` in StreamingQueryListener events was removed in apache#15954, parsing 2.0.2 structured streaming logs will throw the following errror: ``` [info] com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"]) [info] at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"]) [info] at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) [info] at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839) [info] at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453) [info] at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099) ... ``` This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs. ## How was this patch tested? `query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16085 from zsxwing/SPARK-18655.

marmbrus added 3 commits November 20, 2016 18:31

[SPARK-18516] Split state and progress in streaming

ae8a37c

Merge remote-tracking branch 'origin/master' into queryProgress

527c8d6

fix python

213081a

fix python style

f4357d1

avoid getOrDefault

6eb1396

MIMA

f1bd871

marmbrus added 2 commits November 21, 2016 16:38

Merge remote-tracking branch 'origin/master' into queryProgress

17b13fb

drop test

59a9139

tdas reviewed Nov 22, 2016

View reviewed changes

tdas added 6 commits November 28, 2016 07:47

Added tests

f5e1d09

More changes

82ecffe

More test fixes

c64632c

Fixed test

247ada6

More fixes

8ac76c6

Renamed recentProgress to recentProgresses

27e3b73

Handle Double.Nan in json

f6d60df

tdas added 3 commits November 28, 2016 19:16

Tests for python APIs

32ff04e

Added progress to termination event

d6200d1

Minor changes

b41a662

Merge remote-tracking branch 'apache-github/master' into queryProgress

d92f4bf

Added SinkProgress

8bd49b7

marmbrus commented Nov 29, 2016

View reviewed changes

Removed progress from termination event

d9d8f82

marmbrus changed the title ~~[WIP][SPARK-18516][SQL] Split state and progress in streaming~~ [SPARK-18516][SQL] Split state and progress in streaming Nov 29, 2016

tdas added 2 commits November 29, 2016 13:43

Removed unnecessary files, and addressed comments

aa8af9c

Addressed comment

c11d2e5

Fixed mima

69d9b4a

brkyvz reviewed Nov 29, 2016

View reviewed changes

brkyvz reviewed Nov 30, 2016

View reviewed changes

asfgit closed this in c3d08e2 Nov 30, 2016

zsxwing mentioned this pull request Nov 30, 2016

[SPARK-18655][SS]Ignore Structured Streaming 2.0.2 logs in history server #16085

Closed

		@@ -33,25 +35,27 @@ trait StreamingQuery {
		* Returns the name of the query. This name is unique across all active queries. This can be

[SPARK-18516][SQL] Split state and progress in streaming #15954

[SPARK-18516][SQL] Split state and progress in streaming #15954

Conversation

marmbrus commented Nov 21, 2016 • edited

marmbrus commented Nov 21, 2016

SparkQA commented Nov 21, 2016

SparkQA commented Nov 21, 2016

SparkQA commented Nov 21, 2016

SparkQA commented Nov 21, 2016

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 22, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

marmbrus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 30, 2016

marmbrus commented Nov 30, 2016

marmbrus commented Nov 21, 2016 •

edited