[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame #21571

tdas · 2018-06-14T23:06:04Z

What changes were proposed in this pull request?

Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful.

Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning).
Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source).
Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice.

The proposal is to add a method foreachBatch(f: Dataset[T] => Unit) to Scala/Java/Python DataStreamWriter.

How was this patch tested?

New unit tests.

tdas · 2018-06-14T23:06:57Z

dev/sparktestsupport/modules.py

+#        "pyspark.sql.readwriter",
+#        "pyspark.sql.streaming",
+#        "pyspark.sql.udf",
+#        "pyspark.sql.window",


temporary change for faster local testing. will remove before finalizing.

tdas · 2018-06-14T23:07:21Z

python/pyspark/sql/streaming.py

@@ -854,6 +856,20 @@ def trigger(self, processingTime=None, once=None, continuous=None):
        self._jwrite = self._jwrite.trigger(jTrigger)
        return self

+    def foreachBatch(self, func):


tdas · 2018-06-14T23:07:26Z

python/pyspark/sql/tests.py

@@ -269,6 +269,7 @@ def test_struct_field_type_name(self):
        struct_field = StructField("a", IntegerType())
        self.assertRaises(TypeError, struct_field.typeName)

+'''


temporary change for faster local testing. will remove before finalizing.

SparkQA · 2018-06-14T23:10:14Z

Test build #91873 has finished for PR 21571 at commit 3b7b20d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class GBTClassificationModel(TreeEnsembleModel, JavaClassificationModel, JavaMLWritable,
class PrefixSpan(JavaParams):
public class MaskExpressionsUtils
case class ArrayRemove(left: Expression, right: Expression)
trait MaskLike
trait MaskLikeWithN extends MaskLike
case class Mask(child: Expression, upper: String, lower: String, digit: String)
case class MaskFirstN(
case class MaskLastN(
case class MaskShowFirstN(
case class MaskShowLastN(
case class MaskHash(child: Expression)
abstract class FileFormatDataWriter(
class EmptyDirectoryDataWriter(
class SingleDirectoryDataWriter(
class DynamicPartitionDataWriter(
class WriteJobDescription(
case class WriteTaskResult(commitMsg: TaskCommitMessage, summary: ExecutedWriteSummary)
case class ExecutedWriteSummary(

SparkQA · 2018-06-15T00:55:22Z

Test build #91875 has finished for PR 21571 at commit 985a4fe.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-15T11:26:00Z

Test build #91897 has finished for PR 21571 at commit 687402c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-06-15T18:31:10Z

@zsxwing @HyukjinKwon @HeartSaVioR @JoshRosen

tdas · 2018-06-15T21:06:23Z

python/pyspark/java_gateway.py

@@ -145,3 +145,26 @@ def do_server_auth(conn, auth_secret):
    if reply != "ok":
        conn.close()
        raise Exception("Unexpected reply from iterator server.")
+
+
+def ensure_callback_server_started(gw):


This was copied verbatim from python streaming/context.py

tdas · 2018-06-15T21:08:20Z

dev/sparktestsupport/modules.py

+        #        "pyspark.sql.readwriter",
+        #        "pyspark.sql.streaming",
+        #        "pyspark.sql.udf",
+        #        "pyspark.sql.window",


temp changes only to speed up local testing. will revert after first round of review.

zsxwing

Overall looks good. Left some minor comments.

zsxwing · 2018-06-15T22:26:19Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+   * @since 2.4.0
+   */
+  @InterfaceStability.Evolving
+  def foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T] = {


it's unclear that only can one of foreachBatch and foreach be set. Reading from the doc, the user may think he can set both of them. Maybe we should disallow this case?

goood point.

Well... that is an existing problem because one can write the following confusion code

df.writeStream.format("kafka").foreach(...).start()

This will execute the foreach but it looks confusing nonetheless. In fact you can also do

df.writeStream.format("kafka").format("bla").format("random")....

This is a general existing problem that should be addressed in a different PR.

zsxwing · 2018-06-15T22:50:31Z

python/pyspark/sql/streaming.py

+
+        wrapped_func = ForeachBatchFunction(self._spark, func)
+        gw.jvm.PythonForeachBatchHelper.callForeachBatch(self._jwrite, wrapped_func)
+        ensure_callback_server_started(gw)


this should be above otherwise there is a race that the streaming query calls this python func before the callback server is started.

This is not possible because the callback from JVM ForeachBatch sink to Python is made ONLY after the query is started. And the query cannot be started until this foreach() method finishes and start() is called.

I am sorry if I'm mistaken but can't we still put this above? Looks weird we ensure the callback server at the end.

zsxwing · 2018-06-15T23:02:29Z

python/pyspark/sql/utils.py

@@ -62,6 +62,7 @@ def deco(*a, **kw):
        try:
            return f(*a, **kw)
        except py4j.protocol.Py4JJavaError as e:
+            print(str(e))


nit: remove this

SparkQA · 2018-06-16T00:33:35Z

Test build #91940 has finished for PR 21571 at commit 0763a44.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait MemorySinkBase extends BaseStreamingSink with Logging
class MemorySink(val schema: StructType, outputMode: OutputMode, options: DataSourceOptions)
class MemoryWriter(
class MemoryStreamWriter(

SparkQA · 2018-06-16T01:12:22Z

Test build #91938 has finished for PR 21571 at commit e8073ea.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-06-17T03:38:07Z

Test build #91982 has started for PR 21571 at commit 6f9fdf4.

tdas · 2018-06-17T07:40:39Z

Jenkins retest this please

SparkQA · 2018-06-17T11:38:16Z

Test build #91994 has finished for PR 21571 at commit 6f9fdf4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-06-17T20:19:44Z

jenkins retest this please

SparkQA · 2018-06-18T00:41:25Z

Test build #92005 has finished for PR 21571 at commit 9062fb9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-18T11:00:42Z

Test build #92018 has finished for PR 21571 at commit 5b4252a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-06-19T20:56:20Z

LGTM. Merging to master.

HyukjinKwon · 2018-06-20T01:38:16Z

Seems fine to me. Left one question / few nits but not a big deal.

HyukjinKwon · 2018-06-20T01:39:01Z

python/pyspark/sql/tests.py

+        q = None
+        collected = dict()
+
+        def collectBatch(batch_df, batch_id):


collectBatch -> collect_batch per PEP 8.

HyukjinKwon · 2018-06-20T01:43:03Z

python/pyspark/streaming/context.py

-            # gateway with real port
-            gw._python_proxy_port = gw._callback_server.port
-            # get the GatewayServer object in JVM by ID
-            jgws = JavaObject("GATEWAY_SERVER", gw._gateway_client)


Nit: we could remove this import in this file though.

sidtandon2014 · 2020-02-12T11:33:13Z

@tdas @SparkQA @zsxwing @HyukjinKwon ,

I have few questions related to batchId

If I stop the job and start it again what will be the batchID (Lets assume last batchId was N)? Is batchId dependent on offset and partitionId?
If I stop the job (or some error happened) and start it again (assume I am processing Nth batch). Now, this time will Nth batch have more data as compared to last time (considering continuous stream) or additional data will belong to N+1 batch?

…tput rows of each microbatch as a DataFrame Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful. - Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning). - Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source). - Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice. The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to Scala/Java/Python `DataStreamWriter`. New unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#21571 from tdas/foreachBatch. Ref: LIHADOOP-48531 RB=1854649 G=superfriends-reviewers R=latang,yezhou,zolin,fli,mshen A=

tdas added 3 commits June 7, 2018 21:13

First cut of Scala foreachBatch

21acc73

Added python support

4ac056e

Merge remote-tracking branch 'apache-github/master' into foreachBatch

3b7b20d

tdas commented Jun 14, 2018

View reviewed changes

Added more test, added docs

985a4fe

Fixed style, and deduped code

687402c

tdas added 2 commits June 15, 2018 13:49

Merge remote-tracking branch 'apache-github/master' into foreachBatch

e8073ea

Merge remote-tracking branch 'apache-github/master' into foreachBatch

0763a44

tdas changed the title ~~[WIP][SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame~~ [SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame Jun 15, 2018

tdas commented Jun 15, 2018

View reviewed changes

zsxwing requested changes Jun 15, 2018

View reviewed changes

Addressed comment

6f9fdf4

Fixed test

9062fb9

Reverting modules.py

5b4252a

asfgit closed this in 2cb9763 Jun 19, 2018

HyukjinKwon reviewed Jun 20, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame #21571

[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame #21571

tdas commented Jun 14, 2018

tdas Jun 14, 2018

tdas Jun 14, 2018

tdas Jun 14, 2018

SparkQA commented Jun 14, 2018

SparkQA commented Jun 15, 2018

SparkQA commented Jun 15, 2018

tdas commented Jun 15, 2018

tdas Jun 15, 2018

tdas Jun 15, 2018

zsxwing left a comment

zsxwing Jun 15, 2018

tdas Jun 15, 2018

tdas Jun 17, 2018 •

edited

Loading

zsxwing Jun 15, 2018

tdas Jun 17, 2018

HyukjinKwon Jun 20, 2018

zsxwing Jun 15, 2018

SparkQA commented Jun 16, 2018

SparkQA commented Jun 16, 2018

SparkQA commented Jun 17, 2018

tdas commented Jun 17, 2018

SparkQA commented Jun 17, 2018

tdas commented Jun 17, 2018

SparkQA commented Jun 18, 2018

SparkQA commented Jun 18, 2018

zsxwing commented Jun 19, 2018

HyukjinKwon commented Jun 20, 2018 •

edited

Loading

HyukjinKwon Jun 20, 2018

HyukjinKwon Jun 20, 2018

sidtandon2014 commented Feb 12, 2020

[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame #21571

[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame #21571

Conversation

tdas commented Jun 14, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 14, 2018

SparkQA commented Jun 15, 2018

SparkQA commented Jun 15, 2018

tdas commented Jun 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Jun 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 16, 2018

SparkQA commented Jun 16, 2018

SparkQA commented Jun 17, 2018

tdas commented Jun 17, 2018

SparkQA commented Jun 17, 2018

tdas commented Jun 17, 2018

SparkQA commented Jun 18, 2018

SparkQA commented Jun 18, 2018

zsxwing commented Jun 19, 2018

HyukjinKwon commented Jun 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sidtandon2014 commented Feb 12, 2020

tdas Jun 17, 2018 •

edited

Loading

HyukjinKwon commented Jun 20, 2018 •

edited

Loading