Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame #21571

Closed
wants to merge 10 commits into from

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Jun 14, 2018

What changes were proposed in this pull request?

Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful.

  • Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning).
  • Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source).
  • Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice.

The proposal is to add a method foreachBatch(f: Dataset[T] => Unit) to Scala/Java/Python DataStreamWriter.

How was this patch tested?

New unit tests.

# "pyspark.sql.readwriter",
# "pyspark.sql.streaming",
# "pyspark.sql.udf",
# "pyspark.sql.window",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporary change for faster local testing. will remove before finalizing.

@@ -854,6 +856,20 @@ def trigger(self, processingTime=None, once=None, continuous=None):
self._jwrite = self._jwrite.trigger(jTrigger)
return self

def foreachBatch(self, func):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docs

@@ -269,6 +269,7 @@ def test_struct_field_type_name(self):
struct_field = StructField("a", IntegerType())
self.assertRaises(TypeError, struct_field.typeName)

'''
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporary change for faster local testing. will remove before finalizing.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91873 has finished for PR 21571 at commit 3b7b20d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class GBTClassificationModel(TreeEnsembleModel, JavaClassificationModel, JavaMLWritable,
  • class PrefixSpan(JavaParams):
  • public class MaskExpressionsUtils
  • case class ArrayRemove(left: Expression, right: Expression)
  • trait MaskLike
  • trait MaskLikeWithN extends MaskLike
  • case class Mask(child: Expression, upper: String, lower: String, digit: String)
  • case class MaskFirstN(
  • case class MaskLastN(
  • case class MaskShowFirstN(
  • case class MaskShowLastN(
  • case class MaskHash(child: Expression)
  • abstract class FileFormatDataWriter(
  • class EmptyDirectoryDataWriter(
  • class SingleDirectoryDataWriter(
  • class DynamicPartitionDataWriter(
  • class WriteJobDescription(
  • case class WriteTaskResult(commitMsg: TaskCommitMessage, summary: ExecutedWriteSummary)
  • case class ExecutedWriteSummary(

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91875 has finished for PR 21571 at commit 985a4fe.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91897 has finished for PR 21571 at commit 687402c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Jun 15, 2018

@tdas tdas changed the title [WIP][SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame [SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame Jun 15, 2018
@@ -145,3 +145,26 @@ def do_server_auth(conn, auth_secret):
if reply != "ok":
conn.close()
raise Exception("Unexpected reply from iterator server.")


def ensure_callback_server_started(gw):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copied verbatim from python streaming/context.py

# "pyspark.sql.readwriter",
# "pyspark.sql.streaming",
# "pyspark.sql.udf",
# "pyspark.sql.window",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp changes only to speed up local testing. will revert after first round of review.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Left some minor comments.

* @since 2.4.0
*/
@InterfaceStability.Evolving
def foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear that only can one of foreachBatch and foreach be set. Reading from the doc, the user may think he can set both of them. Maybe we should disallow this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goood point.

Copy link
Contributor Author

@tdas tdas Jun 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... that is an existing problem because one can write the following confusion code

df.writeStream.format("kafka").foreach(...).start()

This will execute the foreach but it looks confusing nonetheless. In fact you can also do

df.writeStream.format("kafka").format("bla").format("random")....

This is a general existing problem that should be addressed in a different PR.


wrapped_func = ForeachBatchFunction(self._spark, func)
gw.jvm.PythonForeachBatchHelper.callForeachBatch(self._jwrite, wrapped_func)
ensure_callback_server_started(gw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be above otherwise there is a race that the streaming query calls this python func before the callback server is started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible because the callback from JVM ForeachBatch sink to Python is made ONLY after the query is started. And the query cannot be started until this foreach() method finishes and start() is called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry if I'm mistaken but can't we still put this above? Looks weird we ensure the callback server at the end.

@@ -62,6 +62,7 @@ def deco(*a, **kw):
try:
return f(*a, **kw)
except py4j.protocol.Py4JJavaError as e:
print(str(e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this

@SparkQA
Copy link

SparkQA commented Jun 16, 2018

Test build #91940 has finished for PR 21571 at commit 0763a44.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait MemorySinkBase extends BaseStreamingSink with Logging
  • class MemorySink(val schema: StructType, outputMode: OutputMode, options: DataSourceOptions)
  • class MemoryWriter(
  • class MemoryStreamWriter(

@SparkQA
Copy link

SparkQA commented Jun 16, 2018

Test build #91938 has finished for PR 21571 at commit e8073ea.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 17, 2018

Test build #91982 has started for PR 21571 at commit 6f9fdf4.

@tdas
Copy link
Contributor Author

tdas commented Jun 17, 2018

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jun 17, 2018

Test build #91994 has finished for PR 21571 at commit 6f9fdf4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Jun 17, 2018

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jun 18, 2018

Test build #92005 has finished for PR 21571 at commit 9062fb9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2018

Test build #92018 has finished for PR 21571 at commit 5b4252a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Jun 19, 2018

LGTM. Merging to master.

@asfgit asfgit closed this in 2cb9763 Jun 19, 2018
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 20, 2018

Seems fine to me. Left one question / few nits but not a big deal.

q = None
collected = dict()

def collectBatch(batch_df, batch_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectBatch -> collect_batch per PEP 8.

# gateway with real port
gw._python_proxy_port = gw._callback_server.port
# get the GatewayServer object in JVM by ID
jgws = JavaObject("GATEWAY_SERVER", gw._gateway_client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could remove this import in this file though.

@sidtandon2014
Copy link

@tdas @SparkQA @zsxwing @HyukjinKwon ,

I have few questions related to batchId

  1. If I stop the job and start it again what will be the batchID (Lets assume last batchId was N)? Is batchId dependent on offset and partitionId?
  2. If I stop the job (or some error happened) and start it again (assume I am processing Nth batch). Now, this time will Nth batch have more data as compared to last time (considering continuous stream) or additional data will belong to N+1 batch?

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…tput rows of each microbatch as a DataFrame

Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful.
- Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning).
- Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice.

The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to Scala/Java/Python `DataStreamWriter`.

New unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes apache#21571 from tdas/foreachBatch.

Ref: LIHADOOP-48531

RB=1854649
G=superfriends-reviewers
R=latang,yezhou,zolin,fli,mshen
A=
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants