[SPARK-34947][SQL] Streaming write to a V2 table should invalidate its associated cache by sunchao · Pull Request #32039 · apache/spark

sunchao · 2021-04-02T21:24:39Z

What changes were proposed in this pull request?

Populate table catalog and identifier from DataStreamWriter to WriteToMicroBatchDataSource so that we can invalidate cache for tables that are updated by a streaming write.

This is somewhat related SPARK-27484 and SPARK-34183 (#31700), as ideally we may want to replace WriteToMicroBatchDataSource and WriteToDataSourceV2 with logical write nodes and feed them to analyzer. That will potentially change the code path involved in this PR.

Why are the changes needed?

Currently WriteToDataSourceV2 doesn't have cache invalidation logic, and therefore, when the target table for a micro batch streaming job is cached, the cache entry won't be removed when the table is updated.

Does this PR introduce any user-facing change?

Yes now when a DSv2 table which supports streaming write is updated by a streaming job, its cache will also be invalidated.

How was this patch tested?

Added a new UT.

SparkQA · 2021-04-02T22:22:53Z

Test build #136869 has finished for PR 32039 at commit 0891cd1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class WriteToMicroBatchDataSource(

SparkQA · 2021-04-02T22:31:28Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41447/

SparkQA · 2021-04-06T15:58:48Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41529/

SparkQA · 2021-04-06T16:56:08Z

Test build #136953 has finished for PR 32039 at commit bf8d519.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class WriteToDataSourceV2(
case class WriteToMicroBatchDataSource(

…ssociated cache

SparkQA · 2021-04-06T18:15:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41539/

SparkQA · 2021-04-06T18:15:08Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41539/

SparkQA · 2021-04-06T21:36:04Z

Test build #136962 has finished for PR 32039 at commit 801d35a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class WriteToDataSourceV2(
case class WriteToMicroBatchDataSource(

sunchao · 2021-04-06T23:34:02Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

-      WriteToDataSourceV2Exec(writer, planLater(query)) :: Nil
+    case WriteToDataSourceV2(relationOpt, writer, query) =>
+      val refreshCacheFunc: () => Unit = relationOpt match {
+        case Some(r) => refreshCache(r)


I'm not sure if refresh cache is the best choice in the context of streaming - perhaps we should invalidate it?

If you have a concern, why don't we have a config for that? We can have a controllability to invalidate or to refresh.

I think refreshing the cache per second (assuming the streaming trigger is one second) doesn't make sense. Invalidating seems more reasonable.

Hmm, I cannot think about a use-case we want to refresh the cache of a table written by streaming.

Thanks all! Let me switch to invalidating cache. IMO a config isn't necessary since ppl may never want this behavior (it could get very expensive).

sunchao · 2021-04-07T21:43:30Z

I think the test failures are not related. cc @HeartSaVioR @cloud-fan and @aokolnychyi

viirya · 2021-04-08T17:40:19Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

  override protected def run(): Seq[InternalRow] = {
-    writeWithV2(batchWrite)
+    val writtenRows = writeWithV2(batchWrite)
+    refreshCache()
+    writtenRows


Instead of refreshing/invalidating the table per trigger, why we don't just invalidate the cache before we start the streaming query that writes the table?

Yes that should work too and also will require fewer code changes. I went this way to be consistent with other V2 write commands. Also, in future we may introduce DataStreamWriterV2 which could pass write node with UnresolvedRelation to analyzer and be converted to execution plan, and this approach may fit better in that case.

SparkQA · 2021-04-09T17:20:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41726/

SparkQA · 2021-04-09T17:20:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41726/

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

SparkQA · 2021-04-09T19:10:40Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41730/

SparkQA · 2021-04-09T21:21:15Z

Test build #137147 has finished for PR 32039 at commit 779a8d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-09T23:05:01Z

Test build #137151 has finished for PR 32039 at commit 239fafd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-12T09:21:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

    extraOptions: Map[String, String],
-    plan: WriteToStream)
+    plan: WriteToStream,
+    catalogAndIdent: Option[(CatalogPlugin, Identifier)] = None)


shall we put this info in WriteToStream? It's very weird to see catalogAndIdent as a parameter of MicroBatchExecution.

Good point. WriteToStream is a better place for this information.

SparkQA · 2021-04-12T18:35:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41808/

SparkQA · 2021-04-12T18:35:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41808/

SparkQA · 2021-04-12T21:57:24Z

Test build #137228 has finished for PR 32039 at commit a884b47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR

+1 Looks OK to me.

cloud-fan · 2021-04-13T13:31:04Z

thanks, merging to master!

github-actions bot added SQL STRUCTURED STREAMING labels Apr 2, 2021

sunchao marked this pull request as draft April 2, 2021 22:30

sunchao force-pushed the streaming-cache branch from 0891cd1 to bf8d519 Compare April 6, 2021 14:53

[SPARK-34947][SQL] Streaming write to a V2 table should refresh its a…

801d35a

…ssociated cache

sunchao force-pushed the streaming-cache branch from bf8d519 to 801d35a Compare April 6, 2021 17:17

sunchao commented Apr 6, 2021

View reviewed changes

sunchao marked this pull request as ready for review April 6, 2021 23:34

viirya reviewed Apr 8, 2021

View reviewed changes

invalidate instead of refresh cache

779a8d2

imback82 reviewed Apr 9, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala Outdated Show resolved Hide resolved

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala Outdated Show resolved Hide resolved

update naming

239fafd

sunchao changed the title ~~[SPARK-34947][SQL] Streaming write to a V2 table should refresh its associated cache~~ [SPARK-34947][SQL] Streaming write to a V2 table should invalidate its associated cache Apr 9, 2021

cloud-fan reviewed Apr 12, 2021

View reviewed changes

move catalogAndIdent to WriteToStream

a884b47

HeartSaVioR approved these changes Apr 12, 2021

View reviewed changes

cloud-fan approved these changes Apr 13, 2021

View reviewed changes

cloud-fan closed this in 1a67089 Apr 13, 2021

Conversation

sunchao commented Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 2, 2021

Uh oh!

SparkQA commented Apr 2, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao commented Apr 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 9, 2021

Uh oh!

SparkQA commented Apr 9, 2021

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Apr 9, 2021

Uh oh!

SparkQA commented Apr 9, 2021

Uh oh!

SparkQA commented Apr 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sunchao commented Apr 2, 2021 •

edited

Loading