Skip to content

[SPARK-34947][SQL] Streaming write to a V2 table should invalidate its associated cache#32039

Closed
sunchao wants to merge 4 commits intoapache:masterfrom
sunchao:streaming-cache
Closed

[SPARK-34947][SQL] Streaming write to a V2 table should invalidate its associated cache#32039
sunchao wants to merge 4 commits intoapache:masterfrom
sunchao:streaming-cache

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Apr 2, 2021

What changes were proposed in this pull request?

Populate table catalog and identifier from DataStreamWriter to WriteToMicroBatchDataSource so that we can invalidate cache for tables that are updated by a streaming write.

This is somewhat related SPARK-27484 and SPARK-34183 (#31700), as ideally we may want to replace WriteToMicroBatchDataSource and WriteToDataSourceV2 with logical write nodes and feed them to analyzer. That will potentially change the code path involved in this PR.

Why are the changes needed?

Currently WriteToDataSourceV2 doesn't have cache invalidation logic, and therefore, when the target table for a micro batch streaming job is cached, the cache entry won't be removed when the table is updated.

Does this PR introduce any user-facing change?

Yes now when a DSv2 table which supports streaming write is updated by a streaming job, its cache will also be invalidated.

How was this patch tested?

Added a new UT.

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Test build #136869 has finished for PR 32039 at commit 0891cd1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class WriteToMicroBatchDataSource(

@sunchao sunchao marked this pull request as draft April 2, 2021 22:30
@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41447/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41529/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136953 has finished for PR 32039 at commit bf8d519.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class WriteToDataSourceV2(
  • case class WriteToMicroBatchDataSource(

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41539/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41539/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136962 has finished for PR 32039 at commit 801d35a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class WriteToDataSourceV2(
  • case class WriteToMicroBatchDataSource(

WriteToDataSourceV2Exec(writer, planLater(query)) :: Nil
case WriteToDataSourceV2(relationOpt, writer, query) =>
val refreshCacheFunc: () => Unit = relationOpt match {
case Some(r) => refreshCache(r)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if refresh cache is the best choice in the context of streaming - perhaps we should invalidate it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a concern, why don't we have a config for that? We can have a controllability to invalidate or to refresh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think refreshing the cache per second (assuming the streaming trigger is one second) doesn't make sense. Invalidating seems more reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I cannot think about a use-case we want to refresh the cache of a table written by streaming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all! Let me switch to invalidating cache. IMO a config isn't necessary since ppl may never want this behavior (it could get very expensive).

@sunchao sunchao marked this pull request as ready for review April 6, 2021 23:34
@sunchao
Copy link
Member Author

sunchao commented Apr 7, 2021

I think the test failures are not related. cc @HeartSaVioR @cloud-fan and @aokolnychyi

Comment on lines 258 to +261
override protected def run(): Seq[InternalRow] = {
writeWithV2(batchWrite)
val writtenRows = writeWithV2(batchWrite)
refreshCache()
writtenRows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of refreshing/invalidating the table per trigger, why we don't just invalidate the cache before we start the streaming query that writes the table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that should work too and also will require fewer code changes. I went this way to be consistent with other V2 write commands. Also, in future we may introduce DataStreamWriterV2 which could pass write node with UnresolvedRelation to analyzer and be converted to execution plan, and this approach may fit better in that case.

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41726/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41726/

@sunchao sunchao changed the title [SPARK-34947][SQL] Streaming write to a V2 table should refresh its associated cache [SPARK-34947][SQL] Streaming write to a V2 table should invalidate its associated cache Apr 9, 2021
@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41730/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137147 has finished for PR 32039 at commit 779a8d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137151 has finished for PR 32039 at commit 239fafd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

extraOptions: Map[String, String],
plan: WriteToStream)
plan: WriteToStream,
catalogAndIdent: Option[(CatalogPlugin, Identifier)] = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we put this info in WriteToStream? It's very weird to see catalogAndIdent as a parameter of MicroBatchExecution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. WriteToStream is a better place for this information.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41808/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41808/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137228 has finished for PR 32039 at commit a884b47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Looks OK to me.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1a67089 Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants