Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table #17097

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Feb 28, 2017

What changes were proposed in this pull request?

When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table.

This PR also includes some refactors:

  1. use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating
  2. rename invalidateCache to recacheByPlan, which is more obvious about what it does.

How was this patch tested?

new regression test

@cloud-fan
Copy link
Contributor Author

@windpiger
Copy link
Contributor

great job~

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73581 has finished for PR 17097 at commit 6b57fd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73586 has finished for PR 17097 at commit 97cbfe2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73736 has started for PR 17097 at commit e881f29.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73741 has finished for PR 17097 at commit e881f29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please


/**
* Tries to remove the cache entry of the given plan, no operation, if it's already uncached.
* Note that all other caches that refer to this plan will be re-cached.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior change.

Will uncache be a very expensive operation if we recache all the related cached plans?

@SparkQA
Copy link

SparkQA commented Mar 5, 2017

Test build #73939 has finished for PR 17097 at commit e881f29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-19765][SQL] UNCACHE TABLE should re-cache all cached plans that refer to this table [SPARK-19765][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table Mar 5, 2017
@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73940 has finished for PR 17097 at commit 6748017.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73943 has finished for PR 17097 at commit 33124d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73946 has finished for PR 17097 at commit 067ddf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sparkSession.sharedState.cacheManager.invalidateCache(logicalRelation)
// Re-cache all cached plans(including this relation itself, if it's cached) that refer to this
// data source relation.
sparkSession.sharedState.cacheManager.recacheByPlan(sparkSession, logicalRelation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current design, users need to re-cache the queries by themselves.

After this change, insertion could be super slow. Each insert could trigger the recache of many involved cached data, each of which could be very complex and expensive. That is a trade-off. Although we keep the data correctness/consistence, we could sacrifice the performance/user experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change the behavior, just rename invalidateCache to recacheByPlan.

I'll open a new discussion about whether we should do recache after insertion or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh... My fault... invalidateCache is misleading. Renaming is good!

@gatorsmile
Copy link
Member

SPARK-18549: Failed to Uncache a View that References a Dropped Table is also resolved by this PR. Could you also add a test case for that scenario?

* No operation, if it's already uncached.
* Un-cache all the cache entries that refer to the given plan.
*/
def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock {
Copy link
Member

@gatorsmile gatorsmile Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, UncacheTableCommand, RefreshTable and InsertIntoHiveTable will be affected by the behavior changes of this function, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I don't want to test all of them, just pick a typical one and prove the uncacheQuery works correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is another scenario that after alter datasource table location, we can't uncache the table,
JIRA

@cloud-fan cloud-fan changed the title [SPARK-19765][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table [SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table Mar 7, 2017
@cloud-fan
Copy link
Contributor Author

@gatorsmile I have put the JIRA number in this PR title, without adding a test, because the new behavior this PR introduced can obviously fix that bug.

I don't want to add a lot of end tests to prove a single functionality.

@gatorsmile
Copy link
Member

Different from materialized views, currently, our caches are always used for query rewriting. Thus, uncaching all the invalid data makes sense.

Will finish the review today. Thanks!

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@gatorsmile
Copy link
Member

Should we backport it to the previous releases?

@asfgit asfgit closed this in c05baab Mar 7, 2017
@cloud-fan
Copy link
Contributor Author

yea, we should, but I probably don't have time to do this. Can you take over? thanks!

@gatorsmile
Copy link
Member

Sure. Will do it.

asfgit pushed a commit that referenced this pull request Mar 17, 2017
…L] Backport Three Cache-related PRs to Spark 2.1

### What changes were proposed in this pull request?

Backport a few cache related PRs:

---
[[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression](#16493)

Consider the plans inside subquery expressions while looking up cache manager to make
use of cached data. Currently CacheManager.useCachedData does not consider the
subquery expressions in the plan.

---
[[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path](#17064)

Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path.

However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678.

---
[[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table](#17097)

When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table. The following commands trigger the table uncache: `DropTableCommand`, `TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, `RefreshTable` and `InsertIntoHiveTable`

This PR also includes some refactors:
- use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating
- rename invalidateCache to recacheByPlan, which is more obvious about what it does.

### How was this patch tested?
N/A

Author: Xiao Li <gatorsmile@gmail.com>

Closes #17319 from gatorsmile/backport-17097.
@meteorchenwu
Copy link

What a bad change it is!
Now it can not support for this scene any more.
https://issues.apache.org/jira/browse/SPARK-21579

@gatorsmile
Copy link
Member

@meteorchenwu Correctness is more important for us. The cached plan will be reused when we build any other plans. Thus, users might see the out-of-dated results.

To achieve what you want, it requires introducing new concept, like materialized views, which will not be used by plan matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants