[SPARK-54812][SQL][4.1] Make executable commands not execute on resultDf.cache() #54064
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Backport of #53572 to 4.1 branch
Follow up of #51032 . That pr changed V2WriteCommand not to execute eagerly on df.cache(). However, there are a bunch of other commands that still do.
This patch skip CacheManager for all Command, because these are eagerly-executed already when first calling sql("COMMAND").
Why are the changes needed?
To prevent the command with side-effect from being executed again if a user runs df.cache on the result of the command. Many are dangerous as they would be running a second time without the user expectation (df.cache triggering another action on the table)
Does this PR introduce any user-facing change?
If the user created a resultDF from a command, and then ran resultDf.cache, it used to re-run the command. Now it will no-op. Most of the time, this is beneficial as re-running the command will result in an error, or worse data corruption. However, in some small cases , like SHOW TABLES or SHOW NAMESPACES, it will affect the contents of resultDf as it will no longer refresh when calling resultDf.cache()
Note: In most cases, we are lucky and will not see user-facing change. This is because commands, like for example DescribeTableExec plan node, already has a in-memory reference to Table object and keeps the old result despite repeated execution. However, SHOW XXX command plans do not cache in memory results so they see some effect.
How was this patch tested?
Existing unit test, add new unit tests
Was this patch authored or co-authored using generative AI tooling?
No