Skip to content

Conversation

@szehon-ho
Copy link
Member

What changes were proposed in this pull request?

Backport of #53572 to 4.1 branch

Follow up of #51032 . That pr changed V2WriteCommand not to execute eagerly on df.cache(). However, there are a bunch of other commands that still do.

val df = sql("CREATE TABLE...")
df.cache()  // executes again, fails with TableAlreadyExistsException

This patch skip CacheManager for all Command, because these are eagerly-executed already when first calling sql("COMMAND").

val df = sql("SHOW TABLES.")
sql("CREATE TABLE foo")
df.cache()  // executes again and df now includes foo

Why are the changes needed?

To prevent the command with side-effect from being executed again if a user runs df.cache on the result of the command. Many are dangerous as they would be running a second time without the user expectation (df.cache triggering another action on the table)

Does this PR introduce any user-facing change?

If the user created a resultDF from a command, and then ran resultDf.cache, it used to re-run the command. Now it will no-op. Most of the time, this is beneficial as re-running the command will result in an error, or worse data corruption. However, in some small cases , like SHOW TABLES or SHOW NAMESPACES, it will affect the contents of resultDf as it will no longer refresh when calling resultDf.cache()

Note: In most cases, we are lucky and will not see user-facing change. This is because commands, like for example DescribeTableExec plan node, already has a in-memory reference to Table object and keeps the old result despite repeated execution. However, SHOW XXX command plans do not cache in memory results so they see some effect.

How was this patch tested?

Existing unit test, add new unit tests

Was this patch authored or co-authored using generative AI tooling?

No

…ache()

Follow up of apache#51032 . That pr changed V2WriteCommand not to execute eagerly on df.cache(). However, there are a bunch of other commands that still do.

```
val df = sql("CREATE TABLE...")
df.cache()  // executes again, fails with TableAlreadyExistsException
```

This patch skip CacheManager for all Command, because these are eagerly-executed already when first calling sql("COMMAND").

```
val df = sql("SHOW TABLES.")
sql("CREATE TABLE foo")
df.cache()  // executes again and df now includes foo
```

To prevent the command with side-effect from being executed again if a user runs df.cache on the result of the command.  Many are dangerous as they would be running a second time without the user expectation (df.cache triggering another action on the table)

If the user created a resultDF from a command, and then ran resultDf.cache, it used to re-run the command.  Now it will no-op.  Most of the time, this is beneficial as re-running the command will result in an error, or worse data corruption.  However, in some small cases , like SHOW TABLES or SHOW NAMESPACES, it will affect the contents of resultDf as it will no longer refresh when calling resultDf.cache()

Note:  In most cases, we are lucky and will not see user-facing change.  This is because commands, like for example DescribeTableExec plan node, already has a in-memory reference to Table object and keeps the old result despite repeated execution.  However, SHOW XXX command plans do not cache in memory results so they see some effect.

Existing unit test, add new unit tests

No

Closes apache#53572 from szehon-ho/cache_safety.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@github-actions
Copy link

JIRA Issue Information

=== Bug SPARK-54812 ===
Summary: Make executable commands not execute on resultDf.cache()
Assignee: Szehon Ho
Status: Resolved
Affected: ["4.1.0"]


This comment was automatically generated by GitHub Actions

@szehon-ho
Copy link
Member Author

@pan3793 is it what you meant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant