Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34197][SQL] SessionCatalog.refreshTable() should not invalidate the relation cache for temporary views #31265

Closed

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 20, 2021

What changes were proposed in this pull request?

Check the name passed to SessionCatalog.refreshTable, and if it belongs to a temporary view, do not invalidate the relation cache.

Why are the changes needed?

When SessionCatalog.refreshTable refreshes a temporary or global temporary view, it should not invalidate an entry in the relation cache associated to a table with the same name.

Does this PR introduce any user-facing change?

Should not. The change might improve performance slightly.

How was this patch tested?

By running new UT:

$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SessionCatalogSuite"

@github-actions github-actions bot added the SQL label Jan 20, 2021
@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38878/

@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38878/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Test build #134292 has finished for PR 31265 at commit a73cb59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SQL] Don't invalidate the relation cache for temp and global views [SPARK-34197][SQL] SessionCatalog.refreshTable() should not invalidate the relation cache for temporary views Jan 21, 2021
@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 21, 2021

@cloud-fan @dongjoon-hyun @HyukjinKwon @sunchao Could you take a look at the small fix.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaxGekk for pinging. Overall looks good to me. BTW do you know whether the refreshTable method is used in places where there could be both temp view and table with the same name and we explicitly do not want to refresh the temp view but just the table relation cache?

@@ -994,15 +994,17 @@ class SessionCatalog(
// Go through temporary views and invalidate them.
// If the database is defined, this may be a global temporary view.
// If the database is not defined, there is a good chance this is a temp view.
if (name.database.isEmpty) {
tempViews.get(tableName).foreach(_.refresh())
val isTempView = if (name.database.isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the code structure here looks pretty similar to isTemporaryTable, and it would be nice if we can do:

    lookupTemporaryTable(name).map(_.refresh()).getOrElse {
      // Also invalidate the table relation cache.
      val qualifiedTableName = QualifiedTableName(dbName, tableName)
      tableRelationCache.invalidate(qualifiedTableName)
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that.

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38929/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38929/

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Test build #134342 has finished for PR 31265 at commit 3b49a35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 22, 2021

BTW do you know whether the refreshTable method is used in places where there could be both temp view and table with the same name and we explicitly do not want to refresh the temp view but just the table relation cache?

Theoretically, I could imagine the situation when an user run any v1 command for a table, for instance ALTER TABLE tbl DROP PARTITION, and tbl is resolved to a v1 table:

case class AlterTableDropPartitionCommand(
    tableName: TableIdentifier,
...

tableName refers to tbl. While running the command, the user creates a temp view tbl in a parallel thread somewhere in the middle of run():

override def run(sparkSession: SparkSession): Seq[Row] = {
    val table = catalog.getTableMetadata(tableName)
    ...
    // USER CREATES A TEMP VIEW WITH THE SAME NAME
    ...
    sparkSession.catalog.refreshTable(table.identifier.quotedString)

And at the end of run(), we refresh the view instead of the table. So we wanted to refresh the table but refreshed the view.

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38946/

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38946/

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Test build #134360 has finished for PR 31265 at commit dcae0ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -877,21 +877,27 @@ class SessionCatalog(
isTemporaryTable(nameParts.asTableIdentifier)
}

private def lookupTempView(name: TableIdentifier): Option[LogicalPlan] = {
val table = formatTableName(name.table)
val dbName = formatDatabaseName(name.database.getOrElse(currentDb))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to check the current database, as we can't set global_tmp as the current database.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code didn't check the current database as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan This is "previous" code:

val dbName = formatDatabaseName(name.database.getOrElse(currentDb))

or you are speaking about another previous code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, do you propose to call database.get?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's follow that "previous" code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that. ... The function name isTemporaryTable looks weird, so, I would replace it by:

def isTempView(name: TableIdentifier): Boolean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #31295 which renames isTemporaryTable().

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38961/

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38961/

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Test build #134374 has finished for PR 31265 at commit 5c15355.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…-refresh-table

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
@SparkQA
Copy link

SparkQA commented Jan 23, 2021

Test build #134402 has finished for PR 31265 at commit 4c3350e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class BitwiseGet(left: Expression, right: Expression)
  • new RuntimeException(s\"class$`

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 24, 2021

Any objections for the changes?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6fe5a8a Jan 25, 2021
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…ate the relation cache for temporary views

### What changes were proposed in this pull request?
Check the name passed to `SessionCatalog.refreshTable`, and if it belongs to a temporary view, do not invalidate the relation cache.

### Why are the changes needed?
When `SessionCatalog.refreshTable` refreshes a temporary or global temporary view, it should not invalidate an entry in the relation cache associated to a table with the same name.

### Does this PR introduce _any_ user-facing change?
Should not. The change might improve performance slightly.

### How was this patch tested?
By running new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SessionCatalogSuite"
```

Closes apache#31265 from MaxGekk/fix-session-catalog-refresh-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants