Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30494][SQL] Fix cached data leakage during replacing an existing view #27185

Closed
wants to merge 9 commits into from

Conversation

LantaoJin
Copy link
Contributor

@LantaoJin LantaoJin commented Jan 13, 2020

What changes were proposed in this pull request?

The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak.

We can reproduce by below commands:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("create or replace temporary view temp1 as select 1")
scala> spark.sql("cache table temp1")
scala> spark.sql("create or replace temporary view temp1 as select 1, 2")
scala> spark.sql("cache table temp1")
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined)

Why are the changes needed?

Fix the memory leak, specially for long running mode.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add an unit test.

@@ -125,6 +132,9 @@ case class CreateViewCommand(
val viewIdent = tableMetadata.identifier
checkCyclicViewReference(analyzedPlan, Seq(viewIdent), viewIdent)

// uncache the cached data before replacing an exists view
sparkSession.catalog.uncacheTable(viewIdent.quotedString)
Copy link
Contributor Author

@LantaoJin LantaoJin Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. There is a condition } else if (replace) { in line 130.

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116607 has finished for PR 27185 at commit 48c8b37.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest it please

@LantaoJin
Copy link
Contributor Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116629 has finished for PR 27185 at commit 48c8b37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

@@ -108,9 +108,16 @@ case class CreateViewCommand(
verifyTemporaryObjectsNotExists(catalog)

if (viewType == LocalTempView) {
if (replace && catalog.getTempView(name.table).isDefined) {
sparkSession.catalog.uncacheTable(name.quotedString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about showing an INFO log for this case instead of implicitly uncaching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to adding INFO log, but not removing the uncacheTable here, because after replacing, there is no way to drop the old cached data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean so and its ok to just add logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added.

@LantaoJin LantaoJin changed the title [SPARK-30494][SQL] Avoid duplicated cached RDD when create or replace an existing view [SPARK-30494][SQL] Avoid duplicated cached RDD when replace an existing view Jan 14, 2020
@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116668 has finished for PR 27185 at commit 69ab94f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

@srowen
Copy link
Member

srowen commented Jan 31, 2020

Seems reasonable to me?

@cloud-fan
Copy link
Contributor

I think it makes sense, but we should follow how similar things are done in DROP TABLE.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Feb 1, 2020

I think it makes sense, but we should follow how similar things are done in DROP TABLE.

I think they are similar with DROP TABLE
In DropTableCommand:

sparkSession.sharedState.cacheManager.uncacheQuery(

      try {
        sparkSession.sharedState.cacheManager.uncacheQuery(
          sparkSession.table(tableName), cascade = !isTempView)
      } catch {
        case NonFatal(e) => log.warn(e.toString, e)
      }

I added the sparkSession.catalog.uncacheTable() in views.scala, uncacheTable() has the similar logic:

sparkSession.catalog.uncacheTable(name.quotedString)

  override def uncacheTable(tableName: String): Unit = {
    val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
    val cascade = !sessionCatalog.isTemporaryTable(tableIdent)
    sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName), cascade)
  }

Can I just use try..cache to wrap the uncacheTable()?

@SparkQA
Copy link

SparkQA commented Feb 1, 2020

Test build #117717 has finished for PR 27185 at commit 8a2a9f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 3, 2020

shall we further generalize it? Currently we un-cache tables in several commands like DROP TABLE, TRUNCATE TABLE, etc. and now we find more missing places like CREATE VIEW.

Instead of un-caching tables in the commands, I feel it's better to do it in low-level basic operations like SessionCatalog.dropTable, createTempView, etc.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Feb 3, 2020

To do that, SessionCatalog needs a reference of SparkSession. We need pass SparkSession to the construct arguments of SessionCatalog. Is that okey?

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Feb 3, 2020

To do that, SessionCatalog needs a reference of SparkSession. We need pass SparkSession to the construct arguments of SessionCatalog. Is that okey?

Oh, checked code. Catalog and CacheManager are all in sql-core package, but SessionCatalog is in sql-catalyst package. So we cannot reference them here.

@cloud-fan
Copy link
Contributor

ah I see. CacheManager is in sql/core not catalyst. Then can we create a util function to make sure we uncache table in a consistent way?

@LantaoJin
Copy link
Contributor Author

I think def uncacheTable(tableName: String): Unit in Catalog.scala is already the general function to uncache table. But it still can not prevent to use other ways in invoke side.

def uncacheTable(tableName: String): Unit

Can we move sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala to catalyst package?

@cloud-fan
Copy link
Contributor

The entry point of Catalog is SparkSession#catalog. It doesn't help if we only move Catalog to catalyst.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Feb 3, 2020

The entry point of Catalog is SparkSession#catalog. It doesn't help if we only move Catalog to catalyst.

Yes. Another way is adding an uncacheTable method in interface ExternalCatalog. And pass cacheManager to HiveExternalCatalog construct arguments. But InMemoryCatalog in catalyst package can not use cacheManager. So the unit test only works when use HiveExternalCatalog.

But I am not sure whether or not worth to touch the interface ExternalCatalog

@LantaoJin
Copy link
Contributor Author

retest please

@maropu
Copy link
Member

maropu commented Mar 14, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Mar 14, 2020

Test build #119788 has finished for PR 27185 at commit 8a2a9f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2020

Test build #120127 has finished for PR 27185 at commit 57c3c33.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2020

Test build #120128 has finished for PR 27185 at commit 8839bdb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -21,7 +21,6 @@ import java.util.Locale

import org.apache.spark.sql.{Dataset, Row, SparkSession}
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.analysis.NoSuchTableException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we revert this change in this file from this PR?
This kind of irrelevant change makes the backport difficult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@dongjoon-hyun
Copy link
Member

Also, @LantaoJin . Please fix the PR description. For example, the following doesn't work.

beeline> create or replace temporary view temp1 as select 1
beeline> cache table tempView
beeline> create or replace temporary view temp1 as select 1, 2
beeline> cache table tempView

assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isEmpty)
sql("cache table tempView")
assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be redundant due to line 1133.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isEmpty)
sql("cache table global_temp.tempGlobalTempView")
assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be redundant due to 1145.

// |
// + Project[1 AS 1]
spark.sharedState.cacheManager.uncacheQuery(spark.table("view1"), cascade = false)
assert(spark.sharedState.cacheManager.isEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an invalid test scope because this is testing uncacheQuery at line 1164 instead of create or replace view. Shall we revise this whole test case?

Copy link
Contributor Author

@LantaoJin LantaoJin Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1164 and 1165 is to make sure our patch is working, no cached data leak for persisted view. Without this patch, the assert 1165 will fails. So I don't think it's an invalid test.

@@ -1122,4 +1122,47 @@ class CachedTableSuite extends QueryTest with SQLTestUtils
assert(!spark.catalog.isCached("t1"))
}
}

test("SPARK-30494 avoid duplicated cached RDD when replace an existing view") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be misleading because the cached RDD is not duplicated. Two cached RDD are independent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. how about "Fix the leak of cached data when replace an existing view"

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @LantaoJin . Could you revise the PR once more? Thanks!

@LantaoJin
Copy link
Contributor Author

Ah, you meant the beeline>. Sorry, I tested via thriftserver, I will change to spark

Also, @LantaoJin . Please fix the PR description. For example, the following doesn't work.

beeline> create or replace temporary view temp1 as select 1
beeline> cache table tempView
beeline> create or replace temporary view temp1 as select 1, 2
beeline> cache table tempView

Ah I get it.

@LantaoJin LantaoJin changed the title [SPARK-30494][SQL] Avoid duplicated cached RDD when replace an existing view [SPARK-30494][SQL] Fix the leak of cached data when replace an existing view Mar 22, 2020
@SparkQA
Copy link

SparkQA commented Mar 22, 2020

Test build #120160 has finished for PR 27185 at commit e59fc98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30494][SQL] Fix the leak of cached data when replace an existing view [SPARK-30494][SQL] Fix the leak of cached data when replace an existing temp view Mar 23, 2020
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30494][SQL] Fix the leak of cached data when replace an existing temp view [SPARK-30494][SQL] Fix the leak of cached data when replace an existing view Mar 23, 2020
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30494][SQL] Fix the leak of cached data when replace an existing view [SPARK-30494][SQL] Fix cached data leakage during replacing an existing view Mar 23, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @LantaoJin and all!
Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request Mar 23, 2020
…ng view

### What changes were proposed in this pull request?

The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak.

We can reproduce by below commands:
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("create or replace temporary view temp1 as select 1")
scala> spark.sql("cache table temp1")
scala> spark.sql("create or replace temporary view temp1 as select 1, 2")
scala> spark.sql("cache table temp1")
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined)
```

### Why are the changes needed?
Fix the memory leak, specially for long running mode.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Add an unit test.

Closes #27185 from LantaoJin/SPARK-30494.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 929b794)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Hi, @LantaoJin . Could you make a backport PR against branch-2.4?

LantaoJin added a commit to LantaoJin/spark that referenced this pull request Mar 24, 2020
…ng view

### What changes were proposed in this pull request?

The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak.

We can reproduce by below commands:
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("create or replace temporary view temp1 as select 1")
scala> spark.sql("cache table temp1")
scala> spark.sql("create or replace temporary view temp1 as select 1, 2")
scala> spark.sql("cache table temp1")
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined)
```

### Why are the changes needed?
Fix the memory leak, specially for long running mode.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Add an unit test.

Closes apache#27185 from LantaoJin/SPARK-30494.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

(cherry picked from commit 929b794)
dongjoon-hyun pushed a commit that referenced this pull request Mar 24, 2020
…xisting view

### What changes were proposed in this pull request?

This is backport of #27185 to branch-2.4.

The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak.

We can reproduce by below commands:
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("create or replace temporary view temp1 as select 1")
scala> spark.sql("cache table temp1")
scala> spark.sql("create or replace temporary view temp1 as select 1, 2")
scala> spark.sql("cache table temp1")
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined)
```

### Why are the changes needed?
Fix the memory leak, specially for long running mode.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Add a unit test.

Closes #28000 from LantaoJin/SPARK-30494_2.4.

Lead-authored-by: lajin <lajin@ebay.com>
Co-authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ng view

### What changes were proposed in this pull request?

The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak.

We can reproduce by below commands:
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("create or replace temporary view temp1 as select 1")
scala> spark.sql("cache table temp1")
scala> spark.sql("create or replace temporary view temp1 as select 1, 2")
scala> spark.sql("cache table temp1")
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined)
```

### Why are the changes needed?
Fix the memory leak, specially for long running mode.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Add an unit test.

Closes apache#27185 from LantaoJin/SPARK-30494.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants