[SPARK-23195] [SQL] Keep the Hint of Cached Data #20368

gatorsmile · 2018-01-23T20:50:48Z

What changes were proposed in this pull request?

The broadcast hint of the cached plan is lost if we cache the plan. This PR is to correct it.

  val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
  val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
  broadcast(df2).cache()
  df2.collect()
  val df3 = df1.join(df2, Seq("key"), "inner")

How was this patch tested?

Added a test.

gatorsmile · 2018-01-23T20:52:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala

+        val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")
+        val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
+        broadcast(df2).cache()
+        if (materialized) df2.collect()


This PR #19864 accidentally fixes the issue when the plan is not materialized. However, it does not resolve the issue when the cached plan is materialized.

gatorsmile · 2018-01-23T20:53:19Z

cc @sameeragarwal @cloud-fan @jiangxb1987 @zsxwing

jaceklaskowski · 2018-01-23T21:00:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

@@ -77,7 +77,7 @@ case class InMemoryRelation(
      // Underlying columnar RDD hasn't been materialized, use the stats from the plan to cache
      statsOfPlanToCache
    } else {
-      Statistics(sizeInBytes = batchStats.value.longValue)
+      Statistics(sizeInBytes = batchStats.value.longValue, hints = statsOfPlanToCache.hints)


Why don't you simply statsOfPlanToCache.copy(sizeInBytes = batchStats.value.longValue)?

That is misleading. Conceptually, that is wrong. The values should be filled by the actual values from the materialized results.

jaceklaskowski · 2018-01-23T21:01:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala

+  test("broadcast hint is retained in a cached plan") {
+    Seq(true, false).foreach { materialized =>
+      withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+        val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value")


Is spark.createDataFrame(...) wrapper really required? I thought Seq((1, "4"), (2, "2")).toDF("key", "value") would just work fine.

That should not matter.

jaceklaskowski · 2018-01-23T21:03:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala

+        val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value")
+        broadcast(df2).cache()
+        if (materialized) df2.collect()
+        val df3 = df1.join(df2, Seq("key"), "inner")


val df3 = df1.join(df2, "key")? inner is implied, isn't it? (I'm proposing the change as this and other tests could be easily used as a learning tool to master Spark SQL's API)

That should not matter.

I tend to agree that tests are also examples for Spark users, we should pick the recommended usages.

All the other cases are creating Dataframes like this. Anyway, I changed all of them in the new PR.

jiangxb1987

LGTM

sameeragarwal · 2018-01-23T21:18:09Z

LGTM

CodingCat · 2018-01-23T22:48:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

@@ -63,7 +63,7 @@ case class InMemoryRelation(
    tableName: Option[String])(
    @transient var _cachedColumnBuffers: RDD[CachedBatch] = null,
    val batchStats: LongAccumulator = child.sqlContext.sparkContext.longAccumulator,
-    statsOfPlanToCache: Statistics = null)
+    statsOfPlanToCache: Statistics)


leave no default value is fine, we do not any default value actually

Setting null by default is risky, because we might hit NullPointerException .

SparkQA · 2018-01-24T00:10:45Z

Test build #86545 has finished for PR 20368 at commit 21e5321.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-24T00:17:42Z

Thanks! Merged to master/2.3

## What changes were proposed in this pull request? The broadcast hint of the cached plan is lost if we cache the plan. This PR is to correct it. ```Scala val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") broadcast(df2).cache() df2.collect() val df3 = df1.join(df2, Seq("key"), "inner") ``` ## How was this patch tested? Added a test. Author: gatorsmile <gatorsmile@gmail.com> Closes #20368 from gatorsmile/cachedBroadcastHint. (cherry picked from commit 44cc4da) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

dongjoon-hyun · 2018-01-24T05:59:52Z

Hi, All.

This seems to break both master and branch-2.3. It might be correlated to the sister PR, SPARK-23192.
Could you take a look?

dongjoon-hyun · 2018-01-24T06:01:56Z

As a result, this seems to block SparkPullRequestBuilder, too.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/86557/

gatorsmile · 2018-01-24T06:32:52Z

I am reverting this PR.

dongjoon-hyun · 2018-01-24T06:38:07Z

Oh, thank you so much for fast recovery, @gatorsmile .

gatorsmile · 2018-01-24T06:40:30Z

Thanks! revert it from master/2.3

fix

21e5321

gatorsmile commented Jan 23, 2018

View reviewed changes

jaceklaskowski reviewed Jan 23, 2018

View reviewed changes

jiangxb1987 approved these changes Jan 23, 2018

View reviewed changes

CodingCat reviewed Jan 23, 2018

View reviewed changes

asfgit closed this in 44cc4da Jan 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23195] [SQL] Keep the Hint of Cached Data #20368

[SPARK-23195] [SQL] Keep the Hint of Cached Data #20368

gatorsmile commented Jan 23, 2018

gatorsmile Jan 23, 2018

gatorsmile commented Jan 23, 2018

jaceklaskowski Jan 23, 2018

gatorsmile Jan 23, 2018 •

edited

Loading

jaceklaskowski Jan 23, 2018

gatorsmile Jan 23, 2018

jaceklaskowski Jan 23, 2018

gatorsmile Jan 23, 2018

cloud-fan Jan 24, 2018

gatorsmile Jan 24, 2018

jiangxb1987 left a comment

sameeragarwal commented Jan 23, 2018

CodingCat Jan 23, 2018

gatorsmile Jan 23, 2018

SparkQA commented Jan 24, 2018

gatorsmile commented Jan 24, 2018

dongjoon-hyun commented Jan 24, 2018

dongjoon-hyun commented Jan 24, 2018

gatorsmile commented Jan 24, 2018

dongjoon-hyun commented Jan 24, 2018

gatorsmile commented Jan 24, 2018

[SPARK-23195] [SQL] Keep the Hint of Cached Data #20368

[SPARK-23195] [SQL] Keep the Hint of Cached Data #20368

Conversation

gatorsmile commented Jan 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

gatorsmile commented Jan 23, 2018

Choose a reason for hiding this comment

gatorsmile Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiangxb1987 left a comment

Choose a reason for hiding this comment

sameeragarwal commented Jan 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 24, 2018

gatorsmile commented Jan 24, 2018

dongjoon-hyun commented Jan 24, 2018

dongjoon-hyun commented Jan 24, 2018

gatorsmile commented Jan 24, 2018

dongjoon-hyun commented Jan 24, 2018

gatorsmile commented Jan 24, 2018

gatorsmile Jan 23, 2018 •

edited

Loading