[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution #22198

maropu · 2018-08-23T09:53:15Z

What changes were proposed in this pull request?

This pr fixed code to respect a database name for broadcast table hint resolution.
Currently, spark ignores a database name in multi-part names;

scala> sql("CREATE DATABASE testDb")
scala> spark.range(10).write.saveAsTable("testDb.t")

// without this patch
scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain
== Physical Plan ==
*(2) Project [id#24L]
+- *(2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   :  +- *(1) Range (0, 10, step=1, splits=4)
   +- *(2) Project [id#26L]
      +- *(2) Filter isnotnull(id#26L)
         +- *(2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

// with this patch
scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain
== Physical Plan ==
*(2) Project [id#3L]
+- *(2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight
   :- *(2) Range (0, 10, step=1, splits=4)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
      +- *(1) Project [id#5L]
         +- *(1) Filter isnotnull(id#5L)
            +- *(1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

How was this patch tested?

Added tests in DataFrameJoinSuite.

maropu · 2018-08-23T09:55:57Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

@@ -191,6 +195,39 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext {
    assert(plan2.collect { case p: BroadcastHashJoinExec => p }.size == 1)
  }

+  test("SPARK-25121 Supports multi-part names for broadcast hint resolution") {


Would it be better to move the three tests below into DataFrameHintSuite?

test("broadcast join hint using broadcast function")

test("broadcast join hint using Dataset.hint")

test("SPARK-25121 Supports multi-part names for broadcast hint resolution")

ResolveHintsSuite is the smallest one for this. Can we add the following test to ResolveHintsSuite?

test("Supports multi-part table names for broadcast hint resolution") { checkAnalysis( UnresolvedHint("MAPJOIN", Seq("default.table", "default.table2"), table("table").join(table("table2"))), Join(ResolvedHint(testRelation, HintInfo(broadcast = true)), ResolvedHint(testRelation2, HintInfo(broadcast = true)), Inner, None), caseSensitive = false) }

SparkQA · 2018-08-23T13:39:09Z

Test build #95148 has finished for PR 22198 at commit d2be692.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-25T02:46:48Z

@dilipbiswal @gatorsmile ping

gatorsmile · 2018-08-25T06:15:44Z

cc @dongjoon-hyun Try to review this PR?

dilipbiswal · 2018-08-25T07:49:14Z

@maropu @gatorsmile @dongjoon-hyun I do have a question on the semantics.

use hint;
explain extended SELECT /*+ BROADCASTJOIN(hint.s2) */ * FROM s1, s2 where s1.c1 = s2.c1;

In this case, aren't we supposed to apply the hint ? even though s2 is not explicitly qualified with the database in the from clause ? Here is the optimized plan i see ..

*(5) SortMergeJoin [c1#30], [c1#32], Inner
:- *(2) Sort [c1#30 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(c1#30, 200)
:     +- *(1) Filter isnotnull(c1#30)
:        +- HiveTableScan [c1#30, c2#31], HiveTableRelation `hint`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31]
+- *(4) Sort [c1#32 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(c1#32, 200)
      +- *(3) Filter isnotnull(c1#32)
         +- HiveTableScan [c1#32, c2#33], HiveTableRelation `hint`.`s2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#32, c2#33]

maropu · 2018-08-25T12:08:16Z

aha, I see. IMO we need to apply the hint in the case, too. I'll fix.

SparkQA · 2018-08-25T18:37:08Z

Test build #95252 has finished for PR 22198 at commit b5f9584.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ResolveBroadcastHints(conf: SQLConf, catalog: SessionCatalog) extends Rule[LogicalPlan]

dongjoon-hyun · 2018-08-25T20:12:26Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

@@ -191,6 +195,39 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext {
    assert(plan2.collect { case p: BroadcastHashJoinExec => p }.size == 1)
  }

+  test("SPARK-25121 Supports multi-part names for broadcast hint resolution") {


ResolveHintsSuite is the smallest one for this. Can we add the following test to ResolveHintsSuite?

test("Supports multi-part table names for broadcast hint resolution") { checkAnalysis( UnresolvedHint("MAPJOIN", Seq("default.table", "default.table2"), table("table").join(table("table2"))), Join(ResolvedHint(testRelation, HintInfo(broadcast = true)), ResolvedHint(testRelation2, HintInfo(broadcast = true)), Inner, None), caseSensitive = false) }

dongjoon-hyun · 2018-08-25T20:17:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala

@@ -47,20 +49,39 @@ object ResolveHints {
   *
   * This rule must happen before common table expressions.
   */
-  class ResolveBroadcastHints(conf: SQLConf) extends Rule[LogicalPlan] {
+  class ResolveBroadcastHints(conf: SQLConf, catalog: SessionCatalog) extends Rule[LogicalPlan] {


Accordingly, we can use String instead of SessionCatalog.

- class ResolveBroadcastHints(conf: SQLConf, catalog: SessionCatalog) extends Rule[LogicalPlan] { + class ResolveBroadcastHints(conf: SQLConf, currentDatabase: String) extends Rule[LogicalPlan] {

I think we can't use String there because currentDatabase might be updatable by others?

We can instead use getCurrentDatabase: () => String?

Ya. Right, please ignore this. We need catalog to lookup global_temp, too.

dongjoon-hyun · 2018-08-25T23:39:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala

+        tableIdent: IdentifierWithDatabase): Boolean = {
+      val identifierList =
+        tableIdent.database.getOrElse(catalog.getCurrentDatabase) :: tableIdent.identifier :: Nil
+      namePartsWithDatabase(nameParts).corresponds(identifierList)(resolver)


This logic will make a regression (plan1 in the below) in case of global temporary view. Please add the following test case into GlobalTempViewSuite and revise the logic to handle both cases correctly.

test("broadcast hint on global temp view") { import org.apache.spark.sql.catalyst.plans.logical.{ResolvedHint, Join} withGlobalTempView("v1") { spark.range(10).createGlobalTempView("v1") withTempView("v2") { spark.range(10).createTempView("v2") Seq( "SELECT /*+ MAPJOIN(v1) */ * FROM global_temp.v1, v2 WHERE v1.id = v2.id", "SELECT /*+ MAPJOIN(global_temp.v1) */ * FROM global_temp.v1, v2 WHERE v1.id = v2.id" ).foreach { statement => val plan = sql(statement).queryExecution.optimizedPlan assert(plan.asInstanceOf[Join].left.isInstanceOf[ResolvedHint]) assert(!plan.asInstanceOf[Join].right.isInstanceOf[ResolvedHint]) } } } }

@dongjoon-hyun a little confused about the name resolution here;

"SELECT /*+ MAPJOIN(v1) */ * FROM global_temp.v1, v2 WHERE v1.id = v2.id",

MAPJOIN(v1) implicitly means global_temp.v1?
For example;

"SELECT /*+ MAPJOIN(v1) */ * FROM default.v1, global_temp.v1 WHERE default.v1.id = global_temp.v1.id",

In this case, what's the MAPJOIN(v1) behaviour?

Apply no hint (current behaviour)

Apply a hint into default.v1 only

Apply a hint into both

WDYT?

First of all, the above two test cases in test("broadcast hint on global temp view") should work as before. In general, global_temp.v1 should be used with the prefix global_temp.. However, before this PR, we cannot put database name on Hint. So, we allowed exceptional cases; hints on global temporary view (without global_temp. prefix).

For the case you mentioned, I'd like to interpret MAPJOIN(v1) to default.v1 only because it's the Spark's behavior outside this Hint syntax. And, please add a test case for this, too.

@cloud-fan and @gatorsmile . Could you give us some advice, too? Is it okay to you?

BTW, @maropu . In addition,

The current behavior of master branch (Spark 2.4) is Apply a hint into both.

The legacy behavior of Spark 2.3.1 is raising an AnalysisException for that query.

So, I think it's a good change to become consistent in Spark 2.4.

scala> sql("set spark.sql.autoBroadcastJoinThreshold=-1") scala> sql("set spark.sql.crossJoin.enabled=true") scala> sql("drop view v1") scala> sql("create view v1 as select 'view' id").show scala> sql("create global temporary view v1 as select 'global_temp_view' id").show scala> sql("SELECT /*+ MAPJOIN(v1) */ * FROM v1, global_temp.v1 WHERE default.v1.id = global_temp.v1.id").explain(true) org.apache.spark.sql.AnalysisException: cannot resolve '`default.v1.id`' given input columns: [v1.id, v1.id]; line 1 pos 58;

oh, yes. I'll refine the pr. thanks.

dongjoon-hyun · 2018-08-26T00:25:40Z

@maropu . I made a PR to you (maropu#2) for new two additional test cases. You can review and merge that into this PR. GlobalTempViewSuite.scala will start to fail after merging.

maropu · 2018-08-26T04:07:38Z

Thanks, @dongjoon-hyun! I'll check and merge that.

xuanyuanking · 2018-08-26T11:06:45Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

+          assert(plan.collect { case p: BroadcastHashJoinExec => p }.size == 0)
+
+          // Uses multi-part table names for broadcast hints
+          def checkIfHintApplied(tableName: String, hintTableName: String): Unit = {


hintTableName is never used in this func?

yea, I'll fix.

SparkQA · 2018-08-27T05:57:58Z

Test build #95275 has finished for PR 22198 at commit 5277563.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-27T07:05:02Z

Test build #95279 has finished for PR 22198 at commit bd3f502.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-27T07:22:16Z

retest this please

SparkQA · 2018-08-27T07:46:53Z

Test build #95281 has finished for PR 22198 at commit bd3f502.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-27T07:47:28Z

retest this please

SparkQA · 2018-08-27T11:45:14Z

Test build #95284 has finished for PR 22198 at commit bd3f502.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-08-27T19:27:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala

+        nameParts: Seq[String],
+        tableIdent: IdentifierWithDatabase): Boolean = {
+      tableIdent.database match {
+        case Some(db) if catalog.globalTempViewManager.database == formatDatabaseName(db) =>


Could you use resolver here like the following and remove formatDatabaseName in line 65~67? Since it's a SessionCatalog function, let's avoid duplication.

- case Some(db) if catalog.globalTempViewManager.database == formatDatabaseName(db) => + case Some(db) if resolver(catalog.globalTempViewManager.database, db) =>

Also, we need a case-sensitive test. I made another PR to you for that, maropu#3 .

SparkQA · 2018-08-28T04:15:14Z

Test build #95322 has finished for PR 22198 at commit 83387f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-08-28T06:08:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala

+        case _ =>
+          val db = tableIdent.database.getOrElse(catalog.getCurrentDatabase)
+          val identifierList = db :: tableIdent.identifier :: Nil
+          namePartsWithDatabase(nameParts, catalog.getCurrentDatabase)


This part will break temporary view case. In the following case, no table should be broadcasted. Also, could you add more test cases? We need to test table, global temporary view, temporary view, and view. It seems that we still miss some cases like the following.

scala> :paste // Entering paste mode (ctrl-D to finish) sql("set spark.sql.autoBroadcastJoinThreshold=-1") spark.range(10).write.mode("overwrite").saveAsTable("t") sql("create temporary view tv as select * from t") sql("select /*+ mapjoin(default.tv) */ * from t, tv where t.id = tv.id").explain sql("select * from default.tv") // Exiting paste mode, now interpreting. == Physical Plan == *(2) BroadcastHashJoin [id#7L], [id#12L], Inner, BuildRight :- *(2) Project [id#7L] : +- *(2) Filter isnotnull(id#7L) : +- *(2) FileScan parquet default.t[id#7L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dongjoon/PR-22198/spark-warehouse/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])) +- *(1) Project [id#12L] +- *(1) Filter isnotnull(id#12L) +- *(1) FileScan parquet default.t[id#12L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dongjoon/PR-22198/spark-warehouse/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> org.apache.spark.sql.AnalysisException: Table or view not found: `default`.`tv`; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation `default`.`tv`

ok, I'll try.

dongjoon-hyun · 2018-08-28T17:13:10Z

@maropu and @dilipbiswal and @gatorsmile .

The complexity comes because this PR duplicates the existing name resolution logic. Although we may move matchedTableIdentifier to SessionCatalog, it seems that we had better clarify the purpose of this PR again.

From your PR description,

Currently, spark ignores a database name in multi-part names;

Originally, ResolveBroadcastHints is designed to be executed at the first batch before ResolveRelation. So, it only compare table names. /*+ MAPJOIN(t) */ will broadcast testDb.t and testDb2.t. We will keep this behavior, won't we?

For /*+ MAPJOIN(testDb.t) */,

What we wanted at the beginning seems to be only supporting matching testDb.t to testDb.t .
After @dilipbiswal 's comment, this PR aims to a real resolution by /*+ MAPJOIN(testDb.t) */ to t. However, t depends on the session (currentDatabase and temporary views and global temporary views.)

To sum up, until now, we are moving forward to (2), but is (2) really required for SPARK-25121? If we choose (1), it will become simpler and consistent with the original design choice (matching based on unresolved strings).

dilipbiswal · 2018-08-28T22:44:10Z

@dongjoon-hyun Thanks for nicely summarizing. Actually i was not clear on the semantics when i asked the question :-) and was wondering if we should resolve it like a table identifier or just match it like a string. Do we know how other databases that support hints handle this ? I am actually fine if we go with option 1.

maropu · 2018-08-29T00:31:45Z

Thanks for the sum-up. I like simpler one, too. Le me just describe more to make me more understood; IIUC we have the two case: (1) fully-qualified case /*+ MAPJOIN(testDb.t) */ and (2) non-qualified case /*+ MAPJOIN(t) */.

(1) no ambiguity, as @dongjoon-hyun said, we just exactly map testDb.t to testDb.t
(2) the case @dongjoon-hyun worry about ,right?

Since I think most users meet this case (2) (they don't add database names there in most case I probably think...), IMHO it is important to support the syntax for usability. Based on the thought, my proposal is that we handle /*+ MAPJOIN(t) */ as /*+ MAPJOIN(*.t) */to pick up all the matched tables in views and tables. Then, we print warning message for users like this: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L265

SparkQA · 2018-08-29T04:14:46Z

Test build #95395 has finished for PR 22198 at commit cc0cd4f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-08-29T17:06:10Z

Ur, @maropu . What I worried was case (1). For testDb.t, there is no problem if this PR matches testDb.t literally. However, this PR tries to resolve every relation t; this could be temporary view, view, table. That's the reason this PR makes logic duplication. Previously, this layer (Hint resolution / handle / cleanup) doesn't aim that.

maropu · 2018-08-30T01:01:07Z

Aha, I see. It is simple to match identifiers literally. So, let me wait for other developers comments. cc: @gatorsmile

SparkQA · 2018-08-30T04:31:51Z

Test build #95448 has finished for PR 22198 at commit 9a6da27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-08-30T06:26:45Z

Thanks for understanding, @maropu . Yes. We need to build consensus.

@gatorsmile and @cloud-fan . Could you give us a directional advice for this PR? Basically, we are wondering if we need to provide the same name resolution at this Hint layers. Please see for the summary comment.

SparkQA · 2019-02-13T18:08:27Z

Test build #102295 has finished for PR 22198 at commit b6b9f65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-14T02:03:59Z

Could you check @cloud-fan @dongjoon-hyun ?

SparkQA · 2019-12-03T22:53:05Z

Test build #114805 has finished for PR 22198 at commit b6b9f65.

This patch fails build dependency tests.
This patch does not merge cleanly.
This patch adds no public classes.

Remove obsolete review comment.

github-actions · 2020-03-16T00:14:51Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

cloud-fan · 2020-03-16T07:35:41Z

does this problem still exist?

maropu · 2020-03-16T07:58:11Z

I'll check again.

maropu · 2020-03-17T00:21:05Z

Since this issue still exists, I'll open a new PR for this issue.

… resolution ### What changes were proposed in this pull request? This pr fixed code to respect a database name for broadcast table hint resolution. Currently, spark ignores a database name in multi-part names; ``` scala> sql("CREATE DATABASE testDb") scala> spark.range(10).write.saveAsTable("testDb.t") // without this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == *(2) Project [id#24L] +- *(2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) : +- *(1) Range (0, 10, step=1, splits=4) +- *(2) Project [id#26L] +- *(2) Filter isnotnull(id#26L) +- *(2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> // with this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == *(2) Project [id#3L] +- *(2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight :- *(2) Range (0, 10, step=1, splits=4) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])) +- *(1) Project [id#5L] +- *(1) Filter isnotnull(id#5L) +- *(1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> ``` This PR comes from #22198 ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #27935 from maropu/SPARK-25121-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… resolution ### What changes were proposed in this pull request? This pr fixed code to respect a database name for broadcast table hint resolution. Currently, spark ignores a database name in multi-part names; ``` scala> sql("CREATE DATABASE testDb") scala> spark.range(10).write.saveAsTable("testDb.t") // without this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == *(2) Project [id#24L] +- *(2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) : +- *(1) Range (0, 10, step=1, splits=4) +- *(2) Project [id#26L] +- *(2) Filter isnotnull(id#26L) +- *(2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> // with this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == *(2) Project [id#3L] +- *(2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight :- *(2) Range (0, 10, step=1, splits=4) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])) +- *(1) Project [id#5L] +- *(1) Filter isnotnull(id#5L) +- *(1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> ``` This PR comes from #22198 ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #27935 from maropu/SPARK-25121-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ca499e9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… resolution ### What changes were proposed in this pull request? This pr fixed code to respect a database name for broadcast table hint resolution. Currently, spark ignores a database name in multi-part names; ``` scala> sql("CREATE DATABASE testDb") scala> spark.range(10).write.saveAsTable("testDb.t") // without this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == *(2) Project [id#24L] +- *(2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) : +- *(1) Range (0, 10, step=1, splits=4) +- *(2) Project [id#26L] +- *(2) Filter isnotnull(id#26L) +- *(2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> // with this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == *(2) Project [id#3L] +- *(2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight :- *(2) Range (0, 10, step=1, splits=4) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])) +- *(1) Project [id#5L] +- *(1) Filter isnotnull(id#5L) +- *(1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> ``` This PR comes from apache#22198 ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes apache#27935 from maropu/SPARK-25121-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

maropu commented Aug 23, 2018

View reviewed changes

dongjoon-hyun previously requested changes Aug 25, 2018

View reviewed changes

dongjoon-hyun reviewed Aug 25, 2018

View reviewed changes

xuanyuanking reviewed Aug 26, 2018

View reviewed changes

dongjoon-hyun reviewed Aug 27, 2018

View reviewed changes

dongjoon-hyun reviewed Aug 28, 2018

View reviewed changes

maropu and others added 11 commits February 13, 2019 16:20

Fix

24de799

Fix

a6e4e40

Add test cases

f021770

Fix

d434ba7

Fix

c138b81

Fix

6a202f2

fix

545148b

Fix

bc29a11

Fix

59e60d4

Fix

5b2b272

Fix

b6b9f65

maropu force-pushed the SPARK-25121 branch from 7368dcc to b6b9f65 Compare February 13, 2019 14:03

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Mar 16, 2020

github-actions bot closed this Mar 17, 2020

maropu mentioned this pull request Mar 17, 2020

[SPARK-25121][SQL] Supports multi-part relation names for join strategy hint resolution #27935

Closed

gatorsmile changed the title ~~[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution~~ [SPARK-25121][SQL] Supports multi-part table names for hint resolution Mar 25, 2020

gatorsmile changed the title ~~[SPARK-25121][SQL] Supports multi-part table names for hint resolution~~ [SPARK-25121][SQL] Supports multi-part table names for join hint resolution Mar 25, 2020

gatorsmile changed the title ~~[SPARK-25121][SQL] Supports multi-part table names for join hint resolution~~ [SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution Mar 25, 2020

[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution #22198

[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution #22198

Conversation

maropu commented Aug 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

dongjoon-hyun Aug 25, 2018 • edited

Choose a reason for hiding this comment

SparkQA commented Aug 23, 2018

maropu commented Aug 25, 2018

gatorsmile commented Aug 25, 2018

dilipbiswal commented Aug 25, 2018 • edited

maropu commented Aug 25, 2018

SparkQA commented Aug 25, 2018

dongjoon-hyun Aug 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Aug 27, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Aug 25, 2018 • edited

Choose a reason for hiding this comment

maropu Aug 27, 2018 • edited

Choose a reason for hiding this comment

dongjoon-hyun Aug 27, 2018 • edited

Choose a reason for hiding this comment

dongjoon-hyun Aug 27, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 26, 2018 • edited

maropu commented Aug 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 27, 2018

SparkQA commented Aug 27, 2018

maropu commented Aug 27, 2018

SparkQA commented Aug 27, 2018

maropu commented Aug 27, 2018

SparkQA commented Aug 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 28, 2018

dongjoon-hyun Aug 28, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 28, 2018 • edited

dilipbiswal commented Aug 28, 2018

maropu commented Aug 29, 2018

SparkQA commented Aug 29, 2018

dongjoon-hyun commented Aug 29, 2018

maropu commented Aug 30, 2018

SparkQA commented Aug 30, 2018

dongjoon-hyun commented Aug 30, 2018

SparkQA commented Feb 13, 2019

maropu commented Feb 14, 2019

SparkQA commented Dec 3, 2019

github-actions bot commented Mar 16, 2020

cloud-fan commented Mar 16, 2020 • edited

maropu commented Mar 16, 2020

maropu commented Mar 17, 2020

dongjoon-hyun Aug 25, 2018 •

edited

dilipbiswal commented Aug 25, 2018 •

edited

dongjoon-hyun Aug 25, 2018 •

edited

maropu Aug 27, 2018 •

edited

dongjoon-hyun Aug 25, 2018 •

edited

maropu Aug 27, 2018 •

edited

dongjoon-hyun Aug 27, 2018 •

edited

dongjoon-hyun Aug 27, 2018 •

edited

dongjoon-hyun commented Aug 26, 2018 •

edited

dongjoon-hyun Aug 28, 2018 •

edited

dongjoon-hyun commented Aug 28, 2018 •

edited

cloud-fan commented Mar 16, 2020 •

edited