[SPARK-21043][SQL] Add unionByName in Dataset #18300

maropu · 2017-06-14T07:56:51Z

What changes were proposed in this pull request?

This pr added unionByName in DataSet.
Here is how to use:

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.unionByName(df2).show

// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// |   1|   2|   3|
// |   6|   4|   5|
// +----+----+----+

How was this patch tested?

Added tests in DataFrameSuite.

SparkQA · 2017-06-14T08:00:47Z

Test build #78040 has finished for PR 18300 at commit de9af43.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-14T08:13:08Z

Test build #78041 has finished for PR 18300 at commit 97ea33d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-14T08:47:55Z

Test build #78044 has finished for PR 18300 at commit 3b04902.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-14T11:06:40Z

Test build #78046 has finished for PR 18300 at commit 8783524.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

You can use sparkSession.sessionState.conf.resolver to compare the column names. The target of this PR is to build a Project by column name comparison. Could we simplify the implementation by using a for loop with find + resolver?

gatorsmile · 2017-06-14T19:19:59Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * followed by a [[distinct]].
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name:


Nit: by name -> by name (not by position)

gatorsmile · 2017-06-14T19:24:13Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -1764,6 +1764,68 @@ class Dataset[T] private[sql](
  }

  /**
+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * To do a SQL-style set union (that does deduplication of elements), use this function


Also add a comment This is different from both UNION ALLandUNION DISTINCTin SQL.

maropu · 2017-06-14T22:09:08Z

ok, I'll update soon.

maropu · 2017-06-15T00:45:18Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+          s"""Cannot resolve column name "${lattr.name}" among """ +
+            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+      }
+    }


@gatorsmile How about this impl.?

SparkQA · 2017-06-15T02:59:13Z

Test build #78073 has finished for PR 18300 at commit b0fd2ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-15T03:19:01Z

Test build #78076 has finished for PR 18300 at commit 5b41430.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-15T04:09:51Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    val resolver = sparkSession.sessionState.analyzer.resolver
+    val rightProjectList = mutable.ArrayBuffer.empty[Attribute]
+    val rightOutputAttrs = right.output
+    for (lattr <- left.output) {


Since left and right always have the same number of columns (after L1796 assertAnalyzed()), we do not need to add ArrayBuffer if using map to build the Project of right. For example,

left.map { later =>

gatorsmile · 2017-06-15T04:20:08Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    for (lattr <- left.output) {
+      // To handle duplicate names, we first compute diff between `rightOutputAttrs` and
+      // already-found attrs in `rightProjectList`.
+      rightOutputAttrs.diff(rightProjectList).find { rattr => resolver(lattr.name, rattr.name)}


Inside the map, we can find the column names by using filter + resolver.

If the number of found columns is larger than two, throw an error for duplicate names.

If the number is zero, throw an error.

If the number is one, return the right-side attribute.

In the logic, it seems we cannot catch left column duplication, I think.
How about checking column name duplication first, then build a right project list?

// Check column name duplication in both sides first val leftOutputAttrs = left.output val rightOutputAttrs = right.output val caseSensitiveAnalysis = sparkSession.sessionState.conf.caseSensitiveAnalysis SchemaUtils.checkColumnNameDuplication( leftOutputAttrs.map(_.name), "left column names", caseSensitiveAnalysis) SchemaUtils.checkColumnNameDuplication( rightOutputAttrs.map(_.name), "right column names", caseSensitiveAnalysis) // Then, builds a project list for `other` based on `logicalPlan` output names val resolver = sparkSession.sessionState.analyzer.resolver val rightProjectList = left.output.map { lattr => val foundAttrs = rightOutputAttrs.filter { rattr => resolver(lattr.name, rattr.name) } assert(foundAttrs.size <= 1) if (foundAttrs.size == 1) { foundAttrs.head } else if (foundAttrs.size == 0) { throw new AnalysisException(s"""Cannot resolve column name "${lattr.name}" among """ + s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""") } }

(I used SchemaUtils here implemented in #17758
https://github.com/apache/spark/pull/17758/files#diff-dc9b15e4af298799d788b59d2baf96a9R29)

I am fine about this. If so, we just need to use find to get the first matched column.

okay, so I'll update after #17758 finished.

viirya · 2017-06-19T07:52:44Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   */
+  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+    // Creates a `Union` node and resolves it first to reorder output attributes in `other` by name
+    val unionPlan = sparkSession.sessionState.executePlan(Union(logicalPlan, other.logicalPlan))


Is this always resolvable? If the columns don't have the same data type, the Union may not be resolved.

In that case, I think we couldn't pass unionPlan.assertAnalyzed() below?

Yeah, so we don't plan to support it?

I think (you already know that) TypeCoercion in Analyzer resolves compatible types for that case like: https://github.com/apache/spark/pull/18300/files#diff-5d2ebf4e9ca5a990136b276859769289R122. You suggested other cases?

hmm, I mean the case looks like:

val df1 = Seq((1, "2", 3.4)).toDF("a", "b", "c") val df2 = Seq((6.7, 4, "5")).toDF("c", "a", "b")

And the result should be Row(1, "2", 3.4) :: Row(4, "5", 6.7).

That's what I guess unionByName should do?

Forcibly widening the types looks a bit weird for me. Because after the union, the schema is different to original datasets.

Or maybe I miss the purpose of this API?

Aha, I see. This is a bug, so I'll look into this. Thanks!
This target is just to union by name while keeping the union semantics.

maropu · 2017-06-19T09:21:55Z

@viirya How about this?

scala> val df1 = Seq((1, "2", 3.4)).toDF("a", "b", "c")
scala> val df2 = Seq((1, "3", 6.7)).toDF("a", "b", "c")
scala> df1.union(df2).printSchema
root
 |-- a: integer (nullable = false)
 |-- b: string (nullable = true)
 |-- c: double (nullable = false)

scala> df1.union(df2).show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|3.4|
|  1|  3|6.7|
+---+---+---+

scala> val df1 = Seq((1, "2", 3.4)).toDF("a", "b", "c")
scala> val df2 = Seq((6.7, 4, "5")).toDF("c", "a", "b")
scala> df1.unionByName(df2).printSchema
root
 |-- a: integer (nullable = false)
 |-- b: string (nullable = true)
 |-- c: double (nullable = false)

scala> df1.unionByName(df2).show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|3.4|
|  1|  3|6.7|
+---+---+---+

maropu · 2017-06-19T09:25:46Z

oh, the current one does not work well..., so I need to consider more.

SparkQA · 2017-06-19T10:49:11Z

Test build #78252 has finished for PR 18300 at commit ed26881.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-06-20T18:05:47Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    val unionDf = df1.unionByName(df2.unionByName(df3))
+    checkAnswer(unionDf,
+      Row(1, "a", 3.0) :: Row(2, "bc", 1.2) :: Row(3, "def", 1.2) :: Nil
+    )


Hi, @maropu .
To be clearer, could you add more test cases requiring type coercions here?

SparkQA · 2017-06-21T15:08:32Z

Test build #78385 has finished for PR 18300 at commit 7e94f4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-06-22T02:54:32Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+    // Resolves children first to reorder output attributes in `other` by name
+    val leftPlan = sparkSession.sessionState.executePlan(logicalPlan)
+    val rightPlan = sparkSession.sessionState.executePlan(other.logicalPlan)


I think a Dataset already guarantees its plan is analyzed and passes check? Do we need to resolve the plans again?

yea, it seems we needn't. removed. Thanks!

logicalPlan and other.logicalPlan are already analyzed plans. Looks like you just access analyzed plans below. So we can simply use logicalPlan and other.logicalPlan?

viirya · 2017-06-22T03:02:54Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    // SchemaUtils.checkColumnNameDuplication(
+    //   rightOutputAttrs.map(_.name),
+    //   "in the right attributes",
+    //   sparkSession.sessionState.conf.caseSensitiveAnalysis)


The function to check name duplication is discussed in #17758. I'm planning to use the func to check the duplication and then do union-by. See the discussion: #18300 (comment)

SparkQA · 2017-06-22T06:18:06Z

Test build #78425 has finished for PR 18300 at commit b17a14e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-06-22T07:02:59Z

retest this please.

SparkQA · 2017-06-22T09:22:52Z

Test build #78438 has finished for PR 18300 at commit b17a14e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-05T17:44:17Z

Retest this please

SparkQA · 2017-07-05T19:54:39Z

Test build #79234 has finished for PR 18300 at commit b17a14e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-06T09:31:10Z

Test build #79268 has finished for PR 18300 at commit c43a968.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-08T00:11:40Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+      Row(1, "a", 3.0) :: Row(2, "bc", 1.2) :: Row(3, "def", 1.2) :: Nil
+    )
+
+    // Failure cases


Could we split the test case test("union by name") to multiple ones?

gatorsmile · 2017-07-08T00:13:06Z

LGTM waiting for #17758

SparkQA · 2017-07-10T10:49:44Z

Test build #79457 has finished for PR 18300 at commit bae9ff0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM, too.

gatorsmile · 2017-07-10T16:15:54Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+          df1.unionByName(df2)
+        }.getMessage
+        assert(errMsg.contains("Found duplicate column(s) in the left attributes:"))
+            df1 = Seq((1, 1)).toDF("c0", "c1")


Nit: indents.

SparkQA · 2017-07-11T01:51:37Z

Test build #79482 has finished for PR 18300 at commit 2c59dfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-11T03:16:48Z

Thanks! Merging to master.

maropu force-pushed the SPARK-21043-2 branch from de9af43 to 97ea33d Compare June 14, 2017 08:05

maropu force-pushed the SPARK-21043-2 branch from 97ea33d to 3b04902 Compare June 14, 2017 08:37

maropu force-pushed the SPARK-21043-2 branch from 3b04902 to 8783524 Compare June 14, 2017 08:55

gatorsmile reviewed Jun 14, 2017

View reviewed changes

maropu commented Jun 15, 2017

View reviewed changes

maropu force-pushed the SPARK-21043-2 branch from b0fd2ac to 5b41430 Compare June 15, 2017 01:08

gatorsmile reviewed Jun 15, 2017

View reviewed changes

viirya reviewed Jun 19, 2017

View reviewed changes

dongjoon-hyun reviewed Jun 20, 2017

View reviewed changes

viirya reviewed Jun 22, 2017

View reviewed changes

gatorsmile reviewed Jul 8, 2017

View reviewed changes

maropu added 10 commits July 10, 2017 17:20

Implement unionByName

42bd40a

Update docs

b7da2c0

Use resolver

286d6c1

Simplify code

8b80f8d

Apply viirya reviews

144a625

Brush up code

2fbfbc5

Add more tests

6dd4119

Remove unnecessary assertion

03af8ce

Apply more comments

78e769f

Use SchemaUtils to check name duplication

bae9ff0

maropu force-pushed the SPARK-21043-2 branch from c43a968 to bae9ff0 Compare July 10, 2017 08:39

dongjoon-hyun approved these changes Jul 10, 2017

View reviewed changes

gatorsmile reviewed Jul 10, 2017

View reviewed changes

Fix minor issue

2c59dfd

asfgit closed this in a2bec6c Jul 11, 2017

[SPARK-21043][SQL] Add unionByName in Dataset #18300

[SPARK-21043][SQL] Add unionByName in Dataset #18300

Conversation

maropu commented Jun 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 14, 2017

SparkQA commented Jun 14, 2017

SparkQA commented Jun 14, 2017

SparkQA commented Jun 14, 2017

gatorsmile left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jun 14, 2017

Choose a reason for hiding this comment

SparkQA commented Jun 15, 2017

SparkQA commented Jun 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Jun 15, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jun 19, 2017 • edited

Choose a reason for hiding this comment

maropu Jun 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jun 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jun 19, 2017

maropu commented Jun 19, 2017

SparkQA commented Jun 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 21, 2017

viirya Jun 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 22, 2017

viirya commented Jun 22, 2017

SparkQA commented Jun 22, 2017

dongjoon-hyun commented Jul 5, 2017

SparkQA commented Jul 5, 2017

SparkQA commented Jul 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jul 8, 2017

SparkQA commented Jul 10, 2017

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 11, 2017

gatorsmile commented Jul 11, 2017

maropu Jun 15, 2017 •

edited

viirya Jun 19, 2017 •

edited

maropu Jun 19, 2017 •

edited

viirya Jun 19, 2017 •

edited

viirya Jun 22, 2017 •

edited