Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21043][SQL] Add unionByName in Dataset #18300

Closed
wants to merge 11 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jun 14, 2017

What changes were proposed in this pull request?

This pr added unionByName in DataSet.
Here is how to use:

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.unionByName(df2).show

// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// |   1|   2|   3|
// |   6|   4|   5|
// +----+----+----+

How was this patch tested?

Added tests in DataFrameSuite.

@SparkQA
Copy link

SparkQA commented Jun 14, 2017

Test build #78040 has finished for PR 18300 at commit de9af43.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2017

Test build #78041 has finished for PR 18300 at commit 97ea33d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2017

Test build #78044 has finished for PR 18300 at commit 3b04902.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2017

Test build #78046 has finished for PR 18300 at commit 8783524.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use sparkSession.sessionState.conf.resolver to compare the column names. The target of this PR is to build a Project by column name comparison. Could we simplify the implementation by using a for loop with find + resolver?

* followed by a [[distinct]].
*
* The difference between this function and [[union]] is that this function
* resolves columns by name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: by name -> by name (not by position)

@@ -1764,6 +1764,68 @@ class Dataset[T] private[sql](
}

/**
* Returns a new Dataset containing union of rows in this Dataset and another Dataset.
*
* To do a SQL-style set union (that does deduplication of elements), use this function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a comment This is different from both UNION ALLandUNION DISTINCTin SQL.

@maropu
Copy link
Member Author

maropu commented Jun 14, 2017

ok, I'll update soon.

s"""Cannot resolve column name "${lattr.name}" among """ +
s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile How about this impl.?

@SparkQA
Copy link

SparkQA commented Jun 15, 2017

Test build #78073 has finished for PR 18300 at commit b0fd2ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2017

Test build #78076 has finished for PR 18300 at commit 5b41430.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val resolver = sparkSession.sessionState.analyzer.resolver
val rightProjectList = mutable.ArrayBuffer.empty[Attribute]
val rightOutputAttrs = right.output
for (lattr <- left.output) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since left and right always have the same number of columns (after L1796 assertAnalyzed()), we do not need to add ArrayBuffer if using map to build the Project of right. For example,

  left.map { later =>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, ok.

for (lattr <- left.output) {
// To handle duplicate names, we first compute diff between `rightOutputAttrs` and
// already-found attrs in `rightProjectList`.
rightOutputAttrs.diff(rightProjectList).find { rattr => resolver(lattr.name, rattr.name)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the map, we can find the column names by using filter + resolver.

  • If the number of found columns is larger than two, throw an error for duplicate names.
  • If the number is zero, throw an error.
  • If the number is one, return the right-side attribute.

Copy link
Member Author

@maropu maropu Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the logic, it seems we cannot catch left column duplication, I think.
How about checking column name duplication first, then build a right project list?


    // Check column name duplication in both sides first
    val leftOutputAttrs = left.output
    val rightOutputAttrs = right.output
    val caseSensitiveAnalysis = sparkSession.sessionState.conf.caseSensitiveAnalysis
    SchemaUtils.checkColumnNameDuplication(
      leftOutputAttrs.map(_.name), "left column names", caseSensitiveAnalysis)
    SchemaUtils.checkColumnNameDuplication(
      rightOutputAttrs.map(_.name), "right column names", caseSensitiveAnalysis)

    // Then, builds a project list for `other` based on `logicalPlan` output names
    val resolver = sparkSession.sessionState.analyzer.resolver
    val rightProjectList = left.output.map { lattr =>
      val foundAttrs = rightOutputAttrs.filter { rattr => resolver(lattr.name, rattr.name) }
      assert(foundAttrs.size <= 1)
      if (foundAttrs.size == 1) {
        foundAttrs.head
      } else if (foundAttrs.size == 0) {
        throw new AnalysisException(s"""Cannot resolve column name "${lattr.name}" among """ +
          s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
      }
    }

(I used SchemaUtils here implemented in #17758
https://github.com/apache/spark/pull/17758/files#diff-dc9b15e4af298799d788b59d2baf96a9R29)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine about this. If so, we just need to use find to get the first matched column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so I'll update after #17758 finished.

*/
def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
// Creates a `Union` node and resolves it first to reorder output attributes in `other` by name
val unionPlan = sparkSession.sessionState.executePlan(Union(logicalPlan, other.logicalPlan))
Copy link
Member

@viirya viirya Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always resolvable? If the columns don't have the same data type, the Union may not be resolved.

Copy link
Member Author

@maropu maropu Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I think we couldn't pass unionPlan.assertAnalyzed() below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so we don't plan to support it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (you already know that) TypeCoercion in Analyzer resolves compatible types for that case like: https://github.com/apache/spark/pull/18300/files#diff-5d2ebf4e9ca5a990136b276859769289R122. You suggested other cases?

Copy link
Member

@viirya viirya Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I mean the case looks like:

val df1 = Seq((1, "2", 3.4)).toDF("a", "b", "c")
val df2 = Seq((6.7, 4, "5")).toDF("c", "a", "b")

And the result should be Row(1, "2", 3.4) :: Row(4, "5", 6.7).

That's what I guess unionByName should do?

Forcibly widening the types looks a bit weird for me. Because after the union, the schema is different to original datasets.

Or maybe I miss the purpose of this API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. This is a bug, so I'll look into this. Thanks!
This target is just to union by name while keeping the union semantics.

@maropu
Copy link
Member Author

maropu commented Jun 19, 2017

@viirya How about this?

scala> val df1 = Seq((1, "2", 3.4)).toDF("a", "b", "c")
scala> val df2 = Seq((1, "3", 6.7)).toDF("a", "b", "c")
scala> df1.union(df2).printSchema
root
 |-- a: integer (nullable = false)
 |-- b: string (nullable = true)
 |-- c: double (nullable = false)

scala> df1.union(df2).show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|3.4|
|  1|  3|6.7|
+---+---+---+

scala> val df1 = Seq((1, "2", 3.4)).toDF("a", "b", "c")
scala> val df2 = Seq((6.7, 4, "5")).toDF("c", "a", "b")
scala> df1.unionByName(df2).printSchema
root
 |-- a: integer (nullable = false)
 |-- b: string (nullable = true)
 |-- c: double (nullable = false)

scala> df1.unionByName(df2).show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|3.4|
|  1|  3|6.7|
+---+---+---+

@maropu
Copy link
Member Author

maropu commented Jun 19, 2017

oh, the current one does not work well..., so I need to consider more.

@SparkQA
Copy link

SparkQA commented Jun 19, 2017

Test build #78252 has finished for PR 18300 at commit ed26881.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val unionDf = df1.unionByName(df2.unionByName(df3))
checkAnswer(unionDf,
Row(1, "a", 3.0) :: Row(2, "bc", 1.2) :: Row(3, "def", 1.2) :: Nil
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @maropu .
To be clearer, could you add more test cases requiring type coercions here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, sure.

@SparkQA
Copy link

SparkQA commented Jun 21, 2017

Test build #78385 has finished for PR 18300 at commit 7e94f4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
// Resolves children first to reorder output attributes in `other` by name
val leftPlan = sparkSession.sessionState.executePlan(logicalPlan)
val rightPlan = sparkSession.sessionState.executePlan(other.logicalPlan)
Copy link
Member

@viirya viirya Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a Dataset already guarantees its plan is analyzed and passes check? Do we need to resolve the plans again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, it seems we needn't. removed. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logicalPlan and other.logicalPlan are already analyzed plans. Looks like you just access analyzed plans below. So we can simply use logicalPlan and other.logicalPlan?

// SchemaUtils.checkColumnNameDuplication(
// rightOutputAttrs.map(_.name),
// "in the right attributes",
// sparkSession.sessionState.conf.caseSensitiveAnalysis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function to check name duplication is discussed in #17758. I'm planning to use the func to check the duplication and then do union-by. See the discussion: #18300 (comment)

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78425 has finished for PR 18300 at commit b17a14e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jun 22, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78438 has finished for PR 18300 at commit b17a14e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79234 has finished for PR 18300 at commit b17a14e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79268 has finished for PR 18300 at commit c43a968.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Row(1, "a", 3.0) :: Row(2, "bc", 1.2) :: Row(3, "def", 1.2) :: Nil
)

// Failure cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we split the test case test("union by name") to multiple ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@gatorsmile
Copy link
Member

LGTM waiting for #17758

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79457 has finished for PR 18300 at commit bae9ff0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM, too.

df1.unionByName(df2)
}.getMessage
assert(errMsg.contains("Found duplicate column(s) in the left attributes:"))
df1 = Seq((1, 1)).toDF("c0", "c1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79482 has finished for PR 18300 at commit 2c59dfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in a2bec6c Jul 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants