[SPARK-16289][SQL] Implement posexplode table generating function #13971

dongjoon-hyun · 2016-06-29T10:18:45Z

What changes were proposed in this pull request?

This PR implements posexplode table generating function. Currently, master branch raises the following exception for map argument. It's different from Hive.

Before

scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7

After

scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+

For array argument, after is the same with before.

scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+

How was this patch tested?

Pass the Jenkins tests with newly added testcases.

SparkQA · 2016-06-29T12:26:49Z

Test build #61461 has finished for PR 13971 at commit 584eb9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class ExplodeBase(child: Expression, position: Boolean)
- case class Explode(child: Expression)
- case class PosExplode(child: Expression)

dongjoon-hyun · 2016-06-29T18:50:08Z

cc @rxin and @cloud-fan .

rxin · 2016-06-29T19:03:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(a) - Separates the elements of array a into multiple rows, or the elements of map a into multiple rows and columns.")


an example would be useful.

rxin · 2016-06-29T19:06:27Z

Do we have unit tests for explode expression? (not end-to-end tests)

if not, do you mind looking into it?

dongjoon-hyun · 2016-06-29T19:35:09Z

Hi, @rxin . Thank you for review. I updated the followings.

Add function descriptions for explode and posexplode.
Add examples in comments.
Change indentation to make the fields clearer.

For the explode and posexplode expressions, it seems that we don't have unit tests in expression level because they are generators.

rxin · 2016-06-29T20:01:20Z

Can we create a suite for unit testing generators?

dongjoon-hyun · 2016-06-29T20:10:33Z

If you want that for explode and posexplode only, sure!

In general, GeneratorTestSuite seems to have not only explode and posexplode, but also UserDefinedGenerator and HiveGenericUDTF.

rxin · 2016-06-29T20:15:19Z

Yea let's start with that, and we can add more in the future. I'd also add it for the other ones you are implementing, e.g. inline, in those prs.

dongjoon-hyun · 2016-06-29T20:23:14Z

Sure. Thank you for fast feedback! :)

dongjoon-hyun · 2016-06-29T21:05:02Z

Now, GeneratorSuite is added.

rxin · 2016-06-29T21:12:31Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/GeneratorSuite.scala

+  private final val int_array = Seq(1, 2, 3)
+  private final val str_array = Seq("a", "b", "c")
+
+  test("explode") {


can you add a test case for empty input?

dongjoon-hyun · 2016-06-29T21:46:33Z

Now, the followings are updated.

Make sql/GeneratorSuite.scala and moves the testcases from ColumnExpressionSuite.scala.
Remove redundant Serializable and braces.
Fix a typo in PosExplode example.

SparkQA · 2016-06-29T21:51:42Z

Test build #61492 has finished for PR 13971 at commit fcfccee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-29T21:57:58Z

Could you do me a favor?

There is a tiny fix. Could you take a look at #13730 ?

rxin · 2016-06-29T21:58:16Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/GeneratorSuite.scala

+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.unsafe.types.UTF8String
+
+class GeneratorSuite extends SparkFunSuite with ExpressionEvalHelper {


thsi one maybe GeneratorExpressionSuite

rxin · 2016-06-29T22:02:19Z

This looks pretty good. Let's fix the remaining minor issues and merge it.

dongjoon-hyun · 2016-06-29T22:50:41Z

I tried to add to Python/R.
But, currently R explode is a little misleading. So, I just committed Python first.
For R, I will clean up explode and posexplode later.

rxin · 2016-06-29T23:16:46Z

LGTM pending Jenkins.

dongjoon-hyun · 2016-06-29T23:19:22Z

Oops. I added R, too. The exactly same semantic of current explode in R.
Yep, please wait for two hours again.

SparkQA · 2016-06-29T23:27:33Z

Test build #61498 has finished for PR 13971 at commit e255873.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GeneratorSuite extends SparkFunSuite with ExpressionEvalHelper

dongjoon-hyun · 2016-06-29T23:27:36Z

@rxin Thank you for intensive reviewing this PR.
I will improve another PRs (adding new SQL functions) with the same level of quality!

SparkQA · 2016-06-29T23:48:18Z

Test build #61500 has finished for PR 13971 at commit 1cf723a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-29T23:59:21Z

Test build #61501 has finished for PR 13971 at commit c5dee49.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Explode(child: Expression) extends ExplodeBase(child, position = false)
- case class PosExplode(child: Expression) extends ExplodeBase(child, position = true)

SparkQA · 2016-06-30T00:19:38Z

Test build #61504 has finished for PR 13971 at commit 153e8eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GeneratorExpressionSuite extends SparkFunSuite with ExpressionEvalHelper
- class GeneratorFunctionSuite extends QueryTest with SharedSQLContext

cloud-fan · 2016-06-30T00:55:55Z

sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.SharedSQLContext
+
+class GeneratorFunctionSuite extends QueryTest with SharedSQLContext {


why do we put expression level unit test in sql core module instead of catalyst?

oh sorry I just realized it's not expression level unit test, but end-to-end test

cloud-fan · 2016-06-30T01:03:58Z

LGTM except the unit test, @rxin do we need expression level unit test for it?

rxin · 2016-06-30T01:06:38Z

He added it, didn't he?

SparkQA · 2016-06-30T01:08:31Z

Test build #61509 has finished for PR 13971 at commit 0266052.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-30T01:16:31Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/GeneratorExpressionSuite.scala

+
+class GeneratorExpressionSuite extends SparkFunSuite with ExpressionEvalHelper {
+  private def checkTuple(actual: ExplodeBase, expected: Seq[InternalRow]): Unit = {
+    assert(actual.eval(null).toSeq === expected)


We have checkEvaluation for this purpose, how about using that?

Oh, thank you for review, @cloud-fan , too.
Do we have an example of checkEvaluation to check the generator, multiple InternalRows?
I just thought checkEvaluation is just for a single row, e.g., values, arrays, maps.

And, how to check the zero row? At Line 39,
https://github.com/apache/spark/pull/13971/files#diff-6715134a4e95980149a7600ecb71674cR41

checkEvaluation takes Any as expected result, so I don't think checkEvaluation is only used for a single row.
Have you tried to pass a Seq[Row] to checkEvaluation? If it doesn't work, is it possible to improve checkEvaluation so that it can work for this case? thanks

Sure. @cloud-fan . In fact, I try everything you told me in many ways because I trust you. :)

As a evidence, let me write the results of the most simplest case.

checkEvaluation(Explode(CreateArray(Seq.empty)), Seq.empty[Row]) checkEvaluation(Explode(CreateArray(Seq.empty)), Seq.empty[InternalRow]) checkEvaluation(Explode(CreateArray(Seq.empty)), Seq.empty)

All the above returns the followings.

Incorrect evaluation (codegen off): explode(array()), actual: InternalRow;(), expected: []

Here is the body of checkEvaluation. The following comments are the limitation I found.

// 1. This makes `Seq[Any]` into `GenericArrayData` generally. val catalystValue = CatalystTypeConverters.convertToCatalyst(expected) checkEvaluationWithoutCodegen(expression, catalystValue, inputRow) // 2. Here, `val actual = plan(inputRow).get(0, expression.dataType)` is called to try casting to `expression.dataType`. checkEvaluationWithGeneratedMutableProjection(expression, catalystValue, inputRow) if (GenerateUnsafeProjection.canSupport(expression.dataType)) { // 3. Here, `val unsafeRow = plan(inputRow)` with one row assumption. checkEvalutionWithUnsafeProjection(expression, catalystValue, inputRow) } // 4. Here, `checkResult` fails at `result == expected`. checkEvaluationWithOptimization(expression, catalystValue, inputRow)

In short, every steps of the checkEvaluation seem to depend on the single row assumption heavily. If we wan to change this. We should do in a separate issue since it's not trivial.

If I didn't misunderstand, it's definitely valuable issue to investigate more. If we can upgrade checkEvaluation later, we can unify the testcases of this PR with checkEvaluation.

Let's not change it for now. We also don't want test code to become so complicated that is is no longer obvious what's going on.

Yep. Thank you. I'll investigate it later.

SparkQA · 2016-06-30T01:20:35Z

Test build #61512 has finished for PR 13971 at commit 5f3a951.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-30T10:21:26Z

If I understand correctly, the remaining issue is checkEvaluation.
I'm sorry for this, but I'm still not sure how to use checkEvaluation for the generators.

rxin · 2016-06-30T19:03:04Z

Merging in master. Thanks!

dongjoon-hyun · 2016-06-30T19:09:25Z

Thank you for review and merging, @rxin and @cloud-fan .

cloud-fan · 2016-07-03T18:02:39Z

python/pyspark/sql/functions.py

@@ -1637,6 +1637,27 @@ def explode(col):
    return Column(jc)


+@since(2.1)
+def posexplode(col):


cc @rxin , is posexplode a special hive fallback function that we need to register? other ones don't get registered in functions

For this one, I thought the reason is explode is already registered. posexplode is a pair of that.

yea this one is probably fine.

i wouldn't register the other ones.

Thank you for reconfirming!

This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive. **Before** ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7 ``` **After** ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show +---+---+-----+ |pos|key|value| +---+---+-----+ | 0| a| 1| | 1| b| 2| +---+---+-----+ ``` For `array` argument, `after` is the same with `before`. ``` scala> sql("select posexplode(array(1, 2, 3))").show +---+---+ |pos|col| +---+---+ | 0| 1| | 1| 2| | 2| 3| +---+---+ ``` Pass the Jenkins tests with newly added testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13971 from dongjoon-hyun/SPARK-16289. (cherry picked from commit 46395db) Signed-off-by: Reynold Xin <rxin@databricks.com>

### What changes were proposed in this pull request? The pr aims to upgrade `netty` from `4.1.108.Final` to `4.1.109.Final`. ### Why are the changes needed? https://netty.io/news/2024/04/15/4-1-109-Final.html This version has brought some bug fixes and improvements, such as: - Fix DefaultChannelId#asLongText NPE ([#13971](netty/netty#13971)) - Rewrite ZstdDecoder to remove the need of allocate a huge byte[] internally ([#13928](netty/netty#13928)) - Don't send a RST frame when closing the stream in a write future while processing inbound frames ([#13973](netty/netty#13973)) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46112 from panbingkun/netty_for_spark4. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-16289][SQL] Implement posexplode table generating function

584eb9e

rxin reviewed Jun 29, 2016
View reviewed changes

Add function descs, examples in comments, and indentation.

fcfccee

Add GeneratorSuite.

e255873

rxin reviewed Jun 29, 2016
View reviewed changes

dongjoon-hyun added 2 commits June 29, 2016 14:22

Add empty array cases.

1cf723a

Make sql.GeneratorSuite and move the existing testcases.

c5dee49

rxin reviewed Jun 29, 2016
View reviewed changes

dongjoon-hyun added 2 commits June 29, 2016 15:05

Rename.

153e8eb

Add posexplode to python

0266052

Add posexplode to R.

5f3a951

cloud-fan reviewed Jun 30, 2016
View reviewed changes

asfgit closed this in 46395db Jun 30, 2016

cloud-fan reviewed Jul 3, 2016
View reviewed changes

dongjoon-hyun deleted the SPARK-16289 branch July 20, 2016 07:41

[SPARK-16289][SQL] Implement posexplode table generating function #13971

[SPARK-16289][SQL] Implement posexplode table generating function #13971

Conversation

dongjoon-hyun commented Jun 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016

Choose a reason for hiding this comment

rxin commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016

rxin commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016 • edited Loading

rxin commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 29, 2016

SparkQA commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016

Choose a reason for hiding this comment

rxin commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016 • edited Loading

rxin commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016

SparkQA commented Jun 29, 2016

dongjoon-hyun commented Jun 29, 2016

SparkQA commented Jun 29, 2016

SparkQA commented Jun 29, 2016

SparkQA commented Jun 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 30, 2016

rxin commented Jun 30, 2016

SparkQA commented Jun 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Jun 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 30, 2016

dongjoon-hyun commented Jun 30, 2016

rxin commented Jun 30, 2016

dongjoon-hyun commented Jun 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 29, 2016 •

edited

Loading

dongjoon-hyun commented Jun 29, 2016 •

edited

Loading

dongjoon-hyun Jun 30, 2016 •

edited

Loading