Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16289][SQL] Implement posexplode table generating function #13971

Closed
wants to merge 8 commits into from
Closed

[SPARK-16289][SQL] Implement posexplode table generating function #13971

wants to merge 8 commits into from

Conversation

dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

This PR implements posexplode table generating function. Currently, master branch raises the following exception for map argument. It's different from Hive.

Before

scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7

After

scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+

For array argument, after is the same with before.

scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+

How was this patch tested?

Pass the Jenkins tests with newly added testcases.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61461 has finished for PR 13971 at commit 584eb9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class ExplodeBase(child: Expression, position: Boolean)
    • case class Explode(child: Expression)
    • case class PosExplode(child: Expression)

@dongjoon-hyun
Copy link
Member Author

cc @rxin and @cloud-fan .

*/
// scalastyle:off line.size.limit
@ExpressionDescription(
usage = "_FUNC_(a) - Separates the elements of array a into multiple rows, or the elements of map a into multiple rows and columns.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example would be useful.

@rxin
Copy link
Contributor

rxin commented Jun 29, 2016

Do we have unit tests for explode expression? (not end-to-end tests)

if not, do you mind looking into it?

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin . Thank you for review. I updated the followings.

  • Add function descriptions for explode and posexplode.
  • Add examples in comments.
  • Change indentation to make the fields clearer.

For the explode and posexplode expressions, it seems that we don't have unit tests in expression level because they are generators.

@rxin
Copy link
Contributor

rxin commented Jun 29, 2016

Can we create a suite for unit testing generators?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 29, 2016

If you want that for explode and posexplode only, sure!

In general, GeneratorTestSuite seems to have not only explode and posexplode, but also UserDefinedGenerator and HiveGenericUDTF.

@rxin
Copy link
Contributor

rxin commented Jun 29, 2016

Yea let's start with that, and we can add more in the future. I'd also add it for the other ones you are implementing, e.g. inline, in those prs.

@dongjoon-hyun
Copy link
Member Author

Sure. Thank you for fast feedback! :)

@dongjoon-hyun
Copy link
Member Author

Now, GeneratorSuite is added.

private final val int_array = Seq(1, 2, 3)
private final val str_array = Seq("a", "b", "c")

test("explode") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test case for empty input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Done.

@dongjoon-hyun
Copy link
Member Author

Now, the followings are updated.

  • Make sql/GeneratorSuite.scala and moves the testcases from ColumnExpressionSuite.scala.
  • Remove redundant Serializable and braces.
  • Fix a typo in PosExplode example.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61492 has finished for PR 13971 at commit fcfccee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Could you do me a favor?

There is a tiny fix. Could you take a look at #13730 ?

import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.unsafe.types.UTF8String

class GeneratorSuite extends SparkFunSuite with ExpressionEvalHelper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thsi one maybe GeneratorExpressionSuite

@rxin
Copy link
Contributor

rxin commented Jun 29, 2016

This looks pretty good. Let's fix the remaining minor issues and merge it.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 29, 2016

I tried to add to Python/R.
But, currently R explode is a little misleading. So, I just committed Python first.
For R, I will clean up explode and posexplode later.

@rxin
Copy link
Contributor

rxin commented Jun 29, 2016

LGTM pending Jenkins.

@dongjoon-hyun
Copy link
Member Author

Oops. I added R, too. The exactly same semantic of current explode in R.
Yep, please wait for two hours again.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61498 has finished for PR 13971 at commit e255873.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GeneratorSuite extends SparkFunSuite with ExpressionEvalHelper

@dongjoon-hyun
Copy link
Member Author

@rxin Thank you for intensive reviewing this PR.
I will improve another PRs (adding new SQL functions) with the same level of quality!

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61500 has finished for PR 13971 at commit 1cf723a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61501 has finished for PR 13971 at commit c5dee49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Explode(child: Expression) extends ExplodeBase(child, position = false)
    • case class PosExplode(child: Expression) extends ExplodeBase(child, position = true)

@SparkQA
Copy link

SparkQA commented Jun 30, 2016

Test build #61504 has finished for PR 13971 at commit 153e8eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GeneratorExpressionSuite extends SparkFunSuite with ExpressionEvalHelper
    • class GeneratorFunctionSuite extends QueryTest with SharedSQLContext

import org.apache.spark.sql.functions._
import org.apache.spark.sql.test.SharedSQLContext

class GeneratorFunctionSuite extends QueryTest with SharedSQLContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we put expression level unit test in sql core module instead of catalyst?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry I just realized it's not expression level unit test, but end-to-end test

@cloud-fan
Copy link
Contributor

LGTM except the unit test, @rxin do we need expression level unit test for it?

@rxin
Copy link
Contributor

rxin commented Jun 30, 2016

He added it, didn't he?

@SparkQA
Copy link

SparkQA commented Jun 30, 2016

Test build #61509 has finished for PR 13971 at commit 0266052.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


class GeneratorExpressionSuite extends SparkFunSuite with ExpressionEvalHelper {
private def checkTuple(actual: ExplodeBase, expected: Seq[InternalRow]): Unit = {
assert(actual.eval(null).toSeq === expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have checkEvaluation for this purpose, how about using that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thank you for review, @cloud-fan , too.
Do we have an example of checkEvaluation to check the generator, multiple InternalRows?
I just thought checkEvaluation is just for a single row, e.g., values, arrays, maps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkEvaluation takes Any as expected result, so I don't think checkEvaluation is only used for a single row.
Have you tried to pass a Seq[Row] to checkEvaluation? If it doesn't work, is it possible to improve checkEvaluation so that it can work for this case? thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. @cloud-fan . In fact, I try everything you told me in many ways because I trust you. :)

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jun 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a evidence, let me write the results of the most simplest case.

checkEvaluation(Explode(CreateArray(Seq.empty)), Seq.empty[Row])
checkEvaluation(Explode(CreateArray(Seq.empty)), Seq.empty[InternalRow])
checkEvaluation(Explode(CreateArray(Seq.empty)), Seq.empty)

All the above returns the followings.

Incorrect evaluation (codegen off): explode(array()), actual: InternalRow;(), expected: []

Here is the body of checkEvaluation. The following comments are the limitation I found.

// 1. This makes `Seq[Any]` into `GenericArrayData` generally.
val catalystValue = CatalystTypeConverters.convertToCatalyst(expected)
checkEvaluationWithoutCodegen(expression, catalystValue, inputRow)

// 2. Here, `val actual = plan(inputRow).get(0, expression.dataType)` is called to try casting to `expression.dataType`.
checkEvaluationWithGeneratedMutableProjection(expression, catalystValue, inputRow)

if (GenerateUnsafeProjection.canSupport(expression.dataType)) {
      // 3. Here, `val unsafeRow = plan(inputRow)` with one row assumption.
      checkEvalutionWithUnsafeProjection(expression, catalystValue, inputRow)
}

// 4. Here, `checkResult` fails at `result == expected`.
checkEvaluationWithOptimization(expression, catalystValue, inputRow)

In short, every steps of the checkEvaluation seem to depend on the single row assumption heavily. If we wan to change this. We should do in a separate issue since it's not trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I didn't misunderstand, it's definitely valuable issue to investigate more. If we can upgrade checkEvaluation later, we can unify the testcases of this PR with checkEvaluation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change it for now. We also don't want test code to become so complicated that is is no longer obvious what's going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Thank you. I'll investigate it later.

@SparkQA
Copy link

SparkQA commented Jun 30, 2016

Test build #61512 has finished for PR 13971 at commit 5f3a951.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

If I understand correctly, the remaining issue is checkEvaluation.
I'm sorry for this, but I'm still not sure how to use checkEvaluation for the generators.

@rxin
Copy link
Contributor

rxin commented Jun 30, 2016

Merging in master. Thanks!

@asfgit asfgit closed this in 46395db Jun 30, 2016
@dongjoon-hyun
Copy link
Member Author

Thank you for review and merging, @rxin and @cloud-fan .

@@ -1637,6 +1637,27 @@ def explode(col):
return Column(jc)


@since(2.1)
def posexplode(col):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rxin , is posexplode a special hive fallback function that we need to register? other ones don't get registered in functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, I thought the reason is explode is already registered. posexplode is a pair of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this one is probably fine.

i wouldn't register the other ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reconfirming!

asfgit pushed a commit that referenced this pull request Jul 8, 2016
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.

**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```

**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+
```

For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+
```

Pass the Jenkins tests with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13971 from dongjoon-hyun/SPARK-16289.

(cherry picked from commit 46395db)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-16289 branch July 20, 2016 07:41
dongjoon-hyun pushed a commit that referenced this pull request Apr 18, 2024
### What changes were proposed in this pull request?
The pr aims to upgrade `netty` from `4.1.108.Final` to `4.1.109.Final`.

### Why are the changes needed?
https://netty.io/news/2024/04/15/4-1-109-Final.html
This version has brought some bug fixes and improvements, such as:
- Fix DefaultChannelId#asLongText NPE ([#13971](netty/netty#13971))
- Rewrite ZstdDecoder to remove the need of allocate a huge byte[] internally ([#13928](netty/netty#13928))
- Don't send a RST frame when closing the stream in a write future while processing inbound frames ([#13973](netty/netty#13973))

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46112 from panbingkun/netty_for_spark4.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants