Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23927][SQL] Add "sequence" expression #21155

Closed
wants to merge 6 commits into from

Conversation

wajda
Copy link
Contributor

@wajda wajda commented Apr 25, 2018

What changes were proposed in this pull request?

The PR adds the SQL function sequence.
https://issues.apache.org/jira/browse/SPARK-23927

The behavior of the function is based on Presto's one.
Ref: https://prestodb.io/docs/current/functions/array.html

  • sequence(start, stop) → array<bigint>
    Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise -1.
  • sequence(start, stop, step) → array<bigint>
    Generate a sequence of integers from start to stop, incrementing by step.
  • sequence(start_date, stop_date) → array<date>
    Generate a sequence of dates from start_date to stop_date, incrementing by interval 1 day if start_date is less than or equal to stop_date, otherwise - interval 1 day.
  • sequence(start_date, stop_date, step_interval) → array<date>
    Generate a sequence of dates from start_date to stop_date, incrementing by step_interval. The type of step_interval is CalendarInterval.
  • sequence(start_timestemp, stop_timestemp) → array<timestamp>
    Generate a sequence of timestamps from start_timestamps to stop_timestamps, incrementing by interval 1 day if start_date is less than or equal to stop_date, otherwise - interval 1 day.
  • sequence(start_timestamp, stop_timestamp, step_interval) → array<timestamp>
    Generate a sequence of timestamps from start_timestamps to stop_timestamps, incrementing by step_interval. The type of step_interval is CalendarInterval.

How was this patch tested?

Added unit tests.

@wajda wajda force-pushed the feature/array-api-sequence branch from 5cc7d87 to 8fb427c Compare April 25, 2018 13:45
@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

cc @ueshin

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89851 has finished for PR 21155 at commit 8fb427c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedLiteral(child: Expression, literalResolver: DataType => Literal)
  • trait SequenceImpl
  • case class Sequence(left: Expression,
  • class IntegralSequenceImpl[T: ClassTag](implicit num: Integral[T]) extends SequenceImpl
  • abstract class TemporalSequence[T: ClassTag](dt: IntegralType, scale: Long, fromLong: Long => T)

val stepMonths = ctx.freshName("stepMonths")
val stepMicros = ctx.freshName("stepMicros")

lazy val stepScaled = ctx.freshName("stepScaled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to use lazy here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. It's a left over, sorry.
Fixed.

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89854 has finished for PR 21155 at commit 88fde1c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89852 has finished for PR 21155 at commit 0b4d868.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 26, 2018

Test build #89860 has finished for PR 21155 at commit a7c0ccd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you annotate the pr title as [SPARK-23927][SQL]?
I left some comments, but since this is kind of large, let me have some more time to review thoroughly.
Thanks!

@@ -536,6 +536,15 @@ object TypeCoercion {
case None => c
}

case s @ Sequence(_, _, _, timeZoneId) if !haveSameType(s.children) =>
val types = s.children.map(_.dataType)
findWiderCommonType(types) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if step is interval?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use ExpectsInputTypes or ImplicitCastInputTypes instead of TypeCoercion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you. I've added a unit test and fixed coercion when the step is interval.
Regarding ImplicitCastInputTypes I tried it, but it isn't flexible enough for my case. To support actual type coercion for both integral and temporal sequences I had to resort to TypeCoercion. Please check the fixed version. Is there any better way how to do it?

middle: Expression,
right: Expression,
timeZoneId: Option[String] = None)
extends TernaryExpression with TimeZoneAwareExpression {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style

case class Sequence(
    left: Expression,
    middle: Expression,
    right: Expression,
    timeZoneId: Option[String] = None)
  extends TernaryExpression with TimeZoneAwareExpression {
...

And should we use start, stop, step instead of left, middle, right respectively for readability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making step: Option[Expression] so we don't need UnresolvedLiteral, I guess. Similar usage is in ArrayJoin expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Option[Expression], removed UnresolvedLiteral. Thanks for suggestion.


import Sequence._

def this(arg0: Expression, arg1: Expression) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start, stop?

def this(arg0: Expression, arg1: Expression) =
this(arg0, arg1, Sequence.defaultStepExpression(arg0, arg1), None)

def this(arg0: Expression, arg1: Expression, arg2: Expression) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start, stop, step?

@@ -1059,3 +1063,316 @@ case class Flatten(child: Expression) extends UnaryExpression {

override def prettyName: String = "flatten"
}

@ExpressionDescription(
usage = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to fix indent in the usage, arguments, examples.

object Sequence {

private trait SequenceImpl {
def eval(input1: Any, input2: Any, input3: Any): Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start, stop, step?


def genCode(start: String, stop: String, step: String)
(arr: String, elemType: String)
(implicit ctx: CodegenContext): String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style

def genCode
    (start: String, stop: String, step: String)
    (arr: String, elemType: String)
    (implicit ctx: CodegenContext): String


override def genCode(start: String, stop: String, step: String)
(arr: String, elemType: String)
(implicit ctx: CodegenContext): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

| $i--;
| $arr[$i] = ($elemType) ($start + $step * $i);
| }
""".stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

}

private class TemporalSequenceImpl[T: ClassTag]
(dt: IntegralType, scale: Long, fromLong: Long => T, timeZone: TimeZone)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

@wajda wajda changed the title SPARK-23927: Add "sequence" expression [SPARK-23927][SQL] Add "sequence" expression May 4, 2018
@wajda
Copy link
Contributor Author

wajda commented May 4, 2018

Thanks you for the comments. I'll review and make changes shortly.

@wajda
Copy link
Contributor Author

wajda commented May 4, 2018

Regarding step, stop, step vs. arg1, arg2, arg3, left, middle, right and so on. Originally I used those to avoid hiding variables from the outer scope, but after the final refactoring it doesn't seem to be an issue anymore. I'll rename those, but I would keep ones as is for the def eval(input1: Any, input2: Any, input3: Any) method implementations. The reason is to distinguish the argument of type Any from their cast references as below:

 override def eval(input1: Any, input2: Any, input3: Any): Array[T] = {
      val start = input1.asInstanceOf[T]
      val stop = input2.asInstanceOf[T]
      val step = input3.asInstanceOf[T]

@wajda wajda force-pushed the feature/array-api-sequence branch from a7c0ccd to d15dad8 Compare May 4, 2018 18:20
@SparkQA
Copy link

SparkQA commented May 4, 2018

Test build #90207 has finished for PR 21155 at commit d15dad8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wajda wajda force-pushed the feature/array-api-sequence branch from d15dad8 to 8786192 Compare May 7, 2018 16:34
override lazy val dataType: ArrayType = ArrayType(start.dataType, containsNull = false)

override def checkInputDataTypes(): TypeCheckResult = {
val Seq(startType, stopType) = Seq(start, stop).map(_.dataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about val StartType = start.dataType? Then, replace stopType with stop.dataType at line 1540.

@SparkQA
Copy link

SparkQA commented May 7, 2018

Test build #90330 has finished for PR 21155 at commit 383f180.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 7, 2018

Test build #90329 has finished for PR 21155 at commit 8786192.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Seq(elemType, elemType) ++
stepOpt.map(_ => elemType match {
case DateType | TimestampType => CalendarIntervalType
case _: IntegralType => elemType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to override checkInputDataTypes here to check if elemType is one of DateType, TimestampType or IntegralType. And also the validity of type of step expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, isn't it exactly what I had before? Just yesterday I followed @ueshin suggestion and replaced checkInputDataTypes with ExpectsInputTypes trait + inputTypes

Copy link
Contributor Author

@wajda wajda May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both implementations basically do the same thing - check the validity of the element and step types. Unless I am missing something. There are unit tests for mixed types cases.

Copy link
Member

@viirya viirya May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default checkInputDataTypes just checks if data types of children matches inputTypes. Looks like for now here we don't have any limit. So even an invalid data type is given, checkInputDataTypes won't complain because inputTypes simply picks up children's data types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there are no test cases with invalid data types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would fail with match error, but I see your point.
Reverted to checkInputDataTypes, improved type check conditions, added tests for invalid data types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what was wrong with checkInputDataTypes, but I can't see the difference because the commit was just overwritten.

@SparkQA
Copy link

SparkQA commented May 10, 2018

Test build #90428 has finished for PR 21155 at commit 22bde31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 10, 2018

Test build #90456 has finished for PR 21155 at commit 0a0a1e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wajda wajda force-pushed the feature/array-api-sequence branch from 0a0a1e6 to 3ef9566 Compare May 21, 2018 12:25
@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90892 has finished for PR 21155 at commit 1b74d6f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wajda wajda force-pushed the feature/array-api-sequence branch 2 times, most recently from 4b8af57 to 2ccb9bb Compare May 21, 2018 16:59
@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90899 has finished for PR 21155 at commit 4b8af57.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Sequence(

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90893 has finished for PR 21155 at commit 3ef9566.

  • This patch fails SparkR unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • case class Sequence(

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90909 has finished for PR 21155 at commit 2ccb9bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Sequence(

@wajda
Copy link
Contributor Author

wajda commented May 23, 2018

Any other comments on this one?

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the super delay.
I left some comments. Thanks.


def coercibleChildren: Seq[Expression] = children.filter(_.dataType != CalendarIntervalType)

def castChildrenTo(widerType: DataType): Expression = Sequence(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use withNewChildren instead?

Copy link
Contributor Author

@wajda wajda Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose? Copy constructor works the best in this case as it's type safe and does exactly what I need to do.
I tried to use withNewChildren like this:

def castChildrenTo(widerType: DataType): Expression = withNewChildren(Seq(
    Cast(start, widerType),
    Cast(stop, widerType)) ++
    stepOpt.map(step => if (step.dataType != CalendarIntervalType) Cast(step, widerType) else step))

but it doesn't work as expected for some reason. I didn't grasp all the magic it does yet, but do we really need to complicate things? Why is a copy constructor a bad choice in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to confirm if we can use withNewChildren or not, and it's okay to use the copy constructor if it's hard to use the method.

| final int $stepSign = $stopMicros > $startMicros ? +1 : -1;
| final long $exclusiveItem = $stopMicros + $stepSign;
| final java.util.TimeZone $genTimeZone =
| java.util.TimeZone.getTimeZone("${timeZone.getID}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use ctx.addReferenctObj("timeZone", timeZone) instead to follow the implementations of datetime expressions? Doing TimeZone.getTimeZone() for every row should be slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.


val len = if (start == stop) 1L else 1L + (stop.toLong - start.toLong) / step.toLong

require(len <= Int.MaxValue, s"Too long sequence: $len. Should be <= ${Int.MaxValue}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH instead of Int.MaxValue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

s"""
|if (!($step > 0 && $start <= $stop||
| $step < 0 && $start >= $stop||
| $step == 0 && $start == $stop)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more parenthesis to be clear the boundaries between && and ||? The combinations of && and || without parenthesis may cause an unexpected behavior.

( ... && ... ) || ( ... && ... ) || ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

require(
step > num.zero && start <= stop
|| step < num.zero && start >= stop
|| step == 0 && start == stop,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

} else {
ev.copy(code =
s"""
|boolean ${ev.isNull} = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -308,6 +313,292 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
ArrayMax(Literal.create(Seq(1.123, 0.1234, 1.121), ArrayType(DoubleType))), 1.123)
}

test("Sequence") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need additional words in the test title?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

}

test("Sequence on DST boundaries") {
val timeZone = TimeZone.getTimeZone("CET")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a long name? 3-letter name might not work in some environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91617 has finished for PR 21155 at commit 2ccb9bb.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Sequence(

@wajda wajda force-pushed the feature/array-api-sequence branch from 2ccb9bb to c54857a Compare June 22, 2018 16:21
@SparkQA
Copy link

SparkQA commented Jun 22, 2018

Test build #92217 has finished for PR 21155 at commit c54857a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you split the current updates into some PRs to add sequence function and refactor other functions?
We usually do one thing in one PR unless we have a special reason. Thanks!

def castChildrenTo(widerType: DataType): Expression = Sequence(
Cast(start, widerType),
Cast(stop, widerType),
stepOpt.map(step => if (step.dataType != CalendarIntervalType) Cast(step, widerType) else step),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try to cast to CalendarIntervalType if widerType is DateType or TimestampType and step.dataType is StringType? We might want to use StringType for the interval.

Copy link
Contributor Author

@wajda wajda Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it be useful? Do you have a use case example?
When using SQL notation an interval step is automatically parsed as a correct type, so I never get it as String in this method. Unless I am mistaken the only way for step to be of String type is if Sequence expression is created programmatically with a string type expression passed as 3rd argument. But in this case having strong typing is rather a good thing, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks!

| "Illegal sequence boundaries: " + $start + " to " + $stop + " by " + $step);
|}
|long $longLen = $stop == $start ? 1L : 1L + ((long) $stop - $start) / $step;
|if ($longLen > Integer.MAX_VALUE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_ROUNDED_ARRAY_LENGTH instead of Integer.MAX_VALUE?

@wajda wajda force-pushed the feature/array-api-sequence branch from c54857a to def678b Compare June 25, 2018 08:11
@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92295 has finished for PR 21155 at commit def678b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wajda wajda force-pushed the feature/array-api-sequence branch from def678b to 140225d Compare June 26, 2018 06:38
@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92326 has finished for PR 21155 at commit 140225d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wajda
Copy link
Contributor Author

wajda commented Jun 26, 2018

please retest

@ueshin
Copy link
Member

ueshin commented Jun 26, 2018

Jenkins, retest this please.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment.

require(
(step > num.zero && start <= stop)
|| (step < num.zero && start >= stop)
|| (step == 0 && start == stop),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num.zero instead of 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Although I wonder what difference it makes besides the readability? Doesn't equals(0) work consistently on every numeric types in Scala? Or is it related to autoboxing somehow? It's not very clear to me in this context. Thanks.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92331 has finished for PR 21155 at commit 140225d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92334 has finished for PR 21155 at commit d6de8cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jun 27, 2018

Thanks! merging to master.

@asfgit asfgit closed this in 2669b4d Jun 27, 2018
asfgit pushed a commit that referenced this pull request Jun 27, 2018
## What changes were proposed in this pull request?

This pr is a follow-up pr of #21155.
The #21155 removed unnecessary import at that time, but the import became necessary in another pr.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #21646 from ueshin/issues/SPARK-23927/fup1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants