[SPARK-10868] monotonicallyIncreasingId() supports offset for indexing #14568

tedyu · 2016-08-09T20:14:26Z

What changes were proposed in this pull request?

This PR adds offset to monotonicallyIncreasingId()

How was this patch tested?

Existing tests.

SparkQA · 2016-08-09T20:19:48Z

Test build #63458 has finished for PR 14568 at commit 4a4e247.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MonotonicallyIncreasingID(offset: Long = 0) extends LeafExpression with Nondeterministic

hvanhovell · 2016-08-09T20:25:58Z

Add a test?

hvanhovell · 2016-08-09T20:28:20Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala

@@ -40,13 +40,14 @@ import org.apache.spark.sql.types.{DataType, LongType}
      represent the record number within each partition. The assumption is that the data frame has
      less than 1 billion partitions, and each partition has less than 8 billion records.""",
  extended = "> SELECT _FUNC_();\n 0")
-case class MonotonicallyIncreasingID() extends LeafExpression with Nondeterministic {
+case class MonotonicallyIncreasingID(offset: Long = 0) extends LeafExpression


Check if this still works in SQL. We might have to change offset into a literal expression. See HyperLogLogPlusPlus or NTile for examples of this.

case class HyperLogLogPlusPlus( child: Expression, relativeSD: Double = 0.05, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0)

The change seems to be in line with HyperLogLogPlusPlus ctor.

It calls one of these constructors: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala#L62-L72

SparkQA · 2016-08-09T20:54:18Z

Test build #63462 has finished for PR 14568 at commit ea77d78.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-08-09T21:12:34Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

   * @group normal_funcs
-   * @since 1.6.0
+   * @since 2.0.1


Lets make this 2.1.0. Could you also add a little bit of documentation on the offset? One line suffices.

hvanhovell · 2016-08-09T21:15:28Z

A high-level question: is it a problem when the offset is larger than 1 << 33? I can't really think of one.

rxin · 2016-08-09T21:21:35Z

So I'm still confused what offset does, even after reading the code. Can you please write a more clear documentation? An example would help.

Also let's update the Python API as well, and add an integration test for SQL.

SparkQA · 2016-08-09T21:40:52Z

Test build #63459 has finished for PR 14568 at commit 46ea8a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MonotonicallyIncreasingID(offset: Long = 0) extends LeafExpression

SparkQA · 2016-08-09T22:04:37Z

Test build #63472 has finished for PR 14568 at commit 8198d9c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T22:18:59Z

Test build #63466 has finished for PR 14568 at commit 81e342d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T22:38:40Z

Test build #63469 has finished for PR 14568 at commit f78d6aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T22:52:16Z

Test build #63470 has finished for PR 14568 at commit 1107182.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T23:26:35Z

Test build #63473 has finished for PR 14568 at commit b713f23.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-10T00:07:58Z

Test build #63475 has finished for PR 14568 at commit 4d0cd3a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tedyu · 2016-08-10T02:01:17Z

[info] - monotonically_increasing_id_with_offset *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Invalid number of arguments for function monotonically_increasing_id; line 1 pos 0
[info]   at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:457)
[info]   at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:443)

I wonder why 'monotonically_increasing_id(offset: Long): Column' wasn't considered as a match.

hvanhovell · 2016-08-10T08:24:09Z

monotonically_increasing_id(5) get parsed into the following expression UnresolvedFunction("monotonically_increasing_id", Seq(Literal(5)). Note that the offset is turned into an Expression as well.

The FunctionRegistry can only resolve Expression based constructors. Which is not provided by MonotonicallyIncreasingID. Solution: add an Expression based constructor.

SparkQA · 2016-08-10T14:03:11Z

Test build #63537 has finished for PR 14568 at commit 58e044f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-15T03:37:23Z

Test build #63766 has finished for PR 14568 at commit 57ab6fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tedyu · 2016-08-15T17:08:09Z

@rxin
Can you take a look at the python API one more time ?

hvanhovell · 2016-08-15T17:47:58Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala

+object MonotonicallyIncreasingID {
+  private def parseExpression(expr: Expression): Long = expr match {
+    case IntegerLiteral(i) => i.toLong
+    case NonNullLiteral(l: Long, LongType) => l.toString.toLong


just return l?

hvanhovell · 2016-08-15T18:11:56Z

@tedyu the scala code is shaping up nicely.

I do have a question regarding usage. How will this be used? The thing is that the monotonically_increasing_id returns an id based on the number of rows in a partition and the partition id. If I have read the JIRA correctly, someone wants to use this for is id generation in multiple batches. How would a user be able to provide a sensible offset? Could you create an example?

SparkQA · 2016-08-15T20:02:26Z

Test build #63791 has finished for PR 14568 at commit 5bdb3ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tedyu · 2016-08-16T00:04:10Z

@hvanhovell
As Martin said in JIRA:

Add the index column to A' - this time starting at 200, as there are already entries with id's from 0 to 199 (here, monotonicallyInreasingID( 200 ) is required.)
union A and A'

Is the above sample good by you ?

tedyu · 2016-08-16T20:55:14Z

@hvanhovell
What do you think of the above reply ?

hvanhovell · 2016-08-16T21:41:43Z

Not super. I will try to explain why.

monotonically_increasing_id constructs an id based on an increasing offset (the lower 33 bits) and the partition id (the upper 31 bits). The example given is not super realistic because it only seems to contain the one partition (id=0). In this case maximum id of the previous run would be usable as the seed.

If you have more than one partition an issue emerges, for instance range(0, 9, 1, 3).select(monotonically_increasing_id()).show yields the following ids:

+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
|                            2|
|                            3|
|                            4|
|                   8589934592|
|                   8589934593|
|                   8589934594|
|                   8589934595|
|                   8589934596|
+-----------------------------+

Which offset for the next run would you use in this case? Lets explore the the options we have:

Taking the maximum offset. This will probably lead to collisions between ids.
Calculating the maximum of the lower 33 bits. This relies heavily on the exact implementation of monotonically_increasing_id.

What would you recommend an end user to do?

tedyu · 2016-08-16T21:54:59Z

With:
spark.range(0, 9, 1, 3).select(monotonically_increasing_id()).show

I got:

+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
|                            2|
|                   8589934592|
|                   8589934593|
|                   8589934594|
|                  17179869184|
|                  17179869185|
|                  17179869186|
+-----------------------------+

The next offset could be 3.

tedyu · 2016-08-17T14:59:36Z

The addition of offset support allows users to concatenate rows from different datasets.

hvanhovell · 2016-08-17T15:55:58Z

python/pyspark/sql/functions.py

@@ -426,6 +426,29 @@ def monotonically_increasing_id():
    return Column(sc._jvm.functions.monotonically_increasing_id())


+@since(2.1)
+def monotonically_increasing_id_w_offset(offset):


Could you do this by adding a default parameter to the monotonically_increasing_id method and remove this one?

I was planning to do that.
But the @SInCE() annotation becomes confusing.

hvanhovell · 2016-08-17T15:58:28Z

Ok, so the maximum of the lower 33 bits would be the starting offset for the next batch. That is not super easy for an end user to do. Lets say a user inserts data from table a into table b, using this would require code like this:

insert into b
select *,
       monotonically_increasing_id((select max(id & 8589934591) from b)) as id
from   a

Other than that, the change looks pretty good. I left one last comment regarding the python code.

SparkQA · 2016-08-17T18:27:43Z

Test build #63929 has finished for PR 14568 at commit 9b34358.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-17T18:31:04Z

Test build #63928 has finished for PR 14568 at commit 0167b02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tedyu · 2016-08-17T21:54:43Z

@hvanhovell
Let me know if there is more I should do for this enhancement.
Thanks

rxin · 2016-08-17T22:05:43Z

Is this actually going to solve the use case asked in SPARK-10868?

The problem with monotonicallyIncreasingId is that it is actually not consecutive -- e.g. even if you have 200 records in this batch of update, the max id won't be 200. It will be some very large number (since the upper few bits are determined by the partition id).

tedyu · 2016-08-17T22:19:19Z

As Herman commented above, obtaining lower 33 bits of the id column would allow Ids generated from two (or more) executions to form contiguous range.

rxin · 2016-08-17T22:30:25Z

Yea but you can't do that more than once.

tedyu · 2016-08-17T22:35:25Z

Can you elaborate ?

1st run: Id's 1 to 99 are generated.

2nd run: poll Id column and obtain 99. Specify 100 as offset for monotonically_increasing_id(). Id's 100 to 199 are generated.

3rd run: poll Id column and obtain 199. Specify 200 as offset for monotonically_increasing_id(). Id's 200 to 299 are generated.

rxin · 2016-08-17T23:18:33Z

That won't work when there are more than one partitions.

tedyu · 2016-08-17T23:20:04Z

I don't think so.
Using (id & 8589934591) would obtain the numbers 99 and 199 in my example.

rxin · 2016-08-17T23:25:24Z

scala> spark.range(10).selectExpr("monotonically_increasing_id() & 8589934591L").show()
+--------------------------------------------+
|(monotonically_increasing_id() & 8589934591)|
+--------------------------------------------+
|                                           0|
|                                           0|
|                                           0|
|                                           0|
|                                           1|
|                                           0|
|                                           0|
|                                           0|
|                                           0|
|                                           1|
+--------------------------------------------+

hvanhovell · 2016-08-17T23:30:09Z

The thing is that this (id & 8589934591L) is difficult/strange for an end user to work with; they should not really have to think about such a detail.

rxin · 2016-08-17T23:34:18Z

@tedyu I closed the original JIRA. Can you close the pull request?

Let me know if it is unclear why this doesn't really solve the problem.

hvanhovell · 2016-08-31T13:05:09Z

@tedyu could you close this one?

SPARK-10868 monotonicallyIncreasingId() supports offset for indexing

4a4e247

tedyu changed the title ~~SPARK-10868 monotonicallyIncreasingId() supports offset for indexing~~ [SPARK-10868] monotonicallyIncreasingId() supports offset for indexing Aug 9, 2016

Wrap long line

46ea8a4

hvanhovell reviewed Aug 9, 2016
View reviewed changes

Add test for monotonically_increasing_id with offset

ea77d78

Clone monotonically_increasing_id() for compatibility

81e342d

hvanhovell reviewed Aug 9, 2016
View reviewed changes

Address review comments

f78d6aa

Modify example in functions.scala

1107182

Add Python API

8198d9c

Address PEP8 check

b713f23

Add parameterless ctor for MonotonicallyIncreasingID

4d0cd3a

add an Expression based constructor

58e044f

Try to fix compilation error

6e4fe06

Fix typo in python code

57ab6fa

hvanhovell reviewed Aug 15, 2016
View reviewed changes

Address Herman's review comment

5bdb3ab

hvanhovell reviewed Aug 17, 2016
View reviewed changes

tedyu added 2 commits August 17, 2016 09:31

Address Herman's comment

0167b02

Restore python method name

9b34358

tedyu closed this Aug 31, 2016

[SPARK-10868] monotonicallyIncreasingId() supports offset for indexing #14568

[SPARK-10868] monotonicallyIncreasingId() supports offset for indexing #14568

Conversation

tedyu commented Aug 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 9, 2016

hvanhovell commented Aug 9, 2016

hvanhovell Aug 9, 2016

Choose a reason for hiding this comment

tedyu Aug 9, 2016

Choose a reason for hiding this comment

hvanhovell Aug 9, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 9, 2016

hvanhovell Aug 9, 2016

Choose a reason for hiding this comment

hvanhovell commented Aug 9, 2016

rxin commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 10, 2016

tedyu commented Aug 10, 2016

hvanhovell commented Aug 10, 2016

SparkQA commented Aug 10, 2016

SparkQA commented Aug 15, 2016

tedyu commented Aug 15, 2016

hvanhovell Aug 15, 2016

Choose a reason for hiding this comment

hvanhovell commented Aug 15, 2016

SparkQA commented Aug 15, 2016

tedyu commented Aug 16, 2016

tedyu commented Aug 16, 2016

hvanhovell commented Aug 16, 2016

tedyu commented Aug 16, 2016

tedyu commented Aug 17, 2016

hvanhovell Aug 17, 2016

Choose a reason for hiding this comment

tedyu Aug 17, 2016

Choose a reason for hiding this comment

hvanhovell commented Aug 17, 2016

SparkQA commented Aug 17, 2016

SparkQA commented Aug 17, 2016

tedyu commented Aug 17, 2016

rxin commented Aug 17, 2016

tedyu commented Aug 17, 2016

rxin commented Aug 17, 2016

tedyu commented Aug 17, 2016

rxin commented Aug 17, 2016

tedyu commented Aug 17, 2016

rxin commented Aug 17, 2016

hvanhovell commented Aug 17, 2016

rxin commented Aug 17, 2016

hvanhovell commented Aug 31, 2016