-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10868] monotonicallyIncreasingId() supports offset for indexing #14568
Conversation
Test build #63458 has finished for PR 14568 at commit
|
Add a test? |
@@ -40,13 +40,14 @@ import org.apache.spark.sql.types.{DataType, LongType} | |||
represent the record number within each partition. The assumption is that the data frame has | |||
less than 1 billion partitions, and each partition has less than 8 billion records.""", | |||
extended = "> SELECT _FUNC_();\n 0") | |||
case class MonotonicallyIncreasingID() extends LeafExpression with Nondeterministic { | |||
case class MonotonicallyIncreasingID(offset: Long = 0) extends LeafExpression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if this still works in SQL. We might have to change offset into a literal expression. See HyperLogLogPlusPlus or NTile for examples of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case class HyperLogLogPlusPlus(
child: Expression,
relativeSD: Double = 0.05,
mutableAggBufferOffset: Int = 0,
inputAggBufferOffset: Int = 0)
The change seems to be in line with HyperLogLogPlusPlus ctor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test build #63462 has finished for PR 14568 at commit
|
* @group normal_funcs | ||
* @since 1.6.0 | ||
* @since 2.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make this 2.1.0
. Could you also add a little bit of documentation on the offset? One line suffices.
A high-level question: is it a problem when the offset is larger than |
So I'm still confused what offset does, even after reading the code. Can you please write a more clear documentation? An example would help. Also let's update the Python API as well, and add an integration test for SQL. |
Test build #63459 has finished for PR 14568 at commit
|
Test build #63472 has finished for PR 14568 at commit
|
Test build #63466 has finished for PR 14568 at commit
|
Test build #63469 has finished for PR 14568 at commit
|
Test build #63470 has finished for PR 14568 at commit
|
Test build #63473 has finished for PR 14568 at commit
|
Test build #63475 has finished for PR 14568 at commit
|
I wonder why 'monotonically_increasing_id(offset: Long): Column' wasn't considered as a match. |
The |
Test build #63537 has finished for PR 14568 at commit
|
Test build #63766 has finished for PR 14568 at commit
|
@rxin |
object MonotonicallyIncreasingID { | ||
private def parseExpression(expr: Expression): Long = expr match { | ||
case IntegerLiteral(i) => i.toLong | ||
case NonNullLiteral(l: Long, LongType) => l.toString.toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return l
?
@tedyu the scala code is shaping up nicely. I do have a question regarding usage. How will this be used? The thing is that the |
Test build #63791 has finished for PR 14568 at commit
|
@hvanhovell
Is the above sample good by you ? |
@hvanhovell |
Not super. I will try to explain why.
If you have more than one partition an issue emerges, for instance
Which offset for the next run would you use in this case? Lets explore the the options we have:
What would you recommend an end user to do? |
With: I got:
The next offset could be 3. |
The addition of offset support allows users to concatenate rows from different datasets. |
@@ -426,6 +426,29 @@ def monotonically_increasing_id(): | |||
return Column(sc._jvm.functions.monotonically_increasing_id()) | |||
|
|||
|
|||
@since(2.1) | |||
def monotonically_increasing_id_w_offset(offset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do this by adding a default parameter to the monotonically_increasing_id
method and remove this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to do that.
But the @SInCE() annotation becomes confusing.
Ok, so the maximum of the lower 33 bits would be the starting offset for the next batch. That is not super easy for an end user to do. Lets say a user inserts data from table a into table b, using this would require code like this: insert into b
select *,
monotonically_increasing_id((select max(id & 8589934591) from b)) as id
from a Other than that, the change looks pretty good. I left one last comment regarding the python code. |
Test build #63929 has finished for PR 14568 at commit
|
Test build #63928 has finished for PR 14568 at commit
|
@hvanhovell |
Is this actually going to solve the use case asked in SPARK-10868? The problem with monotonicallyIncreasingId is that it is actually not consecutive -- e.g. even if you have 200 records in this batch of update, the max id won't be 200. It will be some very large number (since the upper few bits are determined by the partition id). |
As Herman commented above, obtaining lower 33 bits of the id column would allow Ids generated from two (or more) executions to form contiguous range. |
Yea but you can't do that more than once. |
Can you elaborate ? 1st run: Id's 1 to 99 are generated. 2nd run: poll Id column and obtain 99. Specify 100 as offset for monotonically_increasing_id(). Id's 100 to 199 are generated. 3rd run: poll Id column and obtain 199. Specify 200 as offset for monotonically_increasing_id(). Id's 200 to 299 are generated. |
That won't work when there are more than one partitions. |
I don't think so. |
|
The thing is that this ( |
@tedyu I closed the original JIRA. Can you close the pull request? Let me know if it is unclear why this doesn't really solve the problem. |
@tedyu could you close this one? |
What changes were proposed in this pull request?
This PR adds offset to monotonicallyIncreasingId()
How was this patch tested?
Existing tests.