[SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable#42772
[SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable#42772dillitz wants to merge 3 commits intoapache:masterfrom
Conversation
ced98e5 to
8b0a8a1
Compare
|
I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have? One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs? |
This was a request from @jdesjean. I believe he can give you a bit more background on this. From what I understand, this would allow us to present a more comprehensible history of executed operations to the user.
I agree with you that in this current state, we can not rely on the ID being in the v7 format, but this is also not the goal of this PR. We just want to change the default format from v4 to v7 since it has nicer properties while fulfilling the same requirements. |
When operation id is used as a PK, UUIDv7 gives us the nice property that the order will roughly match the server start time order for the query. While no one should rely on this property exclusively, having the records roughly ordered improves sorting performance. |
2f50870 to
58486fd
Compare
|
We maintain backwards compatibility, where older clients can connect to newer server. These older clients will not provide such UUIDs. |
|
... although, the currently existing (Spark 3.4) clients never generate operationId client side, so we can get away with adding an assertion that the client side id is UUID7 in |
|
We agreed that the benefits of adding this are not big enough because we can not rely on the operation ID being UUIDv7 and need to sort by startDate anyway. Closing this PR. |
What changes were proposed in this pull request?
This changes the default operations ID format from UUIDv4 to UUIDv7. This is done on the server and on the client side (both Scala and Python) since both sides can set this ID first.
Why are the changes needed?
Spark Connect currently uses UUIDv4 for operation IDs. Using UUIDv7 instead allows us to sort operations by ID to receive a chronological order while keeping the collision-free properties we require from this ID.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No.