Skip to content

[SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable#42772

Closed
dillitz wants to merge 3 commits intoapache:masterfrom
dillitz:SPARK-45051-uuid7-operation-id
Closed

[SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable#42772
dillitz wants to merge 3 commits intoapache:masterfrom
dillitz:SPARK-45051-uuid7-operation-id

Conversation

@dillitz
Copy link
Contributor

@dillitz dillitz commented Sep 1, 2023

What changes were proposed in this pull request?

This changes the default operations ID format from UUIDv4 to UUIDv7. This is done on the server and on the client side (both Scala and Python) since both sides can set this ID first.

Why are the changes needed?

Spark Connect currently uses UUIDv4 for operation IDs. Using UUIDv7 instead allows us to sort operations by ID to receive a chronological order while keeping the collision-free properties we require from this ID.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@hvanhovell
Copy link
Contributor

I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?

One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?

@dillitz
Copy link
Contributor Author

dillitz commented Sep 1, 2023

I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?

This was a request from @jdesjean. I believe he can give you a bit more background on this. From what I understand, this would allow us to present a more comprehensible history of executed operations to the user.

One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?

I agree with you that in this current state, we can not rely on the ID being in the v7 format, but this is also not the goal of this PR. We just want to change the default format from v4 to v7 since it has nicer properties while fulfilling the same requirements.

@dillitz dillitz changed the title [SPARK-45051][CONNECT] Use UUIDv7 for operation IDs to make operations chronologically sortable [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable Sep 1, 2023
@jdesjean
Copy link
Contributor

jdesjean commented Sep 1, 2023

I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?

One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?

When operation id is used as a PK, UUIDv7 gives us the nice property that the order will roughly match the server start time order for the query. While no one should rely on this property exclusively, having the records roughly ordered improves sorting performance.
Additionally, for most lookup sorting by server start time, adding operation id in the order by clause (i.e. start_time, operation_id) is one solution to obtain consistent ordering in the case of duplicates. Roughly ordered records again help improve the performance.

@dillitz dillitz force-pushed the SPARK-45051-uuid7-operation-id branch from 2f50870 to 58486fd Compare September 2, 2023 10:13
@juliuszsompolski
Copy link
Contributor

We maintain backwards compatibility, where older clients can connect to newer server. These older clients will not provide such UUIDs.
What will happen then? Does it break any critical assumptions?

@juliuszsompolski
Copy link
Contributor

... although, the currently existing (Spark 3.4) clients never generate operationId client side, so we can get away with adding an assertion that the client side id is UUID7 in ExecuteHolder.operationId.
But then this has to go into Spark 3.5.

@dillitz
Copy link
Contributor Author

dillitz commented Sep 5, 2023

We agreed that the benefits of adding this are not big enough because we can not rely on the operation ID being UUIDv7 and need to sort by startDate anyway. Closing this PR.

@dillitz dillitz closed this Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants