[SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable by dillitz · Pull Request #42772 · apache/spark

dillitz · 2023-09-01T12:36:57Z

What changes were proposed in this pull request?

This changes the default operations ID format from UUIDv4 to UUIDv7. This is done on the server and on the client side (both Scala and Python) since both sides can set this ID first.

Why are the changes needed?

Spark Connect currently uses UUIDv4 for operation IDs. Using UUIDv7 instead allows us to sort operations by ID to receive a chronological order while keeping the collision-free properties we require from this ID.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

hvanhovell · 2023-09-01T14:15:37Z

I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?

One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?

dillitz · 2023-09-01T14:40:11Z

I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?

This was a request from @jdesjean. I believe he can give you a bit more background on this. From what I understand, this would allow us to present a more comprehensible history of executed operations to the user.

One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?

I agree with you that in this current state, we can not rely on the ID being in the v7 format, but this is also not the goal of this PR. We just want to change the default format from v4 to v7 since it has nicer properties while fulfilling the same requirements.

jdesjean · 2023-09-01T16:09:50Z

I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?

One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?

When operation id is used as a PK, UUIDv7 gives us the nice property that the order will roughly match the server start time order for the query. While no one should rely on this property exclusively, having the records roughly ordered improves sorting performance.
Additionally, for most lookup sorting by server start time, adding operation id in the order by clause (i.e. start_time, operation_id) is one solution to obtain consistent ordering in the case of duplicates. Roughly ordered records again help improve the performance.

juliuszsompolski · 2023-09-05T17:12:14Z

We maintain backwards compatibility, where older clients can connect to newer server. These older clients will not provide such UUIDs.
What will happen then? Does it break any critical assumptions?

juliuszsompolski · 2023-09-05T17:18:21Z

... although, the currently existing (Spark 3.4) clients never generate operationId client side, so we can get away with adding an assertion that the client side id is UUID7 in ExecuteHolder.operationId.
But then this has to go into Spark 3.5.

dillitz · 2023-09-05T17:37:06Z

We agreed that the benefits of adding this are not big enough because we can not rely on the operation ID being UUIDv7 and need to sort by startDate anyway. Closing this PR.

github-actions bot added SQL BUILD INFRA PYTHON CONNECT labels Sep 1, 2023

dillitz force-pushed the SPARK-45051-uuid7-operation-id branch from ced98e5 to 8b0a8a1 Compare September 1, 2023 13:17

dillitz changed the title ~~[SPARK-45051][CONNECT] Use UUIDv7 for operation IDs to make operations chronologically sortable~~ [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable Sep 1, 2023

dillitz added 3 commits September 2, 2023 12:12

use uuidv7 for operation id

dd6f29a

remove spaces

59fcfb4

remove from spark-deps-hadoop-3-hive-2.3

58486fd

dillitz force-pushed the SPARK-45051-uuid7-operation-id branch from 2f50870 to 58486fd Compare September 2, 2023 10:13

dillitz closed this Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable#42772

[SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable#42772
dillitz wants to merge 3 commits intoapache:masterfrom
dillitz:SPARK-45051-uuid7-operation-id

dillitz commented Sep 1, 2023 •

edited

Loading

Uh oh!

hvanhovell commented Sep 1, 2023

Uh oh!

dillitz commented Sep 1, 2023

Uh oh!

jdesjean commented Sep 1, 2023 •

edited

Loading

Uh oh!

juliuszsompolski commented Sep 5, 2023

Uh oh!

juliuszsompolski commented Sep 5, 2023

Uh oh!

dillitz commented Sep 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dillitz commented Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

hvanhovell commented Sep 1, 2023

Uh oh!

dillitz commented Sep 1, 2023

Uh oh!

jdesjean commented Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliuszsompolski commented Sep 5, 2023

Uh oh!

juliuszsompolski commented Sep 5, 2023

Uh oh!

dillitz commented Sep 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dillitz commented Sep 1, 2023 •

edited

Loading

jdesjean commented Sep 1, 2023 •

edited

Loading