Include runtime version in Dapr placement data to ease with upgrades #6838

ItalyPaleAle · 2023-08-22T16:15:44Z

As we continue to make improvements to actors (especially reminders) and workflow, we are sometimes faced with the need to make changes that, while non-breaking (in the sense that don't alter the behavior observed by end users), are not backwards-compatible.

For example, assume that we wanted to change the format used to store reminders in the actor state store (today it's JSON, and we could use a more efficient format like protobuf). This change would make it impossible for older versions of the Dapr sidecar to be able to parse the reminders data, and would cause issues in clusters with heterogeneous Dapr versions. In fact, we can make Dapr "vNext" be able to read data created from "vCurrent", but there's no way for us to make "vCurrent" read data created by "vNext".

This scenario, where "vCurrent" and "vNext" run side-by-side, is very common while performing rolling upgrades, as there are always sidecars running alongside older versions. It generally is short-lived, as upgrades are eventually completed in minutes/hours, but if we don't manage this situation we will observe a variety of issues.

Because of this problem, it's really hard to make changes to actors/reminders, and workflow too once it reaches beta in the next release, as we'll have to manage clusters with heterogeneous versions.

A possible solution to this problem could be making each Dapr sidecar report its version to the Placement service and use that to ease with upgrades.

Every time a sidecar registers itself with Placement, it includes a constant indicating the "Actor API level". For example, sidecar A is running 1.12.0 and reports API level "10". Sidecar B is running 1.13.0 and reports API level "20". (API level "0" is reserved for Dapr <= 1.11 which does not report the version)
- API levels can be incremented whenever needed. It may not be necessary to increment them on every Dapr release
- Major Dapr releases increment API levels by adding 10, so we can make hotfixes if needed
When Placement disseminates its state, it also includes the minimum API version observed among all sidecars. In this example, it would be "10". This instructs all sidecars to behave in "compatibility mode", so when they store data they must store it in the format that can be read by 1.12 (API level "10").
Once all sidecars are updated, Placement will report that minimum version is "20". Sidecars can now begin storing data in the updated format and can be confident no older sidecar will try to read that.
At that point, Placement will refuse to accept connections from sidecars running API versions < 20 This way, there's no risk that an older sidecar will try to access actor data using an old format.

Pros:

Allows us to make changes that are needed to improve actors and reminders, including their performance, without worrying about having an inconsistent state when older sidecars are running.
Can be used by actors, actor reminders, and workflow.

Cons:

Once all sidecars are updated to a certain version, it is not possible to run an older version of a Dapr sidecar in the cluster and have that use actors (other building blocks will work, but not actors and workflow).
- We could also add a flag to the Dapr Placement where we enforce a different minimum version. For example, we can set a CLI flag (passed via Helm option on K8s) to always enforce minimum of "10", even if all deployed sidecars are running API level "20". This will make Dapr sidecars always run in compatibility mode with 1.12.x.

yaron2 · 2023-08-22T17:55:56Z

How does this work with edge builds and/or custom builds?

ItalyPaleAle · 2023-08-22T17:57:29Z

How does this work with edge builds and/or custom builds?

Good question, and I am not sure I have a good answer to this. Do you have any suggestion?

yaron2 · 2023-08-22T18:02:48Z

Maybe selecting a default behavior which is non-compatibility mode for edge builds while having users explicitly set the Dapr version that the custom build is based on.

ItalyPaleAle · 2023-08-22T18:05:09Z

I think that makes sense.

Here's a possible decision matrix:

Edge builds: there are 2 kinds of edge builds:
- Nightly ones built from the CI. In here, we can make sure the version is explicitly set to "edge". Placement will then consider "edge" as always one version higher than whatever versioned is specified. For example, if there are sidecars running 1.11.4 and "edge", the min version will be "1.11.4" (so, compat mode is enabled). If all sidecars are "edge", then the min version is "edge" (non-compat mode)
- Apps running directly from source, such as with go run in the repo. Audience here should be Dapr developers, and here we can default to non-compatibility mode.
Custom builds
- With the mandatory note that custom builds are not something we officially support, so users are a bit on their own here...
- Users will have to explicitly set a version in custom builds. We should encourage users to put the version based on the upstream version they use. But they are free to do as they please.
- If they don't pass a semver-compatible version, we'll use the same logic as the nightly builds.

Implements the first bullet point of dapr#6838 (and only the first bullet point) - The Dapr runtime reports its version to the Placement service when it connects - This data is currently unused and ignored by the Placement service - In Dapr 1.13 we can use that data to perform decisions as described in dapr#6838 - Having the data submitted by 1.12 already will be helpful to ensure smoother upgrades Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

ItalyPaleAle · 2023-08-22T22:09:58Z

@yaron2 @mukundansundar I've found a better solution, which involves defining a constant "Actor API version" set in the actors package. That constnat will be set to "10" in Dapr 1.12, and if a future version makes changes that require new API levels, we can increment that to for example "20".

This should solve the problems highlighted above:

It's hardcoded in the code for a specific Dapr version
No more need to ensure that the build always includes the correct version in the buildinfo package
Works with "edge" builds and custom builds too, and there's no possibility to make mistakes
Does not need to handle the complexity of parsing and comparing semver

Another (lesser) benefit is that the API level can be sent as an int over the wire so it's a bit less data on each message

I've updated the first message to include the details and the PR

JoshVanL · 2023-08-24T21:33:07Z

Question: why are we versioning by 10, 20 etc, rather than 1, 2, or 1.0, 2.0? I think this is because we want to store the version as an int but want a precision between major versions? In any case, it’s just surprising when first reading.

mukundansundar · 2023-08-31T02:38:58Z

Once all sidecars are updated to a certain version, it is not possible to run an older version of a Dapr sidecar in the cluster and have that use actors (other building blocks will work, but not actors and workflow).
We could also add a flag to the Dapr Placement where we enforce a different minimum version. For example, we can set a CLI flag (passed via Helm option on K8s) to always enforce minimum of "10", even if all deployed sidecars are running API level "20". This will make Dapr sidecars always run in compatibility mode with 1.12.x.

In this case, won't this essentially prevent rollbacks from happening?
Consider scenarios where events arrive in sporadic intervals. If after an upgrade the API level all changes to 20 and then it is observed that in some other component/runtime block there is a block that necessitates a rollback, then this will prevent actors from being used?

Probably a reconciliation needs to happen between what min version is represented by Placement vs Sidecar?

ItalyPaleAle · 2023-09-05T14:35:26Z

Question: why are we versioning by 10, 20 etc, rather than 1, 2, or 1.0, 2.0? I think this is because we want to store the version as an int but want a precision between major versions? In any case, it’s just surprising when first reading.

The version is meant to be decoupled from the Dapr version. So, 1.12 will have version "10", and then we don't need to update the version on each Dapr version; only when we make significant changes that may not be fully backwards-compatible. So, while 1.13 could have version 20, we may not need to change it until Dapr 1.16 then, for example.

The reason for skipping 10 is to be able to have hotfixes where we make changes that require increasing the version. So if 1.13 comes out with version "20", and we realize we need to change something in 1.12.4, we can use version "11" there. I believe this will be very unlikely, but it's best to be safe.

As for using ints... Exactly the reason you mentioned: they are easier to manage. With floats, checking for equality is always a bit iffy. We could use strings that represent a semver version, but that's more complex and less efficient over the wire.

In this case, won't this essentially prevent rollbacks from happening?
Consider scenarios where events arrive in sporadic intervals. If after an upgrade the API level all changes to 20 and then it is observed that in some other component/runtime block there is a block that necessitates a rollback, then this will prevent actors from being used?

Yes, that could be a problem. If a rollback is needed:

First, ideally users should not update the entire cluster and then check for issues. As long as there's one app on the older version, the minimum version is unchanged
If a rollback is needed, then users could restart the Placement service with a new "hardcoded" minimum flag. We should offer a Helm option for this.

* Report Dapr runtime version to Placement Implements the first bullet point of #6838 (and only the first bullet point) - The Dapr runtime reports its version to the Placement service when it connects - This data is currently unused and ignored by the Placement service - In Dapr 1.13 we can use that data to perform decisions as described in #6838 - Having the data submitted by 1.12 already will be helpful to ensure smoother upgrades Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Switched to use API level instead of Dapr version Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com> Co-authored-by: Mukundan Sundararajan <65565396+mukundansundar@users.noreply.github.com>

dapr-bot · 2023-11-05T19:13:25Z

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

Follow-up from dapr#7134 Part of dapr#6838 Actor runtime can use this to change its behavior. Will allow completing dapr#7129 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

Follow-up from #7134 Part of #6838 Actor runtime can use this to change its behavior. Will allow completing #7129 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

Follow-up from dapr#7134 Part of dapr#6838 Actor runtime can use this to change its behavior. Will allow completing dapr#7129 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Signed-off-by: Cassandra Coyle <cassie@diagrid.io>

ItalyPaleAle self-assigned this Aug 22, 2023

ItalyPaleAle mentioned this issue Aug 22, 2023

Report Dapr Actor API level (runtime version) to Placement #6839

Merged

ItalyPaleAle added this to the v1.13 milestone Sep 6, 2023

ItalyPaleAle mentioned this issue Sep 6, 2023

Allow (middleware) components to define internal actors #6538

Open

This was referenced Oct 31, 2023

Serialize reminders as protobuf in state store #7129

Merged

Placement: disseminate min Actor API version + reject nodes with version lower than minium #7134

Merged

dapr-bot added the stale Issues and PRs without response label Nov 5, 2023

ItalyPaleAle added pinned and removed stale Issues and PRs without response labels Nov 5, 2023

yaron2 closed this as completed in #7134 Nov 13, 2023

ItalyPaleAle mentioned this issue Nov 15, 2023

Store actor API level in actors object #7196

Merged

ItalyPaleAle mentioned this issue Jan 16, 2024

Calculate vnodes on daprd instead of the placement service #7382

Merged

7 tasks

ItalyPaleAle mentioned this issue Jan 31, 2024

Placement does not allow older versions of daprd to join #7471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include runtime version in Dapr placement data to ease with upgrades #6838

Include runtime version in Dapr placement data to ease with upgrades #6838

ItalyPaleAle commented Aug 22, 2023 •

edited

yaron2 commented Aug 22, 2023

ItalyPaleAle commented Aug 22, 2023

yaron2 commented Aug 22, 2023 •

edited

ItalyPaleAle commented Aug 22, 2023 •

edited

ItalyPaleAle commented Aug 22, 2023 •

edited

JoshVanL commented Aug 24, 2023

mukundansundar commented Aug 31, 2023

ItalyPaleAle commented Sep 5, 2023

dapr-bot commented Nov 5, 2023

Include runtime version in Dapr placement data to ease with upgrades #6838

Include runtime version in Dapr placement data to ease with upgrades #6838

Comments

ItalyPaleAle commented Aug 22, 2023 • edited

yaron2 commented Aug 22, 2023

ItalyPaleAle commented Aug 22, 2023

yaron2 commented Aug 22, 2023 • edited

ItalyPaleAle commented Aug 22, 2023 • edited

ItalyPaleAle commented Aug 22, 2023 • edited

JoshVanL commented Aug 24, 2023

mukundansundar commented Aug 31, 2023

ItalyPaleAle commented Sep 5, 2023

dapr-bot commented Nov 5, 2023

ItalyPaleAle commented Aug 22, 2023 •

edited

yaron2 commented Aug 22, 2023 •

edited

ItalyPaleAle commented Aug 22, 2023 •

edited

ItalyPaleAle commented Aug 22, 2023 •

edited