Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include runtime version in Dapr placement data to ease with upgrades #6838

Closed
ItalyPaleAle opened this issue Aug 22, 2023 · 9 comments · Fixed by #7134
Closed

Include runtime version in Dapr placement data to ease with upgrades #6838

ItalyPaleAle opened this issue Aug 22, 2023 · 9 comments · Fixed by #7134
Assignees
Labels
Milestone

Comments

@ItalyPaleAle
Copy link
Contributor

ItalyPaleAle commented Aug 22, 2023

As we continue to make improvements to actors (especially reminders) and workflow, we are sometimes faced with the need to make changes that, while non-breaking (in the sense that don't alter the behavior observed by end users), are not backwards-compatible.

For example, assume that we wanted to change the format used to store reminders in the actor state store (today it's JSON, and we could use a more efficient format like protobuf). This change would make it impossible for older versions of the Dapr sidecar to be able to parse the reminders data, and would cause issues in clusters with heterogeneous Dapr versions. In fact, we can make Dapr "vNext" be able to read data created from "vCurrent", but there's no way for us to make "vCurrent" read data created by "vNext".

This scenario, where "vCurrent" and "vNext" run side-by-side, is very common while performing rolling upgrades, as there are always sidecars running alongside older versions. It generally is short-lived, as upgrades are eventually completed in minutes/hours, but if we don't manage this situation we will observe a variety of issues.

Because of this problem, it's really hard to make changes to actors/reminders, and workflow too once it reaches beta in the next release, as we'll have to manage clusters with heterogeneous versions.

A possible solution to this problem could be making each Dapr sidecar report its version to the Placement service and use that to ease with upgrades.

  1. Every time a sidecar registers itself with Placement, it includes a constant indicating the "Actor API level". For example, sidecar A is running 1.12.0 and reports API level "10". Sidecar B is running 1.13.0 and reports API level "20". (API level "0" is reserved for Dapr <= 1.11 which does not report the version)
    • API levels can be incremented whenever needed. It may not be necessary to increment them on every Dapr release
    • Major Dapr releases increment API levels by adding 10, so we can make hotfixes if needed
  2. When Placement disseminates its state, it also includes the minimum API version observed among all sidecars. In this example, it would be "10". This instructs all sidecars to behave in "compatibility mode", so when they store data they must store it in the format that can be read by 1.12 (API level "10").
  3. Once all sidecars are updated, Placement will report that minimum version is "20". Sidecars can now begin storing data in the updated format and can be confident no older sidecar will try to read that.
  4. At that point, Placement will refuse to accept connections from sidecars running API versions < 20 This way, there's no risk that an older sidecar will try to access actor data using an old format.

Pros:

  • Allows us to make changes that are needed to improve actors and reminders, including their performance, without worrying about having an inconsistent state when older sidecars are running.
  • Can be used by actors, actor reminders, and workflow.

Cons:

  • Once all sidecars are updated to a certain version, it is not possible to run an older version of a Dapr sidecar in the cluster and have that use actors (other building blocks will work, but not actors and workflow).
    • We could also add a flag to the Dapr Placement where we enforce a different minimum version. For example, we can set a CLI flag (passed via Helm option on K8s) to always enforce minimum of "10", even if all deployed sidecars are running API level "20". This will make Dapr sidecars always run in compatibility mode with 1.12.x.
@yaron2
Copy link
Member

yaron2 commented Aug 22, 2023

How does this work with edge builds and/or custom builds?

@ItalyPaleAle
Copy link
Contributor Author

How does this work with edge builds and/or custom builds?

Good question, and I am not sure I have a good answer to this. Do you have any suggestion?

@yaron2
Copy link
Member

yaron2 commented Aug 22, 2023

Maybe selecting a default behavior which is non-compatibility mode for edge builds while having users explicitly set the Dapr version that the custom build is based on.

@ItalyPaleAle
Copy link
Contributor Author

ItalyPaleAle commented Aug 22, 2023

I think that makes sense.

Here's a possible decision matrix:

  1. Edge builds: there are 2 kinds of edge builds:
    • Nightly ones built from the CI. In here, we can make sure the version is explicitly set to "edge". Placement will then consider "edge" as always one version higher than whatever versioned is specified. For example, if there are sidecars running 1.11.4 and "edge", the min version will be "1.11.4" (so, compat mode is enabled). If all sidecars are "edge", then the min version is "edge" (non-compat mode)
    • Apps running directly from source, such as with go run in the repo. Audience here should be Dapr developers, and here we can default to non-compatibility mode.
  2. Custom builds
    • With the mandatory note that custom builds are not something we officially support, so users are a bit on their own here...
    • Users will have to explicitly set a version in custom builds. We should encourage users to put the version based on the upstream version they use. But they are free to do as they please.
    • If they don't pass a semver-compatible version, we'll use the same logic as the nightly builds.

@ItalyPaleAle ItalyPaleAle self-assigned this Aug 22, 2023
ItalyPaleAle added a commit to ItalyPaleAle/dapr that referenced this issue Aug 22, 2023
Implements the first bullet point of dapr#6838 (and only the first bullet point)

- The Dapr runtime reports its version to the Placement service when it connects
- This data is currently unused and ignored by the Placement service
- In Dapr 1.13 we can use that data to perform decisions as described in dapr#6838
- Having the data submitted by 1.12 already will be helpful to ensure smoother upgrades

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle
Copy link
Contributor Author

ItalyPaleAle commented Aug 22, 2023

@yaron2 @mukundansundar I've found a better solution, which involves defining a constant "Actor API version" set in the actors package. That constnat will be set to "10" in Dapr 1.12, and if a future version makes changes that require new API levels, we can increment that to for example "20".

This should solve the problems highlighted above:

  • It's hardcoded in the code for a specific Dapr version
  • No more need to ensure that the build always includes the correct version in the buildinfo package
  • Works with "edge" builds and custom builds too, and there's no possibility to make mistakes
  • Does not need to handle the complexity of parsing and comparing semver

Another (lesser) benefit is that the API level can be sent as an int over the wire so it's a bit less data on each message

I've updated the first message to include the details and the PR

@JoshVanL
Copy link
Contributor

Question: why are we versioning by 10, 20 etc, rather than 1, 2, or 1.0, 2.0? I think this is because we want to store the version as an int but want a precision between major versions? In any case, it’s just surprising when first reading.

@mukundansundar
Copy link
Contributor

Once all sidecars are updated to a certain version, it is not possible to run an older version of a Dapr sidecar in the cluster and have that use actors (other building blocks will work, but not actors and workflow).
We could also add a flag to the Dapr Placement where we enforce a different minimum version. For example, we can set a CLI flag (passed via Helm option on K8s) to always enforce minimum of "10", even if all deployed sidecars are running API level "20". This will make Dapr sidecars always run in compatibility mode with 1.12.x.

In this case, won't this essentially prevent rollbacks from happening?
Consider scenarios where events arrive in sporadic intervals. If after an upgrade the API level all changes to 20 and then it is observed that in some other component/runtime block there is a block that necessitates a rollback, then this will prevent actors from being used?

Probably a reconciliation needs to happen between what min version is represented by Placement vs Sidecar?

@ItalyPaleAle
Copy link
Contributor Author

Question: why are we versioning by 10, 20 etc, rather than 1, 2, or 1.0, 2.0? I think this is because we want to store the version as an int but want a precision between major versions? In any case, it’s just surprising when first reading.

The version is meant to be decoupled from the Dapr version. So, 1.12 will have version "10", and then we don't need to update the version on each Dapr version; only when we make significant changes that may not be fully backwards-compatible. So, while 1.13 could have version 20, we may not need to change it until Dapr 1.16 then, for example.

The reason for skipping 10 is to be able to have hotfixes where we make changes that require increasing the version. So if 1.13 comes out with version "20", and we realize we need to change something in 1.12.4, we can use version "11" there. I believe this will be very unlikely, but it's best to be safe.

As for using ints... Exactly the reason you mentioned: they are easier to manage. With floats, checking for equality is always a bit iffy. We could use strings that represent a semver version, but that's more complex and less efficient over the wire.

In this case, won't this essentially prevent rollbacks from happening?
Consider scenarios where events arrive in sporadic intervals. If after an upgrade the API level all changes to 20 and then it is observed that in some other component/runtime block there is a block that necessitates a rollback, then this will prevent actors from being used?

Yes, that could be a problem. If a rollback is needed:

  1. First, ideally users should not update the entire cluster and then check for issues. As long as there's one app on the older version, the minimum version is unchanged
  2. If a rollback is needed, then users could restart the Placement service with a new "hardcoded" minimum flag. We should offer a Helm option for this.

mukundansundar added a commit that referenced this issue Sep 5, 2023
* Report Dapr runtime version to Placement

Implements the first bullet point of #6838 (and only the first bullet point)

- The Dapr runtime reports its version to the Placement service when it connects
- This data is currently unused and ignored by the Placement service
- In Dapr 1.13 we can use that data to perform decisions as described in #6838
- Having the data submitted by 1.12 already will be helpful to ensure smoother upgrades

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Switched to use API level instead of Dapr version

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
Co-authored-by: Mukundan Sundararajan <65565396+mukundansundar@users.noreply.github.com>
@ItalyPaleAle ItalyPaleAle added this to the v1.13 milestone Sep 6, 2023
@dapr-bot
Copy link
Collaborator

dapr-bot commented Nov 5, 2023

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale Issues and PRs without response label Nov 5, 2023
@ItalyPaleAle ItalyPaleAle added pinned and removed stale Issues and PRs without response labels Nov 5, 2023
ItalyPaleAle added a commit to ItalyPaleAle/dapr that referenced this issue Nov 15, 2023
Follow-up from dapr#7134
Part of dapr#6838

Actor runtime can use this to change its behavior. Will allow completing dapr#7129

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
yaron2 pushed a commit that referenced this issue Nov 17, 2023
Follow-up from #7134
Part of #6838

Actor runtime can use this to change its behavior. Will allow completing #7129

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
cicoyle pushed a commit to cicoyle/dapr that referenced this issue May 24, 2024
Follow-up from dapr#7134
Part of dapr#6838

Actor runtime can use this to change its behavior. Will allow completing dapr#7129

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: Cassandra Coyle <cassie@diagrid.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants