Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation for how to perform rollbacks #3878

Merged
merged 3 commits into from Mar 8, 2024

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Mar 5, 2024

Fixes

I noticed that some documentation was out of date or slightly unclear while performing a deploy recently.

Description

This PR updates some documentation around rollbacks following the changes to the release process in https://github.com/WordPress/openverse/pull/3755/files. It also tries to clarify some ambiguity I found confusing (in general, referring to "the workflow" without specifying deployment vs release-app).

Testing Instructions

Read through the documentation and confirm that the updates are accurate. I've left a few questions in comments about things I was not 100% sure of, so please correct me!

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟩 priority: low Low priority and doesn't need to be rushed 📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: documentation Related to Sphinx documentation labels Mar 5, 2024
@stacimc stacimc self-assigned this Mar 5, 2024
repository using the tag of the latest stable version. You can find the
release version number in the [changelogs](/changelogs/index), and then the
tag to pass to the workflow is the version number prefixed with "rel-", for
example "rel-2023.07.03.17.52.00".
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate (here and in the other applications)? Is the appropriate tag the rel- tag, or the tag obtained from the version endpoint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the rel- tag. We could probably drop that if we wanted to, one day, and just use the date without the rel- prefix. That is probably less confusing than needing to swap the application slug for rel-? Instead, we'd just drop the application slug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feelings either way -- I think we see/use the rel- tags plenty when deploying catalog/ingestion-server so it's not especially confusing as is. I just wanted to double-check :)

rollback any environment for any service. Only members of the
The same staging and production deployment workflows in the
WordPress/openverse-infrastructure repository are used directly to rollback any
environment for any service. **Do not attempt to perform a rollback using the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, is this accurate? My understanding is that rollbacks should be done only by running the deployment workflow from the infra repo, with the latest stable version.

Would triggering a new release at https://github.com/WordPress/openverse/releases/new with the stable version result in creating duplicate tags?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. Really we can frame it like this: to deploy a specific, already released version, run the deployment workflow in the infrastructure repository with the version tag to deploy. This applies to any (literally, all possible cases) scenario where we need to trigger a dpeloyment to an existing release version, whether a rollback or re-deployment (e.g., to get new environment variables) or whatever.

Publishing a release creates the release and the release app workflow is just an incidental part of publishing (it would be good if we just renamed it to "Tag stable image and publish changelog" (TSIAPC 😰) or something, as it is not in any way independent of publishing). Publishing is only for creating new a new release off of HEAD on main and has no other purpose.

The deployment workflows are general purpose deployment workflows utilised by the "TSIAPC" workflow (release-app) as an incidental part of publishing a release, but not a strictly necessary one. We could, for example, remove automated deployments for the API and frontend from the "TSIAPC" workflow (release-app), to completely separate the ideas and prevent any possible confusion between the two. Publishing is publishing. Deploying is deploying. There is no necessary crossover between the two, which is why they are distinct workflows with no direct relationship to each other, aside from that publishing happens to trigger the deployment workflows.

This might sound pedantic, but it's really important to completely separate these two workflows conceptually, and for everyone creating releases and/or deploying the application in any capacity understands the separation between them. Publishing triggers deployment, but is not part of deployment. Deployment is also not part of publishing a release, it just happens to occur after publishing a release because that's basically why we publish releases, to immediately deploy to production.

It really might be best to just separate the two and not have publishing trigger a deployment, to emphasise the separation of the two... but, as it is literally what we want to do every time we publish a release, I think we would quickly find it tedious that the frontend and API no longer automatically deploy to production when we create a new release.

I'm also thinking there is a confusion of the GitHub terms and what would be the ideal, precise words to use: Publishing implies making the thing public, which in our case really means deploying it to production (because we run apps, not libraries). But, ideally, the process would be called "Create a stable version with a changelog, then deploy it", which emphasises the two completely separate parts. It's just that GitHub call's it "publishing a release", wihch overloads a whole lot of ideas that for us are actually separate. If we think of publishing as a combination of creating the release and deployment, it's still kind of confusing.

Anyway, the concepts are currently muddled, and the words GitHub uses and that we've titled our workflow files with are not helping maintain the separation. If removing the automated deployments from the workflow triggered by publishing a release in GitHub would help make that clearer, then it's probably worth the tediousness, as mistakes here are potentially more confusing than the tediousness.

@stacimc stacimc marked this pull request as ready for review March 5, 2024 22:39
@stacimc stacimc requested a review from a team as a code owner March 5, 2024 22:39
Copy link

github-actions bot commented Mar 5, 2024

Full-stack documentation: https://docs.openverse.org/_preview/3878

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, I don't see a need for changes now.

But. I think that if we remove the automated production deployment from the workflow that runs after publishing a release in GitHub, that all of this documentation would be far simpler. Rather than needing to add any caveats, we would always have a simple instruction:

  • Deploy the application to the version you wish to be deployed

That is actually already true in all cases. As I said in my lengthy comment, it just happens that the workflow triggered by publishing a release in GitHub does this for you. If we remove that though, it would completely separate the idea of publishing a release as "a method of deployment". There is only one method of deployment for any app, and any deployment to any version always uses that single method, regardless of whether it is automatically or manually triggered. If we switch it to always be manually triggered, then while it adds a step to pushing a version to production, it separates the question of "how do I deploy a version" only one possible answer: deploy it using the deployment workflow.

This would basically remove all caveats about deployment, aside from zero-downtime concerns, which are already extensively covered in a separate document. The documentation outline would go like this:

## Pushing a version to production

1. Publish the drafted release in GitHub for the app you want to run the latest code for in production
2. Copy the changelog title, replace the application slug with `rel-`, and then run the deployment workflow for the application from the infrastructure repository with that new tag.

## Rollback

Run the deployment workflow with the version you wish to roll the application back to.

## Redeploy

Run the deployment workflow with the version tag reported by the `/version` endpoint of the application you wish to redeploy.

It would be even simpler if we did away with the annoying task of changing the tag prefix from the app slug to rel-.

What do you think, Staci? Would that clarify the ideas? It adds a step to updating production to the latest code, but prevents any possibility of confusion between tagging an image with a "stable release" tag, and the deployment of the app. I think we could delete a fair bit of documentation and remove basically all the caveats that aren't just referncing zero-downtime.

repository using the tag of the latest stable version. You can find the
release version number in the [changelogs](/changelogs/index), and then the
tag to pass to the workflow is the version number prefixed with "rel-", for
example "rel-2023.07.03.17.52.00".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the rel- tag. We could probably drop that if we wanted to, one day, and just use the date without the rel- prefix. That is probably less confusing than needing to swap the application slug for rel-? Instead, we'd just drop the application slug?

rollback any environment for any service. Only members of the
The same staging and production deployment workflows in the
WordPress/openverse-infrastructure repository are used directly to rollback any
environment for any service. **Do not attempt to perform a rollback using the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. Really we can frame it like this: to deploy a specific, already released version, run the deployment workflow in the infrastructure repository with the version tag to deploy. This applies to any (literally, all possible cases) scenario where we need to trigger a dpeloyment to an existing release version, whether a rollback or re-deployment (e.g., to get new environment variables) or whatever.

Publishing a release creates the release and the release app workflow is just an incidental part of publishing (it would be good if we just renamed it to "Tag stable image and publish changelog" (TSIAPC 😰) or something, as it is not in any way independent of publishing). Publishing is only for creating new a new release off of HEAD on main and has no other purpose.

The deployment workflows are general purpose deployment workflows utilised by the "TSIAPC" workflow (release-app) as an incidental part of publishing a release, but not a strictly necessary one. We could, for example, remove automated deployments for the API and frontend from the "TSIAPC" workflow (release-app), to completely separate the ideas and prevent any possible confusion between the two. Publishing is publishing. Deploying is deploying. There is no necessary crossover between the two, which is why they are distinct workflows with no direct relationship to each other, aside from that publishing happens to trigger the deployment workflows.

This might sound pedantic, but it's really important to completely separate these two workflows conceptually, and for everyone creating releases and/or deploying the application in any capacity understands the separation between them. Publishing triggers deployment, but is not part of deployment. Deployment is also not part of publishing a release, it just happens to occur after publishing a release because that's basically why we publish releases, to immediately deploy to production.

It really might be best to just separate the two and not have publishing trigger a deployment, to emphasise the separation of the two... but, as it is literally what we want to do every time we publish a release, I think we would quickly find it tedious that the frontend and API no longer automatically deploy to production when we create a new release.

I'm also thinking there is a confusion of the GitHub terms and what would be the ideal, precise words to use: Publishing implies making the thing public, which in our case really means deploying it to production (because we run apps, not libraries). But, ideally, the process would be called "Create a stable version with a changelog, then deploy it", which emphasises the two completely separate parts. It's just that GitHub call's it "publishing a release", wihch overloads a whole lot of ideas that for us are actually separate. If we think of publishing as a combination of creating the release and deployment, it's still kind of confusing.

Anyway, the concepts are currently muddled, and the words GitHub uses and that we've titled our workflow files with are not helping maintain the separation. If removing the automated deployments from the workflow triggered by publishing a release in GitHub would help make that clearer, then it's probably worth the tediousness, as mistakes here are potentially more confusing than the tediousness.

documentation/general/deployment.md Show resolved Hide resolved
@stacimc
Copy link
Contributor Author

stacimc commented Mar 6, 2024

@sarayourfriend Thank you so much for responding so thoroughly!

What do you think, Staci? Would that clarify the ideas? It adds a step to updating production to the latest code, but prevents any possibility of confusion between tagging an image with a "stable release" tag, and the deployment of the app. I think we could delete a fair bit of documentation and remove basically all the caveats that aren't just referncing zero-downtime.

I'm not opposed to separating them out, but I don't know that I would remove much documentation afterward. For me the confusion in the docs was related entirely to ambiguity in phrasing "the process" or "the workflow" etc.

Colloquially we use the phrase "deploying" a service to mean the process of publishing the release and then actually deploying it, and I think we would still do so even if that work required two steps instead of one. When the documentation for a rollback uses language like "repeat the deployment process", to me that really sounds like it's saying to repeat all the steps. The phrases "workflow", "deployment workflow", and "deployment process" all could be interpreted as referring to (1) the actual deployment - <env> <service> workflows in the infra repo, (2) the release-app workflows in the monorepo, or (3) the general process of deploying at a higher level (since you're right, 'publishing', 'process', and 'workflow' are all unfortunately loaded terms).

For me all the confusion is immediately cleared up just by specificity. I read "repeat the deployment process" to mean that I should be publishing a new release somehow, and was immediately confused because that doesn't seem right! If we specify that we mean to use Run Workflow in the infra repo to run just the deployment workflow that immediately makes sense.

@sarayourfriend
Copy link
Contributor

Colloquially we use the phrase "deploying" a service to mean the process of publishing the release and then actually deploying it

I read "repeat the deployment process" to mean that I should be publishing a new release somehow

I see, I didn't understand that that was the point of ambiguity, but I see now that my suggestion wouldn't address that, basically at all. Ideally we would not use deployment as a catch-all, but you're right that we do, and there's not much we can really do to just change the language. Thanks for clarifying where the confusion was coming from, and sorry I misunderstood it!

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this helps quite a bit. I've left one non-blocking suggestion.

stacimc and others added 2 commits March 8, 2024 09:40
…ease.

Co-authored-by: zack <6351754+zackkrida@users.noreply.github.com>
@stacimc stacimc merged commit 9460b14 into main Mar 8, 2024
37 checks passed
@stacimc stacimc deleted the update/rollback-documentation branch March 8, 2024 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants