status	implementation	status_last_reviewed
accepted	done	2024-03-04

Continuous Deployment

Summary

In RFC 70 we committed to fully automatic deployments. The following provides a framework for how we will actually achieve this for all of our existing apps.

Decision: in order to safely enable automatic deployments for each app, we will specify minimum criteria that must first be met. See the criteria described below.

An audit of our apps against these criteria can be found in Appendix A: Compatibility.

Problem

Our current approach to shipping software has multiple issues with it:

1. We have a constant backlog of deployments

Merging a PR for an application repo results in a new "release" tag. We have a Release app to monitor these. Every new release tag is automatically deployed to our Integration environment.

This is where the automation ends. In order for a release (or series of releases) to be deployed to the GOV.UK Production environment, a developer with production access must:

Manually check the release is "OK" in Integration.
Manually deploy the release to Staging.
Manually check the release is "OK" in Staging.
Manually deploy the release to Production.

Having these tedious, manual steps makes it easy to not deploy a change. This is especially true for new joiners, who can only merge PRs and cannot deploy them beyond Integration.

Full stats and supporting code can be found here. We expect these will improve slightly over the coming months, due to the reduction in changes generated by Dependabot [1].

A backlog of deployments also makes diagnostics harder:

⚠ We can mistakenly debug an issue against the undeployed version of the code.
⚠ We may struggle to isolate a breaking change, if many were deployed together.

2. We freely merge PRs with breaking changes

We tend to merge PRs and only later worry if they break the app. This can cause the codebase to become undeployable, and interrupt other developers testing changes on Integration.

Example: https://github.com/alphagov/email-alert-api/commit/8e633fc7d89d7f13bc0e5bdcab14ed3dea968289#diff-c1fd91cb1911a0512578b99f657554526f3e1421decdb9e908712beab57e10f9R20

Fix: alphagov/email-alert-api#1346

I couldn't find a good example of myself doing this, but I've definitely done it too. We will never write unbreakable code, yet we have too little caution around merging our changes.

3. We do not have a working definition of "OK"

It's easy for each developer to have their own standards, and rely on their own judgement when they check that a change is safe to deploy. Theoretically, we assume a developer will:

Check the results of the smoke tests.
Look for any Icinga alerts related to the app.
Check the deployment dashboard for the app.
Generally make sure the change is tested.

We often defer to automated tests to tell us when something is broken, but it's unclear whether this is really justified: there is no minimum standard for testing of GOV.UK apps.

Proposal

The first issue (we have a constant backlog of deployments) has an obvious solution: make deployments automatic. This has already begun, as part of a trial of Continuous Deployment and a subsequent extension to include 5 more apps.

It is still possible to deploy an app manually. The new pipeline is just an automation of the manual deployment steps. Rollback is unchanged, and freezing deployments is supported [1].

However, we have been unable to make further progress due to the other two issues above. We are concerned that enabling automatic deployments en masse would lead to an increase in defects reaching production.

In order to proceed, we need to agree a clear and actionable set of criteria for enabling automatic deployments, so that we will only do so when we can show this is safe, as well as secure.

Security checks

We need to ensure that only people with production access are able to deploy code to the live site. This is equivalent to being a member of the "GOV.UK Production" GitHub team. As part of the trial of CD, we restricted merge access on the affected repos to people in this team [1] [2] [3]. We will continue to use this approach.

As well as restricting who can merge, we also have restrictions around which version of the code can be deployed automatically: only the latest release. This protects against someone with access to the Integration environment triggering a potentially malicious deployment of historical code, or code that has not been reviewed. The following checks in the Deploy_App_Downstream Jenkins job enforce this restriction:

The tag to deploy must be the highest release_123-type tag.
The tagged commit must match the latest commit on the master branch.

However, the pipeline could still be abused by:

By manually altering the job definition. In this scenario, a someone with Integration access (but not Production access) could deploy any version of the code to the Staging environment. (But no further: only people with production access can trigger the job in Staging Jenkins [1], which also does the above checks.)
By repeatedly invoking the Deploy_App_Downstream job with the latest release, flooding the Production environment with deployments. We must accept risks like this are inevitable, given the current state of our infrastructure: allowing general SSH access to our Integration environment means we're exposing part of the CD chain.

Safety checks

The person who merges a PR must be confident that it is safe to do so. As part of the trial of CD, we added PR templates to each affected app, to explicitly warn developers about the automatic deployment.

However, we also want to ensure that a faulty PR merged in haste can be caught before it impacts the live site. There are a number of checks we can perform to help ensure this, without aiming for perfection.

We won't check alerts. At the time of writing, we have no evidence that checking Icinga alerts has any advantage over the checks discussed below. We can investigate such a check if we find evidence [1].

We will not consider manual and scheduled tasks. We do not expect enabling automatic deployments to lead to any increase in defects that impact production, for these parts of a codebase.

Check: Internal repo tests pass (i.e. features are tested in detail)

For the purpose of automating deployments:

We should always have tests for the majority of the functionality of an app. Since this is not easy to determine across languages [1], we will use Ruby code coverage as the main indicator of thorough testing.
We will require some evidence of JavaScript testing, but we will not go so far as to check the coverage. This is because of its relatively low footprint in our codebases, the difficulty of measuring coverage for it [1], and the fact that we often cover it implicitly with Ruby tests, as part of our UI approach of progressive enhancement. An exception to the need for coverage is Static, for which JavaScript constitutes 70% of the codebase.

An app has "enough" of these tests when:

It has at least one JavaScript test, if it makes use of the language.
Its code coverage exceeds 95% at the point when CD is enabled.

Check: API contract tests pass (i.e. adapters work with their APIs)

For the purpose of automating deployments:

We only need to add contract tests when it's hard to be confident about an API change. If an API endpoint only has a single "consumer" app, then it's easy to gain confidence by manually testing the pair still work together.
We should still consider adding contracts tests for APIs that are likely to be reused internally, as we may not notice this has happened. Collections [1] and Imminence [1] should both have contract tests for their APIs.
We only need to add contract tests for the APIs we consume internally. Some of our APIs may be public, but this is a separate, product-level commitment. Our strategy for testing public APIs is out-of-scope for this RFC.

With this in mind, an API app has "enough" of these tests when:

Each endpoint with multiple, internal consumer apps has at least one contract test.

Example: Asset Manager is missing contract tests because multiple apps use it to create [1] [2] assets, among other APIs. Conversely, no Support API endpoints are used by more than one GOV.UK app [1].

Check: App is healthy (i.e. it can run in a production environment)

For the purpose of automating deployments:

We need to test that an app can respond to a request for each of the machine classes it runs on [1] [2]. This is purely to prove that an instance of the app is running [1], and may be covered by other tests.
We need to test that an app can connect to any systems it uses for data storage and retrieval, such as a database. This usually has production-specific config [1] [2] [3], which could be faulty.
We do not need to test connectivity to other GOV.UK apps, since this should be handled by our internal API adapters and our SSO library, with little or no configuration inside the repo.

With this in mind, an app has "enough" of these tests when:

It has a test that checks for a successful response to an arbitrary request.
It has a test that checks for connectivity to any read/write dependencies:
- Databases e.g. MongoDB, Postgres, MySQL, Redis
- File systems e.g. AWS S3
- Caching systems e.g. Memcached

Example: Travel Advice Publisher has enough of these tests because there is one to check for a "200 OK" response for its /healthcheck endpoint, which tests for connectivity to Redis and MongoDB [1] [2].

Services are special. We cannot make a request to a service, as it has no web interface. Instead, we will update the deployment script for each service app to check the process is running after the app is restarted.

Check: Smoke test pass (i.e. double check critical features work)

For the purpose of automating deployments, we do not strictly need Smoke tests for arbitrary features of an app: in theory, they are replaced by the other checks described here. However, this assumes perfect testing.

We can think of our Smoke tests as a "backup" or "secondary" layer of testing for features we have previously deemed to be critical. We will therefore check all existing Smoke tests pass, where they are relevant to an app.

Rejected approaches

The following are alternatives we considered to address the safety concerns with enabling automatic deployments for our apps. We did not consider any alternative approaches to address security concerns.

Check: interactions between apps work

A previous draft of this RFC would have made extensive use of Smoke tests and sandboxed "end-to-end" (E2E) tests to check apps can successfully communicate with real instances of other apps (as opposed to test stubs). The need for sandboxed E2E tests arises when the tests have some associated state [1].

This approach was rejected because:

The API calls made by each app are stubbed at the request level, so there is little benefit in re-testing the associated functionality if our API adapters are guaranteed by contract tests.
The environment we have for sandboxed E2E tests does not reflect our real environments [1] [2] [3] [4] [5]. It's unclear if something that works in this environment will also work in a real one.
The environment we have for sandboxed E2E tests is notoriously slow, brittle and complex [1] [2] [3] [4] [5]. We should invest in more granular forms of testing instead of this one [1] [2].

Check: detailed features of GOV.UK work

A previous draft of this RFC would have required a detailed audit of the functionality of each app, to ensure we had Smoke or sandboxed E2E tests for each of its features, for some definition of "feature".

This approach was rejected because:

There is no clear way to demarcate features of an app, let alone GOV.UK as a whole. Given the sheer volume of functionality in our apps, it's not practical to audit or have realistic tests for all of it.
While we have existing tests for specific features of GOV.UK [1] [2], there is no thorough, positive logic to justify their existence or extension [1] [2]. This is a weak foundation to try and build upon.

Consequences

No deployment delay

We will no longer have the safety blanket of our changes languishing in Integration or Staging. If we merge buggy code, then that code could reach production faster. We must accept that. On the positive side, we can also ship new features and fixes more quickly. While some bugs may make it to production, having a shorter delay between merge and deploy means it's more likely that the person who introduced the bug will be present to help fix it.

The safety checks we run for each app should always be seen as a last resort to halt a deployment. We are ultimately responsible for our changes, and will need to adapt to this new "CD" way of working by e.g.

Gradually doing more branch deployments.
Only merging changes during deployment hours.
Having pre-release test versions of our libraries.

Automatic deployments

We will enable automatic deployments for the apps listed here, but only when they meet the safety criteria of this RFC. To help with this, a rough audit of "compatibility" can be found in Appendix A: Compatibility.

We can immediately enable automatic deployments for apps marked with a ✅.
We will undertake work to address all the issues (⚠) for apps marked with a ❌.

We will disable CD for Publishing API and Collections. These were only enabled on a trial basis, and require non-trivial work to meet the new standard. We should avoid setting a false precedent by leaving CD enabled.

The following steps must be taken as part of enabling automatic deployments:

Restrict merge access to "GOV.UK Production" [example].
Tag relevant Smoke tests to run for the app [example].
Add a note in the Release app to warn about the change.
Add a PR template to warn about the change [example].
- Backfill the PR template as a comment on existing PRs.
Enable automatic deployments in a Puppet PR [example].
- Comment on how the app meets the criteria of this RFC.

We can be flexible with coverage. We can defer adding internal repo tests if the code has no impact on the running app e.g. rake tasks [1]. But we should not defer removing any unused code.

Delete publishing-e2e-tests

In establishing the safety criteria for automatic deployments, we had to consider these tests. As explained in the rejected approaches, we chose to avoid them entirely. This brings their existence into question.

These tests add limited value to GOV.UK. They reduce our agility due to the time they take to run on every PR. They often fail for reasons unrelated to the features they are testing. We invest little effort in improving or extending them.

Once automatic deployments are enabled for all of the supported, non-frontend apps, we will have improved our other testing to the extent that the E2E tests will offer few additional safety guarantees, compared to the cost of maintaining them. We will therefore disable and delete these tests and their associated infrastructure [1] [2].

Healthy deployments

The original implementation of the CD pipeline resulted in a new set of /healthcheck tests in Smokey, in addition to checking them with an Icinga alert on each machine. It's still impractical to use Icinga as part of the CD pipeline: the alerts are checked asynchronously and the names are unstable. Some duplication is therefore inevitable.

An alternative to individual Smokey tests is to make calling the /healthcheck endpoint part of the "Deploy_App" job, such that it would fail if the healthcheck fails. We will investigate this approach as a way to DRY up the Smokey tests, in parallel with enabling automatic deployments for our apps.

Explainable Smoke tests

We will add documentation to Smokey to explain that the tests it contains should be limited to features we have deemed to be critical to the users of GOV.UK, and therefore exist:

As a way to test for problems due to changes in infrastructure.
As a "backup" layer of testing for changes to individual apps.

Better longterm testing

In order to meet the safety criteria in this RFC, we will need to improve the tests we have around many of our apps. Once we have enabled automatic deployments, there is a risk the tests will degrade over time.

To help prevent this, we will write new a new manual in GOV.UK Developer Docs, which will specify the safety criteria in this RFC as the new minimum level of testing for GOV.UK apps. This means that, in addition to meeting these criteria for the purpose of enabling automatic deployments, we will commit to upholding them on an ongoing basis. It is out-of-scope for this RFC to resolve how we will uphold this commitment once we have made it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc-128-continuous-deployment.md

rfc-128-continuous-deployment.md

Continuous Deployment

Summary

Problem

1. We have a constant backlog of deployments

2. We freely merge PRs with breaking changes

3. We do not have a working definition of "OK"

Proposal

Security checks

Safety checks

Check: Internal repo tests pass (i.e. features are tested in detail)

Check: API contract tests pass (i.e. adapters work with their APIs)

Check: App is healthy (i.e. it can run in a production environment)

Check: Smoke test pass (i.e. double check critical features work)

Rejected approaches

Check: interactions between apps work

Check: detailed features of GOV.UK work

Consequences

No deployment delay

Automatic deployments

Delete publishing-e2e-tests

Healthy deployments

Explainable Smoke tests

Better longterm testing

Files

rfc-128-continuous-deployment.md

Latest commit

History

rfc-128-continuous-deployment.md

File metadata and controls

Continuous Deployment

Summary

Problem

1. We have a constant backlog of deployments

2. We freely merge PRs with breaking changes

3. We do not have a working definition of "OK"

Proposal

Security checks

Safety checks

Check: Internal repo tests pass (i.e. features are tested in detail)

Check: API contract tests pass (i.e. adapters work with their APIs)

Check: App is healthy (i.e. it can run in a production environment)

Check: Smoke test pass (i.e. double check critical features work)

Rejected approaches

Check: interactions between apps work

Check: detailed features of GOV.UK work

Consequences

No deployment delay

Automatic deployments

Delete publishing-e2e-tests

Healthy deployments

Explainable Smoke tests

Better longterm testing