Skip to content

Commit

Permalink
Assorted updates relating to Replatforming.
Browse files Browse the repository at this point in the history
- Replatforming team is now Platform Engineering and there's also a
  Platform Security and Reliability team.
- Improve the guidance on TLS certificates, which was very out of date.
- Lots of minor fixes and clarifications, including removing
  inappropriate/unhelpful levels of detail that had become outdated.
  • Loading branch information
sengi committed Jul 5, 2023
1 parent 3285e60 commit ba9ffde
Show file tree
Hide file tree
Showing 9 changed files with 142 additions and 252 deletions.
12 changes: 8 additions & 4 deletions source/manual/2nd-line.html.md
Expand Up @@ -130,10 +130,10 @@ This will help inform developers, Technical 2nd Line tech lead(s), and the GOV.U

Follow these Slack channels while working on Technical 2nd Line:

- `#govuk-2ndline-tech` - the main channel for people on Technical 2nd Line
- `#govuk-deploy` - every time a Staging/Production deploy is done, this is automatically posted to - people also manually post when putting branches on Integration for testing
- `#govuk-developers` - this is a general channel for developers and can be a good place to ask questions if you are struggling
- `#govuk-platform-engineering` - this is the channel for the Platform Engineering team, where the SREs are currently working. However, you should use #govuk-2ndline-tech to contact the RE interruptible person about urgent GOV.UK infrastructure issues.
- [#govuk-2ndline-tech] - the main channel for people on Technical 2nd Line
- [#govuk-deploy] - every time a Staging/Production deploy is done, this is automatically posted to - people also manually post when putting branches on Integration for testing
- [#govuk-developers] - this is a general channel for developers and can be a good place to ask questions if you are struggling
- [#govuk-platform-engineering] - Platform Engineering team looks after the GOV.UK Kubernetes clusters and base images

[Technical 2nd Line dashboard]: https://govuk-2ndline-dashboard.herokuapp.com/
[GOV.UK Technical 2nd Line Trello board]: https://trello.com/b/M7UzqXpk/govuk-2nd-line
Expand All @@ -150,3 +150,7 @@ Follow these Slack channels while working on Technical 2nd Line:
[Ongoing issues, useful Info & unexplained events]: https://trello.com/c/TwquoCfW/316-readme
[Missing documentation]: https://trello.com/c/owAK2OjY/1009-please-use-this-column-to-record-any-missing-documentation-you-notice-and-were-not-able-to-add-during-your-shift
[gds-vpn]: https://docs.google.com/document/d/1O1LmLByDLlKU4F1-3chwS8qddd2WjYQgMaaEgTfK5To/edit
[#govuk-2ndline-tech]: https://gds.slack.com/channels/govuk-2ndline-tech
[#govuk-deploy]: https://gds.slack.com/channels/govuk-deploy
[#govuk-developers]: https://gds.slack.com/channels/govuk-developers
[#govuk-platform-engineering]: https://gds.slack.com/channels/govuk-platform-engineering
30 changes: 17 additions & 13 deletions source/manual/alerts/renew-tls-certificate.html.md
Expand Up @@ -14,24 +14,28 @@ These checks look at the validity of the TLS certificates for:
* www.integration.publishing.service.gov.uk at the edge (Fastly)
* \*.publishing.service.gov.uk, \*.staging.publishing.service.gov.uk and \*.integration.publishing.service.gov.uk at the origin (our servers), depending on the environment Icinga is running in

You'll start seeing an alert 30 days before the relevant certificate is due to
expire.
The alert fires 30 days before certificate expiry.

See [renew a TLS certificate for GOV.UK](/manual/renew-a-tls-certificate.html)
for details of how to renew the relevant certificate. This is normally done by
GOV.UK Platform Engineering.
See [renew a TLS certificate for GOV.UK](/manual/renew-a-tls-certificate.html).

## Production www.gov.uk certificate

The TLS certificate for www.gov.uk is managed by Fastly. They will open a support
ticket when the certificate is due for renewal. This ticket will be picked up by
GOV.UK Platform Engineering, who will co-ordinate with Fastly to renew the
certificate.
The TLS certificate for www.gov.uk is managed by Fastly. If any additional
verification of domain ownership is needed for renewal (for example if Fastly
chooses a different outsourcing partner for its certification authority),
Fastly will open a support ticket with us. This ticket will go to 2nd-line Tech
Support, who should co-ordinate with Fastly to ensure that the certificate is
renewed.

## Production, staging and integration wildcard certificates

The wildcard TLS certificates for production, staging and integration are
managed by GOV.UK Platform Engineering. Once the alert appears, they will work to
renew the relevant certificate and make it live. For staging and integration,
the certificates are also provided to Fastly to enable TLS for our staging and
integration CDN environments.
automatically renewed by AWS. Renewal should require no human intervention
provided the DNS validation records remain in place.

## Staging and integration www certificates

The certificates for www.staging.publishing.service.gov.uk and
www.integration.publishing.service.gov.uk are automatically issued by Fastly.
Renewal should require no human intervention provided the DNS validation
records remain in place.
38 changes: 18 additions & 20 deletions source/manual/ask-for-help.html.md
Expand Up @@ -11,7 +11,6 @@ The GOV.UK Technical 2nd Line team (#govuk-2ndline-tech):
- monitors the GOV.UK hosting platform and applications, and works to fix any issues
- calls on experienced members of other teams to assist in incidents
- deploys changes on behalf of teams that don’t have sufficient access
- supports the software and processes that deploy code
- triages technical issues and recommends when to escalate to a site reliability engineer

The GOV.UK developer community (#govuk-developers):
Expand All @@ -22,25 +21,24 @@ The GOV.UK developer community (#govuk-developers):

The GOV.UK Platform Security and Reliability team (#govuk-platform-security-reliability-team):

- works on long term fixes to the platform
- owns the infrastructure, although doesn't necessarily have the expertise to fix issues
- works on long-term improvements to the reliability and security of GOV.UK
- manages some access control automation such as govuk-user-reviewer
- manages some AWS infrastructure that supports multiple teams (together with Platform Engineering team)

The GOV.UK Platform Engineering team (#govuk-platform-engineering):

- supports the infrastructure used to run and make changes to GOV.UK
- handles updates to `*.gov.uk` DNS (excluding `*.publishing.service.gov.uk`)
- obtains and renews TLS certificates

The GDS Reliability Engineering team (#reliability-eng):

- maintains centrally-provided services such as Logit and Concourse

If you and your colleagues can’t resolve a technical issue, problem or question, you should escalate it through, in order:

1. The Technical Lead on the team
2. The Lead Developer on the programme
3. The Lead Architect

If Technical 2nd Line instructs you to escalate something to GOV.UK Platform Engineering, raise a ticket on Zendesk and assign it to the `3rd Line--GDS Reliability Engineering` queue. You should also raise a ticket if the issue is related to an ongoing incident for tracking purposes, but you can speak to the team directly to get it more immediate attention.

If you speak to GOV.UK Platform Engineering about a process only they know about, they will work with you to document the process for all of GOV.UK.
- manages the Kubernetes clusters and base images on which GOV.UK applications run
- works on long-term improvements to the efficiency and reliability of GOV.UK
- supports CI/CD (build, rollout, release) automation
- can offer advice on monitoring and alerting
- can offer design reviews and advice to help build your application for
reliability, robustness and low maintenance (especially at the early stages of
the software lifecycle)
- can offer advice and assistance with changes such as migrating from one
database to another as safely and efficiently as possible

If you and your colleagues can’t resolve a technical issue, problem or question, you should try talking with (in this order):

1. Your tech lead (TL)
1. [#govuk-tech-leads](https://gds.slack.com/channels/govuk-tech-leads)
1. The [senior tech team](https://groups.google.com/a/digital.cabinet-office.gov.uk/g/govuk-senior-tech-members/members)
178 changes: 26 additions & 152 deletions source/manual/common-aws-tasks-for-2nd-line-support.html.md
Expand Up @@ -10,7 +10,7 @@ parent: "/manual.html"
This document details some of the tasks that GOV.UK Technical 2nd Line may
carry out regarding AWS.

## Logging into AWS
## Logging into the AWS web console

Once you've [set up AWS access](/manual/get-started.html#9-access-aws-for-the-first-time), you can log into the AWS console for the relevant environment by running:

Expand All @@ -24,7 +24,7 @@ See [these notes](https://github.com/alphagov/govuk-aws-data/blob/main/data/infr
gds aws govuk-integration-readonly -l
```

It will then ask you to supply your AWS vault password, followed by your 2FA code.
It will then ask you to supply your aws-vault password, followed by your 2FA code.

## Getting help with AWS

Expand All @@ -39,38 +39,33 @@ We pay AWS for Premium Support. You are strongly encouraged to contact AWS to
help you solve problems when using AWS products. Contacting AWS Support is a
very common procedure.

See our documentation on [how to escalate to AWS support][].

[how to escalate to AWS support]: https://docs.publishing.service.gov.uk/manual/how-to-escalate-to-AWS-support.html
See our documentation on [how to escalate to AWS
support](/manual/how-to-escalate-to-AWS-support.html)

### Internal support

Usually, Technical 2nd Line should be able to investigate issues related to AWS.

During working hours, one of the Site Reliability Engineers on the Platform Engineering
team may be able to provide advice or expert knowledge. Outside of working hours,
you should escalate to AWS support if the engineers on call can't resolve an
issue themselves.
Usually, 2nd-line Tech Support should be able to investigate issues related to AWS.

If you are experiencing an incident, refer to the
[So, you're having an incident][] documentation.
During working hours, one of the Site Reliability Engineers (SREs) on the
Platform Engineering or Platform Security and Reliability teams may be able to
provide advice or expert knowledge. Outside office hours, you should escalate
to AWS Support if the engineers on call can't resolve an issue themselves.

[So, you're having an incident]: https://docs.publishing.service.gov.uk/manual/incident-what-to-do.html
If you are experiencing an incident, refer to the [So, you're having an
incident](/manual/incident-what-to-do.html) documentation.

## Troubleshooting

### How to view ALB metrics

You can see metrics for load balancers in CloudWatch. See the AWS documentation
on [load balancer CloudWatch metrics][] for more detail.
on [load balancer CloudWatch metrics] for more detail.

[load balancer CloudWatch metrics]: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html

### How to query Athena logs

See the documentation on [how to query CDN logs][].

[how to query cdn logs]: https://docs.publishing.service.gov.uk/manual/query-cdn-logs.html
See [how to query CDN logs](/manual/query-cdn-logs.html)

### How to identify AWS managed DB performance issues

Expand All @@ -79,134 +74,13 @@ worth having a look at the CloudWatch metrics for the service. This might tell
you which resource is the limiting factor impacting performance.

It is also worth looking at AWS's troubleshooting documentation, such as
the [DocumentDB documentation][].
the [DocumentDB documentation].

[DocumentDB documentation]: https://docs.aws.amazon.com/documentdb/latest/developerguide/user_diagnostics.html

## How to analyse and find out the performance limiting factor of an EC2 instance

EC2 instances have limits on CPU, memory, network, and storage, which can
impact application performance. To identify the machine resource that is the
performance limiting factor, visit the [machine metrics dashboard][] (in the
relevant environment!), which shows CPU, memory, disk, and TCP stats.
This should help you find bottlenecks.

You can also see the [EC2 troubleshooting documentation][] and contact AWS Support.

[machine metrics dashboard]: https://grafana.blue.production.govuk.digital/dashboard/file/machine.json?refresh=1m&orgId=1
[EC2 troubleshooting documentation]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-troubleshoot.html

## Detaching an instance from an Auto Scaling Group

Log into the AWS console for the relevant environment:

```
gds aws govuk-integration-poweruser -l
gds aws govuk-staging-poweruser -l
gds aws govuk-production-poweruser -l
```

First, make a note of the instance ID in case you need to investigate the logs later
following an incident. If you forget to do this, follow the instructions below on
how to find a detached instance.

Refer to the AWS documentation on steps for on [how to detach an instance from an ASG][]

If a machine is unhealthy, you may want to detach an instance from its Auto
Scaling Group (ASG). Detaching the instance stops it receiving requests.

Avoid removing an instance from an ALB Target Group if the target group's
instances are managed by an ASG. Instead, simply remove the instance from the
ASG. Removing an instance from an associated ASG will remove the instance
from all target groups automatically (conversely, adding an instance to an
ASG will add the instance to associated target groups).

When detaching an instance from an ASG, you can choose whether to replace that
instance in the ASG via a checkbox in the detach confirmation window: 'Add a
new instance to the Auto Scaling group to balance the load'. This option will
need to be selected, otherwise an error will likely be thrown, as by default,
the ASG's minimum capacity of instances will be equal to the amount instances
running (desired capacity).

_**Note**: check if the ASG has just a single instance. If so, removing the
instance from the ASG may cause downtime for a service._

The detached instance will stick around, so make sure to
terminate it when you no longer need it. See the section below for further details.

[how to detach an instance from an ASG]: https://docs.aws.amazon.com/autoscaling/ec2/userguide/detach-instance-asg.html#detach-instance-console

## How to find a detached instance

If you made a note of the instance id before you detached it, you should be able to find it in
the console. If you forgot this step, you can work out which is the detached instance for an
application by comparing the instances you get for the application under 'instances' (which displays all instances)
and the instances you get for an application by looking at the ASG for that application's instances
(which displays only attached instances).

## Logging into a detached instance

When you've found the detached instance in the AWS console, you can click on the instance ID column to get the summary. This includes the private IP. Copy this, and use it with the gds govuk connect command instead of an app/machine type, eg:

`gds govuk connect ssh -e integration ip-10-1-4-96.eu-west-1.compute.internal`

## Terminating an instance

Refer to the AWS documentation for steps on [how to terminate an instance][].

If you do not want a detached instance to serve traffic again in future, you
should terminate it. Detaching an instance from an ASG does not terminate the
instance automatically. Once connections have drained from a detached instance,
you can terminate the instance via the AWS EC2 user interface.

[how to terminate an instance]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html#terminating-instances-console

Note that when you've detached and terminated an instance, you may get sentry alerts from puppet about healthchecks for the affected application(s) not
returning OK, as it still expects the old instance to be healthy. These should disappear when puppet next runs (every 30 minutes) and picks up the
new instance.

## How to scale up vertically

You can increase the resources for EC2 instances by modifying the instance type in govuk-aws-data.

We typically make this change very gradually. The process usually looks like this:

1. Merge a PR changing the instance type (example: https://github.com/alphagov/govuk-aws-data/pull/827)
2. Apply the Terraform change to all environments
3. In the AWS UI, cycle the running instances gradually, to avoid downtime.

Cycling an instance might involve removing the smaller instances one by one,
waiting for connections to the instance to drain, and then killing the
instance. This will prompt AWS to bring up the new bigger instances
automatically.

Be careful here, [provisioning a machine](#how-are-instances-provisioned) is
slow, flaky and requires manual intervention (e.g. sometimes we need to
manually run puppet on a new instance, must deploy apps to new instances from
Jenkins). Check that the new instances are healthy before removing the
healthy old instances.

## How to scale horizontally

You can increase the number of EC2 instances running in an ASG via a config
change in govuk-aws and govuk-aws-data.

In an emergency, you can also [manually bump up the number of instances in the
ASG via the AWS console][], but you're encouraged to use Terraform to do this,
since you'll need to make the change in Terraform anyway.

Here's an example set of PRs:

* https://github.com/alphagov/govuk-aws-data/pull/805
* https://github.com/alphagov/govuk-aws/pull/1385

You will need to apply Terraform for this to have an effect.

[manually bump up the number of instances in the ASG via the AWS console]: https://docs.publishing.service.gov.uk/manual/auto-scaling-groups.html#removing-a-specific-instance

## How to restore an AWS managed DB from a backup

View the documentation on [how to backup and restore in AWS RDS][].
View the documentation on [how to backup and restore in AWS RDS].

[how to backup and restore in AWS RDS]: https://docs.publishing.service.gov.uk/manual/howto-backup-and-restore-in-aws-rds.html

Expand Down Expand Up @@ -237,28 +111,28 @@ old persistent disk

### How do we do DNS?

See the documentation on [how GOV.UK does DNS][].
GOV.UK is effectively a DNS registrar for some third-level domain names, for
example service.gov.uk.

GOV.UK manages the DNS for other domain names used by third parties, for example
service.gov.uk. At the moment this will remain with the SREs until further
decisions have been made.
See [how GOV.UK does DNS](/manual/dns.html).

[how GOV.UK does DNS]: https://docs.publishing.service.gov.uk/manual/dns.html
## How are EC2 instances (legacy infrastructure) provisioned?

## How are instances provisioned?
**As of Mar 2023, only Crawler, CKAN (for data.gov.uk) and Licensing still use
the legacy EC2 infrastructure.**

We typically use Terraform config in [govuk-aws][] to provision infrastructure,
such as EC2 instances.
The legacy EC2 infrastructure is configured via Terraform code in the
[govuk-aws] repo.

There are a few exceptions to this, such as ad-hoc instances started from
Concourse via the AWS CLI - these are mainly for data science projects.

We use [userdata scripts][] to run commands on our instances at launch. These
We use [userdata scripts] to run commands on our instances at launch. These
scripts install various core bits of software needed by a particular instance
and then typically use [govuk-puppet][] to provision our instances.
and then typically use [govuk-puppet] to provision our instances.

Finally, new instances send Jenkins their Fully Qualified Domain Name (FQDN)
and puppet class. Jenkins automatically [deploys apps][] to newly provisioned
and puppet class. Jenkins automatically [deploys apps] to newly provisioned
instances.

[govuk-aws]: https://github.com/alphagov/govuk-aws
Expand Down

0 comments on commit ba9ffde

Please sign in to comment.