-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.html.md.erb
172 lines (133 loc) · 6.54 KB
/
index.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: GOV.UK S3 Mirror
weight: 10
last_reviewed_on: 2022-09-14
review_in: 6 months
---
# GOV.UK S3 Mirror
The GOV.UK S3 Mirror project in GCP (Google Cloud Platform) copies data from
some AWS S3 buckets into GCP Cloud Storage buckets.
The GOV.UK S3 Mirror allows you to:
- access data from the GOV.UK programme by using GCP instead of AWS
- trigger workflows in GCP projects when new data becomes available from GOV.UK
## Data sources
The GOV.UK S3 Mirror has 1 data source:
- the [`govuk-integration-database-backups` S3
bucket](https://s3.console.aws.amazon.com/s3/object/govuk-data-infrastructure-integration),
which contains nightly backups of databases, such as the Content Store, the
Publishing API, and the Support API
## Data products
The GOV.UK S3 Mirror has 1 data product:
- the [`govuk-s3-mirror_govuk-integration-database-backups` Cloud Storage
bucket](https://console.cloud.google.com/storage/browser/govuk-s3-mirror_govuk-integration-database-backups)
Currently, the bucket contains backup files of four databases:
- the production Content Store
- the draft Content Store
- the Publishing API
- the Support API
More could be included from the S3 bucket, and from other S3 buckets, when they
are required.
## How to access the data
People who are in the following Google Groups have read access to the
[`govuk-s3-mirror_govuk-integration-database-backups` Cloud Storage
bucket](https://console.cloud.google.com/storage/browser/govuk-s3-mirror_govuk-integration-database-backups).
The specific access permission is
[`roles/storage.objectViewer`](https://cloud.google.com/storage/docs/access-control/iam-roles#standard-roles).
- group:data-engineering@digital.cabinet-office.gov.uk
This is configured in the file
[`terraform/transfer.tf`](https://github.com/alphagov/govuk-s3-mirror/tree/main/terraform/transfer.tf).
Individual access can be given by adding an email address to that file. For
example, `"user:firstname.lastname@digital.cabinet-office.gov.uk"`.
A service account can be given access by adding its email address to that file.
For example, `"serviceAccount:accountname@projectname.iam.gserviceaccount.com"`.
## Notifications of new files
The [`govuk-s3-mirror_govuk-integration-database-backups` Cloud Storage
bucket](https://console.cloud.google.com/storage/browser/govuk-s3-mirror_govuk-integration-database-backups)
can publish a notification to a Pub/Sub topic in any project when a new file is
created. Other actions can then be triggered by that Pub/Sub topic. An example
implementation is configured by the [govuk-knowledge-graph-gcp
repository](https://github.com/alphagov/govuk-knowledge-graph-gcp).
To create a notification from the bucket, add a terraform configuration to the
file `terraform/pubsub.tf` in the `govuk-s3-mirror` repository, such as follows.
```terraform
# Reference the topic that it in your project
resource "google_pubsub_topic" "a_name" {
project = "your-project-name"
name = "your-topic-name"
message_retention_duration = "604800s" # 604800 seconds is 7 days
message_storage_policy {
allowed_persistence_regions = [
var.region,
]
}
}
# Notify the topic from the bucket
resource "google_storage_notification" "another_name" {
bucket = google_storage_bucket.govuk-integration-database-backups.name
payload_format = "JSON_API_V1"
topic = google_pubsub_topic.a_name.id
event_types = ["OBJECT_FINALIZE"]
}
```
Create the topic in your own project as follows.
```terraform
resource "google_pubsub_topic" "some_name" {
name = "your-topic-name"
message_retention_duration = "604800s" # 604800 seconds is 7 days
message_storage_policy {
allowed_persistence_regions = [
var.region,
]
}
}
# Allow the govuk-s3-mirror project's bucket to publish to this topic
data "google_iam_policy" "some_other_name" {
binding {
role = "roles/pubsub.publisher"
members = [
"serviceAccount:service-384988117066@gs-project-accounts.iam.gserviceaccount.com"
]
}
}
resource "google_pubsub_topic_iam_policy" "yet_another_name" {
topic = google_pubsub_topic.some_name.name
policy_data = data.google_iam_policy.some_other_name.policy_data
}
```
To trigger things from the Pub/Sub topic, such as Cloud Functions or Workflows,
you must use EventArc as an intermediary between the Pub/Sub topic and other
resources.
## Cloud infrastructure
The GOV.UK S3 Mirror runs entirely in GCP. Terraform is used to deploy the
infrastructure. The Terraform configuration is in a [GitHub
repository](https://github.com/alphagov/govuk-s3-mirror). The repository is the
most accurate and up-to-date representation of the infrastructure. This section
gives an overview.
### Storage Transfer Service
The [GCP Storage Transfer
Service](https://cloud.google.com/storage-transfer-service) is used to copy data
from the [`govuk-integration-database-backups` S3
bucket](https://s3.console.aws.amazon.com/s3/object/govuk-data-infrastructure-integration)
into the [`govuk-s3-mirror_govuk-integration-database-backups` Cloud Storage
bucket](https://console.cloud.google.com/storage/browser/govuk-s3-mirror_govuk-integration-database-backups). Every hour, on the hour, the service:
- compares the contents of each bucket
- copies any files that are in the S3 bucket but not in the Cloud Storage bucket
into the Cloud Storage bucket.
- deletes any files from the Cloud Storage bucket that don't exist in the S3
bucket.
This means that the Cloud Storage bucket contains exactly the same contents as
the S3 bucket. GOV.UK stores only a few recent database backup files in the S3
bucket, so only those files are available in the Cloud Storage bucket.
Not all the contents of the S3 bucket are currently copied to the Cloud Storage
bucket, because not all are required. If more are required, then they can be
included by editing the `include_prefixes` field in the file
[`terraform/transfer.tf`](https://github.com/alphagov/govuk-s3-mirror/blob/main/terraform/transfer.tf)
in the [GitHub repository](https://github.com/alphagov/govuk-s3-mirror).
### Cloud Storage
The main bucket is the [`govuk-s3-mirror_govuk-integration-database-backups`
Cloud Storage
bucket](https://console.cloud.google.com/storage/browser/govuk-s3-mirror_govuk-integration-database-backups).
The contents of this bucket are maintained to be the same as the contents of the
S3 bucket. The name of the Cloud Storage bucket incorporates the name of the S3 bucket.
If mirrors of S3 buckets are required, then each one should be given a
corresponding Cloud Storage bucket, following the naming convention.