-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.html.md.erb
93 lines (57 loc) · 4.53 KB
/
index.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
title: Use BigQuery
weight: 3
last_reviewed_on: 2024-05-22
review_in: 6 months
---
# Use BigQuery
<span style="color:red">This page is a work in progress.</span>
Google BigQuery (often shortened to 'BQ') is a cloud-based SQL data warehouse.
## Access
When requesting access please state what data you need access to.
If in doubt try the #data-engineering Slack channel.
## Understanding tables, datasets, and projects
In BigQuery, data is stored in tables, which are grouped together inside datasets.
Datasets are organised into projects.
For example, our GOV.UK Universal Analytics data is sent to the `govuk-bigquery-analytics.87773428.ga_sessions_intraday_YYYYMMDD` table three times a day. Here,
- `govuk-bigquery-analytics` is the project ID
- `87773428` is the dataset name
- `ga_sessions_intraday_YYYYMMDD` is the table name
There are different types of tables that can be used to store data in BigQuery.
Our [raw Google Analytics data](/data-sources/ga/ga4-bq/#location), for example, is stored in [sharded tables](https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard).
These table names end with the suffix `YYYYMMDD`, representing the date in year-month-day format.
[Partitioned tables](https://cloud.google.com/bigquery/docs/partitioned-tables) are in use in other datasets.
## Quotas
Several projects have quotas set up to limit the amount of data that can be queried.
The aim of this is twofold. Firstly, this limits our exposure to unlimited costs from queries.
Secondly, this provides a safety net for those working with the data, as you can only wrack up so much in query costs.
The aim is not to be a hindrance to the need to use the data we store so please contact the #data-engineering Slack channel to discuss your needs if the quotas are getting in the way of work.
Specific quotas can be found detailed under the project name on the [GCP page](https://docs.data-community.publishing.service.gov.uk/gcp/).
### How did I query more data than permitted by my quota?
This is a very good question that we are still investigating.
Our current thinking is that the dynamic concurrent query queue could be loaded with queries before the quota is breached.
We think that the quota check occurs before a query is moved to Active.
[Google documentation on query queues](https://cloud.google.com/bigquery/docs/query-queues)
### Why is Looker Studio so data hungry?
Another good question that is under investigation.
Currently Looker Studio does not cache results from BQ, meaning the query has to be run every time.
Caching results is due for general availability release by the end of 2023 (see the [Google documentation](https://cloud.google.com/bigquery/docs/visualize-looker-studio.md#looker-studio-integration)).
In addition to not caching results it seems that when creating a new data connection, Looker Studio connects to a sharded set of tables for the first time it runs a series of queries that reference all shards.
The queries are for a small number fields each (4 or 5) and then query all shards and then include a `LIMIT 100` and this is done multiple times untli all fields are done.
This is essentially a `SELECT * FROM * LIMIT 100`.
This does not appear to be the case when using an existing data connection. In these cases, Looker Studio respects the table_suffix and default date range within the report.
Where possible, use shared pre-existing data connections in Looker Studio.
## Roles
A role is a set of permissions. Users should only have the specific role or permissions they need to use the Google Cloud Platform.
Contact the Data Engineering community on Slack to ask for a role, permission or service account.
Common permissions we use include:
- BigQuery Data Viewer
- BigQuery Job User
- BigQuery Read Session User
BigQuery Data Viewer allows the user to access and view data. When granted at the project level, this means the user can see all the data held within a project.
This does not allow the user to query the data from within that project - a user granted only Data Viewer permissions to a given project would need to query the data from a different project.
This permission can also be applied at dataset level.
BigQuery Job User grants the user the ability to run queries from the project in question.
The data being queried could be held in a different project but the query cost is allocated to the querying project.
BigQuery Read Session User permissions are required if the Storage Read API is used when querying.
More information on roles and IAM permissions can be found in the Google Cloud documentation.