Skip to content

Latest commit

 

History

History
251 lines (207 loc) · 39.4 KB

bigquery.md

File metadata and controls

251 lines (207 loc) · 39.4 KB

BigQuery

Setting up the BigQuery destination connector involves setting up the data loading method (BigQuery Standard method and Google Cloud Storage bucket) and configuring the BigQuery destination connector using the Airbyte UI.

This page guides you through setting up the BigQuery destination connector.

Prerequisites

Connector modes

While setting up the connector, you can configure it in the following modes:

  • BigQuery: Produces a normalized output by storing the JSON blob data in _airbyte_raw_* tables and then transforming and normalizing the data into separate tables, potentially exploding nested streams into their own tables if basic normalization is configured.
  • BigQuery (Denormalized): Leverages BigQuery capabilities with Structured and Repeated fields to produce a single "big" table per stream. Airbyte does not support normalization for this option at this time.

Setup guide

Step 1: Set up a data loading method

Although you can load data using BigQuery's INSERTS, we highly recommend using a Google Cloud Storage bucket not only for performance and cost but reliability since larger datasets are prone to more failures when using standard inserts.

(Recommended) Using a Google Cloud Storage bucket

To use a Google Cloud Storage bucket:

  1. Create a Cloud Storage bucket with the Protection Tools set to none or Object versioning. Make sure the bucket does not have a retention policy.
  2. Create an HMAC key and access ID.
  3. Grant the Storage Object Admin role to the Google Cloud Service Account.
  4. Make sure your Cloud Storage bucket is accessible from the machine running Airbyte. The easiest way to verify if Airbyte is able to connect to your bucket is via the check connection tool in the UI.

Your bucket must be encrypted using a Google-managed encryption key (this is the default setting when creating a new bucket). We currently do not support buckets using customer-managed encryption keys (CMEK). You can view this setting under the "Configuration" tab of your GCS bucket, in the Encryption type row.

Using INSERT

You can use BigQuery's INSERT statement to upload data directly from your source to BigQuery. While this is faster to set up initially, we strongly recommend not using this option for anything other than a quick demo. Due to the Google BigQuery SDK client limitations, using INSERT is 10x slower than using a Google Cloud Storage bucket, and you may see some failures for big datasets and slow sources (For example, if reading from a source takes more than 10-12 hours). For more details, refer to #3549

Step 2: Set up the BigQuery connector

  1. Log into your Airbyte Cloud or Airbyte Open Source account.

  2. Click Destinations and then click + New destination.

  3. On the Set up the destination page, select BigQuery or BigQuery (denormalized typed struct) from the Destination type dropdown depending on whether you want to set up the connector in BigQuery or BigQuery (Denormalized) mode.

  4. Enter the name for the BigQuery connector.

  5. For Project ID, enter your Google Cloud project ID.

  6. For Dataset Location, select the location of your BigQuery dataset. :::warning You cannot change the location later. :::

  7. For Default Dataset ID, enter the BigQuery Dataset ID.

  8. For Loading Method, select Standard Inserts or GCS Staging. :::tip We recommend using the GCS Staging option. :::

  9. For Service Account Key JSON (Required for cloud, optional for open-source), enter the Google Cloud Service Account Key in JSON format.

  10. For Transformation Query Run Type (Optional), select interactive to have BigQuery run interactive query jobs or batch to have BigQuery run batch queries.

    :::note Interactive queries are executed as soon as possible and count towards daily concurrent quotas and limits, while batch queries are executed as soon as idle resources are available in the BigQuery shared resource pool. If BigQuery hasn't started the query within 24 hours, BigQuery changes the job priority to interactive. Batch queries don't count towards your concurrent rate limit, making it easier to start many queries at once. :::

  11. For Google BigQuery Client Chunk Size (Optional), use the default value of 15 MiB. Later, if you see networking or memory management problems with the sync (specifically on the destination), try decreasing the chunk size. In that case, the sync will be slower but more likely to succeed.

Supported sync modes

The BigQuery destination connector supports the following sync modes:

  • Full Refresh Sync
  • Incremental - Append Sync
  • Incremental - Append + Deduped

Output schema

Airbyte outputs each stream into its own table in BigQuery. Each table contains three columns:

  • _airbyte_ab_id: A UUID assigned by Airbyte to each event that is processed. The column type in BigQuery is String.
  • _airbyte_emitted_at: A timestamp representing when the event was pulled from the data source. The column type in BigQuery is Timestamp.
  • _airbyte_data: A JSON blob representing the event data. The column type in BigQuery is String.

The output tables in BigQuery are partitioned and clustered by the Time-unit column _airbyte_emitted_at at a daily granularity. Partitions boundaries are based on UTC time. This is useful to limit the number of partitions scanned when querying these partitioned tables, by using a predicate filter (a WHERE clause). Filters on the partitioning column are used to prune the partitions and reduce the query cost. (The parameter Require partition filter is not enabled by Airbyte, but you may toggle it by updating the produced tables.)

BigQuery Naming Conventions

Follow BigQuery Datasets Naming conventions.

Airbyte converts any invalid characters into _ characters when writing data. However, since datasets that begin with _ are hidden on the BigQuery Explorer panel, Airbyte prepends the namespace with n for converted namespaces.

Data type map

Airbyte type BigQuery type BigQuery denormalized type
DATE DATE DATE
STRING (BASE64) STRING STRING
NUMBER FLOAT NUMBER
OBJECT STRING RECORD
STRING STRING STRING
BOOLEAN BOOLEAN BOOLEAN
INTEGER INTEGER INTEGER
STRING (BIG_NUMBER) STRING STRING
STRING (BIG_INTEGER) STRING STRING
ARRAY REPEATED REPEATED
STRING (TIMESTAMP_WITH_TIMEZONE) TIMESTAMP DATETIME
STRING (TIMESTAMP_WITHOUT_TIMEZONE) TIMESTAMP DATETIME

Troubleshooting permission issues

The service account does not have the proper permissions.

  • Make sure the BigQuery service account has BigQuery User and BigQuery Data Editor roles or equivalent permissions as those two roles.
  • If the GCS staging mode is selected, ensure the BigQuery service account has the right permissions to the GCS bucket and path or the Cloud Storage Admin role, which includes a superset of the required permissions.

The HMAC key is wrong.

  • Make sure the HMAC key is created for the BigQuery service account, and the service account has permission to access the GCS bucket and path.

Tutorials

Now that you have set up the BigQuery destination connector, check out the following BigQuery tutorials:

Changelog

Version Date Pull Request Subject
2.0.18 2023-09-26 [#30775](#30775 Increase async block size
2.0.17 2023-09-26 #30696 Attempt unsafe typing operations with an exception clause
2.0.16 2023-09-22 #30697 Improve resiliency to unclean exit during schema change
2.0.15 2023-09-21 #30640 Handle streams with identical name and namespace
2.0.14 2023-09-20 #30069 Staging destination async
2.0.13 2023-09-19 #30592 Internal code changes
2.0.12 2023-09-19 #30319 Improved testing
2.0.11 2023-09-18 #30551 GCS Staging is first loading method option
2.0.10 2023-09-15 #30491 Improve error message display
2.0.9 2023-09-14 #30439 Fix a transient error
2.0.8 2023-09-12 #30364 Add log message
2.0.7 2023-08-29 #29878 Internal code changes
2.0.6 2023-09-05 #29917 Improve performance by changing metadata error array construction from ARRAY_CONCAT to ARRAY_AGG
2.0.5 2023-08-31 #30020 Run typing and deduping tasks in parallel
2.0.4 2023-09-05 #30117 Type and Dedupe at sync start and then every 6 hours
2.0.3 2023-09-01 #30056 Internal refactor, no behavior change
2.0.2 2023-09-01 #30120 Improve performance on very wide streams by skipping SAFE_CAST on strings
2.0.1 2023-08-29 #29972 Publish a new version to supersede old v2.0.0
2.0.0 2023-08-27 #29783 Destinations V2
1.10.2 2023-08-24 #29805 Destinations v2: Don't soft reset in migration
1.10.1 2023-08-23 #29774 Destinations v2: Don't soft reset overwrite syncs
1.10.0 2023-08-21 #29636 Destinations v2: Several Critical Bug Fixes (cursorless dedup, improved floating-point handling, improved special characters handling; improved error handling)
1.9.1 2023-08-21 #28687 Under the hood: Add dependency on Java CDK v0.0.1.
1.9.0 2023-08-17 #29560 Destinations v2: throw an error on disallowed column name prefixes
1.8.1 2023-08-17 #29522 Migration BugFix - ensure raw dataset created
1.8.0 2023-08-17 #29498 Fix checkpointing logic in GCS staging mode
1.7.8 2023-08-15 #29461 Migration BugFix - ensure migration happens before table creation for GCS staging.
1.7.7 2023-08-11 #29381 Destinations v2: Add support for streams with no columns
1.7.6 2023-08-04 #28894 Destinations v2: Add v1 -> v2 migration Logic
1.7.5 2023-08-04 #29106 Destinations v2: handle unusual CDC deletion edge case
1.7.4 2023-08-04 #29089 Destinations v2: improve special character handling in column names
1.7.3 2023-08-03 #28890 Internal code updates; improved testing
1.7.2 2023-08-02 #28976 Fix composite PK handling in v1 mode
1.7.1 2023-08-02 #28959 Destinations v2: Fix CDC syncs in non-dedup mode
1.7.0 2023-08-01 #28894 Destinations v2: Open up early access program opt-in
1.6.0 2023-07-26 #28723 Destinations v2: Change raw table dataset and naming convention
1.5.8 2023-07-25 #28721 Destinations v2: Handle cursor change across syncs
1.5.7 2023-07-24 #28625 Destinations v2: Limit Clustering Columns to 4
1.5.6 2023-07-21 #28580 Destinations v2: Create dataset in user-specified location
1.5.5 2023-07-20 #28490 Destinations v2: Fix schema change detection in OVERWRITE mode when existing table is empty; other code refactoring
1.5.4 2023-07-17 #28382 Destinations v2: Schema Change Detection
1.5.3 2023-07-14 #28345 Increment patch to trigger a rebuild
1.5.2 2023-07-05 #27936 Internal scaffolding change for future development
1.5.1 2023-06-30 #27891 Revert bugged update
1.5.0 2023-06-27 #27781 License Update: Elv2
1.4.6 2023-06-28 #27268 Internal scaffolding change for future development
1.4.5 2023-06-21 #27555 Reduce image size
1.4.4 2023-05-25 #26585 Small tweak in logs for clarity
1.4.3 2023-05-17 #26213 Fix bug in parsing file buffer config count
1.4.2 2023-05-10 #25925 Testing update. Normalization tests are now done in the destination container.
1.4.1 2023-05-11 #25993 Internal library update
1.4.0 2023-04-29 #25570 Internal library update. Bumping version to stay in sync with BigQuery-denormalized.
1.3.4 2023-04-28 #25588 Internal scaffolding change for future development
1.3.3 2023-04-27 #25346 Internal code cleanup
1.3.1 2023-04-20 #25097 Internal scaffolding change for future development
1.3.0 2023-04-19 #25287 Add parameter to configure the number of file buffers when GCS is used as the loading method
1.2.20 2023-04-12 #25122 Add additional data centers
1.2.19 2023-03-29 #24671 Fail faster in certain error cases
1.2.18 2023-03-23 #24447 Set the Service Account Key JSON field to always_show: true so that it isn't collapsed into an optional fields section
1.2.17 2023-03-17 #23788 S3-Parquet: added handler to process null values in arrays
1.2.16 2023-03-10 #23931 Added support for periodic buffer flush
1.2.15 2023-03-10 #23466 Changed S3 Avro type from Int to Long
1.2.14 2023-02-08 #22497 Fixed table already exists error
1.2.13 2023-01-26 #20631 Added support for destination checkpointing with staging
1.2.12 2023-01-18 #21087 Wrap Authentication Errors as Config Exceptions
1.2.11 2023-01-18 #21144 Added explicit error message if sync fails due to a config issue
1.2.9 2022-12-14 #20501 Report GCS staging failures that occur during connection check
1.2.8 2022-11-22 #19489 Added non-billable projects handle to check connection stage
1.2.7 2022-11-11 #19358 Fixed check method to capture mismatch dataset location
1.2.6 2022-11-10 #18554 Improve check connection method to handle more errors
1.2.5 2022-10-19 #18162 Improve error logs
1.2.4 2022-09-26 #16890 Add user-agent header
1.2.3 2022-09-22 #17054 Respect stream namespace
1.2.1 2022-09-14 #15668 (bugged, do not use) Wrap logs in AirbyteLogMessage
1.2.0 2022-09-09 #14023 (bugged, do not use) Cover arrays only if they are nested
1.1.16 2022-09-01 #16243 Fix Json to Avro conversion when there is field name clash from combined restrictions (anyOf, oneOf, allOf fields)
1.1.15 2022-08-22 #15787 Throw exception if job failed
1.1.14 2022-08-03 #14784 Enabling Application Default Credentials
1.1.13 2022-08-02 #14801 Fix multiple log bindings
1.1.12 2022-08-02 #15180 Fix standard loading mode
1.1.11 2022-06-24 #14114 Remove "additionalProperties": false from specs for connectors with staging
1.1.10 2022-06-16 #13852 Updated stacktrace format for any trace message errors
1.1.9 2022-06-17 #13753 Deprecate and remove PART_SIZE_MB fields from connectors based on StreamTransferManager
1.1.8 2022-06-07 #13579 Always check GCS bucket for GCS loading method to catch invalid HMAC keys.
1.1.7 2022-06-07 #13424 Reordered fields for specification.
1.1.6 2022-05-15 #12768 Clarify that the service account key json field is required on cloud.
1.1.5 2022-05-12 #12805 Updated to latest base-java to emit AirbyteTraceMessage on error.
1.1.4 2022-05-04 #12578 In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
1.1.3 2022-05-02 #12528 Update Dataset location field description
1.1.2 2022-04-29 #12477 Dataset location is a required field
1.1.1 2022-04-15 #12068 Fixed bug with GCS bucket conditional binding
1.1.0 2022-04-06 #11776 Use serialized buffering strategy to reduce memory consumption.
1.0.2 2022-03-30 #11620 Updated spec
1.0.1 2022-03-24 #11350 Improve check performance
1.0.0 2022-03-18 #11238 Updated spec and documentation
0.6.12 2022-03-18 #10793 Fix namespace with invalid characters
0.6.11 2022-03-03 #10755 Make sure to kill children threads and stop JVM
0.6.8 2022-02-14 #10256 Add -XX:+ExitOnOutOfMemoryError JVM option
0.6.6 2022-02-01 #9959 Fix null pointer exception from buffered stream consumer.
0.6.6 2022-01-29 #9745 Integrate with Sentry.
0.6.5 2022-01-18 #9573 BigQuery Destination : update description for some input fields
0.6.4 2022-01-17 #8383 Support dataset-id prefixed by project-id
0.6.3 2022-01-12 #9415 BigQuery Destination : Fix GCS processing of Facebook data
0.6.2 2022-01-10 #9121 Fixed check method for GCS mode to verify if all roles assigned to user
0.6.1 2021-12-22 #9039 Added part_size configuration to UI for GCS staging
0.6.0 2021-12-17 #8788 BigQuery/BiqQuery denorm Destinations : Add possibility to use different types of GCS files
0.5.1 2021-12-16 #8816 Update dataset locations
0.5.0 2021-10-26 #7240 Output partitioned/clustered tables
0.4.1 2021-10-04 #6733 Support dataset starting with numbers
0.4.0 2021-08-26 #5296 Added GCS Staging uploading option
0.3.12 2021-08-03 #3549 Add optional arg to make a possibility to change the BigQuery client's chunk\buffer size
0.3.11 2021-07-30 #5125 Enable additionalPropertities in spec.json
0.3.10 2021-07-28 #3549 Add extended logs and made JobId filled with region and projectId
0.3.9 2021-07-28 #5026 Add sanitized json fields in raw tables to handle quotes in column names
0.3.6 2021-06-18 #3947 Service account credentials are now optional.
0.3.4 2021-06-07 #3277 Add dataset location option