Skip to content

Commit

Permalink
Remove duplicate S3 regions from destination specs (#36846)
Browse files Browse the repository at this point in the history
  • Loading branch information
evantahler committed Apr 5, 2024
1 parent 9d53013 commit 4342182
Show file tree
Hide file tree
Showing 9 changed files with 38 additions and 27 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ data:
connectorSubtype: database
connectorType: destination
definitionId: 072d5540-f236-4294-ba7c-ade8fd918496
dockerImageTag: 1.1.1
dockerImageTag: 1.1.2
dockerRepository: airbyte/destination-databricks
githubIssueLabel: destination-databricks
icon: databricks.svg
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,6 @@
"me-central-1",
"me-south-1",
"sa-east-1",
"sa-east-1",
"us-east-1",
"us-east-2",
"us-gov-east-1",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ data:
connectorSubtype: database
connectorType: destination
definitionId: df65a8f3-9908-451b-aa9b-445462803560
dockerImageTag: 0.1.5
dockerImageTag: 0.1.6
dockerRepository: airbyte/destination-iceberg
githubIssueLabel: destination-iceberg
license: MIT
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,6 @@
"me-central-1",
"me-south-1",
"sa-east-1",
"sa-east-1",
"us-east-1",
"us-east-2",
"us-gov-east-1",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ data:
connectorSubtype: database
connectorType: destination
definitionId: f7a7d195-377f-cf5b-70a5-be6b819019dc
dockerImageTag: 2.4.0
dockerImageTag: 2.4.1
dockerRepository: airbyte/destination-redshift
documentationUrl: https://docs.airbyte.com/integrations/destinations/redshift
githubIssueLabel: destination-redshift
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,6 @@
"me-central-1",
"me-south-1",
"sa-east-1",
"sa-east-1",
"us-east-1",
"us-east-2",
"us-gov-east-1",
Expand Down
1 change: 1 addition & 0 deletions docs/integrations/destinations/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,7 @@ Delta Lake tables are created. You may want to consult the tutorial on

| Version | Date | Pull Request | Subject |
| :------ | :--------- | :------------------------------------------------------------------------------------------------------------------ | :----------------------------------------------------------------------------------------------------------------------- |
| 1.1.2 | 2024-04-04 | [#36846](https://github.com/airbytehq/airbyte/pull/36846) | (incompatible with CDK, do not use) Remove duplicate S3 Region |
| 1.1.1 | 2024-01-03 | [#33924](https://github.com/airbytehq/airbyte/pull/33924) | (incompatible with CDK, do not use) Add new ap-southeast-3 AWS region |
| 1.1.0 | 2023-06-02 | [\#26942](https://github.com/airbytehq/airbyte/pull/26942) | Support schema evolution |
| 1.0.2 | 2023-04-20 | [\#25366](https://github.com/airbytehq/airbyte/pull/25366) | Fix default catalog to be `hive_metastore` |
Expand Down
1 change: 1 addition & 0 deletions docs/integrations/destinations/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ specify the target size of compacted Iceberg data file.

| Version | Date | Pull Request | Subject |
| :------ | :--------- | :-------------------------------------------------------- | :--------------------------------------------------------- |
| 0.1.6 | 2024-04-04 | [#36846](https://github.com/airbytehq/airbyte/pull/36846) | Remove duplicate S3 Region |
| 0.1.5 | 2024-01-03 | [#33924](https://github.com/airbytehq/airbyte/pull/33924) | Add new ap-southeast-3 AWS region |
| 0.1.4 | 2023-07-20 | [28506](https://github.com/airbytehq/airbyte/pull/28506) | Support server-managed storage config |
| 0.1.3 | 2023-07-12 | [28158](https://github.com/airbytehq/airbyte/pull/28158) | Bump Iceberg library to 1.3.0 and add REST catalog support |
Expand Down
54 changes: 33 additions & 21 deletions docs/integrations/destinations/redshift.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ For INSERT strategy:
2. COPY: Replicates data by first uploading data to an S3 bucket and issuing a COPY command. This is
the recommended loading approach described by Redshift
[best practices](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html).
Requires an S3 bucket and credentials. Data is copied into S3 as multiple files with a manifest file.
Requires an S3 bucket and credentials. Data is copied into S3 as multiple files with a manifest
file.

Airbyte automatically picks an approach depending on the given configuration - if S3 configuration
is present, Airbyte will use the COPY strategy and vice versa.
Expand Down Expand Up @@ -69,11 +70,14 @@ Optional parameters:
(`ab_id`, `data`, `emitted_at`). Normally these files are deleted after the `COPY` command
completes; if you want to keep them for other purposes, set `purge_staging_data` to `false`.
- **File Buffer Count**
- Number of file buffers allocated for writing data. Increasing this number is beneficial for connections using Change Data Capture (CDC) and up to the number of streams within a connection. Increasing the number of file buffers past the maximum number of streams has deteriorating effects.
- Number of file buffers allocated for writing data. Increasing this number is beneficial for
connections using Change Data Capture (CDC) and up to the number of streams within a connection.
Increasing the number of file buffers past the maximum number of streams has deteriorating
effects.

NOTE: S3 staging does not use the SSH Tunnel option for copying data, if configured. SSH Tunnel supports the SQL
connection only. S3 is secured through public HTTPS access only. Subsequent typing and deduping queries on final table
are executed over using provided SSH Tunnel configuration.
NOTE: S3 staging does not use the SSH Tunnel option for copying data, if configured. SSH Tunnel
supports the SQL connection only. S3 is secured through public HTTPS access only. Subsequent typing
and deduping queries on final table are executed over using provided SSH Tunnel configuration.

## Step 1: Set up Redshift

Expand All @@ -92,14 +96,16 @@ are executed over using provided SSH Tunnel configuration.
staging S3 bucket \(for the COPY strategy\).

### Permissions in Redshift
Airbyte writes data into two schemas, whichever schema you want your data to land in, e.g. `my_schema`
and a "Raw Data" schema that Airbyte uses to improve ELT reliability. By default, this raw data schema
is `airbyte_internal` but this can be overridden in the Redshift Destination's advanced settings.
Airbyte also needs to query Redshift's

Airbyte writes data into two schemas, whichever schema you want your data to land in, e.g.
`my_schema` and a "Raw Data" schema that Airbyte uses to improve ELT reliability. By default, this
raw data schema is `airbyte_internal` but this can be overridden in the Redshift Destination's
advanced settings. Airbyte also needs to query Redshift's
[SVV_TABLE_INFO](https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_TABLE_INFO.html) table for
metadata about the tables airbyte manages.

To ensure the `airbyte_user` has the correction permissions to:

- create schemas in your database
- grant usage to any existing schemas you want Airbyte to use
- grant select to the `svv_table_info` table
Expand Down Expand Up @@ -187,14 +193,19 @@ characters.
### Data Size Limitations

Redshift specifies a maximum limit of 16MB (and 65535 bytes for any VARCHAR fields within the JSON
record) to store the raw JSON record data. Thus, when a row is too big to fit, the destination connector will
do one of the following.
1. Null the value if the varchar size > 65535, The corresponding key information is added to `_airbyte_meta`.
2. Null the whole record while trying to preserve the Primary Keys and cursor field declared as part of your stream configuration, if the total record size is > 16MB.
* For DEDUPE sync mode, if we do not find Primary key(s), we fail the sync.
* For OVERWRITE and APPEND mode, syncs will succeed with empty records emitted, if we fail to find Primary key(s).
record) to store the raw JSON record data. Thus, when a row is too big to fit, the destination
connector will do one of the following.

1. Null the value if the varchar size > 65535, The corresponding key information is added to
`_airbyte_meta`.
2. Null the whole record while trying to preserve the Primary Keys and cursor field declared as part
of your stream configuration, if the total record size is > 16MB.
- For DEDUPE sync mode, if we do not find Primary key(s), we fail the sync.
- For OVERWRITE and APPEND mode, syncs will succeed with empty records emitted, if we fail to
find Primary key(s).

See AWS docs for [SUPER](https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html) and [SUPER limitations](https://docs.aws.amazon.com/redshift/latest/dg/limitations-super.html).
See AWS docs for [SUPER](https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html) and
[SUPER limitations](https://docs.aws.amazon.com/redshift/latest/dg/limitations-super.html).

### Encryption

Expand All @@ -208,15 +219,15 @@ Each stream will be output into its own raw table in Redshift. Each table will c
Redshift is `VARCHAR`.
- `_airbyte_extracted_at`: a timestamp representing when the event was pulled from the data source.
The column type in Redshift is `TIMESTAMP WITH TIME ZONE`.
- `_airbyte_loaded_at`: a timestamp representing when the row was processed into final table.
The column type in Redshift is `TIMESTAMP WITH TIME ZONE`.
- `_airbyte_loaded_at`: a timestamp representing when the row was processed into final table. The
column type in Redshift is `TIMESTAMP WITH TIME ZONE`.
- `_airbyte_data`: a json blob representing with the event data. The column type in Redshift is
`SUPER`.

## Data type map

| Airbyte type | Redshift type |
|:------------------------------------|:---------------------------------------|
| :---------------------------------- | :------------------------------------- |
| STRING | VARCHAR |
| STRING (BASE64) | VARCHAR |
| STRING (BIG_NUMBER) | VARCHAR |
Expand All @@ -235,7 +246,8 @@ Each stream will be output into its own raw table in Redshift. Each table will c
## Changelog

| Version | Date | Pull Request | Subject |
|:--------|:-----------|:-----------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| :------ | :--------- | :--------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 2.4.1 | 2024-04-04 | [#36846](https://github.com/airbytehq/airbyte/pull/36846) | Remove duplicate S3 Region |
| 2.4.0 | 2024-03-21 | [\#36589](https://github.com/airbytehq/airbyte/pull/36589) | Adapt to Kotlin cdk 0.28.19 |
| 2.3.2 | 2024-03-21 | [\#36374](https://github.com/airbytehq/airbyte/pull/36374) | Supress Jooq DataAccessException error message in logs |
| 2.3.1 | 2024-03-18 | [\#36255](https://github.com/airbytehq/airbyte/pull/36255) | Mark as Certified-GA |
Expand Down Expand Up @@ -297,7 +309,7 @@ Each stream will be output into its own raw table in Redshift. Each table will c
| 0.3.55 | 2023-01-26 | [\#20631](https://github.com/airbytehq/airbyte/pull/20631) | Added support for destination checkpointing with staging |
| 0.3.54 | 2023-01-18 | [\#21087](https://github.com/airbytehq/airbyte/pull/21087) | Wrap Authentication Errors as Config Exceptions |
| 0.3.53 | 2023-01-03 | [\#17273](https://github.com/airbytehq/airbyte/pull/17273) | Flatten JSON arrays to fix maximum size check for SUPER field |
| 0.3.52 | 2022-12-30 | [\#20879](https://github.com/airbytehq/airbyte/pull/20879) | Added configurable parameter for number of file buffers (⛔ this version has a bug and will not work; use `0.3.56` instead) |
| 0.3.52 | 2022-12-30 | [\#20879](https://github.com/airbytehq/airbyte/pull/20879) | Added configurable parameter for number of file buffers (⛔ this version has a bug and will not work; use `0.3.56` instead) |
| 0.3.51 | 2022-10-26 | [\#18434](https://github.com/airbytehq/airbyte/pull/18434) | Fix empty S3 bucket path handling |
| 0.3.50 | 2022-09-14 | [\#15668](https://github.com/airbytehq/airbyte/pull/15668) | Wrap logs in AirbyteLogMessage |
| 0.3.49 | 2022-09-01 | [\#16243](https://github.com/airbytehq/airbyte/pull/16243) | Fix Json to Avro conversion when there is field name clash from combined restrictions (`anyOf`, `oneOf`, `allOf` fields) |
Expand Down

0 comments on commit 4342182

Please sign in to comment.