-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connector builder: Incremental sync documentation #25238
Changes from 12 commits
011c443
eea60d7
e7f5688
f88c88b
9dbd144
08a9978
bbdeb99
4b8adba
3c5bb69
e3ca9da
0d76115
b750ef0
c12fb15
c4a0b73
cad5876
f5bc4b6
aa3cb65
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
# Authentication | ||
|
||
Authentication allows the connection to check whether it has sufficient permission to fetch data. The authentication feature provides a secure way to configure authentication using a variety of methods. | ||
|
||
The credentials itself (e.g. username and password) are _not_ specified as part of the connector, instead they are part of the configuration that is specified by the end user when setting up a source based on the connector. During development, it's possible to provide testing credentials in the "Testing values" menu, but those are not saved along with the connector. Credentials that are part of the source configuration are stored in a secure way in your Airbyte instance while the connector configuration is saved in the regular database. | ||
|
||
In the "Authentication" section on the "Global Configuration" page in the connector builder, the authentication method can be specified. This configuration is shared for all streams - it's not possible to use different authentication methods for different streams in the same connector. In case your API uses multiple or custom authentication methods, you can use the [low-code CDK](/connector-development/config-based/low-code-cdk-overview) or [Python CDK](/connector-development/cdk-python/). | ||
|
||
If your API doesn't need authentication, leave it set at "No auth". This means the connector will be able to make requests to the API without providing any credentials which might be the case for some public open APIs or private APIs only available in local networks. | ||
|
||
## Authentication methods | ||
|
||
Check the documentation of the API you want to integrate for the used authentication method. The following ones are supported in the connector builder: | ||
* [Basic HTTP](#basic-http) | ||
* [Bearer Token](#bearer-token) | ||
* [API Key](#api-key) | ||
* [OAuth](#oauth) | ||
|
||
Select the matching authentication method for your API and check the sections below for more information about individual methods. | ||
|
||
### Basic HTTP | ||
|
||
If requests are authenticated using the Basic HTTP authentication method, the documentation page will likely contain one of the following keywords: | ||
- "Basic Auth" | ||
- "Basic HTTP" | ||
- "Authorization: Basic" | ||
- "Base64" | ||
|
||
The Basic HTTP authentication method is a standard and doesn't require any further configuration. Username and password are set via "Testing values" in the connector builder and by the end user when configuring this connector as a Source. | ||
|
||
#### Example | ||
|
||
The [Greenhouse API](https://developers.greenhouse.io/harvest.html#introduction) is an API using basic authentication. | ||
|
||
Sometimes, only a username and no password is required, like for the [Chargebee API](https://apidocs.chargebee.com/docs/api/auth?prod_cat_ver=2) - in these cases simply leave the password input empty. | ||
|
||
In the basic authentication scheme, the supplied username and password are concatenated with a colon `:` and encoded using the base64 algorithm. For username `user` and password `passwd`, the base64-encoding of `user:passwd` is `dXNlcjpwYXNzd2Q=`. | ||
|
||
When fetching records, this string is sent as part of the `Authorization` header: | ||
``` | ||
curl -X GET \ | ||
-H "Authorization: Basic dXNlcjpwYXNzd2Q=" \ | ||
https://harvest.greenhouse.io/v1/<stream path> | ||
``` | ||
|
||
### Bearer Token | ||
|
||
If requests are authenticated using Bearer authentication, the documentation will probably mention "bearer token" or "token authentication". In this scheme, the `Authorization` header of the HTTP request is set to `Bearer <token>`. | ||
|
||
Like the Basic HTTP authentication it does not require further configuration. The bearer token can be set via "Testing values" in the connector builder and by the end user when configuring this connector as a Source. | ||
|
||
#### Example | ||
|
||
The [Sendgrid API](https://docs.sendgrid.com/api-reference/how-to-use-the-sendgrid-v3-api/authentication) and the [Square API](https://developer.squareup.com/docs/build-basics/access-tokens) are supporting Bearer authentication. | ||
|
||
When fetching records, the token is sent along as the `Authorization` header: | ||
``` | ||
curl -X GET \ | ||
-H "Authorization: Bearer <bearer token>" \ | ||
https://api.sendgrid.com/<stream path> | ||
``` | ||
|
||
### API Key | ||
|
||
The API key authentication method is similar to the Bearer authentication but allows to configure as which HTTP header the API key is sent as part of the request. The http header name is part of the connector definition while the API key itself can be set via "Testing values" in the connector builder as well as when configuring this connector as a Source. | ||
|
||
This form of authentication is often called "(custom) header authentication". It only supports setting the token to an HTTP header, for other cases, see the ["Other authentication methods" section](#access-token-as-query-or-body-parameter) | ||
|
||
#### Example | ||
|
||
The [CoinAPI.io API](https://docs.coinapi.io/market-data/rest-api#authorization) is using API key authentication via the `X-CoinAPI-Key` header. | ||
|
||
When fetching records, the api token is included in the request using the configured header: | ||
``` | ||
curl -X GET \ | ||
-H "X-CoinAPI-Key: <api-key>" \ | ||
https://rest.coinapi.io/v1/<stream path> | ||
``` | ||
|
||
### OAuth | ||
|
||
The OAuth authentication method implements authentication using an [OAuth2.0 flow with a refresh token grant type](https://oauth.net/2/grant-types/refresh-token/). | ||
|
||
In this scheme the OAuth endpoint of an API is called with a long-lived refresh token that's provided by the end user when configuring this connector as a Source. The refresh token is used to obtain a short-lived access token that's used to make requests actually extracting records. If the access token expires, the connection will automatically request a new one. | ||
|
||
The connector needs to be configured with the endpoint to call to obtain access tokens with the refresh token. OAuth client id/secret and the refresh token are provided via "Testing values" in the connector builder as well as when configuring this connector as a Source. | ||
|
||
Depending on how the refresh endpoint is implemented exactly, additional configuration might be necessary to specify how to request an access token with the right permissions (configuring OAuth scopes and grant type) and how to extract the access token and the expiry date out of the response (configuring expiry date format and property name as well as the access key property name): | ||
* Scopes - the [OAuth scopes](https://oauth.net/2/scope/) the access token will have access to. if not specified, no scopes are sent along with the refresh token request | ||
* Grant type - the OAuth grant type to request. This should be set to the string mapping to the [refresh token grant type](https://oauth.net/2/grant-types/refresh-token/). If not specified, it's set to `refresh_token` which is the right value in most cases. | ||
* Token expiry property name - the name of the property in the response that contains token expiry information. If not specified, it's set to `expires_in` | ||
* Token expire property date format - if not specified, the expiry property is interpreted as the number of seconds the access token will be valid | ||
* Access token property name - the name of the property in the response that contains the access token to do requests. If not specified, it's set to `access_token` | ||
|
||
If the API uses a short-lived refresh token that expires after a short amount of time and needs to be refreshed as well or if other grant types like PKCE are required, it's not possible to use the connector builder with OAuth authentication - check out the [compatibility guide](/connector-development/config-based/connector-builder-compatibility#oauth) for more information. | ||
|
||
Keep in mind that the OAuth authentication method does not implement a single-click authentication experience for the end user configuring the connector - it will still be necessary to obtain client id, client secret and refresh token from the API and manually enter them into the configuration form. | ||
|
||
#### Example | ||
|
||
The [Square API](https://developer.squareup.com/docs/build-basics/access-tokens#get-an-oauth-access-token) supports OAuth. | ||
|
||
In this case, the authentication method has to be configured like this: | ||
* "Token refresh endpoint" is `https://connect.squareup.com/oauth2/token` | ||
* "Token expiry property name" is `expires_at` | ||
|
||
When running a sync, the connector is first sending client id, client secret and refresh token to the token refresh endpoint: | ||
``` | ||
|
||
curl -X POST \ | ||
-H "Content-Type: application/json" \ | ||
-d '{"client_id": "<client id>", "client_secret": "<client secret>", "refresh_token": "<refresh token>", "grant_type": "refresh_token" }' \ | ||
<token refresh endpoint> | ||
``` | ||
|
||
The response is a JSON object containing an `access_token` property and an `expires_at` property: | ||
``` | ||
{"access_token":"<access-token>", "expires_at": "2023-12-12T00:00:00"} | ||
``` | ||
|
||
The `expires_at` date tells the connector how long the access token can be used - if this point in time is passed, a new access token is requested automatically. | ||
|
||
When fetching records, the access token is sent along as part of the `Authorization` header: | ||
``` | ||
curl -X GET \ | ||
-H "Authorization: Bearer <access-token>" \ | ||
https://connect.squareup.com/v2/<stream path> | ||
``` | ||
|
||
### Other authentication methods | ||
|
||
If your API is not using one of the natively supported authentication methods, it's still possible to build an Airbyte connector as described below. | ||
|
||
#### Access token as query or body parameter | ||
|
||
Some APIs require to include the access token in different parts of the request (for example as a request parameter). For example, the [Breezometer API](https://docs.breezometer.com/api-documentation/introduction/#authentication) is using this kind of authentication. In these cases it's also possible to configure authentication manually: | ||
* Add a user input as secret field on the "User inputs" page (e.g. named `api_key`) | ||
* On the stream page, add a new "Request parameter" | ||
* As key, configure the name of the query parameter the API requires (e.g. named `key`) | ||
* As value, configure a placeholder for the created user input (e.g. `{{ config['api_key'] }}`) | ||
|
||
The same approach can be used to add the token to the request body. | ||
|
||
#### Custom authentication methods | ||
|
||
Some APIs require complex custom authentication schemes involving signing requests or doing multiple requests to authenticate. In these cases, it's required to use the [low-code CDK](/connector-development/config-based/low-code-cdk-overview) or [Python CDK](/connector-development/cdk-python/). | ||
|
||
## Reference | ||
|
||
For detailed documentation of the underlying low code components, see here TODO |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,133 @@ | ||||||
# Incremental sync | ||||||
|
||||||
An incremental sync is a sync which pulls only the data that has changed since the previous sync (as opposed to all the data available in the data source). | ||||||
|
||||||
This is especially important if there are a large number of records to sync and/or the API has tight request limits which makes a full sync of all records on a regular schedule too expensive or too slow. | ||||||
|
||||||
Incremental syncs are usually implemented using a cursor value (like a timestamp) that delineates which data was pulled and which data is new. A very common cursor value is an `updated_at` timestamp. This cursor means that records whose `updated_at` value is less than or equal than that cursor value have been synced already, and that the next sync should only export records whose `updated_at` value is greater than the cursor value. | ||||||
|
||||||
To use incremental syncs, the API endpoint needs to fullfil the following requirements: | ||||||
* Records contain a date/time field that defines when this record was last updated (the "cursor field") | ||||||
* It's possible to filter/request records by the cursor field | ||||||
|
||||||
To learn more about how different modes of incremental syncs, check out the [Incremental Sync - Append](/understanding-airbyte/connections/incremental-append/) and [Incremental Sync - Deduped History](/understanding-airbyte/connections/incremental-deduped-history) pages. | ||||||
|
||||||
## Configuration | ||||||
|
||||||
To configure incremental syncs for a stream in the connector builder, you have to specify how the records specify the **"last changed" / "updated at" timestamp**, the **initial time range** to fetch records for and **how to request records from a certain time range**. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
In the builder UI, these things are specified like this: | ||||||
* The "Cursor field" is the property in the record that defines the date and time when the record got changed. It's used to decide which records are synced already and which records are "new" | ||||||
* The "Datetime format" specifies the [format](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) the cursor field is using to specify date and time, | ||||||
* The "Cursor granularity" is the smallest time unit supported by the API to specify the time range to request records for | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should either have a link to some external docs or a description / enumeration of the values that are accepted. |
||||||
* The "Start datetime" is the initial start date of the time range to fetch records for. When doing incremental syncs, the second sync will overwrite this date with the last record that got synced so far. In most cases, it is defined by the end user when configuring a Source using your connector. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think for simplicity we can remove this sentence. |
||||||
* The "End datetime" is the end date of the time range to fetch records for. In most cases it's set to the current date and time when the sync is started to sync all changes that happened so far. | ||||||
girarda marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
* The "Inject start/end time into outgoing HTTP request" defines how to request records that got changed in the time range to sync. In most cases the start and end time is added as a request parameter or body parameter | ||||||
|
||||||
## Example | ||||||
|
||||||
The [API of The Guardian](https://open-platform.theguardian.com/documentation/search) has a `/search` endpoint that allows to extract a list of articles. | ||||||
|
||||||
The `/search` endpoint has a `from-date` and a `to-date` query parameter which can be used to only request data for a certain time range. | ||||||
|
||||||
Content records have the following form: | ||||||
``` | ||||||
{ | ||||||
"id": "world/2022/oct/21/russia-ukraine-war-latest-what-we-know-on-day-240-of-the-invasion", | ||||||
"type": "article", | ||||||
"sectionId": "world", | ||||||
"sectionName": "World news", | ||||||
"webPublicationDate": "2022-10-21T14:06:14Z", | ||||||
"webTitle": "Russia-Ukraine war latest: what we know on day 240 of the invasion", | ||||||
// ... | ||||||
} | ||||||
``` | ||||||
|
||||||
As this fulfills the requirements for incremental syncs, we can configure the "Incremental sync" section in the following way: | ||||||
* "Cursor field" is set to `webPublicationDate` | ||||||
* "Datetime format" is set to `%Y-%m-%dT%H:%M:%SZ` | ||||||
* "Cursor granularity is set to `PT1S` as this API can handle date/time values on the second level | ||||||
* "Start datetime" is set to "config value" | ||||||
* "End datetime" is set to "now" to fetch all articles up to the current date | ||||||
* "Inject start time into outgoing HTTP request" is set to `request_parameter` with "Field" set to `from-date` | ||||||
* "Inject end time into outgoing HTTP request" is set to `request_parameter` with "Field" set to `to-date` | ||||||
|
||||||
This API orders records by default from new to old, which is not optimal for a reliable sync as the last encountered cursor value will be the most recent date even if some older records did not get synced (for example if a sync fails halfway through). It's better to start with the oldest records and work your way up to make sure that all older records are synced already once a certain date is encountered on a record. In this case the API can be configured to behave like this by setting an additional parameter: | ||||||
* At the bottom of the stream configuration page, add a new "Request parameter" | ||||||
* Set the key to `order-by` | ||||||
* Set the value to `oldest` | ||||||
|
||||||
Setting the start date in the "Testing values" to a week in the past (`2023-04-09T00:00:00Z` at the time of writing) results in the following request: | ||||||
``` | ||||||
curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=2023-04-09T00:00:00Z&to-date=2023-04-15T10:18:08Z' | ||||||
``` | ||||||
|
||||||
The last encountered date will be saved as part of the connection - when the next sync is running, it picks up from the last record. Let's assume the last ecountered article looked like this: | ||||||
``` | ||||||
{ | ||||||
"id": "business/live/2023/apr/15/uk-bosses-more-optimistic-energy-prices-fall-ai-spending-boom-economics-business-live", | ||||||
"type": "liveblog", | ||||||
"sectionId": "business", | ||||||
"sectionName": "Business", | ||||||
"webPublicationDate": "2023-04-15T07:30:58Z", | ||||||
} | ||||||
``` | ||||||
|
||||||
Then when a sync is triggered for the same connection the next day, the following request is made: | ||||||
``` | ||||||
curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=2023-04-15T07:30:58Z&to-date=2023-04-16T10:18:08Z' | ||||||
``` | ||||||
|
||||||
The `from-date` is set to the cutoff date of articles synced already and the `to-date` is set to the new "now". | ||||||
|
||||||
## Advanced settings | ||||||
|
||||||
The description above is sufficient for a lot of APIs. However there are some more subtle configurations which sometimes become relevant. | ||||||
|
||||||
### Step | ||||||
|
||||||
When incremental syncs are enabled, the connector is not fetching all records since the cutoff date at once - instead it's splitting up the time range between the cutoff date and the desired end date into intervals based on the "Step" configuration (by default it's set to 1 month) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
How do we define "1 month"? Given that months are variable numbers of days and there are edge cases that may not have obvious defaults (e.g. what is 1 month from Jan 31?), I wonder if it would be preferable to choose another default (a week?) instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are using ISO 8601 durations (https://en.wikipedia.org/wiki/ISO_8601#Durations) so a month is a calendar month which can have varying length. I'm not sure about the perfect default, I think @girarda also put some thoughts into making this field optional and instead do a number-of-records based checkpointing which would be preferable as it makes it easier for the user. However, AFAIK that's not planned for the immediate future (before public beta) so I'm documenting the current state of affairs. |
||||||
|
||||||
For example if the "Step" is set to 10 days (`P10D`) for the Guardian articles stream described above and a longer time range, then the following requests will be performed: | ||||||
girarda marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
``` | ||||||
|
||||||
curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=2023-01-01T00:00:00Z&to-date=2023-01-10T00:00:00Z' | ||||||
|
||||||
curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=2023-01-10T00:00:00Z&to-date=2023-01-20T00:00:00Z' | ||||||
|
||||||
curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=2023-01-20T00:00:00Z&to-date=2023-01-30T00:00:00Z' | ||||||
|
||||||
... | ||||||
``` | ||||||
|
||||||
After an interval is processed, the cursor value of the last record will be saved as part of the connection as the new cutoff date. | ||||||
|
||||||
In most cases, the default step size is fine, there are two reasons to change it: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
* **The API is unreliable** and the cutoff date should be saved more often to prevent resync of a lot of records - if the "Step" size is a day, then at most one day worth of data needs to be resync in case the sync fails halfway through. However, a smaller step size might cause more requests to the API and more load on the system. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed this paragraph based on @girarda s comments above There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we focus on the checkpointing ability instead of blaming the API for being unreliable? This is also useful when Airbyte is responsible for the sync failure |
||||||
* **The API requires the connector to fetch data in pre-specified chunks** - for example the [Exchange Rates API](https://exchangeratesapi.io/documentation/) makes the date to fetch data for part of the URL path and only allows to fetch data for a single day at a time | ||||||
|
||||||
### Lookback window | ||||||
|
||||||
The "Lookback window" specifies a duration that is subtracted from the last cutoff date before starting to sync. | ||||||
|
||||||
Same APIs update records over time but do not allow to filter or search by modification date, only by creation date. For example the API of The Guardian might change the title of an article after it got published, but the `webPublicationDate` still shows the original date the article got published initially. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
In these cases, there are two options: | ||||||
* **Do not use incremental sync** and always sync the full set of records to always have a consistent state - depending on the amount of data this might not be feasable | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: this is also undesirable for some use cases even if the amount of data is small. This prevents the user from keeping a history of the changes (dedup + history) |
||||||
* **Configure the "Lookback window"** to not only sync exclusively new records, but resync some portion of records before the cutoff date to catch changes that were made to existing records, trading off data consistency and the amount of synced records. In the case of the API of The Guardian, this strategy will likey work well because news articles tend to only be updated for a few days after the initial release date, so this strategy should be able to catch most updates without having to resync all articles. | ||||||
|
||||||
Reiterating the example from above with a "Lookback window" of 2 days configured, let's assume the last ecountered article looked like this: | ||||||
``` | ||||||
{ | ||||||
"id": "business/live/2023/apr/15/uk-bosses-more-optimistic-energy-prices-fall-ai-spending-boom-economics-business-live", | ||||||
"type": "liveblog", | ||||||
"sectionId": "business", | ||||||
"sectionName": "Business", | ||||||
"webPublicationDate": <b>"2023-04-15T07:30:58Z"</b>, | ||||||
} | ||||||
``` | ||||||
|
||||||
Then when a sync is triggered for the same connection the next day, the following request is made: | ||||||
``` | ||||||
curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=2023-04-13T07:30:58Z&to-date=2023-04-16T10:18:08Z' | ||||||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to document the use of
stream_interval
anywhere?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, added an info box for that and updated the tutorial