Skip to content

Commit

Permalink
Connector builder: Adjust documentation for incremental sync changes (#…
Browse files Browse the repository at this point in the history
…27206)

* adjust documentation

* update tutorial based on incremental sync changes

---------

Co-authored-by: lmossman <lake@airbyte.io>
  • Loading branch information
Joe Reuter and lmossman committed Jun 13, 2023
1 parent ad1a992 commit 69c3a73
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 12 deletions.
Expand Up @@ -48,7 +48,6 @@ Content records have the following form:
As this fulfills the requirements for incremental syncs, we can configure the "Incremental sync" section in the following way:
* "Cursor field" is set to `webPublicationDate`
* "Datetime format" is set to `%Y-%m-%dT%H:%M:%SZ`
* "Cursor granularity is set to `PT1S` as this API can handle date/time values on the second level
* "Start datetime" is set to "user input" to allow the user of the connector configuring a Source to specify the time to start syncing
* "End datetime" is set to "now" to fetch all articles up to the current date
* "Inject start time into outgoing HTTP request" is set to `request_parameter` with "Field" set to `from-date`
Expand Down Expand Up @@ -92,11 +91,13 @@ In some cases, it's helpful to reference the start and end date of the interval

The description above is sufficient for a lot of APIs. However there are some more subtle configurations which sometimes become relevant.

### Step
### Split up interval

When incremental syncs are enabled and "Step" is set, the connector is not fetching all records since the cutoff date at once - instead it's splitting up the time range between the cutoff date and the desired end date into intervals based on the "Step" configuration expressed as [ISO 8601 duration](https://en.wikipedia.org/wiki/ISO_8601#Durations).
When incremental syncs are enabled and "Split up interval" is set, the connector is not fetching all records since the cutoff date at once - instead it's splitting up the time range between the cutoff date and the desired end date into intervals based on the "Step" configuration expressed as [ISO 8601 duration](https://en.wikipedia.org/wiki/ISO_8601#Durations).

For example if the "Step" is set to 10 days (`P10D`) for the Guardian articles stream described above and a longer time range, then the following requests will be performed:
The "cursor granularity" also needs to be set to an ISO 8601 duration - it represents the smallest possible time unit the API supports to filter records by. It's used to ensure the start of a interval does not overlap with the end of the previous one.

For example if the "Step" is set to 10 days (`P10D`) and the "Cursor granularity" set to second (`PT1S`) for the Guardian articles stream described above and a longer time range, then the following requests will be performed:
<pre>
curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=<b>2023-01-01T00:00:00Z</b>&to-date=<b>2023-01-10T00:00:00Z</b>'{`\n`}
curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=<b>2023-01-10T00:00:00Z</b>&to-date=<b>2023-01-20T00:00:00Z</b>'{`\n`}
Expand All @@ -106,7 +107,7 @@ curl 'https://content.guardianapis.com/search?order-by=oldest&from-date=<b>2023-

After an interval is processed, the cursor value of the last record will be saved as part of the connection as the new cutoff date.

This value is optional and if left unset, the connector will not split up the time range at all but will instead just request all records for the entire target time range. This configuration works for all connectors, but there are two reasons to change it:
If left unset, the connector will not split up the time range at all but will instead just request all records for the entire target time range. This configuration works for all connectors, but there are two reasons to change it:
* **To protect a connection against intermittent failures** - if the "Step" size is a day, the cutoff date is saved after all records associated with a day are proccessed. If a sync fails halfway through because the API, the Airbyte system, the destination or the network between these components has a failure, then at most one day worth of data needs to be resynced. However, a smaller step size might cause more requests to the API and more load on the system. It depends on the expected amount of data and load characteristics of an API what step size is optimal, but for a lot of applications the default of one month is a good starting point.
* **The API requires the connector to fetch data in pre-specified chunks** - for example the [Exchange Rates API](https://exchangeratesapi.io/documentation/) makes the date to fetch data for part of the URL path and only allows to fetch data for a single day at a time

Expand Down
14 changes: 7 additions & 7 deletions docs/connector-development/connector-builder-ui/tutorial.mdx
Expand Up @@ -143,7 +143,7 @@ The record should update to use USD as the base currency:

### Adding incremental reads

<div style={{ position: "relative", paddingBottom: "59.66850828729282%", height: 0 }}><iframe src="https://www.loom.com/embed/223d28508682464481a433396cab2a3a" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{position: "absolute", top: 0, left: 0, width: "100%", height: "100%"}}></iframe></div>
<div style={{ position: "relative", paddingBottom: "59.66850828729282%", height: 0 }}><iframe src="https://www.loom.com/embed/d52259513b664119a842809a4fd13c15" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{position: "absolute", top: 0, left: 0, width: "100%", height: "100%"}}></iframe></div>

We now have a working implementation of a connector reading the latest exchange rates for a given currency.
In this section, we'll update the source to read historical data instead of only reading the latest exchange rates.
Expand All @@ -153,14 +153,15 @@ According to the API documentation, we can read the exchange rate for a specific
To configure your connector to request every day individually, follow these steps:
* On top of the form, change the "Path URL" input to `/exchangerates_data/{{ stream_interval.start_time }}` to [inject](/connector-development/config-based/understanding-the-yaml-file/reference#variables) the date to fetch data for into the path of the request
* Enable "Incremental sync" for the Rates stream
* Set the "Cursor field" to `date` - this is the property in our records to check what date got synced last
* Set the "Datetime format" to `%Y-%m-%d` to match the format of the date in the record returned from the API
* Set the "Cursor granularity" to `P1D` to tell the connector the API only supports daily increments
* Set the "Cursor Field" to `date` - this is the property in our records to check what date got synced last
* Set the "Cursor Field Datetime Format" to `%Y-%m-%d` to match the format of the date in the record returned from the API
* Leave start time to "User input" so the end user can set the desired start time for syncing data
* Leave end time to "Now" to always sync exchange rates up to the current date
* In a lot of cases the start and end date are injected into the request body or request parameters. However in the case of the exchange rate API it needs to be added to the path of the request, so disable the "Inject start/end time into outgoing HTTP request" options
* Open the "Advanced" section and set "Step" to `P1D` to configure the connector to do one separate request per day by partitioning the dataset into daily intervals
* Set a start date (like `2023-03-03`) in the "Testing values" menu
* Open the "Advanced" section and enable "Split up interval" so that the connector will partition the dataset into chunks
* Set "Step" to `P1D` to configure the connector to do one separate request per day
* Set the "Cursor granularity" to `P1D` to tell the connector the API only supports daily increments
* Set a start date (like `2023-06-11`) in the "Testing values" menu
* Hit the "Test" button to trigger a new test read

Now, you should see a dropdown above the records view that lets you step through the daily exchange rates along with the requests performed to fetch this data. Note that in the connector builder at most 5 partitions are requested to speed up testing. During a proper sync the full time range between your configured start date and the current day will be executed.
Expand Down Expand Up @@ -190,7 +191,6 @@ Congratulations! You just completed the following steps:
* Configured a production-ready connector to extract currency exchange data from an HTTP-based API:
* Configurable API key, start date and base currency
* Incremental sync to keep the number of requests small
* Schema declaration to enable normalization in the destination
* Tested whether the connector works correctly in the builder
* Made the working connector available to configure sources in the workspace
* Set up a connection using the published connector and synced data from the Exchange Rates API
Expand Down

0 comments on commit 69c3a73

Please sign in to comment.