Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Better Heartbeating External Documentation. (#35932)
In, - airbytehq/airbyte-platform-internal@96baf5b - Better Destination Heartbeat Error Messages airbyte-platform-internal#11595 we improve our heartbeat error messages and point users to this external document. Here, we improve external documentation to help users understand what is happening and what they can do.
- Loading branch information
Showing
1 changed file
with
34 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,34 +1,57 @@ | ||
# Heartbeats | ||
|
||
During a data synchronization, many things can go wrong and sometimes the fix is just to restart the synchronization. | ||
Airbyte aims to make this restart as automated as possible and uses heartbeating mechanism in order to do that. | ||
This performed on 2 differents component: the source and the destination. They have different logics which will be | ||
explained bellow. | ||
Many transient issues can occur when moving data, especially for long jobs. Often the fix is a simple restart. | ||
|
||
## Source | ||
Airbyte aims to make restarts as automated as possible and uses a heartbeating mechanism to do so. This is performed on 2 different components: the source and the destination. | ||
|
||
### Heartbeating logic | ||
Heartbeat errors are expected to be transient and should automatically resolve. If they do not, it is likely a sign of a more serious issue. | ||
|
||
## Known Causes | ||
|
||
Possible reasons for this issue: | ||
1. Certain API sources take an unknown amount of time to generate asynchronous responses (e.g., Salesforce, Facebook, Amplitude). No workaround currently exists. | ||
2. Certain API sources can be rate-limited for a time period longer than their configured threshold. Although Airbyte tries its best to handle this on a per-connector basis, rate limits are not always predictable. | ||
3. Database sources can be slow to respond to a query. This can be due to a variety of reasons, including the size of the database, the complexity of the query, and the number of other queries being made to the database at the same time. | ||
1. The most common reason we see is using an un-indexed column as a cursor column in an incremental sync, or a dramatically under-provisioned database. | ||
4. Destinations can be slow to respond to write requests. | ||
1. The most common reason we see here is destination resource availability vis-a-vis data volumes. | ||
|
||
In general, | ||
* **Database Sources and Destination errors are extremely rare**. Any issues are likely to be indicative of actual issues and need to be investigated. | ||
* **API Sources errors are uncommon but not unexpected**. This is especially true if an API source generates asynchronous responses or has rate limits. | ||
|
||
## Airbyte Cloud | ||
Airbyte Cloud has identical heartbeat monitoring and alerting as Airbyte Open Source. | ||
|
||
If these issues show up on Airbyte Cloud, | ||
1. Please read [Known Causes](#known-causes). In many cases, the issue is with the source, the destination or the connection set up, and not with Airbyte. | ||
2. Reach out to Airbyte Support for help. | ||
|
||
## Technical Details | ||
|
||
### Source | ||
#### Heartbeating logic | ||
|
||
The platform considers both `RECORD` and `STATE` messages emitted by the source as source heartbeats. | ||
The Airbyte platform has a process which monitors when the last beat was send and if it reaches a threshold, | ||
the synchronization attempt will be failed. It fails with a cause being the source an message saying | ||
`The source is unresponsive`. Internal the error has a heartbeat timeout type, which is not display in the UI. | ||
|
||
### Configuration | ||
#### Configuration | ||
|
||
The heartbeat can be configured using the file flags.yaml through 2 entries: | ||
* `heartbeat-max-seconds-between-messages`: this configures the maximum time allowed between 2 messages. | ||
* `hseartbeat-max-seconds-between-messages`: this configures the maximum time allowed between 2 messages. | ||
The default is 3 hours. | ||
* `heartbeat.failSync`: Setting this to true will make the syncs to fail if a missed heartbeat is detected. | ||
If false no sync will be failed because of a missed heartbeat. The default value is true. | ||
|
||
## Destination | ||
### Destination | ||
|
||
### Heartbeating logic | ||
#### Heartbeating logic | ||
|
||
Adding a heartbeat to the destination similar to the one at the source is not straightforward since there isn't a constant stream of messages from the destination to the platform. Instead, we have implemented something that is more akin to a timeout. The platform monitors whether there has been a call to the destination that has taken more than a specified amount of time. If such a delay occurs, the platform considers the destination to have timed out. | ||
|
||
### Configuration | ||
#### Configuration | ||
The timeout can be configured using the file `flags.yaml` through 2 entries: | ||
* `destination-timeout-max-seconds`: If the platform detects a call to the destination exceeding the duration specified in this entry, it will consider the destination to have timed out. The default timeout value is 24 hours. | ||
* `destination-timeout.failSync`: If enabled (true by default), a detected destination timeout will cause the platform to fail the sync. If not, the platform will log a message and allow the sync to continue. When the platform fails a sync due to a destination timeout, the UI will display the message: `The destination is unresponsive`. |