Skip to content

Commit

Permalink
Better Heartbeating External Documentation. (#35932)
Browse files Browse the repository at this point in the history
In,
- airbytehq/airbyte-platform-internal@96baf5b
- Better Destination Heartbeat Error Messages airbyte-platform-internal#11595

we improve our heartbeat error messages and point users to this external document.

Here, we improve external documentation to help users understand what is happening and what they can do.
  • Loading branch information
davinchia committed Mar 8, 2024
1 parent a4dca3b commit e66ec11
Showing 1 changed file with 34 additions and 11 deletions.
45 changes: 34 additions & 11 deletions docs/understanding-airbyte/heartbeats.md
@@ -1,34 +1,57 @@
# Heartbeats

During a data synchronization, many things can go wrong and sometimes the fix is just to restart the synchronization.
Airbyte aims to make this restart as automated as possible and uses heartbeating mechanism in order to do that.
This performed on 2 differents component: the source and the destination. They have different logics which will be
explained bellow.
Many transient issues can occur when moving data, especially for long jobs. Often the fix is a simple restart.

## Source
Airbyte aims to make restarts as automated as possible and uses a heartbeating mechanism to do so. This is performed on 2 different components: the source and the destination.

### Heartbeating logic
Heartbeat errors are expected to be transient and should automatically resolve. If they do not, it is likely a sign of a more serious issue.

## Known Causes

Possible reasons for this issue:
1. Certain API sources take an unknown amount of time to generate asynchronous responses (e.g., Salesforce, Facebook, Amplitude). No workaround currently exists.
2. Certain API sources can be rate-limited for a time period longer than their configured threshold. Although Airbyte tries its best to handle this on a per-connector basis, rate limits are not always predictable.
3. Database sources can be slow to respond to a query. This can be due to a variety of reasons, including the size of the database, the complexity of the query, and the number of other queries being made to the database at the same time.
1. The most common reason we see is using an un-indexed column as a cursor column in an incremental sync, or a dramatically under-provisioned database.
4. Destinations can be slow to respond to write requests.
1. The most common reason we see here is destination resource availability vis-a-vis data volumes.

In general,
* **Database Sources and Destination errors are extremely rare**. Any issues are likely to be indicative of actual issues and need to be investigated.
* **API Sources errors are uncommon but not unexpected**. This is especially true if an API source generates asynchronous responses or has rate limits.

## Airbyte Cloud
Airbyte Cloud has identical heartbeat monitoring and alerting as Airbyte Open Source.

If these issues show up on Airbyte Cloud,
1. Please read [Known Causes](#known-causes). In many cases, the issue is with the source, the destination or the connection set up, and not with Airbyte.
2. Reach out to Airbyte Support for help.

## Technical Details

### Source
#### Heartbeating logic

The platform considers both `RECORD` and `STATE` messages emitted by the source as source heartbeats.
The Airbyte platform has a process which monitors when the last beat was send and if it reaches a threshold,
the synchronization attempt will be failed. It fails with a cause being the source an message saying
`The source is unresponsive`. Internal the error has a heartbeat timeout type, which is not display in the UI.

### Configuration
#### Configuration

The heartbeat can be configured using the file flags.yaml through 2 entries:
* `heartbeat-max-seconds-between-messages`: this configures the maximum time allowed between 2 messages.
* `hseartbeat-max-seconds-between-messages`: this configures the maximum time allowed between 2 messages.
The default is 3 hours.
* `heartbeat.failSync`: Setting this to true will make the syncs to fail if a missed heartbeat is detected.
If false no sync will be failed because of a missed heartbeat. The default value is true.

## Destination
### Destination

### Heartbeating logic
#### Heartbeating logic

Adding a heartbeat to the destination similar to the one at the source is not straightforward since there isn't a constant stream of messages from the destination to the platform. Instead, we have implemented something that is more akin to a timeout. The platform monitors whether there has been a call to the destination that has taken more than a specified amount of time. If such a delay occurs, the platform considers the destination to have timed out.

### Configuration
#### Configuration
The timeout can be configured using the file `flags.yaml` through 2 entries:
* `destination-timeout-max-seconds`: If the platform detects a call to the destination exceeding the duration specified in this entry, it will consider the destination to have timed out. The default timeout value is 24 hours.
* `destination-timeout.failSync`: If enabled (true by default), a detected destination timeout will cause the platform to fail the sync. If not, the platform will log a message and allow the sync to continue. When the platform fails a sync due to a destination timeout, the UI will display the message: `The destination is unresponsive`.

0 comments on commit e66ec11

Please sign in to comment.