Auto-disable failing connections #9715

cgardens · 2022-01-22T23:30:56Z

Tell us about the problem you're trying to solve

If a connection is failing, Airbyte should automatically stop retrying at some point to avoid wasting resources. We will start with the following rule: Disable a sync if it has 14 days of straight failures OR 100 failures in a row, whichever comes first. The goal with the rule is handle syncs with a long period.

So a constant failing 5 minute sync will disable itself after about 6 hours. While a daily sync will disable itself after 7 days. For syncs with a longer period, they will get disabled after 1 bad run.

While this is something that we will want to make configurable, because we are still not sure exactly what the right threshold will be or if we have even chosen the right metrics to use as trigger, we should hold off on making the thresholds configurable via environment variable until we are more certain. If we want, for now we can flag it, and allow OSS users to simply turn off auto disable altogether if they want.

Acceptance Criteria

Syncs are disabled if they have 14 days of straight failures OR 100 failures in a row (whichever comes first)
OSS users can disable this functionality via flag
Users should be notified when a sync is auto-disabled

jrhizor · 2022-01-24T23:36:36Z

We should notify users when this occurs.

ChristopheDuong · 2022-02-04T10:26:15Z

See this metabase question:
https://airbyte.metabaseapp.com/question/82-cloud-connection-to-watch-for-failures?connection_status=active

ChristopheDuong · 2022-02-08T16:00:04Z

The parameters should probably be tweaked...

This workspace: https://cloud.airbyte.io/workspaces/b264bf09-60e0-4a81-9cec-2b64cd613792/connections/5b3f66da-8739-405c-9407-cb2a39b03806

is not showing up based on Charles thresholds since it has:

99 failures in a row (missing 1 more): when the connection reaches this point, the amount of resources wasted is already pretty high.
No 14 days of straight failures: even though the connection is configured at running every 3h, it is not triggered as expected regularly. So there have been days where no runs are being done... It might be due to some scheduling bug? Anyway, maybe the criteria should be in the past 14 days, no successful runs but only fail runs instead

davinchia · 2022-02-15T14:58:41Z

Are those failure limits too high? They might still result in high warehouse costs for folks with normalisation turned on.

To clarify, can we define 14 days of straight failures? Does this refer to a job failing after running for 14 days?

terencecho · 2022-03-05T01:25:59Z

I interpreted 14 days of straight failures as: there has been no successful jobs in the past 14 days (with minimum of one job attempt), meaning there could be cancelled jobs in that 14 day period.

I have a draft PR up that only does this check after a job failure, so in that implementation, there shouldn't be a job in a currently running state.

terencecho · 2022-03-10T22:33:34Z

@cgardens I have this set for only jobs of configType SYNC. Benoit mentioned we might want to include RESET_CONNECTION. Do you have any thoughts around which configTypes we should include?

cgardens · 2022-03-11T17:00:44Z

That makes sense to me to include it.

terencecho · 2022-03-14T22:24:31Z

When a new connection is set up and fails its first job, this will trigger the connection to be auto-disabled since all the jobs in the last 14 days (so only that first job) will be failures. This behavior seems not ideal for users, since when they fix the connection, they will also need to remember to enable it.

One options was to make sure that the connection is 14 days old before doing that check, but this could lead to many failures in that 14 day period. We can still do the 100 consecutive failures check in this case though.

@cgardens do you have any opinions on what behavior might be better for our users?

cgardens · 2022-03-14T23:04:55Z

hahaha. yeah the way i was thinking about that is there is condition 1: disable due to 14 day of failures OR condition 2: disable due to 100 consecutive failures. condition 1 can't trigger until there are at least 14 days of jobs.

terencecho · 2022-03-18T23:26:37Z

Update for this issue. This feature can be turned on by setting the AUTO_DISABLE_FAILING_CONNECTIONS feature flag to true. It's currently not turned on anywhere. We're still looking into making it possible to notify users when a connection has been auto-disabled. You can follow along with that issue here. I'm assuming once we have notifications set up, we can turn it on for workspaces that make sense to have them.

cgardens added type/enhancement New feature or request area/platform issues related to the platform 2022-q1-platform labels Jan 22, 2022

pmossman assigned terencecho Feb 17, 2022

terencecho mentioned this issue Mar 5, 2022

Add Disable Failing Connections feature #10877

Closed

3 tasks

terencecho mentioned this issue Mar 7, 2022

Notify users of auto-disabled connections #10929

Closed

terencecho mentioned this issue Mar 14, 2022

Add Auto-Disable Failing Connections feature #11099

Merged

3 tasks

pmossman closed this as completed Mar 24, 2022

terencecho mentioned this issue Apr 1, 2022

Send Customer.io notifications and warnings for connections being auto-disabled #11670

Merged

1 task

bleonard added team/compose team/platform-move labels Apr 15, 2022

This was referenced Feb 22, 2023

Docs: 14 days or 100 failed syncs #23338

Closed

OSS docs: 14 days or 100 failed syncs before connection is disabled #23364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-disable failing connections #9715

Auto-disable failing connections #9715

cgardens commented Jan 22, 2022 •

edited by pmossman

Loading

jrhizor commented Jan 24, 2022

ChristopheDuong commented Feb 4, 2022 •

edited

Loading

ChristopheDuong commented Feb 8, 2022 •

edited

Loading

davinchia commented Feb 15, 2022

terencecho commented Mar 5, 2022

terencecho commented Mar 10, 2022

cgardens commented Mar 11, 2022

terencecho commented Mar 14, 2022

cgardens commented Mar 14, 2022

terencecho commented Mar 18, 2022

Auto-disable failing connections #9715

Auto-disable failing connections #9715

Comments

cgardens commented Jan 22, 2022 • edited by pmossman Loading

Tell us about the problem you're trying to solve

Acceptance Criteria

jrhizor commented Jan 24, 2022

ChristopheDuong commented Feb 4, 2022 • edited Loading

ChristopheDuong commented Feb 8, 2022 • edited Loading

davinchia commented Feb 15, 2022

terencecho commented Mar 5, 2022

terencecho commented Mar 10, 2022

cgardens commented Mar 11, 2022

terencecho commented Mar 14, 2022

cgardens commented Mar 14, 2022

terencecho commented Mar 18, 2022

cgardens commented Jan 22, 2022 •

edited by pmossman

Loading

ChristopheDuong commented Feb 4, 2022 •

edited

Loading

ChristopheDuong commented Feb 8, 2022 •

edited

Loading