Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-disable failing connections #9715

Closed
cgardens opened this issue Jan 22, 2022 · 10 comments
Closed

Auto-disable failing connections #9715

cgardens opened this issue Jan 22, 2022 · 10 comments
Assignees
Labels
area/platform issues related to the platform team/compose team/platform-move type/enhancement New feature or request

Comments

@cgardens
Copy link
Contributor

cgardens commented Jan 22, 2022

Tell us about the problem you're trying to solve

If a connection is failing, Airbyte should automatically stop retrying at some point to avoid wasting resources. We will start with the following rule: Disable a sync if it has 14 days of straight failures OR 100 failures in a row, whichever comes first. The goal with the rule is handle syncs with a long period.

So a constant failing 5 minute sync will disable itself after about 6 hours. While a daily sync will disable itself after 7 days. For syncs with a longer period, they will get disabled after 1 bad run.

While this is something that we will want to make configurable, because we are still not sure exactly what the right threshold will be or if we have even chosen the right metrics to use as trigger, we should hold off on making the thresholds configurable via environment variable until we are more certain. If we want, for now we can flag it, and allow OSS users to simply turn off auto disable altogether if they want.

Acceptance Criteria

  • Syncs are disabled if they have 14 days of straight failures OR 100 failures in a row (whichever comes first)
  • OSS users can disable this functionality via flag
  • Users should be notified when a sync is auto-disabled
@cgardens cgardens added type/enhancement New feature or request area/platform issues related to the platform 2022-q1-platform labels Jan 22, 2022
@jrhizor
Copy link
Contributor

jrhizor commented Jan 24, 2022

We should notify users when this occurs.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Feb 4, 2022

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Feb 8, 2022

The parameters should probably be tweaked...

This workspace: https://cloud.airbyte.io/workspaces/b264bf09-60e0-4a81-9cec-2b64cd613792/connections/5b3f66da-8739-405c-9407-cb2a39b03806

is not showing up based on Charles thresholds since it has:

  • 99 failures in a row (missing 1 more): when the connection reaches this point, the amount of resources wasted is already pretty high.
  • No 14 days of straight failures: even though the connection is configured at running every 3h, it is not triggered as expected regularly. So there have been days where no runs are being done... It might be due to some scheduling bug? Anyway, maybe the criteria should be in the past 14 days, no successful runs but only fail runs instead

@davinchia
Copy link
Contributor

Are those failure limits too high? They might still result in high warehouse costs for folks with normalisation turned on.

To clarify, can we define 14 days of straight failures? Does this refer to a job failing after running for 14 days?

@terencecho
Copy link
Contributor

I interpreted 14 days of straight failures as: there has been no successful jobs in the past 14 days (with minimum of one job attempt), meaning there could be cancelled jobs in that 14 day period.

I have a draft PR up that only does this check after a job failure, so in that implementation, there shouldn't be a job in a currently running state.

@terencecho
Copy link
Contributor

@cgardens I have this set for only jobs of configType SYNC. Benoit mentioned we might want to include RESET_CONNECTION. Do you have any thoughts around which configTypes we should include?

@cgardens
Copy link
Contributor Author

That makes sense to me to include it.

@terencecho
Copy link
Contributor

When a new connection is set up and fails its first job, this will trigger the connection to be auto-disabled since all the jobs in the last 14 days (so only that first job) will be failures. This behavior seems not ideal for users, since when they fix the connection, they will also need to remember to enable it.

One options was to make sure that the connection is 14 days old before doing that check, but this could lead to many failures in that 14 day period. We can still do the 100 consecutive failures check in this case though.

@cgardens do you have any opinions on what behavior might be better for our users?

@cgardens
Copy link
Contributor Author

hahaha. yeah the way i was thinking about that is there is condition 1: disable due to 14 day of failures OR condition 2: disable due to 100 consecutive failures. condition 1 can't trigger until there are at least 14 days of jobs.

@terencecho
Copy link
Contributor

Update for this issue. This feature can be turned on by setting the AUTO_DISABLE_FAILING_CONNECTIONS feature flag to true. It's currently not turned on anywhere. We're still looking into making it possible to notify users when a connection has been auto-disabled. You can follow along with that issue here. I'm assuming once we have notifications set up, we can turn it on for workspaces that make sense to have them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform team/compose team/platform-move type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants