-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist Job Failure summaries for failed and cancelled attempts #9527
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great to me.
i left one sort of open question about how we should track retriability. i don't think we need to block on it as it should be easy enough to add in after the fact.
type: boolean | ||
cancelledBy: | ||
description: The user who cancelled the attempt. | ||
type: string # TODO how do we represent users? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should refer to whatever uuid we use in the user table in the cloud database i think.
stacktrace: | ||
description: Raw stacktrace associated with the failure. | ||
type: string | ||
retryable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edgao was sharing some feelings on this field that made a lot of sense to me (that was missing in the original spec). Parker, I also think I'm contradicting something I said to you last week, so sorry for the mind change. That we should have the concepts of: quarantine for the user to fix it, quarantine for airbyte to fix it, retry, unknown. would love to figure out how to add that in this struct some how. @edgao wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could failureType
be directly replaced with the error class? And then remove the retryable
entry. Something along these lines:
- unknown -> unknown
- permissionError -> userError
- systemRestart -> retry (?)
- zombie -> airbyteError (?)
- manualCancellation -> notAnError
I'm kind of torn on whether to keep the more granular failureType
- on the one hand, it could be interesting to display stats like "in the last 30 days, 60% of your errors were permissions-related" (maybe this could nudge users to have better processes around managing access, or encourage us to build certain features, etc). But this enum feels like it could easily become a very long list that contains every error under the sun, up to and including targetWidgetDidNotExistBecauseSourceWhizbangWasImproperlyCalibrated
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment thread from the original tech spec is very relevant: https://docs.google.com/document/d/1vy2iWmiGK5gptLCJVmzYckQj24zFuwF6C_Q-YIlP0is/edit?disco=AAAATDVxx6o .
No worries about contradicting anything! It makes sense that we'd want to iterate on what we want to do with the schema now that we can see how it'll actually interact with and exist in the codebase. I think replacing retryable (boolean)
with an enum that captures those different resolution paths is a great call.
failureType
definitely feels hard to keep reasonably scoped. @edgao when you say "error class", do you mean like the actual Exception.class
value of the original exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. enum makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, error class as in unknown/userError/retry/etc. Wanted to avoid overloading the term failureType
in my comment but am bad at naming >.>
That comment thread does look relevant! Gonna put thoughts here b/c gdoc comments aren't great for longform discussion IMO.
user intervention
@cgardens does "quarantined for TS" exist in this usecase? To my understanding, we (airbyte) don't really get involved in actually configuring anything, and everything is just configured by the user, so there's no real TS-equivalent. But maybe I'm missing something there.
more granular failure reasons
I strongly believe the translation from "granular failure reason" (e.g. permissionError
) to "high-level failure handling strategy" (userError
+ user-friendly message) should live as close to the source of the error (destination-s3
) as possible. Then the UI/orchestrator can just check that high-level failure type, rather than needing to know about all the different granular failure reasons.
@cgardens what sort of usecases are you envisioning by having the more granular failure reason? I'm seeing it mostly as useful for a human analyst to look at.
What if we changed failureType
to a string here, but in each individual component is actually an enum? So e.g. the worker might emit one of zombie|idk|whatever
, and destination-s3 could emit one of missingPermissions|networkTimeout|bucketDoesNotExist|etc
.
Then it would take some more effort during analysis to map different granular reasons together (e.g. if source-shopify emits a badOauth
failure, but source-google-ads emits invalidOauth
, we'd need to tie those together somehow), but at least we'd capture it in a somewhat programmatically-friendly way. They could probably also be standardized to some extent (e.g. any blob store connector could have bucketDoesNotExist
, any oauth connector could have invalidClientId
, etc).
(I'm obviously thinking about this in a very connector-centric way. but I think it extends to platform-ish areas fairly intuitively)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can relate to some discussion we had around the new scheduler. The plan was to have 2 type of exceptions thrown by activities: RetryableException
and NonRetryableException
. We didn't have time to properly defined what is retryable and what is not. The new scheduler is suppose to be design to handle the scenario you describe and provide the tools to list the quarantine as well as the signal and query method to return the cause and unstuck a sync in the long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edgao I think you raise some great points that we'll definitely want to consider as we iterate on error/failure handling throughout the platform. I think we'll eventually want to "incentivize" connectors to conform to some standardized form of error/failure reporting, perhaps even defining classes of failures with the expected platform handling in a future version of the protocol.
For now, I think the goal of this PR should primarily be to:
- make it possible to differentiate between source and destination failures today
- create a baseline schema for capturing failure metadata that can serve as a starting point for further iteration
I think the failureSource
solves for goal #1, and we can solve for #2 by supporting an optional failureType
string field like you suggested, with the expectation that we'll want to iterate towards some sort of standardization before we start deriving much value from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, that makes sense to me 👍
import java.util.stream.Collectors; | ||
import org.apache.commons.lang3.exception.ExceptionUtils; | ||
|
||
public class FailureHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this class. keeps all of the code in the complicated parts of the code path easier to read!
stacktrace: | ||
description: Raw stacktrace associated with the failure. | ||
type: string | ||
retryable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could failureType
be directly replaced with the error class? And then remove the retryable
entry. Something along these lines:
- unknown -> unknown
- permissionError -> userError
- systemRestart -> retry (?)
- zombie -> airbyteError (?)
- manualCancellation -> notAnError
I'm kind of torn on whether to keep the more granular failureType
- on the one hand, it could be interesting to display stats like "in the last 30 days, 60% of your errors were permissions-related" (maybe this could nudge users to have better processes around managing access, or encourage us to build certain features, etc). But this enum feels like it could easily become a very long list that contains every error under the sun, up to and including targetWidgetDidNotExistBecauseSourceWhizbangWasImproperlyCalibrated
.
- permissionError | ||
- systemRestart | ||
- zombie | ||
- manualCancellation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't actually an error, right? Is it just included here because it's surfaced via an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there was a bit of back and forth on this in the tech spec. It might make sense to leave it out of the failure summary and instead just have a top-level cancelledBy
column on the attempts table for audit-trail purposes.
@cgardens now that you see it in code, how do you feel about it? I personally think it feels a bit unintuitive as-is. I think this is what you were getting at in the tech spec thread, like in order for this approach to make sense, we would want to use a failed
status for cancelled jobs rather than having an entirely-separate cancelled
status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parker, I agree with what you suggested. Let's do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor cosmetics comments.
@@ -129,10 +136,25 @@ public void run(final ConnectionUpdaterInput connectionUpdaterInput) throws Retr | |||
StandardSyncSummary standardSyncSummary = standardSyncOutput.get().getStandardSyncSummary(); | |||
|
|||
if (standardSyncSummary != null && standardSyncSummary.getStatus() == ReplicationStatus.FAILED) { | |||
failures.addAll(standardSyncOutput.get().getFailures()); | |||
partialSuccess = standardSyncOutput.get().getStandardSyncSummary().getTotalStats().getRecordsCommitted() > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You can use the variable standardSyncSummary
instead of standardSyncOutput.get().getStandardSyncSummary()
here.
if (childWorkflowFailure.getCause() instanceof CanceledFailure) { | ||
// do nothing, cancellation handled by cancellationScope | ||
|
||
} else if (childWorkflowFailure.getCause() instanceof ActivityFailure) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it can be refactor in:
else {
switch (chilldWorkflowFailure) {
case ActivityFailure a: ...
default: ...
}
hrow childWorkflowFailure;
}
a0f20f6
to
69b2ab0
Compare
f25084f
to
7bf3c4c
Compare
11d56a9
to
2ad5395
Compare
42d36db
to
4dd8d6f
Compare
4dd8d6f
to
18b9518
Compare
@cgardens I know you approved a while ago, could you give this another review before I consider it ready-to-merge? A few things have changed a bit since your initial review, namely:
|
18b9518
to
1f40d42
Compare
b0308f1
to
67efb4d
Compare
* add FailureHelper * add jobPersistence method for writing failure summary * record source/destination failures and include them in ReplicationOutput and StandardSyncOutput * handle failures in ConnectionManagerWorkflow, persist them when failing/cancelling an attempt * rename attempt to attempt_id in FailureHelper * test that ConnectionManagerWorkflow correctly records failures * only set failures on ReplicationOutput if a failure actually occurred * test that source or destination failure results in correct failureReason * remove cancellation from failure summaries * formatting, cleanup * remove failureSummaryForCancellation * rename failureSource -> failureOrigin, delete retryable, clarify failureType enum values * actually persist attemptFailureSummary now that column exists * use attemptNumber instead of attemptId where appropriate * small fixes * formatting * use maybeAttemptId instead of connectionUpdaterInput.getAttemptNumber * missed rename from failureSource to failureOrigin
fc68d25
to
7a893e1
Compare
What
This PR introduces a schema for capturing information about failures that occur during sync jobs.
How
The
ConnectionManagerWorkflow
collects failure information and persists it to theAttempts
table when an attempt fails or is cancelled. TheDefaultReplicationWorker
now records failures thrown from within the source and/or destination, and includes failure information in theReplicationOutput
. This failure information is then written toStandardSyncOutput
in the replication activity, so that theConnectionManagerWorkflow
can access and persist it.The
ConnectionManagerWorkflow
also catches allchildWorkflowExceptions
from the sync workflow. If the exception originated from an activity (for example, normalization), then the failure source is identified from the ActivityFailure. Otherwise, the failure source is recorded as 'unknown'.Recommended reading order
I recommend reading commit by commit
Misc