Persist Job Failure summaries for failed and cancelled attempts #9527

pmossman · 2022-01-15T02:42:17Z

What

This PR introduces a schema for capturing information about failures that occur during sync jobs.

How

The ConnectionManagerWorkflow collects failure information and persists it to the Attempts table when an attempt fails or is cancelled. The DefaultReplicationWorker now records failures thrown from within the source and/or destination, and includes failure information in the ReplicationOutput. This failure information is then written to StandardSyncOutput in the replication activity, so that the ConnectionManagerWorkflow can access and persist it.

The ConnectionManagerWorkflow also catches all childWorkflowExceptions from the sync workflow. If the exception originated from an activity (for example, normalization), then the failure source is identified from the ActivityFailure. Otherwise, the failure source is recorded as 'unknown'.

Misc

WIP as I still need to write tests, I wanted to get this up for feedback ASAP
This PR doesn't include failure recording in the old scheduler, should we add failure recording logic to the old schedulerApp code as well? I have a local branch with some progress on the old scheduler but depending on the timeline for switching over to the new one, it may not be worth pursuing.
I haven't written a migration to add the new column, so it just logs what it would have persisted for now.
This is branched off of Benoit's WIP branch, I'll periodically rebase onto his to keep things up to date

cgardens

looks great to me.

i left one sort of open question about how we should track retriability. i don't think we need to block on it as it should be easy enough to add in after the fact.

cgardens · 2022-01-17T03:03:09Z

airbyte-config/models/src/main/resources/types/AttemptFailureSummary.yaml

+    type: boolean
+  cancelledBy:
+    description: The user who cancelled the attempt.
+    type: string # TODO how do we represent users?


it should refer to whatever uuid we use in the user table in the cloud database i think.

cgardens · 2022-01-17T03:05:51Z

airbyte-config/models/src/main/resources/types/FailureReason.yaml

+  stacktrace:
+    description: Raw stacktrace associated with the failure.
+    type: string
+  retryable:


@edgao was sharing some feelings on this field that made a lot of sense to me (that was missing in the original spec). Parker, I also think I'm contradicting something I said to you last week, so sorry for the mind change. That we should have the concepts of: quarantine for the user to fix it, quarantine for airbyte to fix it, retry, unknown. would love to figure out how to add that in this struct some how. @edgao wdyt?

Could failureType be directly replaced with the error class? And then remove the retryable entry. Something along these lines:

unknown -> unknown

permissionError -> userError

systemRestart -> retry (?)

zombie -> airbyteError (?)

manualCancellation -> notAnError

I'm kind of torn on whether to keep the more granular failureType - on the one hand, it could be interesting to display stats like "in the last 30 days, 60% of your errors were permissions-related" (maybe this could nudge users to have better processes around managing access, or encourage us to build certain features, etc). But this enum feels like it could easily become a very long list that contains every error under the sun, up to and including targetWidgetDidNotExistBecauseSourceWhizbangWasImproperlyCalibrated.

I think this comment thread from the original tech spec is very relevant: https://docs.google.com/document/d/1vy2iWmiGK5gptLCJVmzYckQj24zFuwF6C_Q-YIlP0is/edit?disco=AAAATDVxx6o .

No worries about contradicting anything! It makes sense that we'd want to iterate on what we want to do with the schema now that we can see how it'll actually interact with and exist in the codebase. I think replacing retryable (boolean) with an enum that captures those different resolution paths is a great call.

failureType definitely feels hard to keep reasonably scoped. @edgao when you say "error class", do you mean like the actual Exception.class value of the original exception?

cool. enum makes sense to me!

sorry, error class as in unknown/userError/retry/etc. Wanted to avoid overloading the term failureType in my comment but am bad at naming >.>

That comment thread does look relevant! Gonna put thoughts here b/c gdoc comments aren't great for longform discussion IMO.

user intervention

@cgardens does "quarantined for TS" exist in this usecase? To my understanding, we (airbyte) don't really get involved in actually configuring anything, and everything is just configured by the user, so there's no real TS-equivalent. But maybe I'm missing something there.

more granular failure reasons

I strongly believe the translation from "granular failure reason" (e.g. permissionError) to "high-level failure handling strategy" (userError + user-friendly message) should live as close to the source of the error (destination-s3) as possible. Then the UI/orchestrator can just check that high-level failure type, rather than needing to know about all the different granular failure reasons.

@cgardens what sort of usecases are you envisioning by having the more granular failure reason? I'm seeing it mostly as useful for a human analyst to look at.

What if we changed failureType to a string here, but in each individual component is actually an enum? So e.g. the worker might emit one of zombie|idk|whatever, and destination-s3 could emit one of missingPermissions|networkTimeout|bucketDoesNotExist|etc.

Then it would take some more effort during analysis to map different granular reasons together (e.g. if source-shopify emits a badOauth failure, but source-google-ads emits invalidOauth, we'd need to tie those together somehow), but at least we'd capture it in a somewhat programmatically-friendly way. They could probably also be standardized to some extent (e.g. any blob store connector could have bucketDoesNotExist, any oauth connector could have invalidClientId, etc).

(I'm obviously thinking about this in a very connector-centric way. but I think it extends to platform-ish areas fairly intuitively)

This can relate to some discussion we had around the new scheduler. The plan was to have 2 type of exceptions thrown by activities: RetryableException and NonRetryableException. We didn't have time to properly defined what is retryable and what is not. The new scheduler is suppose to be design to handle the scenario you describe and provide the tools to list the quarantine as well as the signal and query method to return the cause and unstuck a sync in the long term.

@edgao I think you raise some great points that we'll definitely want to consider as we iterate on error/failure handling throughout the platform. I think we'll eventually want to "incentivize" connectors to conform to some standardized form of error/failure reporting, perhaps even defining classes of failures with the expected platform handling in a future version of the protocol.

For now, I think the goal of this PR should primarily be to:

make it possible to differentiate between source and destination failures today

create a baseline schema for capturing failure metadata that can serve as a starting point for further iteration

I think the failureSource solves for goal #1, and we can solve for #2 by supporting an optional failureType string field like you suggested, with the expectation that we'll want to iterate towards some sort of standardization before we start deriving much value from it.

cool, that makes sense to me 👍

cgardens · 2022-01-17T03:07:04Z

airbyte-workers/src/main/java/io/airbyte/workers/helper/FailureHelper.java

+import java.util.stream.Collectors;
+import org.apache.commons.lang3.exception.ExceptionUtils;
+
+public class FailureHelper {


i like this class. keeps all of the code in the complicated parts of the code path easier to read!

edgao · 2022-01-17T18:27:38Z

airbyte-config/models/src/main/resources/types/FailureReason.yaml

+  stacktrace:
+    description: Raw stacktrace associated with the failure.
+    type: string
+  retryable:


Could failureType be directly replaced with the error class? And then remove the retryable entry. Something along these lines:

unknown -> unknown

permissionError -> userError

systemRestart -> retry (?)

zombie -> airbyteError (?)

manualCancellation -> notAnError

I'm kind of torn on whether to keep the more granular failureType - on the one hand, it could be interesting to display stats like "in the last 30 days, 60% of your errors were permissions-related" (maybe this could nudge users to have better processes around managing access, or encourage us to build certain features, etc). But this enum feels like it could easily become a very long list that contains every error under the sun, up to and including targetWidgetDidNotExistBecauseSourceWhizbangWasImproperlyCalibrated.

edgao · 2022-01-17T18:28:27Z

airbyte-config/models/src/main/resources/types/FailureReason.yaml

+      - permissionError
+      - systemRestart
+      - zombie
+      - manualCancellation


this isn't actually an error, right? Is it just included here because it's surfaced via an exception?

Yeah there was a bit of back and forth on this in the tech spec. It might make sense to leave it out of the failure summary and instead just have a top-level cancelledBy column on the attempts table for audit-trail purposes.

@cgardens now that you see it in code, how do you feel about it? I personally think it feels a bit unintuitive as-is. I think this is what you were getting at in the tech spec thread, like in order for this approach to make sense, we would want to use a failed status for cancelled jobs rather than having an entirely-separate cancelled status.

Parker, I agree with what you suggested. Let's do it.

benmoriceau

Minor cosmetics comments.

benmoriceau · 2022-01-18T21:39:38Z

...kers/src/main/java/io/airbyte/workers/temporal/scheduling/ConnectionManagerWorkflowImpl.java

@@ -129,10 +136,25 @@ public void run(final ConnectionUpdaterInput connectionUpdaterInput) throws Retr
              StandardSyncSummary standardSyncSummary = standardSyncOutput.get().getStandardSyncSummary();

              if (standardSyncSummary != null && standardSyncSummary.getStatus() == ReplicationStatus.FAILED) {
+                failures.addAll(standardSyncOutput.get().getFailures());
+                partialSuccess = standardSyncOutput.get().getStandardSyncSummary().getTotalStats().getRecordsCommitted() > 0;


Nit: You can use the variable standardSyncSummary instead of standardSyncOutput.get().getStandardSyncSummary() here.

benmoriceau · 2022-01-18T21:47:11Z

...kers/src/main/java/io/airbyte/workers/temporal/scheduling/ConnectionManagerWorkflowImpl.java

+              if (childWorkflowFailure.getCause() instanceof CanceledFailure) {
+                // do nothing, cancellation handled by cancellationScope
+
+              } else if (childWorkflowFailure.getCause() instanceof ActivityFailure) {


Nit: it can be refactor in:
else {
switch (chilldWorkflowFailure) {
case ActivityFailure a: ...
default: ...
}
hrow childWorkflowFailure;
}

From: https://docs.oracle.com/en/java/javase/17/language/pattern-matching-switch-expressions-and-statements.html

pmossman · 2022-01-21T18:30:12Z

@cgardens I know you approved a while ago, could you give this another review before I consider it ready-to-merge? A few things have changed a bit since your initial review, namely:

Removed 'retryable' boolean in favor of 'failureType', which is now an enum with values userError, systemError, transient, and unknown. I expect that this will evolve over time and currently we aren't populating this enum at all, but I'm hoping this serves as a baseline that we can start to populate when we aim to start programmatically handling well-known failures. My thinking here was that eventually we can eventually surface something like a recommended resolution path to the user: userError -> user needs to intervene/fix their config/etc. systemError -> Airbyte needs to intervene, no action for the user to take. transient -> recommend a retry to the user since it might work.
Renamed failureSource to failureOrigin because source is overloaded
Added tests, particularly in the ConnectionManagerWorkflow to ensure that we at least have test coverage for the failureOrigin part of the schema, as that is definitely the most immediately-valuable feature of this project right now

* add FailureHelper * add jobPersistence method for writing failure summary * record source/destination failures and include them in ReplicationOutput and StandardSyncOutput * handle failures in ConnectionManagerWorkflow, persist them when failing/cancelling an attempt * rename attempt to attempt_id in FailureHelper * test that ConnectionManagerWorkflow correctly records failures * only set failures on ReplicationOutput if a failure actually occurred * test that source or destination failure results in correct failureReason * remove cancellation from failure summaries * formatting, cleanup * remove failureSummaryForCancellation * rename failureSource -> failureOrigin, delete retryable, clarify failureType enum values * actually persist attemptFailureSummary now that column exists * use attemptNumber instead of attemptId where appropriate * small fixes * formatting * use maybeAttemptId instead of connectionUpdaterInput.getAttemptNumber * missed rename from failureSource to failureOrigin

pmossman requested review from benmoriceau and cgardens January 15, 2022 02:42

github-actions bot added area/platform issues related to the platform area/scheduler area/worker Related to worker labels Jan 15, 2022

cgardens approved these changes Jan 17, 2022

View reviewed changes

edgao reviewed Jan 17, 2022

View reviewed changes

benmoriceau reviewed Jan 18, 2022

View reviewed changes

benmoriceau force-pushed the bmoric/acceptance-test-update branch 12 times, most recently from a0f20f6 to 69b2ab0 Compare January 20, 2022 01:16

Base automatically changed from bmoric/acceptance-test-update to master January 20, 2022 02:16

pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from f25084f to 7bf3c4c Compare January 20, 2022 20:44

pmossman temporarily deployed to more-secrets January 20, 2022 20:46 Inactive

pmossman temporarily deployed to more-secrets January 20, 2022 21:15 Inactive

pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from 11d56a9 to 2ad5395 Compare January 20, 2022 23:23

pmossman temporarily deployed to more-secrets January 20, 2022 23:25 Inactive

pmossman temporarily deployed to more-secrets January 21, 2022 00:55 Inactive

pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from 42d36db to 4dd8d6f Compare January 21, 2022 18:16

github-actions bot added the area/server label Jan 21, 2022

pmossman temporarily deployed to more-secrets January 21, 2022 18:18 Inactive

pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from 4dd8d6f to 18b9518 Compare January 21, 2022 18:20

pmossman temporarily deployed to more-secrets January 21, 2022 18:22 Inactive

pmossman requested a review from cgardens January 21, 2022 18:30

pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from 18b9518 to 1f40d42 Compare January 22, 2022 00:49

pmossman temporarily deployed to more-secrets January 22, 2022 00:51 Inactive

pmossman temporarily deployed to more-secrets January 22, 2022 00:58 Inactive

pmossman requested a review from benmoriceau January 24, 2022 22:20

pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from b0308f1 to 67efb4d Compare January 24, 2022 22:51

pmossman temporarily deployed to more-secrets January 24, 2022 22:53 Inactive

pmossman temporarily deployed to more-secrets January 24, 2022 22:56 Inactive

pmossman temporarily deployed to more-secrets January 24, 2022 23:39 Inactive

pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from fc68d25 to 7a893e1 Compare January 25, 2022 21:41

pmossman temporarily deployed to more-secrets January 25, 2022 21:43 Inactive

benmoriceau approved these changes Jan 25, 2022

View reviewed changes

pmossman temporarily deployed to more-secrets January 25, 2022 22:08 Inactive

pmossman changed the title ~~[WIP] Persist Job Failure summaries for failed and cancelled attempts~~ Persist Job Failure summaries for failed and cancelled attempts Jan 25, 2022

pmossman merged commit 805c8d9 into master Jan 25, 2022

pmossman deleted the parker/job-failure-off-of-benoit-branch branch January 25, 2022 22:51

octavia-squidington-iii mentioned this pull request Jan 26, 2022

Bump Airbyte version from 0.35.10-alpha to 0.35.11-alpha #9820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist Job Failure summaries for failed and cancelled attempts #9527

Persist Job Failure summaries for failed and cancelled attempts #9527

pmossman commented Jan 15, 2022 •

edited

Loading

cgardens left a comment

cgardens Jan 17, 2022

cgardens Jan 17, 2022 •

edited

Loading

edgao Jan 17, 2022

pmossman Jan 18, 2022

cgardens Jan 19, 2022

edgao Jan 19, 2022

benmoriceau Jan 19, 2022 •

edited

Loading

pmossman Jan 20, 2022

edgao Jan 21, 2022

cgardens Jan 17, 2022

edgao Jan 17, 2022

edgao Jan 17, 2022

pmossman Jan 18, 2022

cgardens Jan 18, 2022

benmoriceau left a comment

benmoriceau Jan 18, 2022

benmoriceau Jan 18, 2022

pmossman commented Jan 21, 2022

Persist Job Failure summaries for failed and cancelled attempts #9527

Persist Job Failure summaries for failed and cancelled attempts #9527

Conversation

pmossman commented Jan 15, 2022 • edited Loading

What

How

Recommended reading order

Misc

cgardens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgardens Jan 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

user intervention

more granular failure reasons

benmoriceau Jan 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmoriceau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmossman commented Jan 21, 2022

pmossman commented Jan 15, 2022 •

edited

Loading

cgardens Jan 17, 2022 •

edited

Loading

benmoriceau Jan 19, 2022 •

edited

Loading