Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist Job Failure summaries for failed and cancelled attempts #9527

Merged
merged 1 commit into from
Jan 25, 2022

Conversation

pmossman
Copy link
Contributor

@pmossman pmossman commented Jan 15, 2022

What

This PR introduces a schema for capturing information about failures that occur during sync jobs.

How

The ConnectionManagerWorkflow collects failure information and persists it to the Attempts table when an attempt fails or is cancelled. The DefaultReplicationWorker now records failures thrown from within the source and/or destination, and includes failure information in the ReplicationOutput. This failure information is then written to StandardSyncOutput in the replication activity, so that the ConnectionManagerWorkflow can access and persist it.

The ConnectionManagerWorkflow also catches all childWorkflowExceptions from the sync workflow. If the exception originated from an activity (for example, normalization), then the failure source is identified from the ActivityFailure. Otherwise, the failure source is recorded as 'unknown'.

Recommended reading order

I recommend reading commit by commit

Misc

  • WIP as I still need to write tests, I wanted to get this up for feedback ASAP
  • This PR doesn't include failure recording in the old scheduler, should we add failure recording logic to the old schedulerApp code as well? I have a local branch with some progress on the old scheduler but depending on the timeline for switching over to the new one, it may not be worth pursuing.
  • I haven't written a migration to add the new column, so it just logs what it would have persisted for now.
  • This is branched off of Benoit's WIP branch, I'll periodically rebase onto his to keep things up to date

@github-actions github-actions bot added area/platform issues related to the platform area/scheduler area/worker Related to worker labels Jan 15, 2022
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me.

i left one sort of open question about how we should track retriability. i don't think we need to block on it as it should be easy enough to add in after the fact.

type: boolean
cancelledBy:
description: The user who cancelled the attempt.
type: string # TODO how do we represent users?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should refer to whatever uuid we use in the user table in the cloud database i think.

stacktrace:
description: Raw stacktrace associated with the failure.
type: string
retryable:
Copy link
Contributor

@cgardens cgardens Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao was sharing some feelings on this field that made a lot of sense to me (that was missing in the original spec). Parker, I also think I'm contradicting something I said to you last week, so sorry for the mind change. That we should have the concepts of: quarantine for the user to fix it, quarantine for airbyte to fix it, retry, unknown. would love to figure out how to add that in this struct some how. @edgao wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could failureType be directly replaced with the error class? And then remove the retryable entry. Something along these lines:

  • unknown -> unknown
  • permissionError -> userError
  • systemRestart -> retry (?)
  • zombie -> airbyteError (?)
  • manualCancellation -> notAnError

I'm kind of torn on whether to keep the more granular failureType - on the one hand, it could be interesting to display stats like "in the last 30 days, 60% of your errors were permissions-related" (maybe this could nudge users to have better processes around managing access, or encourage us to build certain features, etc). But this enum feels like it could easily become a very long list that contains every error under the sun, up to and including targetWidgetDidNotExistBecauseSourceWhizbangWasImproperlyCalibrated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment thread from the original tech spec is very relevant: https://docs.google.com/document/d/1vy2iWmiGK5gptLCJVmzYckQj24zFuwF6C_Q-YIlP0is/edit?disco=AAAATDVxx6o .

No worries about contradicting anything! It makes sense that we'd want to iterate on what we want to do with the schema now that we can see how it'll actually interact with and exist in the codebase. I think replacing retryable (boolean) with an enum that captures those different resolution paths is a great call.

failureType definitely feels hard to keep reasonably scoped. @edgao when you say "error class", do you mean like the actual Exception.class value of the original exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. enum makes sense to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, error class as in unknown/userError/retry/etc. Wanted to avoid overloading the term failureType in my comment but am bad at naming >.>

That comment thread does look relevant! Gonna put thoughts here b/c gdoc comments aren't great for longform discussion IMO.

user intervention

@cgardens does "quarantined for TS" exist in this usecase? To my understanding, we (airbyte) don't really get involved in actually configuring anything, and everything is just configured by the user, so there's no real TS-equivalent. But maybe I'm missing something there.

more granular failure reasons

I strongly believe the translation from "granular failure reason" (e.g. permissionError) to "high-level failure handling strategy" (userError + user-friendly message) should live as close to the source of the error (destination-s3) as possible. Then the UI/orchestrator can just check that high-level failure type, rather than needing to know about all the different granular failure reasons.

@cgardens what sort of usecases are you envisioning by having the more granular failure reason? I'm seeing it mostly as useful for a human analyst to look at.

What if we changed failureType to a string here, but in each individual component is actually an enum? So e.g. the worker might emit one of zombie|idk|whatever, and destination-s3 could emit one of missingPermissions|networkTimeout|bucketDoesNotExist|etc.

Then it would take some more effort during analysis to map different granular reasons together (e.g. if source-shopify emits a badOauth failure, but source-google-ads emits invalidOauth, we'd need to tie those together somehow), but at least we'd capture it in a somewhat programmatically-friendly way. They could probably also be standardized to some extent (e.g. any blob store connector could have bucketDoesNotExist, any oauth connector could have invalidClientId, etc).

(I'm obviously thinking about this in a very connector-centric way. but I think it extends to platform-ish areas fairly intuitively)

Copy link
Contributor

@benmoriceau benmoriceau Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can relate to some discussion we had around the new scheduler. The plan was to have 2 type of exceptions thrown by activities: RetryableException and NonRetryableException. We didn't have time to properly defined what is retryable and what is not. The new scheduler is suppose to be design to handle the scenario you describe and provide the tools to list the quarantine as well as the signal and query method to return the cause and unstuck a sync in the long term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao I think you raise some great points that we'll definitely want to consider as we iterate on error/failure handling throughout the platform. I think we'll eventually want to "incentivize" connectors to conform to some standardized form of error/failure reporting, perhaps even defining classes of failures with the expected platform handling in a future version of the protocol.

For now, I think the goal of this PR should primarily be to:

  1. make it possible to differentiate between source and destination failures today
  2. create a baseline schema for capturing failure metadata that can serve as a starting point for further iteration

I think the failureSource solves for goal #1, and we can solve for #2 by supporting an optional failureType string field like you suggested, with the expectation that we'll want to iterate towards some sort of standardization before we start deriving much value from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, that makes sense to me 👍

import java.util.stream.Collectors;
import org.apache.commons.lang3.exception.ExceptionUtils;

public class FailureHelper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this class. keeps all of the code in the complicated parts of the code path easier to read!

stacktrace:
description: Raw stacktrace associated with the failure.
type: string
retryable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could failureType be directly replaced with the error class? And then remove the retryable entry. Something along these lines:

  • unknown -> unknown
  • permissionError -> userError
  • systemRestart -> retry (?)
  • zombie -> airbyteError (?)
  • manualCancellation -> notAnError

I'm kind of torn on whether to keep the more granular failureType - on the one hand, it could be interesting to display stats like "in the last 30 days, 60% of your errors were permissions-related" (maybe this could nudge users to have better processes around managing access, or encourage us to build certain features, etc). But this enum feels like it could easily become a very long list that contains every error under the sun, up to and including targetWidgetDidNotExistBecauseSourceWhizbangWasImproperlyCalibrated.

- permissionError
- systemRestart
- zombie
- manualCancellation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't actually an error, right? Is it just included here because it's surfaced via an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there was a bit of back and forth on this in the tech spec. It might make sense to leave it out of the failure summary and instead just have a top-level cancelledBy column on the attempts table for audit-trail purposes.

@cgardens now that you see it in code, how do you feel about it? I personally think it feels a bit unintuitive as-is. I think this is what you were getting at in the tech spec thread, like in order for this approach to make sense, we would want to use a failed status for cancelled jobs rather than having an entirely-separate cancelled status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parker, I agree with what you suggested. Let's do it.

Copy link
Contributor

@benmoriceau benmoriceau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor cosmetics comments.

@@ -129,10 +136,25 @@ public void run(final ConnectionUpdaterInput connectionUpdaterInput) throws Retr
StandardSyncSummary standardSyncSummary = standardSyncOutput.get().getStandardSyncSummary();

if (standardSyncSummary != null && standardSyncSummary.getStatus() == ReplicationStatus.FAILED) {
failures.addAll(standardSyncOutput.get().getFailures());
partialSuccess = standardSyncOutput.get().getStandardSyncSummary().getTotalStats().getRecordsCommitted() > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You can use the variable standardSyncSummary instead of standardSyncOutput.get().getStandardSyncSummary() here.

if (childWorkflowFailure.getCause() instanceof CanceledFailure) {
// do nothing, cancellation handled by cancellationScope

} else if (childWorkflowFailure.getCause() instanceof ActivityFailure) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it can be refactor in:
else {
switch (chilldWorkflowFailure) {
case ActivityFailure a: ...
default: ...
}
hrow childWorkflowFailure;
}

From: https://docs.oracle.com/en/java/javase/17/language/pattern-matching-switch-expressions-and-statements.html

@benmoriceau benmoriceau force-pushed the bmoric/acceptance-test-update branch 12 times, most recently from a0f20f6 to 69b2ab0 Compare January 20, 2022 01:16
Base automatically changed from bmoric/acceptance-test-update to master January 20, 2022 02:16
@pmossman pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from f25084f to 7bf3c4c Compare January 20, 2022 20:44
@pmossman pmossman temporarily deployed to more-secrets January 20, 2022 20:46 Inactive
@pmossman pmossman temporarily deployed to more-secrets January 20, 2022 21:15 Inactive
@pmossman pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from 11d56a9 to 2ad5395 Compare January 20, 2022 23:23
@pmossman pmossman temporarily deployed to more-secrets January 20, 2022 23:25 Inactive
@pmossman pmossman temporarily deployed to more-secrets January 21, 2022 00:55 Inactive
@pmossman pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from 42d36db to 4dd8d6f Compare January 21, 2022 18:16
@pmossman pmossman temporarily deployed to more-secrets January 21, 2022 18:18 Inactive
@pmossman pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from 4dd8d6f to 18b9518 Compare January 21, 2022 18:20
@pmossman pmossman temporarily deployed to more-secrets January 21, 2022 18:22 Inactive
@pmossman
Copy link
Contributor Author

@cgardens I know you approved a while ago, could you give this another review before I consider it ready-to-merge? A few things have changed a bit since your initial review, namely:

  1. Removed 'retryable' boolean in favor of 'failureType', which is now an enum with values userError, systemError, transient, and unknown. I expect that this will evolve over time and currently we aren't populating this enum at all, but I'm hoping this serves as a baseline that we can start to populate when we aim to start programmatically handling well-known failures. My thinking here was that eventually we can eventually surface something like a recommended resolution path to the user: userError -> user needs to intervene/fix their config/etc. systemError -> Airbyte needs to intervene, no action for the user to take. transient -> recommend a retry to the user since it might work.

  2. Renamed failureSource to failureOrigin because source is overloaded

  3. Added tests, particularly in the ConnectionManagerWorkflow to ensure that we at least have test coverage for the failureOrigin part of the schema, as that is definitely the most immediately-valuable feature of this project right now

@pmossman pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from 18b9518 to 1f40d42 Compare January 22, 2022 00:49
@pmossman pmossman temporarily deployed to more-secrets January 22, 2022 00:51 Inactive
@pmossman pmossman temporarily deployed to more-secrets January 22, 2022 00:58 Inactive
@pmossman pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from b0308f1 to 67efb4d Compare January 24, 2022 22:51
@pmossman pmossman temporarily deployed to more-secrets January 24, 2022 22:53 Inactive
@pmossman pmossman temporarily deployed to more-secrets January 24, 2022 22:56 Inactive
@pmossman pmossman temporarily deployed to more-secrets January 24, 2022 23:39 Inactive
* add FailureHelper

* add jobPersistence method for writing failure summary

* record source/destination failures and include them in ReplicationOutput and StandardSyncOutput

* handle failures in ConnectionManagerWorkflow, persist them when failing/cancelling an attempt

* rename attempt to attempt_id in FailureHelper

* test that ConnectionManagerWorkflow correctly records failures

* only set failures on ReplicationOutput if a failure actually occurred

* test that source or destination failure results in correct failureReason

* remove cancellation from failure summaries

* formatting, cleanup

* remove failureSummaryForCancellation

* rename failureSource -> failureOrigin, delete retryable, clarify failureType enum values

* actually persist attemptFailureSummary now that column exists

* use attemptNumber instead of attemptId where appropriate

* small fixes

* formatting

* use maybeAttemptId instead of connectionUpdaterInput.getAttemptNumber

* missed rename from failureSource to failureOrigin
@pmossman pmossman force-pushed the parker/job-failure-off-of-benoit-branch branch from fc68d25 to 7a893e1 Compare January 25, 2022 21:41
@pmossman pmossman temporarily deployed to more-secrets January 25, 2022 21:43 Inactive
@pmossman pmossman temporarily deployed to more-secrets January 25, 2022 22:08 Inactive
@pmossman pmossman changed the title [WIP] Persist Job Failure summaries for failed and cancelled attempts Persist Job Failure summaries for failed and cancelled attempts Jan 25, 2022
@pmossman pmossman merged commit 805c8d9 into master Jan 25, 2022
@pmossman pmossman deleted the parker/job-failure-off-of-benoit-branch branch January 25, 2022 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/scheduler area/server area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants