Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-10166] Improve execution time errors with more concise error messages from DoFns #12170

Closed

Conversation

codeBehindMe
Copy link
Contributor

@codeBehindMe codeBehindMe commented Jul 3, 2020

The Go SDK uses errors returned by DoFns to signal failures to process bundles, and terminate bundle processing. However, if the preceding DoFn uses emitters, rather than error returns, the code has no choice to panic to avoid user code handling or ignoring the cross DoFn error (which could cause dataloss or other correctness problems).

All bundle executions are wrapped in callNoPanic to prevent worker termination on such panics, and orderly terminate just the affected bundle instead.callNoPanic uses Go's built in recover mechanism to get the error and provide a stack trace.

This is message is not necessarily helpful to the user and can lead to degraded user experience when debugging such errors.

In this patch, we introduce a new type of error; a doFnError which has a more concise representation of the DoFn which failed, the error which the DoFn failed with and other meta information.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@probot-autolabeler probot-autolabeler bot added the go label Jul 3, 2020
@codeBehindMe
Copy link
Contributor Author

R: @lostluck

@lostluck lostluck self-requested a review July 6, 2020 20:28
@lostluck lostluck changed the title [BEAM-10166] Improve execution timer errors with more concise error messages from DoFns [BEAM-10166] Improve execution time errors with more concise error messages from DoFns Jul 6, 2020
pid string
}
func (e *doFnError) Error() string {
return fmt.Sprintf("%v caused error %v, uid is %v pid is %v", e.doFn, e.err, e.uid, e.pid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wiki syntax in JIRA really butchered my intended suggestion of DoFn[<uid>;<pid>]<doFn> returned error:<error>

Which places the short/known length things before the potentially long and rife with newlines error. It puts the context first, and avoids it getting lost or truncated behind a very long error string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent suggestion.
ff46c6a

@lostluck
Copy link
Contributor

lostluck commented Jul 6, 2020

This is a good first step! Next up is adding it for use in user error returns in exec/pardo.go and similar.

If you did that work already, it's probable that the commit wasn't pushed to your branch.

Thanks for your patience in my getting to this. I took a long weekend for the US (and Canadian) holiday, and avoided all pretense of work.

In BEAM-10166, it was highlighted that when DoFns with emitters use
panics to signal failures, the error message returned to the user
is not helpful.

In this patch, we introduce a new error type called the doFnError
which can be used to generate more concise error messages for
DoFns that use emitters.
As a partt of BEAM-10166, a new type of error doFnError was introduced to
make errors generated from DoFns during bundle execution more digestable
to the user.

In this patch, we implement the catching of this error in the callNoPanic
function, which wraps bundle execution. WEe explicitly check the type of
the error generated to see if its of type doFnError and return that error.
Where it is not the case, the default behaviour of returning the stack
trace is perserved.
@codeBehindMe codeBehindMe force-pushed the improvement/BEAM-10166-exec-err branch from bf36c9e to ff46c6a Compare July 23, 2020 08:54
In the utils package, a new error type DoFnError was introduced to make the
signalling of DoFN errors more concise.

In this patch, we implement this error to be created when a ParDo failure hook
is created.
@lostluck
Copy link
Contributor

lostluck commented Aug 5, 2020

This looks good to me! Thank you for the contribution!

Would you mind adding a test to validate this behaviour?
exec/util_test.go doesn't currently exist, so it would need to be added.

There's no need to test everything in exec/util.go, just callNoPanic.

You can test three cases:

  • one where the passed in func(context.Context) error doesn't panic, and the error is just returned as is (they should compare as equal for a simple error)
  • another test where the the passed in function panics, which should return an error that contains "panic: " in it.
  • finally, one where the error passed to panic is wrapped in your new doFnError error type.

@codeBehindMe
Copy link
Contributor Author

codeBehindMe commented Aug 5, 2020

thanks @lostluck , I will implement this today / tomorrow.
Apologies, personal commitments have been getting in the way recently.

@lostluck
Copy link
Contributor

lostluck commented Aug 7, 2020

Definitely no need to apologize. Life >> code.

@codeBehindMe
Copy link
Contributor Author

So i've been banging my head against the wall with the following error. Am I missing something obvious here?
image

@lostluck
Copy link
Contributor

Your image is missing the whole error, but I suspect that the build doesn't know where that package is coming from.

go get github.cmo/google/go-cmp/cmp/cmpopts should solve the problem locally (and repeat until complete).

An alternative that should work is if you go to the beam root (.../github.com/apache/beam/) and run the gogradle vendor command. ./gradlew :sdks:go:goVendor This requires java and such to be installed unfortunately IIRC. This does almost the same as the above but puts everything into a vendor directory so it only impacts the Go SDK.

Unfortunately I can't recommend the best solution since it's not guaranteed to work with the way the Go SDK is currently implemented, which should get everything for you in one command. There's a small bit of logistics that need to be accomplished and feature work finished before we fix that though.

@codeBehindMe
Copy link
Contributor Author

Thanks @lostluck let me try my luck! =D

@stale
Copy link

stale bot commented Nov 7, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Nov 7, 2020
@stale
Copy link

stale bot commented Nov 16, 2020

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants