[BEAM-6284] Improve error message on waitUntilFinish. by Ardagan · Pull Request #8629 · apache/beam

Ardagan · 2019-05-20T20:51:43Z

Allow for infinite wait.

Seems that [BEAM-6284] is relevant to this issue, even though logs are not available any more.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Allow for infinite wait.

Ardagan · 2019-05-20T21:18:30Z

@kennknowles @akedin

Ardagan · 2019-05-20T22:43:37Z

@amaliujia

amaliujia · 2019-05-21T16:39:46Z

I am not familiar with this piece of code. Maybe also ask in dev@ to see who is also able to review this change?

Ardagan · 2019-05-21T17:51:22Z

run java postcommit

akedin

(Still Looking) I think this is a good thing to refactor, few comments:

akedin · 2019-05-21T18:22:02Z

...-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java

-    if (terminalState != null) {
-      return terminalState;
-    }
+  State getStateWithRetriesNoThrow(BackOff attempts, Sleeper sleeper) {


When reading ..NoThrow I had to lookup what happens instead of throwing. Would something like ..OrUnknown convey the behavior better?

akedin · 2019-05-21T18:48:24Z

...-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java

  }

+  BackOff getBackoff(Duration duration, FluentBackoff factory) {
+    if (duration.equals(Duration.ZERO) || duration.isLongerThan(Duration.ZERO)) {


I think this logic is incorrect and a bit confusing:

factory is always MESSAGES_BACKOFF_FACTORY, from what I can see. I suggest just inlining it instead of passing as a parameter, otherwise here it's unclear what it is. At the call sites it doesn't matter as well, the logic there doesn't care how we get the backoff and what factory we use;

the duration in withMaxCumulativeBackoff cannot be zero;

it's unclear that duration is a total cumulative timeout, not some other parameter of the backoff configuration;

the logic is supposed to read (if I understand it right): "use default backoff config, plus set the max duration if it's positive";

I suggest rewriting this along the lines of:

FluentBackoff backoffConfig = maxDuration.isLongerThan(Duration.ZERO) ? MESSAGES_BACKOFF_FACTORY.withMaxCumulativeBackoff(maxDuration) : MESSAGES_BACKOFF_FACTORY; return BackOffAdapter.toGcpBackOff(backoffConfig.backoff());

Or even

maxDuration = maxDuration.isLongerThan(Duration.ZERO) ? maxDuration : DEFAULT_MAX_BACKOFF; return BackOffAdapter.toGcpBackOff( MESSAGES_BACKOFF_FACTORY.withMaxCumulativeBackoff(maxDuration).backoff());

akedin · 2019-05-21T19:16:48Z

...-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java

-            backoff = BackOff.STOP_BACKOFF;
-          }
+      // We can stop if the job is done.
+      if (state.isTerminal()) {


I suggest refactoring this a bit further. Right now it's hard to find the body of the if:

if (state.isTerminal()) { logTerminalState(state); return state; }

Yeah, was thinking of that, but kept it here since it was one of the main parts of this function as I seen it.
Will refactor it out.

akedin · 2019-05-21T19:26:16Z

...-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java

-        }
+      exception = processJobMessages(messageHandler, monitor);
+
+      if (exception != null) {


previously we would reset backoff before continue, is it the right thing to not do this anymore?

Previous code is actually too layered and that's why I did refactored it a lot. backoff.reset() is under if(!hasError) condition.

We did reset backoff only in case if there was no error, ie status is not UNKNOWN. That caused us to fail upon reaching max retry count while receiving UNKNOWN state, not upon reaching timeout.

My change separates two cases:
a) If we fail to get status, we do not reset backoff and will fail due to exceptions upon attempt to get job status
b) If we actually receive UNKNOWN status, we will wait until timeout.

Let me try to summarize the main flow to see if I understand it correctly:

Previous Flow

get job state:

get non-UNKNOWN state -> reset backoff -> continue loop if not terminal;

will timeout at max duration or get a terminal state; correct behavior;

get IOException, same as:

get UNKNOWN state -> continue loop unconditionally;

does not not reset backoff;

can exceed number of allowed attempts fast, not waiting for max allowed duration;

New Flow

get job state:

get non-UNKNOWN or UNKNOWN state -> reset backoff -> continue loop if not terminal;

can only receive UNKNOWN explicitly;

will timeout at max duration or get a terminal state; correct behavior;

get IOException -> continue loop unconditionally:

can exceed number of attempts instead of waiting for max allowed time;

In this case the logic seems right. I would probably try to organize the body of the loop to emphasize the flow though, something along the lines of:

Optional<State> state = tryGetState(); if (!state.isPresent() || !tryProcessJobMessages()) { continue; } if (state.get().isTerminal()) { return state.get(); } resetAttemptsCount();

Hope this makes sense

You are right and your code example makes sense. The problem that didn't let me to get that state is that I want to propagate exception to outside of the loop. Unfortunately, java can not pass method arguments by reference, so there's no clear way to return state or exception except defining explicit class. And that would be a bit of an overkill in this case.

Ardagan · 2019-05-21T22:44:34Z

UPD:
Confirmed that State.UNKNOWN is not supposed to be terminal on API side.

akedin

LGTM

Ardagan · 2019-05-23T17:56:17Z

run java postcommit

Ardagan · 2019-05-23T17:58:44Z

Run Dataflow ValidatesRunner

akedin · 2019-05-23T20:08:44Z

run java postcommit

Ardagan · 2019-05-23T22:42:53Z

Run Dataflow ValidatesRunner

amaliujia · 2019-05-24T15:49:39Z

Is there a reason that this PR's commits were not squashed?

akedin · 2019-05-24T15:56:26Z

I squashed it when merging: ea32ab9

amaliujia · 2019-05-24T16:03:05Z

I see. Squash and merge will not update this PR directly.

Improve error message on waitUntilFinish.

b3b5355

Allow for infinite wait.

akedin reviewed May 21, 2019

View reviewed changes

Ardagan changed the title ~~Improve error message on waitUntilFinish.~~ [BEAM-6284] Improve error message on waitUntilFinish. May 21, 2019

Ardagan changed the title ~~[BEAM-6284] Improve error message on waitUntilFinish.~~ [BEAM-6284][DoNotMerge] Improve error message on waitUntilFinish. May 21, 2019

Ardagan changed the title ~~[BEAM-6284][DoNotMerge] Improve error message on waitUntilFinish.~~ [BEAM-6284] Improve error message on waitUntilFinish. May 21, 2019

Mikhail Gryzykhin added 4 commits May 22, 2019 15:19

Address PR comments

107d852

spotlessApply

7ecd484

Improve method naming

10b3d8c

Code cleanup

9756ef2

akedin approved these changes May 23, 2019

View reviewed changes

akedin merged commit ea32ab9 into apache:master May 24, 2019

Conversation

Ardagan commented May 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

Ardagan commented May 20, 2019

Uh oh!

Ardagan commented May 20, 2019

Uh oh!

amaliujia commented May 21, 2019

Uh oh!

Ardagan commented May 21, 2019

Uh oh!

akedin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ardagan commented May 21, 2019

Uh oh!

akedin left a comment

Choose a reason for hiding this comment

Uh oh!

Ardagan commented May 23, 2019

Uh oh!

Ardagan commented May 23, 2019

Uh oh!

akedin commented May 23, 2019

Uh oh!

Ardagan commented May 23, 2019

Uh oh!

amaliujia commented May 24, 2019

Uh oh!

akedin commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amaliujia commented May 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ardagan commented May 20, 2019 •

edited

Loading

akedin commented May 24, 2019 •

edited

Loading