Skip to content

[BEAM-6284] Improve error message on waitUntilFinish.#8629

Merged
akedin merged 5 commits intoapache:masterfrom
Ardagan:WUF
May 24, 2019
Merged

[BEAM-6284] Improve error message on waitUntilFinish.#8629
akedin merged 5 commits intoapache:masterfrom
Ardagan:WUF

Conversation

@Ardagan
Copy link
Contributor

@Ardagan Ardagan commented May 20, 2019

Allow for infinite wait.

Seems that [BEAM-6284] is relevant to this issue, even though logs are not available any more.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Allow for infinite wait.
@Ardagan
Copy link
Contributor Author

Ardagan commented May 20, 2019

@kennknowles @akedin

@Ardagan
Copy link
Contributor Author

Ardagan commented May 20, 2019

@amaliujia

@amaliujia
Copy link
Contributor

I am not familiar with this piece of code. Maybe also ask in dev@ to see who is also able to review this change?

@Ardagan
Copy link
Contributor Author

Ardagan commented May 21, 2019

run java postcommit

Copy link
Contributor

@akedin akedin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Still Looking) I think this is a good thing to refactor, few comments:

if (terminalState != null) {
return terminalState;
}
State getStateWithRetriesNoThrow(BackOff attempts, Sleeper sleeper) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading ..NoThrow I had to lookup what happens instead of throwing. Would something like ..OrUnknown convey the behavior better?

}

BackOff getBackoff(Duration duration, FluentBackoff factory) {
if (duration.equals(Duration.ZERO) || duration.isLongerThan(Duration.ZERO)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic is incorrect and a bit confusing:

  • factory is always MESSAGES_BACKOFF_FACTORY, from what I can see. I suggest just inlining it instead of passing as a parameter, otherwise here it's unclear what it is. At the call sites it doesn't matter as well, the logic there doesn't care how we get the backoff and what factory we use;
  • the duration in withMaxCumulativeBackoff cannot be zero;
  • it's unclear that duration is a total cumulative timeout, not some other parameter of the backoff configuration;
  • the logic is supposed to read (if I understand it right): "use default backoff config, plus set the max duration if it's positive";

I suggest rewriting this along the lines of:

FluentBackoff backoffConfig = maxDuration.isLongerThan(Duration.ZERO)
  ? MESSAGES_BACKOFF_FACTORY.withMaxCumulativeBackoff(maxDuration)
  : MESSAGES_BACKOFF_FACTORY;

return BackOffAdapter.toGcpBackOff(backoffConfig.backoff());

Or even

maxDuration = maxDuration.isLongerThan(Duration.ZERO) ? maxDuration : DEFAULT_MAX_BACKOFF;
return BackOffAdapter.toGcpBackOff(
     MESSAGES_BACKOFF_FACTORY.withMaxCumulativeBackoff(maxDuration).backoff());

backoff = BackOff.STOP_BACKOFF;
}
// We can stop if the job is done.
if (state.isTerminal()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest refactoring this a bit further. Right now it's hard to find the body of the if:

if (state.isTerminal()) {
  logTerminalState(state);
  return state;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was thinking of that, but kept it here since it was one of the main parts of this function as I seen it.
Will refactor it out.

}
exception = processJobMessages(messageHandler, monitor);

if (exception != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we would reset backoff before continue, is it the right thing to not do this anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous code is actually too layered and that's why I did refactored it a lot. backoff.reset() is under if(!hasError) condition.

We did reset backoff only in case if there was no error, ie status is not UNKNOWN. That caused us to fail upon reaching max retry count while receiving UNKNOWN state, not upon reaching timeout.

My change separates two cases:
a) If we fail to get status, we do not reset backoff and will fail due to exceptions upon attempt to get job status
b) If we actually receive UNKNOWN status, we will wait until timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to summarize the main flow to see if I understand it correctly:

Previous Flow

  • get job state:
    • get non-UNKNOWN state -> reset backoff -> continue loop if not terminal;
      • will timeout at max duration or get a terminal state; correct behavior;
    • get IOException, same as:
    • get UNKNOWN state -> continue loop unconditionally;
      • does not not reset backoff;
      • can exceed number of allowed attempts fast, not waiting for max allowed duration;

New Flow

  • get job state:
    • get non-UNKNOWN or UNKNOWN state -> reset backoff -> continue loop if not terminal;
      • can only receive UNKNOWN explicitly;
      • will timeout at max duration or get a terminal state; correct behavior;
    • get IOException -> continue loop unconditionally:
      • can exceed number of attempts instead of waiting for max allowed time;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the logic seems right. I would probably try to organize the body of the loop to emphasize the flow though, something along the lines of:

Optional<State> state = tryGetState(); 
if (!state.isPresent() || !tryProcessJobMessages()) {
  continue;
}

if (state.get().isTerminal()) {
  return state.get();
}

resetAttemptsCount();

Hope this makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right and your code example makes sense. The problem that didn't let me to get that state is that I want to propagate exception to outside of the loop. Unfortunately, java can not pass method arguments by reference, so there's no clear way to return state or exception except defining explicit class. And that would be a bit of an overkill in this case.

@Ardagan Ardagan changed the title Improve error message on waitUntilFinish. [BEAM-6284] Improve error message on waitUntilFinish. May 21, 2019
@Ardagan Ardagan changed the title [BEAM-6284] Improve error message on waitUntilFinish. [BEAM-6284][DoNotMerge] Improve error message on waitUntilFinish. May 21, 2019
@Ardagan Ardagan changed the title [BEAM-6284][DoNotMerge] Improve error message on waitUntilFinish. [BEAM-6284] Improve error message on waitUntilFinish. May 21, 2019
@Ardagan
Copy link
Contributor Author

Ardagan commented May 21, 2019

UPD:
Confirmed that State.UNKNOWN is not supposed to be terminal on API side.

Copy link
Contributor

@akedin akedin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Ardagan
Copy link
Contributor Author

Ardagan commented May 23, 2019

run java postcommit

@Ardagan
Copy link
Contributor Author

Ardagan commented May 23, 2019

Run Dataflow ValidatesRunner

@akedin
Copy link
Contributor

akedin commented May 23, 2019

run java postcommit

@Ardagan
Copy link
Contributor Author

Ardagan commented May 23, 2019

Run Dataflow ValidatesRunner

@akedin akedin merged commit ea32ab9 into apache:master May 24, 2019
@amaliujia
Copy link
Contributor

Is there a reason that this PR's commits were not squashed?

@akedin
Copy link
Contributor

akedin commented May 24, 2019

I squashed it when merging: ea32ab9

@amaliujia
Copy link
Contributor

I see. Squash and merge will not update this PR directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants