Skip to content

Conversation

@rkhachatryan
Copy link
Contributor

@rkhachatryan rkhachatryan commented Feb 25, 2023

If a checkpoint fails for some reason, PartiallyFinishedSourcesITCase might hang (and indirectly affect other tests).

There are two issues in case of a checkpoint failure:

  1. FAIL command might be dispatched to the source task that's already finished execution
  2. waiting for the failover times out, but it then waits indefinitely to obtain job status result

Mostly happens in 1.15, because in later versions have increased timeouts/attempts to upload state.

This PR fixes the above root causes of test hanging.

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 25, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@rkhachatryan
Copy link
Contributor Author

@gaoyunhaii would you be able to take a look at this PR?

@dmvk dmvk self-requested a review March 2, 2023 08:09
Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch Roman; this reads very sane! (though I'm not an expert in this part of Flink).

Apart from inline questions:

  1. The first commit should be rephrased

Prevent timeouts when waiting for a status of a …

->

Prevent timeouts when waiting for a result of a …

--

  1. For the second commit
Otherwise, any command with SINGLE_SUBTASK scope might be dispatched to a finished source.
This will result in a timeout while waiting for this command to be executed.

Can you please point me to where precisely the timeout is supposed to happen? As far as I can tell, on the source side, we'd only add it to the queue (there will most likely be something on the executor side that I'm missing).

@rkhachatryan
Copy link
Contributor Author

rkhachatryan commented Mar 2, 2023

Thanks for the review @dmvk!

I've updated both commit messages and marked scheduledCommands as volatile.

For the second commit

Otherwise, any command with SINGLE_SUBTASK scope might be dispatched to a finished source.
This will result in a timeout while waiting for this command to be executed.

Can you please point me to where precisely the timeout is supposed to happen? As far as I can tell, on the source side, > we'd only add it to the queue (there will most likely be something on the executor side that I'm missing).

Sure, the timeout happens in TestJobExecutor.waitForFailover.

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job 👍 Can you, please squash the hotfixes into the original two commits before merging?

- Only obtain execution exception if the job is in globally terminal state

- [hotfix] Unsubscribe finished TestEventSources from test commands.  Otherwise,
  any command with SINGLE_SUBTASK scope might be dispatched to a finished source.
  This will result in TestJobExecutor.waitForFailover timing out while waiting for the
  command to be executed and ACKed.

- [hotfix] Mark TestEventSource.scheduledCommands volatile

- [hotfix] Make sure to process all commands in TestEventSource
@rkhachatryan
Copy link
Contributor Author

Thanks for the review @dmvk!
I've squashed the commits, will merge the PR once the build succeeds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants