Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-5985] Report no task states for stateless tasks on checkpointing #3523

Conversation

StefanRRichter
Copy link
Contributor

This PR fixes [FLINK-5985]. The solution is based on acknowledging null instead of some empty SubtaskState to CheckpointCoordinator#acknowledgeCheckpoint(...), so that no TaskState is registered under the JobVertexID of a stateless task in the checkpoint.

@StefanRRichter
Copy link
Contributor Author

CC @gyfora @uce

@gyfora
Copy link
Contributor

gyfora commented Mar 13, 2017

The changes look reasonable :)

@gyfora
Copy link
Contributor

gyfora commented Mar 13, 2017

I could only try the backported version on the topology that caused the problem initally (that is running 1.2.0)

@StefanRRichter
Copy link
Contributor Author

@gyfora if the effort is reasonable, it would be great to try this out on your topology. As soon as you give your +1, I could merge this change :-)

@gyfora
Copy link
Contributor

gyfora commented Mar 14, 2017

Im gonna try to cherry-pick this on 1.2 and run it today

@StefanRRichter
Copy link
Contributor Author

Great, thanks!

@gyfora
Copy link
Contributor

gyfora commented Mar 14, 2017

There seems to have been some changes in the StreamTask and some tests so I couldn't rebase this nicely. Do you have a minute to take a look and maybe push a branch with the backport please? That would help me a lot.

@StefanRRichter
Copy link
Contributor Author

Sure, I just quickly prepared a backport here:

https://github.com/StefanRRichter/flink/tree/FLINK-5985-backport-to-1.2

@gyfora
Copy link
Contributor

gyfora commented Mar 14, 2017

Hm, doesnt seem to work for the first try. What I did is I updated the client with the new jar based on your backport branch. Redeployed the job with a savepoint (to get the new Flink version), took a savepoint and tried to redeploy with the changed topology.

I still seem to get the same error.

Is it possible that the previous checkpoints have an effect on this? In any case I will double check tomorrow morning and try to do the test again.

@gyfora
Copy link
Contributor

gyfora commented Mar 14, 2017

It also doesnt seem to work starting from a clean state and then savepoint redeploy with changed topology so maybe I am really screwing up something

@gyfora
Copy link
Contributor

gyfora commented Mar 15, 2017

@StefanRRichter It seems to work correctly locally, I am trying to see what went wrong with my yarn tests, but this shouldnt block you

@gyfora
Copy link
Contributor

gyfora commented Mar 15, 2017

Ah, the reason is probably that I didnt change my job jar, and this relies on changes in the rocks backend

@StefanRRichter
Copy link
Contributor Author

Ok, then the mystery is finally solved :-) Thanks again for reporting this problem and your additional testing efforts!

@gyfora
Copy link
Contributor

gyfora commented Mar 15, 2017

On a second thought, shouldnt the older rocks backend version still work? (I guess that still returns a Done future with null value)

@StefanRRichter StefanRRichter changed the title [FLINK-5985] Report no task states for stateless tasks in checkpointing [FLINK-5985] Report no task states for stateless tasks on checkpointing Mar 15, 2017
@StefanRRichter
Copy link
Contributor Author

Yes, it should still work because the changes on RocksDBKeyedStateBackend are purely cosmetical without changing any functionality.

@StefanRRichter
Copy link
Contributor Author

Hm, one potential pitfall that I see is operator chaining, in case your stateless operators are chained together with stateful ones. But then again, you said it works locally?

Copy link
Contributor

@StephanEwen StephanEwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fix is good.
The StreamTask could use some cleanup to make tests easier (less whiteboxing needed). We should do that in a separate refactoring, either as a followup to this, or as a preparation for a modified version of this.

What do you think?

* happens by translating an empty {@link SubtaskState} into reporting 'null' to #acknowledgeCheckpoint.
*/
@Test
public void testEmptySubtaskStateLeadsToStatelessAcknowledgment() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fact that this test requires extensive whiteboxing means we should move the whole CheckpointOperation to a separate class and make it work independent of StreamTask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could separate the classes. CheckpointOperation is already a static inner class anyways. I would suggest to do this in a followup.

@StephanEwen
Copy link
Contributor

I think that verifying the possibility to reconfigure a job with respect to stateless operators warrants an ITCase. Can we extend the SavepointITCase for that?

@StefanRRichter
Copy link
Contributor Author

@StephanEwen, I have added the IT case. Please have another look.

@StephanEwen
Copy link
Contributor

Can we slightly adapt the test to target more the typical use case:

  • Original job has some stateless ops (no uid), and some stateful ones (with uid)
  • Create a modified job that has the same stateful ones (same uids) but different stateless ones

Otherwise this looks good.

@StefanRRichter StefanRRichter force-pushed the no-taskstate-for-stateless branch 6 times, most recently from 4cb4900 to 6382484 Compare March 17, 2017 10:16
@StefanRRichter
Copy link
Contributor Author

Thanks for the review @StephanEwen. I updated the test as suggested. Merging this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants