New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flink 6537] Fixes and improvements for incremental checkpoints in RocksDB #3870

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@StefanRRichter
Contributor

StefanRRichter commented May 11, 2017

This PR bundles several fixes and improvements for incremental checkpoints in RocksDB.

In particular, this addresses:

  • [FLINK-6535] : JobID should not be part of the registration key to the SharedStateRegistry
  • [FLINK-6533] : Duplicated registration of new shared state when checkpoint confirmations are still pending
  • [FLINK-6527] : OperatorSubtaskState has empty implementations of (un)/registerSharedStates
  • [FLINK-6504] : Lack of synchronization on materializedSstFiles in RocksDBKEyedStateBackend
  • [FLINK-6545] : Make incremental checkpoints externalizable
  • [FLINK-6534] : SharedStateRegistry is disposing state handles from main thread

Extended test coverage will be provided as part of [FLINK-6540].

Some of the main changes are in the way the SharedStateRegistry works. It is now able to detect and resolve duplicate state registrations and to serve previously registered state by key. This way, we can avoid resending already registered state handles in the RPC, and can just send their registration keys instead.

@StefanRRichter

This comment has been minimized.

Show comment
Hide comment
@StefanRRichter
Contributor

StefanRRichter commented May 11, 2017

@gyfora

This comment has been minimized.

Show comment
Hide comment
@gyfora

gyfora May 11, 2017

Contributor

Hi,
I am sending this because I will be away for the weekend so won't have time to test further until monday. I pulled these changes and Stephan's classloader fix but I still get some errors. (Possible you are working on other fixes, if so just ignore me :))

These logs might help:
JM:
https://gist.github.com/gyfora/1b54c10601e7482009c6adb804dbcfbf
TM:
https://gist.github.com/gyfora/ee632bfcd0115a746bc55440bd8815e7

Seems like the SavepointV2Serializer can't handle the Incremental rocks handle because it's not a KeyGroupsStateHandle.

You can see both the savepoint error (first error) and a regular incremental checkpoint error (second)

Contributor

gyfora commented May 11, 2017

Hi,
I am sending this because I will be away for the weekend so won't have time to test further until monday. I pulled these changes and Stephan's classloader fix but I still get some errors. (Possible you are working on other fixes, if so just ignore me :))

These logs might help:
JM:
https://gist.github.com/gyfora/1b54c10601e7482009c6adb804dbcfbf
TM:
https://gist.github.com/gyfora/ee632bfcd0115a746bc55440bd8815e7

Seems like the SavepointV2Serializer can't handle the Incremental rocks handle because it's not a KeyGroupsStateHandle.

You can see both the savepoint error (first error) and a regular incremental checkpoint error (second)

@StefanRRichter

This comment has been minimized.

Show comment
Hide comment
@StefanRRichter

StefanRRichter May 11, 2017

Contributor

@gyfora Yes, this is still a limitation, because the incremental checkpoints are currently not (yet) externalizable. The missing piece is, exactly as you said, familiarizing the SavepointSerializer with the new state handle classes from the incremental checkpoints.

I created a jira to track this: https://issues.apache.org/jira/browse/FLINK-6545

Contributor

StefanRRichter commented May 11, 2017

@gyfora Yes, this is still a limitation, because the incremental checkpoints are currently not (yet) externalizable. The missing piece is, exactly as you said, familiarizing the SavepointSerializer with the new state handle classes from the incremental checkpoints.

I created a jira to track this: https://issues.apache.org/jira/browse/FLINK-6545

@shixiaogang

@StefanRRichter I think these changes well address the problems existed in incremental checkpoint in RocksDBKeyedStateBackend. Thanks a lot for your work.

I made some inline comments. Among them, the usage of asynchronous executors in SharedStateRegistry is most concerned. I think we should avoid the usage of asynchronous threads as much as possible. I prefer to utilize existing asynchronous executor to discard unreferenced shared states. Kindly let me know if you have any idea of it.

@StefanRRichter

This comment has been minimized.

Show comment
Hide comment
@StefanRRichter

StefanRRichter May 12, 2017

Contributor

Updated and enhanced.

@shixiaogang The usage of direct executor was just a placeholder until I had fully implemented [FLINK-6534]. It does now use the async IO executor from the completed checkpoint store.

Contributor

StefanRRichter commented May 12, 2017

Updated and enhanced.

@shixiaogang The usage of direct executor was just a placeholder until I had fully implemented [FLINK-6534]. It does now use the async IO executor from the completed checkpoint store.

@rmetzger

This comment has been minimized.

Show comment
Hide comment
@rmetzger

rmetzger May 14, 2017

Contributor

+1 to merge

Contributor

rmetzger commented May 14, 2017

+1 to merge

@StefanRRichter

This comment has been minimized.

Show comment
Hide comment
@StefanRRichter

StefanRRichter May 14, 2017

Contributor

Thanks @rmetzger ! Merged.

Contributor

StefanRRichter commented May 14, 2017

Thanks @rmetzger ! Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment