Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-4000] Checkpoint dictionaries null after taskmgr failures #2061

Closed
wants to merge 3 commits into from

Conversation

rekhajoshm
Copy link
Contributor

Fix for exception during job restart after task mgr failure, at which point restoreState fails as checkpoint dictionaries can be null.

@asavartsov
Copy link
Contributor

This kind of check is might be useless and probably wouldn't fix the issue. My debugging shows that the list idsProcessedButNotAcknowledged is null on recovery, not checkpoints itself. This list is initialized in open method, but somehow it doesn't get called in such scenario.

@rekhajoshm
Copy link
Contributor Author

I agree @asavartsov , that was a quick look and was working to reproduce.Does the updated make sense? thank you.

@asavartsov
Copy link
Contributor

No, it does not make any sense and even makes things worse, sorry.

@rekhajoshm
Copy link
Contributor Author

@asavartsov Ok. Please let me know how you propose to solve this? thanks

@asavartsov
Copy link
Contributor

Take a look at my pull request at #2062

@rekhajoshm
Copy link
Contributor Author

aha, in one of my intermediate runs had done just initializing idsProcessedButNotAcknowledged and retain pendingCheckpoints , but in last run changed it calling open() :-( . makes sense @asavartsov. closing. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants