-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-3713] [clients, runtime] Use user code class loader when disposing savepoints #2083
Conversation
It seems very clumsy to dispose savepoints like that. If that is needed for a fix currently, then I guess we have to live with that. Would make sense to make the jar and jobid optional, so that non-custom cases can dispose the savepoints in the simple way. |
Full ack, but the problem is not just RocksDB, but also savepoints, which reference a user class in the state descriptor, including FS snapshots. So if you configure a file backend and use folding or reducing state, you run into the issue as well. What about the following: make it optional and fail with a proper error message in case of a missing user code class. SubtaskState (previously StateForTask), which fails to dispose the state snapshots currently only logs the Exception. We can change this to rethrow and then we give a proper error message on failed savepoint disposal. |
Syntax: savepoint [OPTIONS] <Job ID> | ||
"savepoint" action options: | ||
-d,--dispose <savepointPath> Disposes an existing savepoint. | ||
-m,--jobmanager <host:port> Address of the JobManager (master) to which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to the jobmanager option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be still there but let me check again.
Good work @uce :-) I agree that it currently it is a bit clumsy to discard savepoints of jobs which are no longer running. From the user perspective it should be as easy as possible. I also think that it would be a good idea to add the RocksDB jar to the flink-dist.jar since all serious user are using it. Furthermore, I think it would be a good idea to not only log possible exceptions in I had some minor comments concerning test coverage and the way the job jars are constructed. |
Thanks for review. I will propagate the errors, make the job ID/JAR arguments optional, and try to simplify parts of the CLI as you suggested. Regarding including the RocksDB jar in dist, I think that should be handled as a separate issue. |
…sing savepoint Disposing savepoints via the JobManager fails for state handles or descriptors, which contain user classes (for example custom folding state or RocksDB handles). With this change, the user has to provide the job ID of a running job when disposing a savepoint in order to use the user code class loader of that job or provide the job JARs. This version breaks the API as the CLI now requires either a JobID or a JAR. I think this is reasonable, because the current approach only works for a subset of the available state variants.
1cdfe9b
to
7e5e3a3
Compare
I've addressed the comments:
|
If there are no objections, I would like to merge this. |
Disposing savepoints via the JobManager fails for state handles or descriptors, which contain user classes (for example custom folding state or RocksDB handles).
With this change, the user has to provide the job ID of a running job when disposing a savepoint in order to use the user code class loader of that job or provide the job JARs.
This version breaks the API as the CLI now requires either a JobID or a JAR. I think this is reasonable, because the current approach only works for a subset of the available state variants.
We can port this back for 1.0.4 and make the JobID or JAR arguments optional. What do you think?
I've tested this with a job running on RocksDB state both while the job was running and after it terminated. This was not working with the current 1.0.3 version.
Ideally, we will get rid of the whole disposal business when we make savepoints properly self-contained. I'm going to open a JIRA issue with a proposal to do so soon.