[FLINK-8005] [runtime] Set user code class loader before snapshot#4980
[FLINK-8005] [runtime] Set user code class loader before snapshot#4980GJL wants to merge 6 commits intoapache:masterfrom
Conversation
During checkpointing, user code may dynamically load classes from the user code jar. This is a problem if the thread invoking the snapshot callbacks does not have the user code class loader set as its context class loader. This commit makes sure that the correct class loader is set.
| for (int i = 1; i <= numCalls; i++) { | ||
| task.triggerCheckpointBarrier(i, 156865867234L, CheckpointOptions.forCheckpoint()); | ||
| } | ||
| catch (Exception e) { |
There was a problem hiding this comment.
Semantics of existing tests did not change: I removed the try-catch and simplified the assertion:
if (currentState != ExecutionState.RUNNING && currentState != ExecutionState.FINISHED) {
fail("Task should be RUNNING or FINISHED, but is " + currentState);
}
to
assertThat(currentState, isOneOf(ExecutionState.RUNNING, ExecutionState.FINISHED));
There was a problem hiding this comment.
Yep, the diff on GitHub is a bit hard to read but I figured it out. 😅
|
These changes look good! 👍 I'll wait for travis and then merge. |
| } | ||
|
|
||
| @Test(timeout = 20000) | ||
| public void testStopExecution() throws Exception { |
There was a problem hiding this comment.
This only tested that stop is invoked. Should be covered by testSetsUserCodeClassLoader now.
| } | ||
|
|
||
| @Test(expected = RuntimeException.class) | ||
| public void testStopExecutionFail() throws Exception { |
There was a problem hiding this comment.
This is now covered by testThrowExceptionIfStopInvokedWithNotStoppableTask
| fail("Task should be RUNNING or FINISHED, but is " + currentState); | ||
| } | ||
|
|
||
| task.cancelExecution(); |
There was a problem hiding this comment.
I moved this to the AutoCloseable TaskCleaner to avoid duplication.
| } | ||
|
|
||
| // assert after task is canceled and executing thread is stopped to avoid race conditions | ||
| assertThat(classLoaders, hasSize(greaterThanOrEqualTo(3))); |
There was a problem hiding this comment.
Are we guaranteed that all three calls have been made at this point or could this be flaky due to race conditions?
There was a problem hiding this comment.
As @aljoscha suggested, I think that there is no guarantee that the 3 calls have finished by the time we check, right?
There was a problem hiding this comment.
I believe you are right. I introduced another latch to counter this.
| * @param classLoader The {@link ClassLoader} to be set as context class loader. | ||
| */ | ||
| public DispatcherThreadFactory( | ||
| ThreadGroup group, |
There was a problem hiding this comment.
This is a code style preference rather than an issue, but I would suggest to indent the arguments by a tab to separate them from the body of the method.
|
|
||
| @Override | ||
| public void abortCheckpointOnBarrier(long checkpointId, Throwable cause) { | ||
| throw new UnsupportedOperationException("Should not be called"); |
There was a problem hiding this comment.
Not sure anymore but I decided to add it again.
|
|
||
| @Override | ||
| public void triggerCheckpointOnBarrier(CheckpointMetaData checkpointMetaData, CheckpointOptions checkpointOptions, CheckpointMetrics checkpointMetrics) throws Exception { | ||
| throw new UnsupportedOperationException("Should not be called"); |
There was a problem hiding this comment.
Not sure anymore but I decided to add it again.
| } | ||
|
|
||
| // assert after task is canceled and executing thread is stopped to avoid race conditions | ||
| assertThat(classLoaders, hasSize(greaterThanOrEqualTo(3))); |
There was a problem hiding this comment.
As @aljoscha suggested, I think that there is no guarantee that the 3 calls have finished by the time we check, right?
Throw UnsupportedOperationException when CheckpointsInOrderInvokable#triggerCheckpointOnBarrier() and CheckpointsInOrderInvokable#abortCheckpointOnBarrier() are called.
|
I think waiting on the stop latch might not be enough (in 100 % of cases) because the other two calls are also asynchronous. |
|
As it is now, it should be enough as there is only one thread dispatching the calls: The tasks cannot overtake each other. I could make the test stricter and wait additionally on |
|
Yes, but I think this is making an assumption about the internal implementation. If someone changes that the test could break/not test the right thing anymore. |
| triggerLatch.trigger(); | ||
| if (error != null) { | ||
| // exit method prematurely due to error but make sure that the tests can finish | ||
| triggerLatch.trigger(); |
There was a problem hiding this comment.
for all latches, it should also have:if (!latch.isTriggered()) { latch.await() }
There was a problem hiding this comment.
Why is that? I think at this point the latch might not get triggered at all (except here).
There was a problem hiding this comment.
Sorry, I was just looking on the IDE and missed the lines. This line should be before every time you call await on the latch.
There was a problem hiding this comment.
I think it doesn't matter because the latch already checks for the flag:
public void await() throws InterruptedException {
synchronized (lock) {
while (!triggered) {
lock.wait();
}
}
}
public void trigger() {
synchronized (lock) {
triggered = true;
lock.notifyAll();
}
}
There was a problem hiding this comment.
yes, a latch that was already triggered will simply return immediately, no need for an additional check
|
thanks, I think this is excellent now. 👌 I'll merge as soon as travis is green. |
|
I agree! +1 to merge as soon as Travis gives us the green light. |
|
Thanks again for this fix! 👍 Could you please close if GitHub doesn't auto-close? |
What is the purpose of the change
During checkpointing, user code may dynamically load classes from the user code
jar. This is a problem if the thread invoking the snapshot callbacks does not
have the user code class loader set as its context class loader. This commit
makes sure that the correct class loader is set.
Brief change log
Task#asyncCallDispatcherVerifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation