New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch connection manager never closed when trying to delete a non-existing POD #9932
Conversation
Signed-off-by: David Festal <dfestal@redhat.com>
Can one of the admins verify this patch? |
ci-test |
}); | ||
Boolean deleteSucceeded = podResource.delete(); | ||
if (deleteSucceeded == null || !deleteSucceeded) { | ||
watch.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, in a case when a pod doesn't exist, delete future will be completed exceptionally with message Webscoket connection is closed. But event about removing is not received.
. And I would say that it is not an exceptional situation. Then maybe deleteFuture.complete()
would be better instead of closing a Watch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sleshchenko It appears (also through debugging) that it's clearly not the case: when the POD doesn't exist, the DeleteWatcher
isn't called at all by the WatchConnectionManager
and finally the WatchConnectionManager
is never closed.
This is precisely the bug I'm fixing here.
This is why I have to explicitly close the WatchConnectionManager
when the delete()
call returns false
because no such POD exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the fixed bug and it's nice catch =)
And what I understand is (as far as I understand) if you completed future then onComplete will be called and watch will be closed. And completing the future without exception would be more correctly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, we would still have to close the watch in case an exception occurs in the podResource.delete()
call and no DeleteWatcher
method is called. So why not do the same action in the 2 cases where deletion didn't occur ?
ci-test build report: |
Signed-off-by: David Festal <dfestal@redhat.com>
as proposed by @sleshchenko Signed-off-by: David Festal <dfestal@redhat.com>
@l0rd I added a test @sleshchenko I finally changed to use |
ci-test |
ci-test build report: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good fix 👍
Please take a look my minor comments
@@ -203,6 +207,33 @@ public void testStopsWaitingServiceAccountEventJustAfterEventReceived() throws E | |||
verify(serviceAccountResource).watch(any()); | |||
} | |||
|
|||
@Test | |||
public void testDeleteNonExistingPodBeforeWatch() { | |||
System.out.println( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just debug info that should be removed before merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's debug info that should have been removed, sure !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -203,6 +207,33 @@ public void testStopsWaitingServiceAccountEventJustAfterEventReceived() throws E | |||
verify(serviceAccountResource).watch(any()); | |||
} | |||
|
|||
@Test | |||
public void testDeleteNonExistingPodBeforeWatch() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add one more test that watch is closed then KubernetesClientException
occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 2 new tests
| InterruptedException | ||
| ExecutionException | ||
| TimeoutException e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the test you know should this exception be thrown or not, if it should not be thrown then I think it would be better to declare Exception
in throwing list of test method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
if (toCloseOnException != null) { | ||
toCloseOnException.close(); | ||
} | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to wrap this exception into InternalInfastructureException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, honestly I prefer re-throwing exactly the same exception, so that there is no behavior change compared to the previous implementation that used to only catch KubernetesClientException
exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's the same as was before. But only InfrastructureException is declared as throwing and it would be safer to wrap other exception in InternalInfastructureException
.
InternalInfastructureException
is special kind of exception that should wrap unexpected exceptions like NullPointerException which happen because of developer fault.
I would recommend wrapping but you can leave it as it is since it doesn't change current approach.
private CompletableFuture<Void> doDelete(String name) throws InfrastructureException { | ||
@VisibleForTesting | ||
CompletableFuture<Void> doDelete(String name) throws InfrastructureException { | ||
Watch toCloseOnException = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can propose you to rewrite this method in the following way:
@VisibleForTesting
CompletableFuture<Void> doDelete(String name) {
final CompletableFuture<Void> deleteFuture = new CompletableFuture<>();
try {
final PodResource<Pod, DoneablePod> podResource =
clientFactory.create(workspaceId).pods().inNamespace(namespace).withName(name);
Watch watch = podResource.watch(new DeleteWatcher(deleteFuture));
//watch is opened - register callback to close it
deleteFuture.whenComplete((v, e) -> watch.close());
Boolean deleteSucceeded = podResource.delete();
if (deleteSucceeded == null || !deleteSucceeded) {
deleteFuture.complete(null);
}
} catch (KubernetesClientException ex) {
deleteFuture.completeExceptionally(new KubernetesInfrastructureException(ex));
} catch (Exception e) {
deleteFuture.completeExceptionally(new InternalInfrastructureException(e.getMessage(), e));
}
return deleteFuture;
}
Why it can be better - because in a case when an exception occurs during removing many pods - then there will be trying to remove all pods. With current approach - if an exception occurs then no more pods will be removed. And I think it would be better to try to clean up resources as many as we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind if we do this change in another PR ?
I'd prefer keep the behavior as much identical as possible to what was previously, and only fix the bug of the non-closed watch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidfestal Makes sense. I'm OK with that
Signed-off-by: David Festal <dfestal@redhat.com>
[here](#9932 (comment)) Signed-off-by: David Festal <dfestal@redhat.com>
Signed-off-by: David Festal <dfestal@redhat.com>
ci-test |
@l0rd Are you OK with the 3 tests I added ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks @davidfestal
ci-test build report: |
…sting POD (#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>
…sting POD (eclipse#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>
* Watch connection manager never closed when trying to delete a non-existing POD (#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com> * CHE-5918 Add an ability to interrupt Kubernetes/OpenShift runtime start Signed-off-by: Sergii Leshchenko <sleshche@redhat.com> * CHE-5918 Add checking of start interruption by KubernetesBootstrapper It is needed to avoid '403 Pod doesn't exists' errors. It happens when start is interrupted when any of machines is on bootstrapping phase. As result connection leak happens TODO Create an issue for fabric8-client * Improve ExecWatchdog to rethrow exception when it occurred while establishing a WebSocket connection * CHE-5918 Fix K8s/OS runtime start failing when unrecoverable event occurs
…sting POD (eclipse#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>
What does this PR do?
This PR fixes a nasty bug that occurs in some conditions when trying to delete a non-existing command POD: the corresponding watch connection manager is never closed and this leads to an infinite loop of regular attempts to reconnect web-sockets to a non-existing POD.
What issues does this PR fix or reference?
This is the root cause of redhat-developer/rh-che#672.