Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch connection manager never closed when trying to delete a non-existing POD #9932

Merged
merged 6 commits into from Jun 5, 2018

Conversation

davidfestal
Copy link
Contributor

What does this PR do?

This PR fixes a nasty bug that occurs in some conditions when trying to delete a non-existing command POD: the corresponding watch connection manager is never closed and this leads to an infinite loop of regular attempts to reconnect web-sockets to a non-existing POD.

What issues does this PR fix or reference?

This is the root cause of redhat-developer/rh-che#672.

Signed-off-by: David Festal <dfestal@redhat.com>
@codenvy-ci
Copy link

Can one of the admins verify this patch?

@davidfestal davidfestal requested a review from l0rd June 4, 2018 13:00
@davidfestal
Copy link
Contributor Author

ci-test

@benoitf benoitf added status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. kind/bug Outline of a bug - must adhere to the bug report template. labels Jun 4, 2018
});
Boolean deleteSucceeded = podResource.delete();
if (deleteSucceeded == null || !deleteSucceeded) {
watch.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, in a case when a pod doesn't exist, delete future will be completed exceptionally with message Webscoket connection is closed. But event about removing is not received.. And I would say that it is not an exceptional situation. Then maybe deleteFuture.complete() would be better instead of closing a Watch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sleshchenko It appears (also through debugging) that it's clearly not the case: when the POD doesn't exist, the DeleteWatcher isn't called at all by the WatchConnectionManager and finally the WatchConnectionManager is never closed.

This is precisely the bug I'm fixing here.

This is why I have to explicitly close the WatchConnectionManager when the delete() call returns false because no such POD exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the fixed bug and it's nice catch =)
And what I understand is (as far as I understand) if you completed future then onComplete will be called and watch will be closed. And completing the future without exception would be more correctly here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we would still have to close the watch in case an exception occurs in the podResource.delete() call and no DeleteWatcher method is called. So why not do the same action in the 2 cases where deletion didn't occur ?

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:9932
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

Signed-off-by: David Festal <dfestal@redhat.com>
as proposed by @sleshchenko

Signed-off-by: David Festal <dfestal@redhat.com>
@davidfestal
Copy link
Contributor Author

@l0rd I added a test

@sleshchenko I finally changed to use deleteFuture.complete() as you proposed. Please could you confirm this is what you were expecting ?

@davidfestal
Copy link
Contributor Author

ci-test

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:9932
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix 👍
Please take a look my minor comments

@@ -203,6 +207,33 @@ public void testStopsWaitingServiceAccountEventJustAfterEventReceived() throws E
verify(serviceAccountResource).watch(any());
}

@Test
public void testDeleteNonExistingPodBeforeWatch() {
System.out.println(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just debug info that should be removed before merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's debug info that should have been removed, sure !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -203,6 +207,33 @@ public void testStopsWaitingServiceAccountEventJustAfterEventReceived() throws E
verify(serviceAccountResource).watch(any());
}

@Test
public void testDeleteNonExistingPodBeforeWatch() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one more test that watch is closed then KubernetesClientException occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added 2 new tests

| InterruptedException
| ExecutionException
| TimeoutException e) {
e.printStackTrace();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the test you know should this exception be thrown or not, if it should not be thrown then I think it would be better to declare Exception in throwing list of test method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if (toCloseOnException != null) {
toCloseOnException.close();
}
throw e;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to wrap this exception into InternalInfastructureException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, honestly I prefer re-throwing exactly the same exception, so that there is no behavior change compared to the previous implementation that used to only catch KubernetesClientException exceptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's the same as was before. But only InfrastructureException is declared as throwing and it would be safer to wrap other exception in InternalInfastructureException.
InternalInfastructureException is special kind of exception that should wrap unexpected exceptions like NullPointerException which happen because of developer fault.
I would recommend wrapping but you can leave it as it is since it doesn't change current approach.

private CompletableFuture<Void> doDelete(String name) throws InfrastructureException {
@VisibleForTesting
CompletableFuture<Void> doDelete(String name) throws InfrastructureException {
Watch toCloseOnException = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can propose you to rewrite this method in the following way:

@VisibleForTesting
  CompletableFuture<Void> doDelete(String name) {
    final CompletableFuture<Void> deleteFuture = new CompletableFuture<>();
    try {
      final PodResource<Pod, DoneablePod> podResource =
          clientFactory.create(workspaceId).pods().inNamespace(namespace).withName(name);

      Watch watch = podResource.watch(new DeleteWatcher(deleteFuture));
      //watch is opened - register callback to close it
      deleteFuture.whenComplete((v, e) -> watch.close());

      Boolean deleteSucceeded = podResource.delete();
      if (deleteSucceeded == null || !deleteSucceeded) {
        deleteFuture.complete(null);
      }
    } catch (KubernetesClientException ex) {
      deleteFuture.completeExceptionally(new KubernetesInfrastructureException(ex));
    } catch (Exception e) {
      deleteFuture.completeExceptionally(new InternalInfrastructureException(e.getMessage(), e));
    }

    return deleteFuture;
  }

Why it can be better - because in a case when an exception occurs during removing many pods - then there will be trying to remove all pods. With current approach - if an exception occurs then no more pods will be removed. And I think it would be better to try to clean up resources as many as we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if we do this change in another PR ?
I'd prefer keep the behavior as much identical as possible to what was previously, and only fix the bug of the non-closed watch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfestal Makes sense. I'm OK with that

Signed-off-by: David Festal <dfestal@redhat.com>
[here](#9932 (comment))

Signed-off-by: David Festal <dfestal@redhat.com>
Signed-off-by: David Festal <dfestal@redhat.com>
@davidfestal
Copy link
Contributor Author

ci-test

@davidfestal
Copy link
Contributor Author

@l0rd Are you OK with the 3 tests I added ?

Copy link
Contributor

@l0rd l0rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @davidfestal

@codenvy-ci
Copy link

ci-test build report:
Build details
Test report
selenium tests report data
docker image: eclipseche/che-server:9932
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@davidfestal davidfestal merged commit 6227792 into master Jun 5, 2018
@benoitf benoitf removed the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Jun 5, 2018
@benoitf benoitf added this to the 6.7.0 milestone Jun 5, 2018
riuvshin pushed a commit that referenced this pull request Jun 6, 2018
…sting POD (#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>
davidfestal added a commit to davidfestal/che that referenced this pull request Jun 14, 2018
…sting POD (eclipse#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>
riuvshin pushed a commit that referenced this pull request Jun 14, 2018
* Watch connection manager never closed when trying to delete a non-existing POD (#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>

* CHE-5918 Add an ability to interrupt Kubernetes/OpenShift runtime start

Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>

* CHE-5918 Add checking of start interruption by KubernetesBootstrapper

It is needed to avoid '403 Pod doesn't exists' errors.
It happens when start is interrupted when any of machines is on bootstrapping phase.
As result connection leak happens TODO Create an issue for fabric8-client

* Improve ExecWatchdog to rethrow exception when it occurred while establishing a WebSocket connection

* CHE-5918 Fix K8s/OS runtime start failing when unrecoverable event
occurs
@skabashnyuk skabashnyuk deleted the 1006-web-socket-error branch June 19, 2018 10:49
hbhargav pushed a commit to hbhargav/che that referenced this pull request Dec 5, 2018
…sting POD (eclipse#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Outline of a bug - must adhere to the bug report template.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants