Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rh-che #557: Adding property and handler for stopping k8s / openshift workspace if unrecoverable event occurs during workspace startup #9426

Merged
merged 1 commit into from
Apr 19, 2018

Conversation

ibuziuk
Copy link
Member

@ibuziuk ibuziuk commented Apr 13, 2018

What does this PR do?

Adding property and handler for stopping k8s / openshift workspace if unrecoverable event occurs during workspace startup

Demo - https://youtu.be/J_DfNZXidE0

What issues does this PR fix or reference?

redhat-developer/rh-che#557

Adding property che.infra.kubernetes.workspace_unrecoverable_events

if during workspace startup an unrecoverable event occurs (e.g. "Failed Mount" "Failed Scheduling") terminate workspace immediately instead of waiting util timeout

Release Notes

Adding property che.infra.kubernetes.workspace_unrecoverable_events

if during workspace startup an unrecoverable event occurs (e.g. "Failed Mount" "Failed Scheduling") terminate workspace immediately instead of waiting util timeout

Docs PR

eclipse-che/che-docs#395

@codenvy-ci
Copy link

Can one of the admins verify this patch?

2 similar comments
@codenvy-ci
Copy link

Can one of the admins verify this patch?

@codenvy-ci
Copy link

Can one of the admins verify this patch?

@benoitf benoitf added the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Apr 13, 2018
@garagatyi garagatyi requested a review from akorneta April 16, 2018 10:42
@@ -420,6 +426,10 @@ protected void startMachines() throws InfrastructureException {
// namespace.pods().watch(new AbnormalStopHandler());
// namespace.pods().watchContainers(new MachineLogsPublisher());

if (failFastOnUnrecoverableEvents) {
namespace.pods().watchContainers(new UnrecoverableEventHanler());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be useful add MachineLogsPublisher here. WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sleshchenko can you elaborate on what exactly to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be something like the following:

--> start
try {
  namespace.pods().watchContainers(new MachineLogsPublisher());
  if (failFastOnUnrecoverableEvents) {
    namespace.pods().watchContainers(new UnrecoverableEventHanler());
  }

  //doStart here
} catch ....
... 
} finally {
  namespace.pods().stopWatch();
}

In this case clients would receive K8s/OS events text as machine's output, and it would be easier to understand why workspace start fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sleshchenko Adding MachineLogsPublisher seems to be unrelated to this issue but rather to #7653 IMO. Could this be addressed in another PR ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ibuziuk For me, it looks like it's easy to add it here (I guess described changes should be enough) and it will be useful (machine logs panel will have all events). It's why I write it here.
But I agree that it is not related to a solved issue and it can be added in another PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sleshchenko ok, sounds relevant - let me add it to this PR

}

private boolean isUnrecoverable(String reason) {
return reason.equals("Failed Mount") || reason.equals("Failed Scheduling");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to make these values configurable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

@ibuziuk ibuziuk Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sleshchenko do you mean property with comma separated reasons ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -80,6 +88,9 @@ public String toString() {
+ ", containerName='"
+ containerName
+ '\''
+ ", reason='"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new field should be used in equals/hashcode methods too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -240,6 +244,8 @@ private void waitMachines(
failure.completeExceptionally(ex);
cancelAll(toCancelFutures);
wrapAndRethrow(ex.getCause());
} finally {
namespace.pods().stopWatch();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered moving canceling of watching to internalStart method? I think that there might be a situation in future when newly introduced bug prevents the execution of method waitMachines which would lead to a leak of the watch connections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, let me fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to internalStart

String reason = event.getReason();
String message = event.getMessage();
if (isUnrecoverable(reason)) {
LOG.error("Unrecoverable event occured {}, {}", reason, message);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think dev/ops team would be able to address logged problem somehow? Maybe it is worth to add more info to be able to debug the issue or remove logging at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garagatyi what info do you will be relevant here ? req_id / identity id would be added to the logs automatically I assume, no ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe workspace ID, maybe some pod/container info if any

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have added pod info

@@ -600,6 +610,26 @@ public void accept(ProbeResult probeResult) {
}
}

/** Listens container's events and terminates runtime if unrecoverable event occurs. */
public class UnrecoverableEventHanler implements ContainerEventHandler {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sure

@@ -193,6 +195,7 @@ public void shouldStartMachines() throws Exception {
verify(routes).create(any());
verify(services).create(any());

verify(project.pods(), times(1)).watchContainers(any());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check for the cancellation of watching?

@@ -137,6 +138,7 @@ public void setup() throws Exception {
new OpenShiftInternalRuntime(
13,
5,
FAIL_ON_UNRECOVERABLE_EVENT,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case when the value is false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garagatyi do you think it is ok to check one infra with false & other with true, or you think there should be both cases handled in both infras ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they might have their own bugs we should have tests for both

@@ -180,6 +182,7 @@ public void setup() throws Exception {
new KubernetesInternalRuntime<>(
13,
5,
FAIL_ON_UNRECOVERABLE_EVENT,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case when the value is false?

@@ -600,6 +610,26 @@ public void accept(ProbeResult probeResult) {
}
}

/** Listens container's events and terminates runtime if unrecoverable event occurs. */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we don't have runtime interruption in place for k8s runtime yet. Can you elaborate how this handler stops start of a runtime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/runtime/workspace/

@ibuziuk ibuziuk changed the title rh-che #557: Adding property and handler for stopping k8s / openshift runtime if unrecoverable event occurs during workspace startup rh-che #557: Adding property and handler for stopping k8s / openshift workspace if unrecoverable event occurs during workspace startup Apr 16, 2018
Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes will make life easier during an investigation of workspace start failing 👍

@@ -347,6 +347,11 @@ che.infra.kubernetes.workspace_start_timeout_min=8
# Defines the timeout in minutes that limits the period for which Kubernetes Ingress become ready
che.infra.kubernetes.ingress_start_timeout_min=5

# If during workspace startup an unrecoverable event defined in 'che.infra.kubernetes.workspace_unrecoverable_events' occurs
# terminate workspace immediately instead of waiting util timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should util be until?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

String message = event.getMessage();
String podName = event.getPodName();
try {
internalStop(emptyMap());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't work correctly until k8s runtime start interruption will be implemented.
I guess client will receive two workspace status events:

  • one that workspace failed to start because of Unrecoverable;
  • a second one that start timeout exceeded.

Let me check that. BTW I think we can merge these changes with the described issue and it will be fixed with implemented interruption of k8s runtime start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked. So here is the Che Server logs


  | 2018-04-17 08:02:45,912[io-8080-exec-10]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 236]   - Starting workspace 'che/qwe' with id 'workspacea75oxtgryzdjq5iq' by user 'che'
  /////////////////// Here Client received that workspace failed to start but Che Server doesn't contain the corresponding message ///////////////////////////
  /////////////////// ~3 minutes later reached workspace start timeout ///////////////////////////
  | 2018-04-17 08:05:18,006[aceSharedPool-0]  [WARN ] [.i.k.KubernetesInternalRuntime 177]  - Failed to start Kubernetes runtime of workspace workspacelhu15ifhded6jrxf. Cause: Server 'terminal' in machine 'dev-machine' not available.
  | 2018-04-17 08:05:18,006[ServersChecker]   [ERROR] [.i.k.KubernetesInternalRuntime 335]  - Unable to update status of the machine '%s:%s'. Cause: %s
  | 2018-04-17 08:05:18,066[aceSharedPool-0]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 292]   - Workspace 'che:wksp-g9f7' with id 'workspacelhu15ifhded6jrxf' start failed
  | 2018-04-17 08:05:18,070[aceSharedPool-0]  [WARN ] [o.e.c.a.w.s.WorkspaceManager 423]    - Cannot set error status of the workspace workspacelhu15ifhded6jrxf. Error is: Workspace with id 'workspacelhu15ifhded6jrxf' doesn't exist

@@ -347,6 +347,11 @@ che.infra.kubernetes.workspace_start_timeout_min=8
# Defines the timeout in minutes that limits the period for which Kubernetes Ingress become ready
che.infra.kubernetes.ingress_start_timeout_min=5

# If during workspace startup an unrecoverable event defined in 'che.infra.kubernetes.workspace_unrecoverable_events' occurs
# terminate workspace immediately instead of waiting util timeout
che.infra.kubernetes.workspace_fail_on_unrecoverable_events=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe why it is necessary to have separate configuration property for disabling listening to unrecoverable events. Maybe it makes sense to configure empty che.infra.kubernetes.workspace_unrecoverable_events for disabling instead of having a separate one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, less config we have the better

@@ -66,6 +66,10 @@
<groupId>javax.inject</groupId>
<artifactId>javax.inject</artifactId>
</dependency>
<dependency>
<groupId>org.codehaus.plexus</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't have CQ for this dependency. Can you use some alternatives?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@ibuziuk ibuziuk Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skabashnyuk ok, good to know - changed to com.google.common.base.Strings

@@ -347,6 +347,10 @@ che.infra.kubernetes.workspace_start_timeout_min=8
# Defines the timeout in minutes that limits the period for which Kubernetes Ingress become ready
che.infra.kubernetes.ingress_start_timeout_min=5

# If during workspace startup an unrecoverable event defined in the property occurs,
# terminate workspace immediately instead of waiting until timeout
che.infra.kubernetes.workspace_unrecoverable_events=Failed Mount,Failed Scheduling,Failed to pull image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For users it may be interesting to have a list (or better a link) of valid kubernetes events.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@l0rd could you please clarify a bit ? is it about changing the property format

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Mario means a link to docs where all the event types are listed.
And 👍 to add it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am a Che user and I am editing che.properties I don't know what are the valid values for this new property. You may want to say in the comment that this is a comma separated list of kubernetes events. And it would be great if we could add a link to a list of valid kubernetes events (but I am not sure we can easily find one so don't lose too much time on this)

try {
internalStop(emptyMap());
} catch (InfrastructureException e) {
LOG.error("Unrecoverable event occured {}, {}, {}", reason, message, podName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that without workspace ID it won't be possible to match the incident with a workspace/user. I think we need WSID her.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will also add workspace id

@@ -62,6 +70,7 @@ public boolean equals(Object o) {
ContainerEvent that = (ContainerEvent) o;
return Objects.equals(podName, that.podName)
&& Objects.equals(containerName, that.containerName)
&& Objects.equals(reason, that.reason)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add the reason to the hashcode method? It is common to have them aligned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for the pointer

@@ -92,7 +97,11 @@ protected void startMachines() throws InfrastructureException {
}
// TODO https://github.com/eclipse/che/issues/7653
// project.pods().watch(new AbnormalStopHandler());
// project.pods().watchContainers(new MachineLogsPublisher());

project.pods().watchContainers(new MachineLogsPublisher());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR please, add finally here as well
Since we set watchers here we should also ensure that all the watchers are properly deleted. We can not guarantee that there is no bug which prevents the execution of finally block from k8s infra.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, not following - startMachines is supposed to be called from internalStart where watchers closed / deleted in finally. Could you describe potential bug that could prevent closing watchers currently ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad. We don't need finally here.

@garagatyi
Copy link

@ibuziuk please, do not push to the branch with force. Each force push make reviewers review whole PR from scratch

@garagatyi
Copy link

BTW looks like you need to address @skabashnyuk 's comment before merging since we can't add new dependency without a CQ

@ibuziuk
Copy link
Member Author

ibuziuk commented Apr 17, 2018

@garagatyi fixed dep CQ problem - changed to com.google.common.base.Strings

@ibuziuk
Copy link
Member Author

ibuziuk commented Apr 17, 2018

@l0rd have added PR for docs eclipse-che/che-docs#395

@garagatyi
Copy link

@ibuziuk is there any other cases that you are going to cover in unit tests?

@ibuziuk
Copy link
Member Author

ibuziuk commented Apr 17, 2018

@garagatyi yeah, working on adding more tests

@@ -347,6 +347,10 @@ che.infra.kubernetes.workspace_start_timeout_min=8
# Defines the timeout in minutes that limits the period for which Kubernetes Ingress become ready
che.infra.kubernetes.ingress_start_timeout_min=5

# If during workspace startup an unrecoverable event defined in the property occurs,
# terminate workspace immediately instead of waiting until timeout
che.infra.kubernetes.workspace_unrecoverable_events=Failed Mount,Failed Scheduling,Failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed to pull image has become Failed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was actually done deliberate, but I can rework

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please revert that. Having Failed in a list of kubernetes events is confusing.

Copy link
Member Author

@ibuziuk ibuziuk Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lord ok, just wanted to clarify that reason of the event for "Failed to pull image" is actually just Failed, so I assume that Failed is quite common reason for k8s / openshift events. Failed to pull image is part of the event message

@ibuziuk
Copy link
Member Author

ibuziuk commented Apr 18, 2018

@l0rd @garagatyi changed property value, added tests. Please, review
PR for updating docs to make it consistent with property - eclipse-che/che-docs#397

@@ -92,7 +97,11 @@ protected void startMachines() throws InfrastructureException {
}
// TODO https://github.com/eclipse/che/issues/7653
// project.pods().watch(new AbnormalStopHandler());
// project.pods().watchContainers(new MachineLogsPublisher());

project.pods().watchContainers(new MachineLogsPublisher());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad. We don't need finally here.

@@ -352,8 +412,9 @@ public void throwsInfrastructureExceptionWhenKubernetesNamespaceCleanupFailed()
@Test
public void testRepublishContainerOutputAsMachineLogEvents() throws Exception {
final MachineLogsPublisher logsPublisher = internalRuntime.new MachineLogsPublisher();
final ContainerEvent out1 = mockContainerEvent("pulling image", "07/07/2007 19:01:22");
final ContainerEvent out2 = mockContainerEvent("image pulled", "07/07/2007 19:08:53");
final ContainerEvent out1 =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have added logs publishing in the runtime start process and this is very cool. Can you also add a unit test for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it already handled by https://github.com/eclipse/che/pull/9426/files#diff-08f3ea6fea2790aa766edf8d57d37897R275 e.g. handler for publishing events is added during runtime start?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it just checks that some subscribers are watching it. We have a separate test for logs publisher, so maybe we can also check that one of the watchers is logs publisher and another one is unrecoverable events listener? But it is not critical, so feel free to leave it as is if you want.

@@ -366,10 +427,34 @@ public void testRepublishContainerOutputAsMachineLogEvents() throws Exception {
assertTrue(captor.getAllValues().containsAll(machineLogs));
}

@Test
public void testHandleUnrecoverableEvent() throws Exception {
final UnrecoverableEventHanler unrecoverableEventHanler =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, test unrecoverable events matching/non-matching in the whole flow instead of with a separated component. We don't have any guarantees that handler is called as it is supposed in this test case.
Please cover cases:
non-matching event
matching by the reason field
matching by the message field

It would be nice to verify that stop of the runtime is really called after appearing of the matching event.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garagatyi was not able to figure out how exactly to test unrecoverable events in starting flow, but have added separate tests for matching by the reason / matching by the message. Also, not sure how exactly verifying that stop of the runtime e.g. internalStop should be tested - currently done by verifying that namespace cleanup was called

Are you ok to merge it in the current state, since I do not really want to face merge conflicts, and we could discuss more tests cases which I would add later in separate PR ?

Copy link
Contributor

@l0rd l0rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @ibuziuk!

Copy link

@garagatyi garagatyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I believe that tests that we were discussing are not mandatory for the PR.

@ibuziuk
Copy link
Member Author

ibuziuk commented Apr 19, 2018

@garagatyi ok, thanks I will try to add more tests for those event handlers during the sprint

…/ openshift runtime if unrecoverable event occurs during workspace startup

Signed-off-by: Ilya Buziuk <ibuziuk@redhat.com>
@ibuziuk ibuziuk merged commit ac3b401 into eclipse-che:master Apr 19, 2018
@benoitf benoitf added this to the 6.5.0 milestone Apr 19, 2018
@benoitf benoitf removed the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants