Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eclipse Che Server doesn't notify user when OOM happens during workspace start #13511

Closed
AndrienkoAleksandr opened this issue Jun 11, 2019 · 3 comments
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.

Comments

@AndrienkoAleksandr
Copy link
Contributor

AndrienkoAleksandr commented Jun 11, 2019

Description

Eclipse Che Server doesn't notify user when OOM happens during workspace start. We should fail starting such workspaces or clear notify user about OOM.

Reproduction Steps

  1. Create workspace from devfile: https://raw.githubusercontent.com/eclipse/che/caae00e21b2fd316560a406f2e1d8ad98edf1dad/workspace-loader/devfile.yaml
2. Start workspace.

Video:
issue0001-4313

Actual result: Error instead of IDE:
503 Service Temporarily Unavailable
nginx/1.15.9

Expected result: We should fail starting such workspaces or clear notify user about OOM.

OS and version:
Kubuntu 19.04 che-server 7.0.0-RC-2.0-SNAPSHOT

@AndrienkoAleksandr AndrienkoAleksandr added team/platform kind/bug Outline of a bug - must adhere to the bug report template. labels Jun 11, 2019
@AndrienkoAleksandr
Copy link
Contributor Author

Related #13512

@AndrienkoAleksandr AndrienkoAleksandr changed the title Eclipse che-server don't notify user when workspace started with sidecar container OOM Eclipse Che Server doesn't notify user when OOM happens during workspace start Jun 11, 2019
@skabashnyuk skabashnyuk added the severity/P1 Has a major impact to usage or development of the system. label Jun 12, 2019
@skabashnyuk skabashnyuk added kind/task Internal things, technical debt, and to-do tasks to be performed. and removed kind/bug Outline of a bug - must adhere to the bug report template. labels Jun 15, 2019
@skabashnyuk skabashnyuk added severity/P2 Has a minor but important impact to the usage or development of the system. and removed severity/P1 Has a major impact to usage or development of the system. labels Oct 10, 2019
@skabashnyuk skabashnyuk added this to the Backlog - Platform milestone Oct 10, 2019
@skabashnyuk skabashnyuk added severity/P1 Has a major impact to usage or development of the system. and removed severity/P2 Has a minor but important impact to the usage or development of the system. labels Oct 31, 2019
@sleshchenko
Copy link
Member

K8s/OS cluster clearly tells that OOM happens in container:

The corresponding event ``` - apiVersion: v1 count: 1 eventTime: null firstTimestamp: 2019-10-31T23:20:36Z involvedObject: apiVersion: v1 fieldPath: spec.containers{che-machine-execxz9} kind: Pod name: workspacemqq71m0z7nwrrp1n.ws-loader-dev-5dcf685bf6-g54fj namespace: che resourceVersion: "154774" uid: b5ecf6b3-fc33-11e9-9feb-080027d24256 kind: Event lastTimestamp: 2019-10-31T23:20:36Z message: Killing container with id docker://che-machine-execxz9:Need to kill Pod metadata: creationTimestamp: 2019-10-31T23:20:38Z name: workspacemqq71m0z7nwrrp1n.ws-loader-dev-5dcf685bf6-g54fj.15d2dea8e8b51f2f namespace: che resourceVersion: "157074" selfLink: /api/v1/namespaces/che/events/workspacemqq71m0z7nwrrp1n.ws-loader-dev-5dcf685bf6-g54fj.15d2dea8e8b51f2f uid: 0cc10564-fc35-11e9-9feb-080027d24256 reason: Killing reportingComponent: "" reportingInstance: "" source: component: kubelet host: localhost type: Normal ```
The corresponding record in Pod status ```yaml status: conditions: - lastProbeTime: null lastTransitionTime: '2019-10-31T23:22:46Z' status: 'True' type: Initialized - lastProbeTime: null lastTransitionTime: '2019-10-31T23:24:05Z' message: 'containers with unready status: [theia-ide2i6]' reason: ContainersNotReady status: 'False' type: Ready - lastProbeTime: null lastTransitionTime: null message: 'containers with unready status: [theia-ide2i6]' reason: ContainersNotReady status: 'False' type: ContainersReady - lastProbeTime: null lastTransitionTime: '2019-10-31T23:22:42Z' status: 'True' type: PodScheduled containerStatuses: - containerID: >- docker://5a4d82ec8eb42453b71e80a3cd5eaa238941725ed82967290462ebd704d58d04 image: 'docker.io/eclipse/che-theia:next' imageID: >- docker-pullable://docker.io/eclipse/che-theia@sha256:2db6f1e61f86885fe04ee31cd10a7526b045759cd9bfa48bb826e538c54da9be lastState: terminated: containerID: >- docker://339669be92c32f36ab8f89a0f8196e63ca686fdd9381335fff83c5c034c1611a exitCode: 137 finishedAt: '2019-10-31T23:23:25Z' reason: OOMKilled startedAt: '2019-10-31T23:23:16Z' name: theia-ide2i6 ready: false restartCount: 3 state: terminated: containerID: >- docker://5a4d82ec8eb42453b71e80a3cd5eaa238941725ed82967290462ebd704d58d04 exitCode: 137 finishedAt: '2019-10-31T23:24:03Z' reason: OOMKilled startedAt: '2019-10-31T23:23:58Z' ```

Actually, that error happens after workspace start but not during start. And Che Server has technical difficulties with watching events after workspace start. The issue with watching events with Fabric8KubernetesClient via WebSockets is well-described in #7653

As an alternative - we could imagine that Che Server periodically calls RestAPI endpoints and check events instead of using WebSocket and listening to events in real-time. But it needs to be well investigated.

@amisevsk
Copy link
Contributor

I believe this is resolved by #15449

Screenshot from 2019-12-18 09-08-12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

4 participants