Skip to content

Don't stop workspace recovery when error is encountered#11446

Merged
amisevsk merged 2 commits intoeclipse-che:masterfrom
amisevsk:fix-workspace-recovery
Oct 3, 2018
Merged

Don't stop workspace recovery when error is encountered#11446
amisevsk merged 2 commits intoeclipse-che:masterfrom
amisevsk:fix-workspace-recovery

Conversation

@amisevsk
Copy link
Copy Markdown
Contributor

@amisevsk amisevsk commented Oct 2, 2018

Fix issue where encountering an exception while recovering workspaces on startup prevents recovery of further workspaces.

What does this PR do?

Fix issue where encountering an exception while recovering workspaces on startup prevents recovery of further workspaces.

What issues does this PR fix or reference?

#11445

Testing

  1. Start Che multiuser on minishift
  2. Start a few workspaces
  3. Scale Che server down to 0
  4. modify postgres database to set owner_id to 0 for some workspace in table che_k8s_runtime
  5. restart Che server
    logs contain something like:
2018-10-02 20:41:56,358[ost-startStop-1]  [ERROR] [o.e.c.a.w.s.WorkspaceRuntimes 544]   - An error occurred while attempted to recover runtimes using infrastructure 'openshift'. Reason: 'Couldn't recover runtime 'workspace3ob15dnpujw9eamy:default'. Error: The user `0` doesn't have the required `use` permission for workspace `workspace3ob15dnpujw9eamy`'
2018-10-02 20:41:57,647[ost-startStop-1]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 590]   - Successfully recovered workspace runtime 'workspacekvgznl4cdef4u8r7'
2018-10-02 20:41:57,739[ost-startStop-1]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 590]   - Successfully recovered workspace runtime 'workspacec7tt9hc681m1uvr7'

and workspaces that can be recovered are. Workspace that had owner_id modified is listed as "not running" in Che, and deployment is not cleaned up, but there's not much we can do there.

Additionally, this will allow logging of all recovery errors, so this could be used to better track recovery bugs.

Fix issue where encountering an exception while recovering workspaces
on startup prevents recovery of further workspaces.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
@amisevsk amisevsk added the kind/bug Outline of a bug - must adhere to the bug report template. label Oct 2, 2018
@amisevsk amisevsk requested a review from garagatyi as a code owner October 2, 2018 20:49
Copy link
Copy Markdown
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix!

You can make this PR even better by adding a unit test that second workspace will be recovered if the first one failed.

return;
} catch (InfrastructureException e) {
LOG.error(
"An error occurred while attempted to recover runtimes using infrastructure '{}'. Reason: '{}'",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Le'ts rephrase this message to make it clear that error occurred during identities fetching.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@benoitf benoitf added the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Oct 3, 2018
Copy link
Copy Markdown

@garagatyi garagatyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Please, add tests.

@sleshchenko
Copy link
Copy Markdown
Member

ci-test

@riuvshin
Copy link
Copy Markdown
Contributor

riuvshin commented Oct 3, 2018

Results of automated E2E tests of Eclipse Che Multiuser on OCP:
Build details
Test report
docker image: eclipseche/che-server:11446
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@garagatyi
Copy link
Copy Markdown

@riuvshin I cannot express how I happy with just 3-hours long tests! It is so good to receive the feedback on the same day. THANK YOU!

@riuvshin
Copy link
Copy Markdown
Contributor

riuvshin commented Oct 3, 2018

@garagatyi this particular run Took 2 hr 42 min :P I've enabled mvn parallel build which decreased execution time for 20mins

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
@amisevsk
Copy link
Copy Markdown
Contributor Author

amisevsk commented Oct 3, 2018

I've added a test case to check that recovery continues past error.

@amisevsk
Copy link
Copy Markdown
Contributor Author

amisevsk commented Oct 3, 2018

ci-test

@garagatyi
Copy link
Copy Markdown

@amisevsk adding unit tests and changing the message usually doesn't require a new run of citest

@garagatyi
Copy link
Copy Markdown

ci-build

@riuvshin
Copy link
Copy Markdown
Contributor

riuvshin commented Oct 3, 2018

Results of automated E2E tests of Eclipse Che Multiuser on OCP:
Build details
Test report
docker image: eclipseche/che-server:11446
https://github.com/orgs/eclipse/teams/eclipse-che-qa please check this report.

@amisevsk amisevsk merged commit 220f10c into eclipse-che:master Oct 3, 2018
@benoitf benoitf removed the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Oct 3, 2018
@benoitf benoitf added this to the 6.12.0 milestone Oct 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Outline of a bug - must adhere to the bug report template.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants