Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenShift OAuth] Idling doesn't have enough permissions to stop the workspace. #15906

Closed
4 of 23 tasks
skabashnyuk opened this issue Feb 3, 2020 · 30 comments
Closed
4 of 23 tasks
Assignees
Labels
area/che-server kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.
Milestone

Comments

@skabashnyuk
Copy link
Contributor

skabashnyuk commented Feb 3, 2020

Describe the bug

Workspace stopped (actually not stopped) incorrectly by inactive timeout if OpenShift OAuth enabled.

2020-02-03 09:49:28,698[aceSharedPool-1]  [ERROR] [o.e.c.a.w.s.WorkspaceRuntimes 980]   - Error occurred during stopping of runtime 'workspace9e55zuzvmbejm0np:default:beb32f95-b372-4a81-b56d-06e30ceafc80' by user 'activity-checker'. Error: Error(s) occurs while cleaning up the namespace. Failure executing: GET at: https://172.30.0.1/apis/route.openshift.io/v1/namespaces/skabashn-che/routes?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. routes.route.openshift.io is forbidden: User "system:serviceaccount:che:che" cannot list resource "routes" in API group "route.openshift.io" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password. Failure executing: GET at: https://172.30.0.1/api/v1/namespaces/skabashn-che/services?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. services is forbidden: User "system:serviceaccount:che:che" cannot list resource "services" in API group "" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password. Failure executing: GET at: https://172.30.0.1/apis/apps/v1/namespaces/skabashn-che/deployments?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. deployments.apps is forbidden: User "system:serviceaccount:che:che" cannot list resource "deployments" in API group "apps" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password. Failure executing: GET at: https://172.30.0.1/api/v1/namespaces/skabashn-che/secrets?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. secrets is forbidden: User "system:serviceaccount:che:che" cannot list resource "secrets" in API group "" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password. Failure executing: GET at: https://172.30.0.1/api/v1/namespaces/skabashn-che/configmaps?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. configmaps is forbidden: User "system:serviceaccount:che:che" cannot list resource "configmaps" in API group "" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password.
org.eclipse.che.api.workspace.server.spi.InfrastructureException: Error(s) occurs while cleaning up the namespace. Failure executing: GET at: https://172.30.0.1/apis/route.openshift.io/v1/namespaces/skabashn-che/routes?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. routes.route.openshift.io is forbidden: User "system:serviceaccount:che:che" cannot list resource "routes" in API group "route.openshift.io" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password. Failure executing: GET at: https://172.30.0.1/api/v1/namespaces/skabashn-che/services?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. services is forbidden: User "system:serviceaccount:che:che" cannot list resource "services" in API group "" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password. Failure executing: GET at: https://172.30.0.1/apis/apps/v1/namespaces/skabashn-che/deployments?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. deployments.apps is forbidden: User "system:serviceaccount:che:che" cannot list resource "deployments" in API group "apps" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password. Failure executing: GET at: https://172.30.0.1/api/v1/namespaces/skabashn-che/secrets?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. secrets is forbidden: User "system:serviceaccount:che:che" cannot list resource "secrets" in API group "" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password. Failure executing: GET at: https://172.30.0.1/api/v1/namespaces/skabashn-che/configmaps?labelSelector=che.workspace_id%3Dworkspace9e55zuzvmbejm0np. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. configmaps is forbidden: User "system:serviceaccount:che:che" cannot list resource "configmaps" in API group "" in the namespace "skabashn-che". The error may be caused by an expired token or changed password. Update Che server deployment with a new token or password.
	at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesNamespace.doRemove(KubernetesNamespace.java:270)
	at org.eclipse.che.workspace.infrastructure.openshift.project.OpenShiftProject.cleanUp(OpenShiftProject.java:155)
	at org.eclipse.che.workspace.infrastructure.kubernetes.KubernetesInternalRuntime.internalStop(KubernetesInternalRuntime.java:573)
	at org.eclipse.che.api.workspace.server.spi.InternalRuntime.stop(InternalRuntime.java:177)
	at org.eclipse.che.api.workspace.server.WorkspaceRuntimes$StopRuntimeTask.run(WorkspaceRuntimes.java:950)
	at org.eclipse.che.commons.lang.concurrent.CopyThreadLocalRunnable.run(CopyThreadLocalRunnable.java:38)
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2020-02-03 09:55:36,842[gine[Catalina]]]  [WARN ] [a.c.w.i.BasicWebSocketEndpoint 129]  - Closing unidentified session
2020-02-03 10:34:48,073[nio-8080-exec-9]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 479]   - Starting workspace 'skabashn/wksp-gtr3' with id 'workspace9e55zuzvmbejm0np' by user 'skabashn'
2020-02-03 10:35:01,581[aceSharedPool-2]  [WARN ] [.i.k.KubernetesInternalRuntime 252]  - Failed to start Kubernetes runtime of workspace workspace9e55zuzvmbejm0np. Cause: Failure executing: POST at: https://172.30.0.1/api/v1/namespaces/skabashn-che/configmaps. Message: configmaps "workspace9e55zuzvmbejm0np-sshconfigmap" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=configmaps, name=workspace9e55zuzvmbejm0np-sshconfigmap, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=configmaps "workspace9e55zuzvmbejm0np-sshconfigmap" already exists, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
2020-02-03 10:35:02,534[aceSharedPool-2]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 911]   - Workspace 'skabashn:wksp-gtr3' with id 'workspace9e55zuzvmbejm0np' start failed
2020-02-03 10:35:16,701[nio-8080-exec-8]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 479]   - Starting workspace 'skabashn/wksp-gtr3' with id 'workspace9e55zuzvmbejm0np' by user 'skabashn'
2020-02-03 10:36:21,076[aceSharedPool-3]  [INFO ] [o.e.c.a.w.s.WorkspaceRuntimes 892]   - Workspace 'skabashn:wksp-gtr3' with id 'workspace9e55zuzvmbejm0np' started by user 'skabashn'

Che version

  • latest
  • nightly
  • other: please specify

Steps to reproduce

  1. Run che on OpenShift with OpenShift OAuth enabled. In my case chectl server:start --platform=crc --installer=operator --os-oauth --tls --self-signed-cert
  2. Start workspace.
  3. Wait 30 minutes.
  4. In che server logs, we can see that idler doesn't have enough permission to stop the workspace.

Expected behavior

Two variants:

  1. Increase permission.
  2. Disable idler for OpenShift OAuth mode

Runtime

  • kubernetes (include output of kubectl version)
  • Openshift (include output of oc version)
  • minikube (include output of minikube version and kubectl version)
  • minishift (include output of minishift version and oc version)
  • docker-desktop + K8S (include output of docker version and kubectl version)
  • other: (please specify)
oc version                                                                                 
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0+b4261e0", GitCommit:"b4261e07ed", GitTreeState:"clean", BuildDate:"2019-07-06T03:16:01Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.6+a8d983c", GitCommit:"a8d983c", GitTreeState:"clean", BuildDate:"2019-12-23T12:16:26Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Screenshots

Installation method

  • chectl
  • che-operator
  • minishift-addon
  • I don't know

Environment

  • my computer
    • Windows
    • Linux
    • macOS
  • Cloud
    • Amazon
    • Azure
    • GCE
    • other (please specify)
  • other: please specify

Additional context

N/A

@skabashnyuk skabashnyuk added kind/bug Outline of a bug - must adhere to the bug report template. severity/P2 Has a minor but important impact to the usage or development of the system. area/che-server labels Feb 3, 2020
@skabashnyuk
Copy link
Contributor Author

I assume such a scenario of this problem.

  1. The idler is not able to stop the workspace. He marked it as stopped but all K8S components are still there.
  2. The user sees that the workspace is stopped (actually not stopped but there is no red sign on it.) and try to start it again.
  3. Start failed since some k8s objects are in the unexpected state. See Failure executing: POST sshconfigmap or gitconfig configmaps already exists #15904
  4. Second attempt to start finished successfully.

@ibuziuk
Copy link
Member

ibuziuk commented Feb 3, 2020

@davidfestal is this task for deploy team now ? cc @tolusha

@davidfestal
Copy link
Contributor

@ibuziuk I don't think so. To me it's a wsmaster problem, if I understand correctly. It's not especially related to installation afaik. BTW it doesn't seem a new issue to me. Don't know why it pops up again right now though.

@skabashnyuk
Copy link
Contributor Author

skabashnyuk commented Feb 3, 2020

@davidfestal when you implement this functionality what was the plan for the idler? How does it suppose to work in upstream?

@ibuziuk ibuziuk added severity/P1 Has a major impact to usage or development of the system. and removed severity/P2 Has a minor but important impact to the usage or development of the system. labels Feb 3, 2020
@ibuziuk
Copy link
Member

ibuziuk commented Feb 3, 2020

setting P1 since it affects Toolchain team

@davidfestal
Copy link
Contributor

davidfestal commented Feb 3, 2020

@skabashnyuk I'm not sure what exact part of code you're speaking about, when mentioning my implementation. But I seem to remember that at some point there was a cache to map subjects associated to started workspaces (this cache I had implemented). This used to help retrieving the right subject when idling.
Of course it was long ago and I have no idea whether the coebase changed in this wsmaster area in the meantime.
OTOH we know that rh-che has some specific code to address the Multi-Cluster aspect of things and oso-proxy permission restrictions. So maybe some change in this area had impacts as well.

@skabashnyuk
Copy link
Contributor Author

@davidfestal I was referring to your code that was added #9577 and related to OSIO team issue #8178.

Can you remind me: is that correct that to get Openshift OAuth token we need Keycloak token that is not available for the idler and we can cache it since it can expire?
About what cache you are talking?

OTOH we know that rh-che has some specific code to address the Multi-Cluster aspect of things and oso-proxy permission restrictions. So maybe some change in this area had impacts as well.

@ibuziuk do we have someone else from your team who remembers implementation details?

@ibuziuk
Copy link
Member

ibuziuk commented Feb 3, 2020

@skabashnyuk @davidfestal sorry, folks but I do not understand how come this issue is related to Hosted Che - this is a pure upstream issue with 0 dependency on downstream (Hosted Che)

@davidfestal
Copy link
Contributor

@skabashnyuk I see the PR you're referring to. But let me remind that this PR has been created when the work on Che 7 wasn't even there. So a number of things might have changed in the meantime which I didn't control.

Can you remind me: is that correct that to get Openshift OAuth token we need Keycloak token that is not available for the idler and we can cache it since it can expire

And yes, afair at this point of time this implied using a cache. But Keycloak token expiration was not a problem since the cached subject for a user was systematically updated when a user did any action on the workspace (either Dashboard or Workspace).

About what cache you are talking?

The only one I'm finding back for now is this PR: https://github.com/eclipse/che/pull/7243/files#diff-a7b479c3e74a881033d3835df47948d6R98
But it was a few months before the PR you mentioned and it seems that related changes were finally not included into the Che 6 multi-tenancy PR: https://github.com/eclipse/che/pull/6441/files#diff-a7b479c3e74a881033d3835df47948d6

Instead of that, the equivalent changes were added into rh-che: https://github.com/redhat-developer/rh-che/commits/3cd8ced08128d5263301cd81a8b6b3c49e29cf93/plugins/ls-bayesian-agent/src/main/java/com/redhat/bayesian/agent/WorkspaceSubjectsRegistry.java

So on the rh-che side I don't see any reason for this not to work.

To summarize, the cache mechanism I had once implemented used to provide (afaict) a consistent solution to support idling in this use-case, at least at this time.
However removing it from the final PR that introduced multi-tenancy in Che 6 was not my choice afair.

@ibuziuk we should probably setup some sort of sync or meeting to understand why this probem popped up again in hosted Che.

@davidfestal
Copy link
Contributor

davidfestal commented Feb 3, 2020

@skabashnyuk @davidfestal sorry, folks but I do not understand how come this issue is related to Hosted Che - this is a pure upstream issue with 0 dependency on downstream (Hosted Che)

Sorry @ibuziuk according to the first comments in this issue, I initially thought that the problem was occurring in Hosted Che currently, but obviously I was mistaken.

This doesn't change the fact that my reasoning is mainly the following: If it's something that can be fixed through idling a workspace by using the Openshift user that created it (which I believe might doable - cf the cache mechanism I had setup long ago), then no additional permission would be required, and this issue would more likely become a platform issue. In a general way, it seems to me that requiring additional permissions instead of fixing design itself is usually not the way to go.

@skabashnyuk
Copy link
Contributor Author

Both #15906 and #15904 (comment) clearly says that none of the caching techniques (keycloak or OpenShift OAuth) doesn't provide 100% guarantee that start or stop(idling) operation would be successfully performed.
I suggest going in the following way.

  1. Caching Openshift OAuth token in runtime. Do not store it. After tomcat restart, it would be lost. My selection of OAuth token instead of Keycloak motived by the fact that Openshift token has a longer timeout by default. In this form the start(see Failure executing: POST sshconfigmap or gitconfig configmaps already exists #15904) of the workspace more or less guarantee.
  2. Remove k8s objects related to the workspace on workspace start in the same way as we doing on workspace stop. Just to guarantee that no object left after the expiration of OAuth token. (happened now if user press start, he will see and error that some object exists. At the end of the failed attempt necessary objects would be cleaned up)

There is plan B that will grant additional permissions for che-server

Note:

@skabashnyuk
Copy link
Contributor Author

CC @benoitf @l0rd

@tomgeorge
Copy link
Contributor

I was able to reproduce this. It seems like the way forward is to change the RBAC rules around allowed actions in developer namespaces. Is there any reason why we cannot do this?

@l0rd
Copy link
Contributor

l0rd commented Mar 29, 2020

@tomgeorge yes I think we agreed that we should grant Che SA privileges to stop workspaces in users namespaces (@ibuziuk and @davidfestal can confirm).

@skabashnyuk
Copy link
Contributor Author

@tomgeorge yes I think we agreed that we should grant Che SA privileges to stop workspaces in users namespaces (@ibuziuk and @davidfestal can confirm).

AFAIK it would not work in a multi-cluster environment.

@l0rd
Copy link
Contributor

l0rd commented Mar 30, 2020

@skabashnyuk there is no official support / documentation for multi-cluster environment. If a user tries to use Che in a multi cluster scenario it will need to use some external tools to simulate a single cluster. hosted-che uses oso-proxy to simulate single cluster.

@skabashnyuk
Copy link
Contributor Author

@skabashnyuk there is no official support / documentation for multi-cluster environment. If a user tries to use Che in a multi cluster scenario it will need to use some external tools to simulate the single host. hosted-che uses oso-proxy to simulate single cluster.

ok. Does this also mean that Che SA privileges would be enough to start the workspace?

@l0rd
Copy link
Contributor

l0rd commented Mar 30, 2020

ok. Does this also mean that Che SA privileges would be enough to start the workspace?

Yes that's the idea: Che SA should be granted edit privileges that allow objects creation/deletion

@ibuziuk ibuziuk modified the milestones: 7.11, 7.12 Apr 1, 2020
@ibuziuk ibuziuk assigned tomgeorge and unassigned tomgeorge Apr 2, 2020
ibuziuk added a commit to ibuziuk/che that referenced this issue Apr 21, 2020
…ty in order to have a possibility to disable the 'OpenShiftStopWorkspaceRoleProvisioner'

Signed-off-by: Ilya Buziuk <ibuziuk@redhat.com>
tomgeorge pushed a commit to tomgeorge/che that referenced this issue Apr 21, 2020
che eclipse-che#15906 Adding 'che.workspace.stop.role.enabled' property in order to have a possibility to disable the 'OpenShiftStopWorkspaceRoleProvisioner'
@ibuziuk
Copy link
Member

ibuziuk commented Apr 22, 2020

PR is merged - #16532
I believe we can close the issue
@alexeykazakov please provide the feedback after the operator upate to 7.12.0 (should be released this week)

@ibuziuk ibuziuk closed this as completed Apr 22, 2020
@alexeykazakov
Copy link

Thank you guys for fixing that! Will definitely verify when the operator is available.

ibuziuk added a commit to ibuziuk/che that referenced this issue Apr 23, 2020
…default

Signed-off-by: Ilya Buziuk <ibuziuk@redhat.com>
ibuziuk added a commit that referenced this issue Apr 23, 2020
Signed-off-by: Ilya Buziuk <ibuziuk@redhat.com>
ibuziuk added a commit to ibuziuk/che that referenced this issue Apr 29, 2020
Signed-off-by: Ilya Buziuk <ibuziuk@redhat.com>
@alexeykazakov
Copy link

@ibuziuk after upgrading the operator to 7.12.0 I still see the same error with not enough permissions to stop workspaces:

https://gist.github.com/alexeykazakov/5e66598e93d8cbed8dcbdd9ff7b23e63

@tomgeorge
Copy link
Contributor

@alexeykazakov I think che-operator 7.12.0 still uses 7.12.0 of che-server, and this fix went into 7.12.1

@alexeykazakov
Copy link

Hm.. OK. Ilya mentioned 7.12.0 version though:

please provide the feedback after the operator upate to 7.12.0 (should be released this week)

Will wait for 7.12.1 then. Any time frame for 7.12.1 release?

@alexeykazakov
Copy link

Actually setting CHE_WORKSPACE_STOP_ROLE_ENABLED: 'true' in 7.12.0 helped.
So, the fix did go into 7.12.0 then? But I guess that flag is supposed to be used to enable the idling until #16804 is fixed.
Am I correct?

@tomgeorge
Copy link
Contributor

There was a defect in #16532 which went out in 7.12.0 and #16803 was merged to fix it in 7.12.1

@tomgeorge
Copy link
Contributor

The fix went into 7.12.0, but it caused an issue in non-multiuser deployments. With CHE_WORKSPACE_STOP_ROLE_ENABLED: true and a single user deployment, you should see a similar stacktrace

@alexeykazakov
Copy link

What is a single user deployment?

@tomgeorge
Copy link
Contributor

tomgeorge commented May 1, 2020

Like running chectl server:start without --multiuser. In multiuser it creates a user namespace for the workspaces, and without it, all workspaces are run in the namespace where che server is installed.

Edit: when no identity provider is given

@l0rd
Copy link
Contributor

l0rd commented May 2, 2020

What is a single user deployment?

No keycloak, no postgres, no authentication. Used for local development because it requires less resources.

@ibuziuk
Copy link
Member

ibuziuk commented May 4, 2020

So, the fix did go into 7.12.0 then? But I guess that flag is supposed to be used to enable the idling until #16804 is fixed.

@alexeykazakov correct. Idling will work OOTB in 7.13.0

ibuziuk added a commit that referenced this issue May 4, 2020
Signed-off-by: Ilya Buziuk <ibuziuk@redhat.com>
ibuziuk added a commit to ibuziuk/che that referenced this issue May 4, 2020
Signed-off-by: Ilya Buziuk <ibuziuk@redhat.com>
ibuziuk added a commit that referenced this issue May 5, 2020
Signed-off-by: Ilya Buziuk <ibuziuk@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-server kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

6 participants