Multi tenant workspace cleaning #7243

davidfestal · 2017-11-08T11:16:12Z

What does this PR do?

This PR introduces changes required to support some multi-tenancy use-cases that were not supported until now:

clean openshift resources associated to a workspace when the workspace is idled
clean openshift resources associated to all started workspaces on che-server shutdown.

These 2 use-cases are critical to allow OSIO multi-tenancy to go to production.

What issues does this PR fix or reference?

redhat-developer/rh-che#370

Signed-off-by: David Festal <dfestal@redhat.com>

More precisely, the idea is to keep track of the userId that started a workspace and have the corresponding Subject always available and updated with the last connection information (mainly the last Keycloak token). Signed-off-by: David Festal <dfestal@redhat.com>

... to allow removing Openshift resources when a workspace is stopped (`removeContainer`, `removeImage`) even if the workspace user is not accessible through the current `EnvironmentContext`. Signed-off-by: David Festal <dfestal@redhat.com>

... This mainly consists in cleaning the workspace files *synchronously* instead of doing it in batch at che-server idling. The previous behavior remains unchanged if the che server is not multi-tenant. Signed-off-by: David Festal <dfestal@redhat.com>

... This allows cleaning the workspaces correctly also in the multi-tenant scenario. Signed-off-by: David Festal <dfestal@redhat.com>

codenvy-ci · 2017-11-08T11:19:02Z

Can one of the admins verify this patch?

codenvy-ci · 2017-11-08T11:21:05Z

Can one of the admins verify this patch?

benoitf · 2017-11-08T11:22:00Z

ci-build

codenvy-ci · 2017-11-08T12:37:15Z

Build # 4314 - FAILED

Please check console output at https://ci.codenvycorp.com/job/che-pullrequests-build/4314/ to view the results.

davidfestal · 2017-11-08T13:07:39Z

any idea why the CI build failed ? it seems from the console output that the maven build was successful but the the Jenkins job failed.

benoitf · 2017-11-08T13:21:54Z

benoitf · 2017-11-08T13:30:02Z

wsmaster/che-core-api-system/src/main/java/org/eclipse/che/api/system/server/SystemManager.java

@@ -106,6 +106,7 @@ private void doStopServices() {
  @PreDestroy
  @VisibleForTesting
  void shutdown() throws InterruptedException {
+    LOG.info("Synchronous shutdown requested");


it should be cleanup, no ?

why ? is it a problem to have this log ? it's only called once on shutdown, but might be very useful to check that the post-destroy method is called on SIGTERM, which wasn't the case last week (for an unknown reason, this was fixed when updating to the last master branch).

it's more the level of the logger

well, the log level used in line 91 is also INFO, and the type of event is clearly the same, no ?

It's my view on log events (there are other views ;-) I still think these events are not at INFO level.
People that are interested in a specific part want always to push to INFO level but many times it's more DEBUG info. I'm not sure it's interesting for end-user to see that by default.

benoitf · 2017-11-08T14:22:39Z

ci-build

codenvy-ci · 2017-11-08T16:10:49Z

Build success. https://ci.codenvycorp.com/job/che-pullrequests-build/4316/

Signed-off-by: David Festal <dfestal@redhat.com>

benoitf · 2017-11-08T16:43:22Z

dockerfiles/init/modules/openshift/files/scripts/deploy_che.sh

@@ -266,6 +266,8 @@ echo "done!"
 # If command == clean up then delete all openshift objects
 # -------------------------------------------------------------
 if [ "${COMMAND}" == "cleanup" ]; then
+  echo "[CHE] Stopping the Che server..."
+  oc scale --replicas=0 --timeout=3h dc che


as a side note, do we really want to wait 3h there ?

probably not :-) The idea though is to give enough time to stop all the workspaces started on the various external clusters / namespaces associated to users in the multi-tenant scenario.

@l0rd What value would you give to this timeout ?

Fixed in commit cc96961

benoitf · 2017-11-08T16:44:48Z

.../main/java/org/eclipse/che/multiuser/machine/authentication/server/MachineTokenRegistry.java

+   * @return workspace identifier
+   * @throws NotFoundException when no such machine token exists
+   */
+  public String getWorkspaceId(String token) throws NotFoundException {


this method should be tested in multiuser/machine-auth/che-multiuser-machine-authentication/src/test/java/org/eclipse/che/multiuser/machine/authentication/server/MachineTokenRegistryTest.java

Can't find where this method is used

This method is used by the rh-che assembly, in order to get an up-to-date Keycloak token that we can inject into the Bayesian Language server when starting it.

See PR redhat-developer/rh-che#417, and especially this line

benoitf · 2017-11-08T16:45:38Z

...i-workspace/src/main/java/org/eclipse/che/api/workspace/server/WorkspaceSubjectRegistry.java

+import org.slf4j.Logger;
+
+@Singleton
+public class WorkspaceSubjectRegistry implements EventSubscriber<WorkspaceStatusEvent> {


it is missing unit tests and class comment like What is the meaning of this class, etc.

Sure, sorry about that. I'll add it.

Javadoc added in commit 3b91135
Unit tests added in commit 4a4a4f0

benoitf · 2017-11-08T16:46:13Z

...i-workspace/src/main/java/org/eclipse/che/api/workspace/server/WorkspaceSubjectRegistry.java

+  }
+
+  @PostConstruct
+  @VisibleForTesting


I don't see the associated test ?

see commit 4a4a4f0

skabashnyuk · 2017-11-08T18:50:15Z

.../main/java/org/eclipse/che/multiuser/machine/authentication/server/MachineTokenRegistry.java

+   * @return workspace identifier
+   * @throws NotFoundException when no such machine token exists
+   */
+  public String getWorkspaceId(String token) throws NotFoundException {


Can't find where this method is used

skabashnyuk · 2017-11-08T18:53:03Z

...i-workspace/src/main/java/org/eclipse/che/api/workspace/server/WorkspaceSubjectRegistry.java

+import org.eclipse.che.commons.subject.Subject;
+import org.slf4j.Logger;
+
+@Singleton


Contract of this class is unclear. Looks like a cache of information that can be retrieved from other API or managers. Can you clarify a contract?

It's a sort of cache yes, but I'll add Javadoc to the class, as well as tests, which I omitted.

Javadoc added in commit 3b91135

skabashnyuk · 2017-11-08T18:54:24Z

...r-local/src/main/java/org/eclipse/che/api/local/filters/EnvironmentInitializationFilter.java

@@ -45,6 +49,7 @@ public final void doFilter(
    Subject subject = new SubjectImpl("che", "che", "dummy_token", false);
    HttpSession session = httpRequest.getSession();
    session.setAttribute("codenvy_user", subject);
+    workspaceSubjectRegistry.updateSubject(subject);


Why do we need to cache subject?

for the reasons explained here

skabashnyuk · 2017-11-08T18:55:40Z

...enshift-client/src/main/java/org/eclipse/che/plugin/openshift/client/OpenShiftConnector.java

+  private Subject subjectForContainerId(String containerId) {
+    String workspaceId = containerIdToWorkspaceId.get(containerId);
+    if (workspaceId != null) {
+      Subject subject = workspaceSubjectRegistry.getWorkspaceStarter(workspaceId);


Why we need to get Subject from cache instead of EnvironmentContext.getCurrent().getSubject();

to manage the cases when OpenshiftConnector has to to some action on a workspace, triggered by a batch, non-UI event:

start/stop of the che-server (which will stop ll workspaces during the shutdown procedure, which in turn will call OpenshiftConnector.removeImage() and OpenshiftConnector.removeContainer().

stop of the current workspace by the activity-checker due to workspace idling, which will also call OpenshiftConnector.removeImage() and OpenshiftConnector.removeContainer().

In cases 1 and 2, the EnvironmentContext.getCurrent().getSubject() doesn't contain a valid subject that can be used to connect the Openshift cluster/namespace hosting the workspace.

Start of the Bayesian language server: in this case, calling EnvironmentContext.getCurrent().getSubject() on the wsmaster during a request coming from the wsagent would return the machine token instead of the Keycloak token in subject.getToken(), which is also wrong.

it would be great if subject is not cached
why batch jobs can't send the required data ? or why associated subject is not propagated ? on removeContainer ?

@benoitf in those 3 scenarios, currently (before this PR) the required user information (contained in the subject) is available nowhere in the wsmaster application. And since these 3 scenarios are triggered by some non-interactive event (idling, tomcat shutdown hook, internal event indirectly triggered by the language server registry), we have no way to send or propagate the user information, afaik.

To be clear, on the other hand, workspace stop initiated by the user himself from the Javascript UIs (GWT or Dashboard) was already working great.

skabashnyuk · 2017-11-08T18:56:30Z

...i-workspace/src/main/java/org/eclipse/che/api/workspace/server/WorkspaceSubjectRegistry.java

+    if (WorkspaceStatusEvent.EventType.STARTING.equals(event.getEventType())) {
+      Subject subject = EnvironmentContext.getCurrent().getSubject();
+      if (subject == Subject.ANONYMOUS) {
+        LOG.warn("Workspace {} is being started by the 'anonymous' user.", workspaceId);


This case is impossible AFAIK. Shall we throw an error instead?

if you confirm that workspace start always happens as a direct result of a user action in the client-side Javascript application, then I'm not against it.

Done in commit 9d94514

Signed-off-by: David Festal <dfestal@redhat.com>

akorneta · 2017-11-09T10:01:29Z

...i-workspace/src/main/java/org/eclipse/che/api/workspace/server/WorkspaceSubjectRegistry.java

+  private final EventService eventService;
+  private final ReadWriteLock lock = new ReentrantReadWriteLock();
+  private final Map<String, Subject> workspaceOwners = new HashMap<>();
+  private final Multimap<String, String> userIdToWorkspaces =


@davidfestal I propose to replace Multimap and Map to next mapping:

Map<String, String> workspaceId2UserId Map<String, Subject> userId2Subject

cases:

1. when the workspace is stopped

... String userId = workspaceId2UserId.remove(workspaceId); if(userId != null) { userId2Subject.remove(userId); } ...

2. when the workspace is started

... workspaceId2UserId.put(workspaceId, subject.getUserId()); updateSubject(Subject subject); ... public void updateSubject(Subject subject) { String token = subject != null ? subject.getToken() : null; if (token != null && token.startsWith("machine")) { return; } ... userId2Subject.put(subject.getUserId(), subject); ... }

3. get subject by workpsace id:

public Subject getWorkspaceStarter(String workspaceId) { ... String userId = workspaceId2UserId.get(workspaceId); if(userId != null) { return userId2Subject.get(userId); } return null; ... }

WDYT?

It seems to me that this proposal doesn't support starting 2 workspaces with the same user and then stopping only one of both. The user Subject would be lost though it is still useful for the second workspace.

Or did I miss something ?

yes, I missed that and the first case when the workspace is stopped should look like this:

String userId = workspaceId2UserId.remove(workspaceId); if(userId != null && !workspaceId2UserId.values().contains(userId)) { userId2Subject.remove(userId); }

also it seems to me that in this proposal, updateSubject(Subject subject) would allow adding in the userId2Subject map a userId that wasn't previously associated to a workspace. In such a case, it will never be removed from the map and we have a memory leak, no ?

Could you explain what seems like a problem for you in the current implementation ?

Basically leak that you described is possible, but I thought that the start of the workspace it is the only case when this method is used and it can be solved by adding a check like:

if (workspaceId2UserId.values().contains(userId)) { userId2Subject.put(subject.getUserId(), subject); }

What I'm trying to improve it is the mapping of:
1 user to n workspaces and n workspaces to 1 subject
to
n workspaces to 1 user and 1 user to 1 subject.

Do you see what I mean?

Yes, I see, but I'm not sure performance would be better in the updateSubject() method, knowing that you potentially need to iterate the whole list of started workspaces.

And the updateSubject() method should be as fast as possible, since it's called each time a client request comes to the wsmaster.

Another point: with your proposal, getWorkspaceStarter() requires 2 hash map get calls instead of one.

In fact the idea of the current implementation was to optimize performance in those 2 methods that are called much more often than the workspace start or stop methods.

@davidfestal I see, so let it be as it is.

@skabashnyuk

... as suggested by @skabashnyuk [here](eclipse-che#7243 (comment)) Signed-off-by: David Festal <dfestal@redhat.com>

Signed-off-by: David Festal <dfestal@redhat.com>

benoitf · 2017-11-09T13:21:31Z

ci-build

Signed-off-by: David Festal <dfestal@redhat.com>

davidfestal · 2017-11-09T14:04:29Z

@skabashnyuk @benoitf @akorneta I tried to answer all your comments, but I might have missed some. Could you tell me if you still have pending comments / questions that still didn't get a satisfying answer ?

codenvy-ci · 2017-11-09T15:25:06Z

Build success. https://ci.codenvycorp.com/job/che-pullrequests-build/4320/

Signed-off-by: David Festal <dfestal@redhat.com>

benoitf · 2017-11-10T09:04:53Z

ci-build

codenvy-ci · 2017-11-10T11:09:23Z

Build success. https://ci.codenvycorp.com/job/che-pullrequests-build/4321/

davidfestal added 6 commits November 8, 2017 11:13

Add the ability to find the workspaceId from the machine token

6ae96f2

Signed-off-by: David Festal <dfestal@redhat.com>

Add a log when a synchronous che-server shutdown is requested

f651db8

Signed-off-by: David Festal <dfestal@redhat.com>

First stop the che server before deleting all the resources...

f39184e

... This allows cleaning the workspaces correctly also in the multi-tenant scenario. Signed-off-by: David Festal <dfestal@redhat.com>

davidfestal requested review from benoitf, garagatyi, l0rd, riuvshin, skabashnyuk and vparfonov as code owners November 8, 2017 11:16

davidfestal self-assigned this Nov 8, 2017

benoitf added the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Nov 8, 2017

benoitf reviewed Nov 8, 2017

View reviewed changes

davidfestal mentioned this pull request Nov 8, 2017

Bayesian compatibility and workspace cleaning with multi-tenancy redhat-developer/rh-che#417

Merged

Change log level to DEBUG

15642ff

Signed-off-by: David Festal <dfestal@redhat.com>

benoitf reviewed Nov 8, 2017

View reviewed changes

l0rd approved these changes Nov 8, 2017

View reviewed changes

skabashnyuk reviewed Nov 8, 2017

View reviewed changes

benoitf added the kind/task Internal things, technical debt, and to-do tasks to be performed. label Nov 8, 2017

davidfestal added 2 commits November 8, 2017 21:46

Add a test for the newly-created getWorkspaceId() method.

499a947

Signed-off-by: David Festal <dfestal@redhat.com>

Add JavaDoc to the WorkspaceSubjectRegistry class

3b91135

Signed-off-by: David Festal <dfestal@redhat.com>

akorneta reviewed Nov 9, 2017

View reviewed changes

davidfestal added 2 commits November 9, 2017 13:24

Throw when a workspace is started by the anonymous user...

9d94514

... as suggested by @skabashnyuk [here](eclipse-che#7243 (comment)) Signed-off-by: David Festal <dfestal@redhat.com>

Add unit tests for the WorkspaceSubjectRegistry

4a4a4f0

Signed-off-by: David Festal <dfestal@redhat.com>

reduce the stop timeout to 3 minutes

cc96961

Signed-off-by: David Festal <dfestal@redhat.com>

benoitf approved these changes Nov 9, 2017

View reviewed changes

skabashnyuk approved these changes Nov 9, 2017

View reviewed changes

akorneta approved these changes Nov 10, 2017

View reviewed changes

Also test users are cleaned when all the related workspaces are stopped

e7050f0

Signed-off-by: David Festal <dfestal@redhat.com>

davidfestal force-pushed the multi-tenant-workspace-cleaning branch from 50930e3 to e7050f0 Compare November 10, 2017 08:55

garagatyi approved these changes Nov 10, 2017

View reviewed changes

vparfonov approved these changes Nov 10, 2017

View reviewed changes

riuvshin approved these changes Nov 10, 2017

View reviewed changes

davidfestal merged commit 5041835 into eclipse-che:master Nov 10, 2017

benoitf added this to the 5.21.0 milestone Nov 10, 2017

benoitf removed the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Nov 10, 2017

This was referenced Nov 10, 2017

Fix one test and include tests that had been left ignored #7316

Merged

Another attempt to fix the tests that are only failing on master CI #7317

Merged

Multi tenant workspace cleaning #7243

Multi tenant workspace cleaning #7243

Conversation

davidfestal commented Nov 8, 2017

What does this PR do?

What issues does this PR fix or reference?

codenvy-ci commented Nov 8, 2017

codenvy-ci commented Nov 8, 2017

benoitf commented Nov 8, 2017

codenvy-ci commented Nov 8, 2017

davidfestal commented Nov 8, 2017

benoitf commented Nov 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benoitf commented Nov 8, 2017

codenvy-ci commented Nov 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfestal Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akorneta Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akorneta Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benoitf commented Nov 9, 2017

davidfestal commented Nov 9, 2017

codenvy-ci commented Nov 9, 2017

benoitf commented Nov 10, 2017

codenvy-ci commented Nov 10, 2017

davidfestal Nov 9, 2017 •

edited

Loading

akorneta Nov 9, 2017 •

edited

Loading

akorneta Nov 9, 2017 •

edited

Loading