Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi tenant workspace cleaning #7243

Merged

Conversation

davidfestal
Copy link
Contributor

What does this PR do?

This PR introduces changes required to support some multi-tenancy use-cases that were not supported until now:

  • clean openshift resources associated to a workspace when the workspace is idled
  • clean openshift resources associated to all started workspaces on che-server shutdown.

These 2 use-cases are critical to allow OSIO multi-tenancy to go to production.

What issues does this PR fix or reference?

redhat-developer/rh-che#370

Signed-off-by: David Festal <dfestal@redhat.com>
Signed-off-by: David Festal <dfestal@redhat.com>
More precisely, the idea is to keep track of the userId that started a
workspace and have the corresponding Subject always available and
updated with the last connection information (mainly the last Keycloak
token).
 

Signed-off-by: David Festal <dfestal@redhat.com>
... to allow removing Openshift resources when a workspace is stopped
(`removeContainer`, `removeImage`) even if the workspace user is not
accessible through the current `EnvironmentContext`.

Signed-off-by: David Festal <dfestal@redhat.com>
...
This mainly consists in cleaning the workspace files *synchronously*
instead of doing it in batch at che-server idling.
The previous behavior remains unchanged if the che server is not
multi-tenant.

Signed-off-by: David Festal <dfestal@redhat.com>
... This allows cleaning the workspaces correctly also in the
multi-tenant scenario. 

Signed-off-by: David Festal <dfestal@redhat.com>
@codenvy-ci
Copy link

Can one of the admins verify this patch?

@benoitf benoitf added the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Nov 8, 2017
@codenvy-ci
Copy link

Can one of the admins verify this patch?

@benoitf
Copy link
Contributor

benoitf commented Nov 8, 2017

ci-build

@codenvy-ci
Copy link

Build # 4314 - FAILED

Please check console output at https://ci.codenvycorp.com/job/che-pullrequests-build/4314/ to view the results.

@davidfestal
Copy link
Contributor Author

any idea why the CI build failed ? it seems from the console output that the maven build was successful but the the Jenkins job failed.

@benoitf
Copy link
Contributor

benoitf commented Nov 8, 2017

image

@@ -106,6 +106,7 @@ private void doStopServices() {
@PreDestroy
@VisibleForTesting
void shutdown() throws InterruptedException {
LOG.info("Synchronous shutdown requested");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be cleanup, no ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ? is it a problem to have this log ? it's only called once on shutdown, but might be very useful to check that the post-destroy method is called on SIGTERM, which wasn't the case last week (for an unknown reason, this was fixed when updating to the last master branch).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's more the level of the logger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the log level used in line 91 is also INFO, and the type of event is clearly the same, no ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my view on log events (there are other views ;-) I still think these events are not at INFO level.
People that are interested in a specific part want always to push to INFO level but many times it's more DEBUG info. I'm not sure it's interesting for end-user to see that by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here

@benoitf
Copy link
Contributor

benoitf commented Nov 8, 2017

ci-build

@codenvy-ci
Copy link

Signed-off-by: David Festal <dfestal@redhat.com>
@@ -266,6 +266,8 @@ echo "done!"
# If command == clean up then delete all openshift objects
# -------------------------------------------------------------
if [ "${COMMAND}" == "cleanup" ]; then
echo "[CHE] Stopping the Che server..."
oc scale --replicas=0 --timeout=3h dc che
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a side note, do we really want to wait 3h there ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not :-) The idea though is to give enough time to stop all the workspaces started on the various external clusters / namespaces associated to users in the multi-tenant scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@l0rd What value would you give to this timeout ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit cc96961

* @return workspace identifier
* @throws NotFoundException when no such machine token exists
*/
public String getWorkspaceId(String token) throws NotFoundException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method should be tested in multiuser/machine-auth/che-multiuser-machine-authentication/src/test/java/org/eclipse/che/multiuser/machine/authentication/server/MachineTokenRegistryTest.java

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find where this method is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used by the rh-che assembly, in order to get an up-to-date Keycloak token that we can inject into the Bayesian Language server when starting it.

See PR redhat-developer/rh-che#417, and especially this line

import org.slf4j.Logger;

@Singleton
public class WorkspaceSubjectRegistry implements EventSubscriber<WorkspaceStatusEvent> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is missing unit tests and class comment like What is the meaning of this class, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sorry about that. I'll add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc added in commit 3b91135
Unit tests added in commit 4a4a4f0

}

@PostConstruct
@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the associated test ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see commit 4a4a4f0

* @return workspace identifier
* @throws NotFoundException when no such machine token exists
*/
public String getWorkspaceId(String token) throws NotFoundException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find where this method is used

import org.eclipse.che.commons.subject.Subject;
import org.slf4j.Logger;

@Singleton
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contract of this class is unclear. Looks like a cache of information that can be retrieved from other API or managers. Can you clarify a contract?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a sort of cache yes, but I'll add Javadoc to the class, as well as tests, which I omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc added in commit 3b91135

@@ -45,6 +49,7 @@ public final void doFilter(
Subject subject = new SubjectImpl("che", "che", "dummy_token", false);
HttpSession session = httpRequest.getSession();
session.setAttribute("codenvy_user", subject);
workspaceSubjectRegistry.updateSubject(subject);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to cache subject?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the reasons explained here

private Subject subjectForContainerId(String containerId) {
String workspaceId = containerIdToWorkspaceId.get(containerId);
if (workspaceId != null) {
Subject subject = workspaceSubjectRegistry.getWorkspaceStarter(workspaceId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to get Subject from cache instead of EnvironmentContext.getCurrent().getSubject();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to manage the cases when OpenshiftConnector has to to some action on a workspace, triggered by a batch, non-UI event:

  1. start/stop of the che-server (which will stop ll workspaces during the shutdown procedure, which in turn will call OpenshiftConnector.removeImage() and OpenshiftConnector.removeContainer().
  2. stop of the current workspace by the activity-checker due to workspace idling, which will also call OpenshiftConnector.removeImage() and OpenshiftConnector.removeContainer().

In cases 1 and 2, the EnvironmentContext.getCurrent().getSubject() doesn't contain a valid subject that can be used to connect the Openshift cluster/namespace hosting the workspace.

  1. Start of the Bayesian language server: in this case, calling EnvironmentContext.getCurrent().getSubject() on the wsmaster during a request coming from the wsagent would return the machine token instead of the Keycloak token in subject.getToken(), which is also wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if subject is not cached
why batch jobs can't send the required data ? or why associated subject is not propagated ? on removeContainer ?

Copy link
Contributor Author

@davidfestal davidfestal Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benoitf in those 3 scenarios, currently (before this PR) the required user information (contained in the subject) is available nowhere in the wsmaster application. And since these 3 scenarios are triggered by some non-interactive event (idling, tomcat shutdown hook, internal event indirectly triggered by the language server registry), we have no way to send or propagate the user information, afaik.

To be clear, on the other hand, workspace stop initiated by the user himself from the Javascript UIs (GWT or Dashboard) was already working great.

if (WorkspaceStatusEvent.EventType.STARTING.equals(event.getEventType())) {
Subject subject = EnvironmentContext.getCurrent().getSubject();
if (subject == Subject.ANONYMOUS) {
LOG.warn("Workspace {} is being started by the 'anonymous' user.", workspaceId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is impossible AFAIK. Shall we throw an error instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you confirm that workspace start always happens as a direct result of a user action in the client-side Javascript application, then I'm not against it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 9d94514

@benoitf benoitf added the kind/task Internal things, technical debt, and to-do tasks to be performed. label Nov 8, 2017
Signed-off-by: David Festal <dfestal@redhat.com>
Signed-off-by: David Festal <dfestal@redhat.com>
private final EventService eventService;
private final ReadWriteLock lock = new ReentrantReadWriteLock();
private final Map<String, Subject> workspaceOwners = new HashMap<>();
private final Multimap<String, String> userIdToWorkspaces =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfestal I propose to replace Multimap and Map to next mapping:

Map<String, String> workspaceId2UserId
Map<String, Subject> userId2Subject

cases:

1. when the workspace is stopped

...
String userId = workspaceId2UserId.remove(workspaceId);
if(userId != null) {
    userId2Subject.remove(userId);
}
...

2. when the workspace is started

...
workspaceId2UserId.put(workspaceId, subject.getUserId());
updateSubject(Subject subject);
...

public void updateSubject(Subject subject) {
    String token = subject != null ? subject.getToken() : null;
    if (token != null && token.startsWith("machine")) {      
      return;
    }
    ...
    userId2Subject.put(subject.getUserId(), subject);
    ...
  }

3. get subject by workpsace id:

public Subject getWorkspaceStarter(String workspaceId) {
  ...
  String userId = workspaceId2UserId.get(workspaceId);
  if(userId != null) {
    return userId2Subject.get(userId);     
  }
  return null;
  ...
}

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this proposal doesn't support starting 2 workspaces with the same user and then stopping only one of both. The user Subject would be lost though it is still useful for the second workspace.

Or did I miss something ?

Copy link
Contributor

@akorneta akorneta Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I missed that and the first case when the workspace is stopped should look like this:

String userId = workspaceId2UserId.remove(workspaceId);
if(userId != null && !workspaceId2UserId.values().contains(userId)) {
    userId2Subject.remove(userId);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it seems to me that in this proposal, updateSubject(Subject subject) would allow adding in the userId2Subject map a userId that wasn't previously associated to a workspace. In such a case, it will never be removed from the map and we have a memory leak, no ?

Could you explain what seems like a problem for you in the current implementation ?

Copy link
Contributor

@akorneta akorneta Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically leak that you described is possible, but I thought that the start of the workspace it is the only case when this method is used and it can be solved by adding a check like:

if (workspaceId2UserId.values().contains(userId)) {
   userId2Subject.put(subject.getUserId(), subject);
}

What I'm trying to improve it is the mapping of:
1 user to n workspaces and n workspaces to 1 subject
to
n workspaces to 1 user and 1 user to 1 subject.

Do you see what I mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see, but I'm not sure performance would be better in the updateSubject() method, knowing that you potentially need to iterate the whole list of started workspaces.

And the updateSubject() method should be as fast as possible, since it's called each time a client request comes to the wsmaster.

Another point: with your proposal, getWorkspaceStarter() requires 2 hash map get calls instead of one.

In fact the idea of the current implementation was to optimize performance in those 2 methods that are called much more often than the workspace start or stop methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfestal I see, so let it be as it is.

... as suggested by @skabashnyuk
[here](eclipse-che#7243 (comment))

Signed-off-by: David Festal <dfestal@redhat.com>
Signed-off-by: David Festal <dfestal@redhat.com>
@benoitf
Copy link
Contributor

benoitf commented Nov 9, 2017

ci-build

Signed-off-by: David Festal <dfestal@redhat.com>
@davidfestal
Copy link
Contributor Author

@skabashnyuk @benoitf @akorneta I tried to answer all your comments, but I might have missed some. Could you tell me if you still have pending comments / questions that still didn't get a satisfying answer ?

@codenvy-ci
Copy link

Signed-off-by: David Festal <dfestal@redhat.com>
@davidfestal davidfestal force-pushed the multi-tenant-workspace-cleaning branch from 50930e3 to e7050f0 Compare November 10, 2017 08:55
@benoitf
Copy link
Contributor

benoitf commented Nov 10, 2017

ci-build

@davidfestal davidfestal merged commit 5041835 into eclipse-che:master Nov 10, 2017
@benoitf benoitf added this to the 5.21.0 milestone Nov 10, 2017
@benoitf benoitf removed the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Nov 10, 2017
@codenvy-ci
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants