Rework reconcile loop for workspace controller #36

amisevsk · 2020-04-01T06:40:15Z

What does this PR do?

Reworks the controller reconcile loop pretty thoroughly. For reviewing this PR, the diff will be nearly useless -- apart from a few copied sections (e.g. webhooks, some specs), the state of this PR represents a ground-up rewrite of most of the controller code.

A more detailed commit history is contained in repository https://github.com/amisevsk/che-workspace-operator-rework; see the design doc for a high-level overview.

The vast majority of the additions are because we store PodAdditions (a list of elements to be added to the workspace deployment, e.g. pods) in the status of subcomponents. This results in their full spec being included in the relevant CRDs. To make even looking at this PR slightly less daunting:

$ git diff master --stat pkg/
82 files changed, 4405 insertions(+), 4194 deletions(-)

What issues does this PR fix or reference?

eclipse-che/che#16494
eclipse-che/che#15786
supercedes PR #22

Is it tested? How?

Tested java-mysql and cloud shell with and without webhooks, oauth, on crc.

JPinkney · 2020-04-02T19:55:40Z

Tried this out and ran into a few things:

When running oc get workspace the url given doesn't have http infront anymore, not a big issue I just like being able to click on it in the terminal

NAME          WORKSPACE ID                PHASE      URL
cloud-shell   workspacee0cc88538f2c4e58   Starting   workspacee0cc88538f2c4e58-cloud-shell-proxy-4400.apps-crc.testing

When testing with webhooks enabled and routing as openshift-oauth I'm running into The only workspace creator has exec access. For some reason org.eclipse.che.workspace/creator is set to ''. I feel like I've run into that issue on master before though

Tested with webhooks disabled and everything was working 👍

In terms of the code, it LGTM but I don't have that much experience in go/kubernetes/the operator world yet. I'm going to take another look through it all again tomorrow though

amisevsk · 2020-04-03T05:05:58Z

When testing with webhooks enabled and routing as openshift-oauth I'm running into The only workspace creator has exec access. For some reason org.eclipse.che.workspace/creator is set to ''. I feel like I've run into that issue on master before though

I've seen this before, but I think usually it's due to starting with webhooks disabled and then enabling them (then the webhook would add the annotation but not have a user for it)? I'm not sure exactly what causes it, but could not reproduce with a fresh deployment

I've added https(s) back into the workspace URL :) -- I didn't add it initially because clicking the url never works in my terminal for some reason.

While looking into including http(s) in the URL status, I came across a few issues:

We weren't listing endpoints in the runtime annotation if:
- The endpoint was for a plugin
- The plugin had an alias
- The plugin alias didn't match the container name for that plugin
This means the problem basically only manifested for cloud-shell, where it doesn't matter; in the java-mysql workspace theia is aliased to theia-ide.

The solution for now is to disable aliases for Che plugins and editors; from checking the full Che server, I don't see how aliases are used currently. We might still run into issues with component name <-> container name matching, but I don't have a good solution in mind. Suggestions here are more than welcome.
We currently still rely on ingress.global.domain for correctly setting hostnames, even when running on OpenShift where it's not necessary. I have no idea how I didn't have this issue during testing, but for now it's required to set ingress.global.domain manually (e.g. app-crc.testing for crc). I'll look into fixing this more generally, but the issue is that even with openshift-oauth routing we create ingresses.
The makefile I merged had some minor issues, I've opened PR to address those Fix makefile issues due to rebasing over PR #35 #37 (they're largely from rebasing over Added in cert generation for openshift #35). The changes are duplicated to this PR but will be removed once Fix makefile issues due to rebasing over PR #35 #37 is merged.

Makefile

sleshchenko · 2020-04-03T12:37:57Z

cmd/manager/main.go

 	log.Info("Setting up webhooks")
 	if err := webhook.SetUpWebhooks(mgr, ctx); err != nil {
 		log.Error(err, "unable to register webhooks to the manager")
 		os.Exit(1)
 	}

+	// TODO: Required to filter GVK for metrics, since we add routes and templates to the scheme.
+	// TODO: see: https://github.com/operator-framework/operator-sdk/pull/2606
+	//if err = serveCRMetrics(cfg); err != nil {


Not sure I undestand why it's commented but does it make sense to create a separate issue to uncomment it and eventually make CR metrics working?

It was initially commented to resolve a log spamming issue Josh found while porting webhooks to the forked repo, and I didn't have time to delve deeply enough to figure out the full problem. I agree that it should be removed and an issue created, though.

sleshchenko · 2020-04-03T12:42:40Z

deploy/crds/workspace.che.eclipse.org_workspaces_crd.yaml

@@ -5,24 +5,24 @@ metadata:
 spec:
  additionalPrinterColumns:
  - JSONPath: .status.workspaceId
-    name: Id
+    description: The workspace's unique id
+    name: Workspace ID


it's not related to this PR but still interesting topic to raise:
Does Workspace ID makes sense since workspace is CR?
Do you think we can reuse UID?

Sometime ago I tried to remove it, but workspace id now it's workspace word + cut version of UID, which is used for objects/names generation. But I wonder if it makes sense in general and if we can drop it out.

One reason for leaving it in is that we use workspace ID (which is a just the object's UID, but modified) in various places and as a label for all created objects. We could use UID directly, but parsing the UID to a UUID can error, so it would complicate using a unique ID the way we do. We would also still have to store it to pass to subcomponents, since their UID is different from the main workspace UID.

Another option would be using the CR's name as a base, so everything created for cloud-shell would be prefixed by cloud-shell.

sleshchenko · 2020-04-03T12:52:54Z

pkg/controller/workspacerouting/solvers/openshift_oauth_solver.go

+					},
+				},
+				Spec: routeV1.RouteSpec{
+					Host: hostname,


I wonder if we should specify hostname at all? I know it helps to solve issues with route lengh and uniqueness but I don't think it's possible on the secure openshift instances, like OSIO

Or you think that controller must be able to set host?

The main issue is that hostname must be set on ingresses; it's done here mostly because the route code is a port of the ingress code; we should drop it here, but the PR will need to be reworked to create only routes and not ingresses when running on OpenShift.

Also, I don't have a minishift instance handy for testing, but I wonder how it would work there since we currently rely on using $(minishift ip).nip.io.

amisevsk · 2020-04-03T23:30:44Z

@sleshchenko Commit 22ab54d should fix your issue with command -v kubectl in the makefile, please double check.

pkg/controller/workspacerouting/solvers/oauth_proxy_container.go

pkg/config/property.go

pkg/controller/workspacerouting/solvers/oauth_proxy_container.go

sleshchenko · 2020-04-06T08:10:37Z

pkg/controller/workspace/env/common_env_vars.go

 		},
 		{
 			Name:  "CHE_WORKSPACE_NAMESPACE",
-			Value: wkspCtx.Namespace,
+			Value: namespace,
 		},


As a quick solution for CloudShell with OpenShiftOAuth:

Suggested change

},

},

{

Name: "USE_BEARER_TOKEN",

Value: config.ControllerCfg.GetWebhooksEnabled(),

},

As a bit better solution I wonder if OpenShift OAuth Routing could contribute env var to all containers via PodAdditions... But then we have another question that SA with exec rights is not needed anymore, then who should contribute it?
I'm OK with trying to solve this issue in scope of a dedicated issue.

sleshchenko · 2020-04-06T08:36:53Z

pkg/controller/workspacerouting/solvers/openshift_oauth_solver.go

+	proxyServices := getServicesForEndpoints(proxyPorts, workspaceMeta)
+	for idx := range proxyServices {
+		proxyServices[idx].Annotations = map[string]string{
+			"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls",


It seems to be just ported from master but since it's easy fix - let's fix running multiple workspaces in the same namespace

Suggested change

"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls",

"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls-" + workspaceMeta.WorkspaceId,

P.S. As you can see two workspaces does not work in hard-coded secret-name.

I assume it's not supposed to use the same secret for multiple services, the service name can be set in certificates as DNS name... And it's generated once.

It makes me thinking: what if we mustn't set this annotation for all services, only for openshift-oauth one.
If we really need TLS inside for all endpoint - we must add endpoint name into secret name as well.

P.S. Controller does not seem to update this field after reconciling, not sure it's expected since CR was not modified. I've just update controller. So, we should keep it in mind since it might lead to unexpected inconsistency after update

P.P.S. Well, seems like a bug since deployment is update but not service

Ah good catch, we currently ignore ObjectMeta when checking if a service needs to be updated. I'll fix this.

Looking into it a bit, this is maybe an task for another issue -- we can't naively check that annotations match since some are added by the cluster, and I don't want to hard-code checking service.alpha.openshift.io/serving-cert-secret-name in the generic case without further discussion.

For now, this annotation should not change (unless you rollout a new operator that changes the semantics).

sleshchenko · 2020-04-06T10:53:37Z

pkg/adaptor/plugin.go

+		if err != nil {
+			return nil, nil, err
+		}
+		// TODO: Alias for plugins seems to be ignored in regular Che


Not really. From che.openshift.io:

{ "namespace": "sleshche", "temporary": false, "id": "workspace6wt7edpromg4w9w0", "status": "RUNNING", "runtime": { "machines": { "nodejs": { "attributes": { "component": "nodejs", "memoryRequestBytes": "209715200", "memoryLimitBytes": "1073741824", "source": "recipe", "cpuLimitCores": "2.0", "cpuRequestCores": "0.125" } } }, "commands": [ { "commandLine": "yarn run lint", "name": "lint", "attributes": { "componentAlias": "nodejs", "machineName": "nodejs", "workingDir": "${CHE_PROJECTS_ROOT}/angular-realworld-example-app" }, "type": "exec" } ] }, "devfile": { "metadata": { "name": "wksp-sek5" }, "components": [ { "mountSources": true, "endpoints": [ { "name": "angular", "port": 4200 } ], "memoryLimit": "1Gi", "type": "dockerimage", "alias": "nodejs", "image": "quay.io/eclipse/che-nodejs10-community:nightly" } ], "apiVersion": "1.0.0", "commands": [ { "name": "lint", "actions": [ { "workdir": "${CHE_PROJECTS_ROOT}/angular-realworld-example-app", "type": "exec", "command": "yarn run lint", "component": "nodejs" } ] } ] } }

So, alias is propagated to machine's and commands attributes. Then Theia use them to match command to container.
I'm OK with skipping alias for time being but it would be good to make comment up to date.

Yes I'm fine to update this -- it was on my TODO list for Friday that I didn't get to. I recognize that alias is used for dockerimage components, but I could not find it being used for plugin components. I need to look into it more deeply.

pkg/controller/workspacerouting/solvers/oauth_proxy_container.go

pkg/controller/component/component_controller.go

sleshchenko

Please take a look my proposal/fixes https://github.com/amisevsk/che-workspace-crd-operator/compare/rework-reconcile-loop...sleshchenko:rework-reconcile-loop-2?expand=1

sleshchenko · 2020-04-06T13:35:27Z

pkg/controller/workspace/runtime/runtime_annotation.go

+				servers[endpoint.Name] = v1alpha1.CheWorkspaceServer{
+					Attributes: endpoint.Attributes,
+					Status:     v1alpha1.RunningServerStatus, // TODO: This is just set so the circles are green
+					URL:        fmt.Sprintf("%s://%s", protocol, endpoint.Url),


It leads to url with double protocol like "http://https://workspace383a324431b54d9f-theia-proxy-4400.apps-crc.testing

Something like amisevsk@84dd613 could be done to solve this issue.

This is an artifact of me implementing Josh's request for protocol in status -- I forgot to remove it here. It doesn't work properly anyways, since protocol is always http even for https routes. We should have protocol included in URL everywhere now, so checking is unnecessary.

Ah except when it doesn't:

ws://http://workspace82ba65b9c2b84595-che-machine-exec-4444.apps-crc.testing

So, now we get https URL even when protocol is ws, right?

pkg/controller/workspacerouting/solvers/oauth_proxy_container.go

pkg/controller/workspacerouting/solvers/solver.go

amisevsk · 2020-04-07T05:16:41Z

Tested again briefly on crc, and I suspect there may be a regression somewhere. With the java-mysql workspace I'm getting frequent issues opening a terminal.

sleshchenko · 2020-04-07T12:26:56Z

pkg/controller/workspace/provision/deployment.go

+				},
+			},
+			Strategy: appsv1.DeploymentStrategy{
+				Type: "RollingUpdate",


Maybe it's better to start with Recreate since workspaces will fails to update if PVC is used (at least on production-ready cluster)

sleshchenko · 2020-04-07T12:49:55Z

samples/no-editor.yaml

+apiVersion: workspace.che.eclipse.org/v1alpha1
+kind: Workspace
+metadata:
+  name: cloud-shell


is it really cloud-shell? Will controller propagate a default editor?

My bad, these got added half-accidentally as I threw them together for testing. I'll update the names since they could still be useful for test cases.

sleshchenko · 2020-04-07T13:42:33Z

pkg/apis/workspace/v1alpha1/workspace_types.go

+// Valid workspace Statuses
+const (
+	WorkspaceStatusStarting WorkspacePhase = "Starting"
+	WorkspaceStatusReady    WorkspacePhase = "Ready"


Previously it was Running, right?
I don't have a strong preferences but just reminder that OpenShift Console depends on Running state and we need inform them after PR is merged.

You're right, great catch -- I mixed up the workspace phase and the workspaceRouting phase. Fixed in PR #46

This commit represents an almost-complete overhaul of the main reconcile loop in the workspace controller. A high level overview of changes: - All controllers do full reconciling of all objects they are responsible for; this means deleting a route or the workspace deployment means it will be recreated - The main workspace controller now watches all resources it creates to trigger reconciles - The main reconcile loop is split into phases with subcontrollers; it only progresses based on status of earlier steps (i.e. if components aren't ready, we don't try to create routing) - All service/ingress/route creation is delegated to WorkspaceRouting - The openshift-oauth routingClass results in the openshift oauth-proxy container running in the main workspace deployment - There's a cleaner separation between elements in `pkg/controller` -- no imports across controllers (i.e. WorkspaceRouting imports nothing from Workspace) - All shared structs are extracted to `apis` folder - One service is created for all workspace endpoints (except discoverable endpoints) - Add Component subcontroller that converts devfile components into k8s objects A design doc and more detailed history for these changes is found at https://github.com/amisevsk/che-workspace-operator-rework Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

It's currently difficult to support aliases for chePlugin and cheEditor components, since the component name no longer matches the container that is created. This means that we cannot match endpoints to plugins when they have an alias that is different from container name. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>

- Name secret used for oauth proxy with workspaceId to prevent name conflicts when running more than one workspace - Add workspace namespace to SAR request with oauth proxy - Use bearer token if webhooks are enabled - Set default webhooks enabled to false Signed-off-by: Angel Misevski <amisevsk@redhat.com>

- Avoid errors when workspace doesn't contain any dockerimage (or plugin/editor) components - Fix double protocol for URLs in runtime annotation - Fix import alias naming Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Make will try to avoid launching a shell to execute simple commands this breaks for shell builtins, such as 'command'. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

sleshchenko

Tested cloud-shell with openshift/basic default routing class.
Basically works fine.

There is an issue with mixing http and https when we run Theia workspace but we agreed to address separately. Then we probably can do more precisely review

I faced issues in corner cases: updating default_routing_class when there is already started WS, but it's not critical at all and don't work in the master. We can address it separately as well.

Feel free to merge as soon as you make sure CloudShell works for you as well

BTW Great job! 💪 😎 👍

Add dependency to relevant rules in makefile to print current env var settings, to avoid accidentally reverting changes in e.g. the configmap Signed-off-by: Angel Misevski <amisevsk@redhat.com>

JPinkney

Giving this a +1 as well, I've been working off of this branch for a few days now and everything has been working on my side

amisevsk · 2020-04-07T17:27:27Z

Tested again, and the issues seem to be related to crc being somewhat unresponsive. Cloud shell works with both basic routing and OAuth + webhooks.

I'm merging this PR to unblock things; we can address edge cases in separate issues.

amisevsk · 2020-04-07T20:40:03Z

@sleshchenko I missed your last comments before merging; they are addressed in PR #46

…g of GO flags to match DWO (devfile#36) * use go mod tidy; go mod vendor and go build -mod=vendor -x since we need that downsteam Change-Id: Id3611db0a7aa81db6df97d7d5e9a7048fb30b197 Signed-off-by: nickboldt <nboldt@redhat.com> * revert to use mod download, then copy sources, THEN mod vendor; no need for -x flag Change-Id: Id5598026473544abdf6b082c9aa7fbff617353bb Signed-off-by: nickboldt <nboldt@redhat.com> * seems like upstream doesn't like vendoring so we can keep this to midstream Change-Id: If858e058b712e16b218f546324c46e703115dee8 Signed-off-by: nickboldt <nboldt@redhat.com>

amisevsk requested review from davidfestal, sleshchenko and JPinkney April 1, 2020 06:40

sleshchenko mentioned this pull request Apr 2, 2020

Move openshift-oauth-proxy into main workspace deployment #22

Closed

amisevsk force-pushed the rework-reconcile-loop branch from dde776d to 93fdedb Compare April 3, 2020 05:08

sleshchenko reviewed Apr 3, 2020

View reviewed changes

Makefile Outdated Show resolved Hide resolved

sleshchenko reviewed Apr 3, 2020

View reviewed changes

amisevsk force-pushed the rework-reconcile-loop branch from bfe2779 to 923c1cc Compare April 3, 2020 22:21