Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework reconcile loop for workspace controller #36

Merged
merged 12 commits into from
Apr 7, 2020

Conversation

amisevsk
Copy link
Collaborator

@amisevsk amisevsk commented Apr 1, 2020

What does this PR do?

Reworks the controller reconcile loop pretty thoroughly. For reviewing this PR, the diff will be nearly useless -- apart from a few copied sections (e.g. webhooks, some specs), the state of this PR represents a ground-up rewrite of most of the controller code.

A more detailed commit history is contained in repository https://github.com/amisevsk/che-workspace-operator-rework; see the design doc for a high-level overview.

The vast majority of the additions are because we store PodAdditions (a list of elements to be added to the workspace deployment, e.g. pods) in the status of subcomponents. This results in their full spec being included in the relevant CRDs. To make even looking at this PR slightly less daunting:

$ git diff master --stat pkg/
82 files changed, 4405 insertions(+), 4194 deletions(-)

What issues does this PR fix or reference?

eclipse-che/che#16494
eclipse-che/che#15786
supercedes PR #22

Is it tested? How?

Tested java-mysql and cloud shell with and without webhooks, oauth, on crc.

@JPinkney
Copy link
Contributor

JPinkney commented Apr 2, 2020

Tried this out and ran into a few things:

  • When running oc get workspace the url given doesn't have http infront anymore, not a big issue I just like being able to click on it in the terminal
NAME          WORKSPACE ID                PHASE      URL
cloud-shell   workspacee0cc88538f2c4e58   Starting   workspacee0cc88538f2c4e58-cloud-shell-proxy-4400.apps-crc.testing
  • When testing with webhooks enabled and routing as openshift-oauth I'm running into The only workspace creator has exec access. For some reason org.eclipse.che.workspace/creator is set to ''. I feel like I've run into that issue on master before though

Tested with webhooks disabled and everything was working 👍

In terms of the code, it LGTM but I don't have that much experience in go/kubernetes/the operator world yet. I'm going to take another look through it all again tomorrow though

@amisevsk
Copy link
Collaborator Author

amisevsk commented Apr 3, 2020

  • When testing with webhooks enabled and routing as openshift-oauth I'm running into The only workspace creator has exec access. For some reason org.eclipse.che.workspace/creator is set to ''. I feel like I've run into that issue on master before though

I've seen this before, but I think usually it's due to starting with webhooks disabled and then enabling them (then the webhook would add the annotation but not have a user for it)? I'm not sure exactly what causes it, but could not reproduce with a fresh deployment

I've added https(s) back into the workspace URL :) -- I didn't add it initially because clicking the url never works in my terminal for some reason.

While looking into including http(s) in the URL status, I came across a few issues:

  • We weren't listing endpoints in the runtime annotation if:

    • The endpoint was for a plugin
    • The plugin had an alias
    • The plugin alias didn't match the container name for that plugin

    This means the problem basically only manifested for cloud-shell, where it doesn't matter; in the java-mysql workspace theia is aliased to theia-ide.

    The solution for now is to disable aliases for Che plugins and editors; from checking the full Che server, I don't see how aliases are used currently. We might still run into issues with component name <-> container name matching, but I don't have a good solution in mind. Suggestions here are more than welcome.

  • We currently still rely on ingress.global.domain for correctly setting hostnames, even when running on OpenShift where it's not necessary. I have no idea how I didn't have this issue during testing, but for now it's required to set ingress.global.domain manually (e.g. app-crc.testing for crc). I'll look into fixing this more generally, but the issue is that even with openshift-oauth routing we create ingresses.

  • The makefile I merged had some minor issues, I've opened PR to address those Fix makefile issues due to rebasing over PR #35 #37 (they're largely from rebasing over Added in cert generation for openshift #35). The changes are duplicated to this PR but will be removed once Fix makefile issues due to rebasing over PR #35 #37 is merged.

Makefile Outdated Show resolved Hide resolved
log.Info("Setting up webhooks")
if err := webhook.SetUpWebhooks(mgr, ctx); err != nil {
log.Error(err, "unable to register webhooks to the manager")
os.Exit(1)
}

// TODO: Required to filter GVK for metrics, since we add routes and templates to the scheme.
// TODO: see: https://github.com/operator-framework/operator-sdk/pull/2606
//if err = serveCRMetrics(cfg); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I undestand why it's commented but does it make sense to create a separate issue to uncomment it and eventually make CR metrics working?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was initially commented to resolve a log spamming issue Josh found while porting webhooks to the forked repo, and I didn't have time to delve deeply enough to figure out the full problem. I agree that it should be removed and an issue created, though.

@@ -5,24 +5,24 @@ metadata:
spec:
additionalPrinterColumns:
- JSONPath: .status.workspaceId
name: Id
description: The workspace's unique id
name: Workspace ID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not related to this PR but still interesting topic to raise:
Does Workspace ID makes sense since workspace is CR?
Do you think we can reuse UID?

Sometime ago I tried to remove it, but workspace id now it's workspace word + cut version of UID, which is used for objects/names generation. But I wonder if it makes sense in general and if we can drop it out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason for leaving it in is that we use workspace ID (which is a just the object's UID, but modified) in various places and as a label for all created objects. We could use UID directly, but parsing the UID to a UUID can error, so it would complicate using a unique ID the way we do. We would also still have to store it to pass to subcomponents, since their UID is different from the main workspace UID.

Another option would be using the CR's name as a base, so everything created for cloud-shell would be prefixed by cloud-shell.

},
},
Spec: routeV1.RouteSpec{
Host: hostname,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should specify hostname at all? I know it helps to solve issues with route lengh and uniqueness but I don't think it's possible on the secure openshift instances, like OSIO
Screenshot_20200403_155155
Or you think that controller must be able to set host?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that hostname must be set on ingresses; it's done here mostly because the route code is a port of the ingress code; we should drop it here, but the PR will need to be reworked to create only routes and not ingresses when running on OpenShift.

Also, I don't have a minishift instance handy for testing, but I wonder how it would work there since we currently rely on using $(minishift ip).nip.io.

@amisevsk
Copy link
Collaborator Author

amisevsk commented Apr 3, 2020

@sleshchenko Commit 22ab54d should fix your issue with command -v kubectl in the makefile, please double check.

},
{
Name: "CHE_WORKSPACE_NAMESPACE",
Value: wkspCtx.Namespace,
Value: namespace,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a quick solution for CloudShell with OpenShiftOAuth:

Suggested change
},
},
{
Name: "USE_BEARER_TOKEN",
Value: config.ControllerCfg.GetWebhooksEnabled(),
},

As a bit better solution I wonder if OpenShift OAuth Routing could contribute env var to all containers via PodAdditions... But then we have another question that SA with exec rights is not needed anymore, then who should contribute it?
I'm OK with trying to solve this issue in scope of a dedicated issue.

proxyServices := getServicesForEndpoints(proxyPorts, workspaceMeta)
for idx := range proxyServices {
proxyServices[idx].Annotations = map[string]string{
"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls",
Copy link
Member

@sleshchenko sleshchenko Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be just ported from master but since it's easy fix - let's fix running multiple workspaces in the same namespace

Suggested change
"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls",
"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls-" + workspaceMeta.WorkspaceId,

P.S. As you can see two workspaces does not work in hard-coded secret-name.
Screenshot_20200406_151502

I assume it's not supposed to use the same secret for multiple services, the service name can be set in certificates as DNS name... And it's generated once.
Screenshot_20200406_152125

It makes me thinking: what if we mustn't set this annotation for all services, only for openshift-oauth one.
If we really need TLS inside for all endpoint - we must add endpoint name into secret name as well.

Copy link
Member

@sleshchenko sleshchenko Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. Controller does not seem to update this field after reconciling, not sure it's expected since CR was not modified. I've just update controller. So, we should keep it in mind since it might lead to unexpected inconsistency after update
Screenshot_20200406_133853

P.P.S. Well, seems like a bug since deployment is update but not service
Screenshot_20200406_152520
Screenshot_20200406_152654

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, we currently ignore ObjectMeta when checking if a service needs to be updated. I'll fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into it a bit, this is maybe an task for another issue -- we can't naively check that annotations match since some are added by the cluster, and I don't want to hard-code checking service.alpha.openshift.io/serving-cert-secret-name in the generic case without further discussion.

For now, this annotation should not change (unless you rollout a new operator that changes the semantics).

if err != nil {
return nil, nil, err
}
// TODO: Alias for plugins seems to be ignored in regular Che
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. From che.openshift.io:

{
  "namespace": "sleshche",
  "temporary": false,
  "id": "workspace6wt7edpromg4w9w0",
  "status": "RUNNING",
  "runtime": {
    "machines": {
      "nodejs": {
        "attributes": {
          "component": "nodejs",
          "memoryRequestBytes": "209715200",
          "memoryLimitBytes": "1073741824",
          "source": "recipe",
          "cpuLimitCores": "2.0",
          "cpuRequestCores": "0.125"
        }
      }
    },
    "commands": [
      {
        "commandLine": "yarn run lint",
        "name": "lint",
        "attributes": {
          "componentAlias": "nodejs",
          "machineName": "nodejs",
          "workingDir": "${CHE_PROJECTS_ROOT}/angular-realworld-example-app"
        },
        "type": "exec"
      }
    ]
  },
  "devfile": {
    "metadata": {
      "name": "wksp-sek5"
    },
    "components": [
      {
        "mountSources": true,
        "endpoints": [
          {
            "name": "angular",
            "port": 4200
          }
        ],
        "memoryLimit": "1Gi",
        "type": "dockerimage",
        "alias": "nodejs",
        "image": "quay.io/eclipse/che-nodejs10-community:nightly"
      }
    ],
    "apiVersion": "1.0.0",
    "commands": [
      {
        "name": "lint",
        "actions": [
          {
            "workdir": "${CHE_PROJECTS_ROOT}/angular-realworld-example-app",
            "type": "exec",
            "command": "yarn run lint",
            "component": "nodejs"
          }
        ]
      }
    ]
  }
}

So, alias is propagated to machine's and commands attributes. Then Theia use them to match command to container.
I'm OK with skipping alias for time being but it would be good to make comment up to date.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm fine to update this -- it was on my TODO list for Friday that I didn't get to. I recognize that alias is used for dockerimage components, but I could not find it being used for plugin components. I need to look into it more deeply.

Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

servers[endpoint.Name] = v1alpha1.CheWorkspaceServer{
Attributes: endpoint.Attributes,
Status: v1alpha1.RunningServerStatus, // TODO: This is just set so the circles are green
URL: fmt.Sprintf("%s://%s", protocol, endpoint.Url),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It leads to url with double protocol like "http://https://workspace383a324431b54d9f-theia-proxy-4400.apps-crc.testing

Something like amisevsk@84dd613 could be done to solve this issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an artifact of me implementing Josh's request for protocol in status -- I forgot to remove it here. It doesn't work properly anyways, since protocol is always http even for https routes. We should have protocol included in URL everywhere now, so checking is unnecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah except when it doesn't:

ws://http://workspace82ba65b9c2b84595-che-machine-exec-4444.apps-crc.testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, now we get https URL even when protocol is ws, right?

@amisevsk
Copy link
Collaborator Author

amisevsk commented Apr 7, 2020

Tested again briefly on crc, and I suspect there may be a regression somewhere. With the java-mysql workspace I'm getting frequent issues opening a terminal.

},
},
Strategy: appsv1.DeploymentStrategy{
Type: "RollingUpdate",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to start with Recreate since workspaces will fails to update if PVC is used (at least on production-ready cluster)

apiVersion: workspace.che.eclipse.org/v1alpha1
kind: Workspace
metadata:
name: cloud-shell
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really cloud-shell? Will controller propagate a default editor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, these got added half-accidentally as I threw them together for testing. I'll update the names since they could still be useful for test cases.

// Valid workspace Statuses
const (
WorkspaceStatusStarting WorkspacePhase = "Starting"
WorkspaceStatusReady WorkspacePhase = "Ready"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously it was Running, right?
I don't have a strong preferences but just reminder that OpenShift Console depends on Running state and we need inform them after PR is merged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, great catch -- I mixed up the workspace phase and the workspaceRouting phase. Fixed in PR #46

@amisevsk amisevsk marked this pull request as ready for review April 7, 2020 13:46
This commit represents an almost-complete overhaul of the main
reconcile loop in the workspace controller.

A high level overview of changes:

- All controllers do full reconciling of all objects they are
  responsible for; this means deleting a route or the workspace
  deployment means it will be recreated
  - The main workspace controller now watches all resources it creates
    to trigger reconciles
- The main reconcile loop is split into phases with subcontrollers; it
  only progresses based on status of earlier steps (i.e. if components
  aren't ready, we don't try to create routing)
- All service/ingress/route creation is delegated to WorkspaceRouting
- The openshift-oauth routingClass results in the openshift oauth-proxy
  container running in the main workspace deployment
- There's a cleaner separation between elements in `pkg/controller` --
  no imports across controllers (i.e. WorkspaceRouting imports nothing
  from Workspace)
  - All shared structs are extracted to `apis` folder
- One service is created for all workspace endpoints (except
  discoverable endpoints)
- Add Component subcontroller that converts devfile components into
  k8s objects

A design doc and more detailed history for these changes is found at
https://github.com/amisevsk/che-workspace-operator-rework

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
It's currently difficult to support aliases for chePlugin and cheEditor
components, since the component name no longer matches the container
that is created. This means that we cannot match endpoints to plugins
when they have an alias that is different from container name.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
JPinkney and others added 8 commits April 7, 2020 10:05
Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>
Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>
Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>
- Name secret used for oauth proxy with workspaceId to prevent name
conflicts when running more than one workspace
- Add workspace namespace to SAR request with oauth proxy
- Use bearer token if webhooks are enabled
- Set default webhooks enabled to false

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
- Avoid errors when workspace doesn't contain any dockerimage (or
plugin/editor) components
- Fix double protocol for URLs in runtime annotation
- Fix import alias naming

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Make will try to avoid launching a shell to execute simple commands
this breaks for shell builtins, such as 'command'.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested cloud-shell with openshift/basic default routing class.
Basically works fine.

There is an issue with mixing http and https when we run Theia workspace but we agreed to address separately. Then we probably can do more precisely review

I faced issues in corner cases: updating default_routing_class when there is already started WS, but it's not critical at all and don't work in the master. We can address it separately as well.

Feel free to merge as soon as you make sure CloudShell works for you as well

BTW Great job! 💪 😎 👍

Add dependency to relevant rules in makefile to print current env var
settings, to avoid accidentally reverting changes in e.g. the configmap

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Copy link
Contributor

@JPinkney JPinkney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving this a +1 as well, I've been working off of this branch for a few days now and everything has been working on my side

@amisevsk
Copy link
Collaborator Author

amisevsk commented Apr 7, 2020

Tested again, and the issues seem to be related to crc being somewhat unresponsive. Cloud shell works with both basic routing and OAuth + webhooks.

I'm merging this PR to unblock things; we can address edge cases in separate issues.

@amisevsk
Copy link
Collaborator Author

amisevsk commented Apr 7, 2020

@sleshchenko I missed your last comments before merging; they are addressed in PR #46

JPinkney pushed a commit to JPinkney/che-workspace-operator that referenced this pull request May 13, 2021
…g of GO flags to match DWO (devfile#36)

* use go mod tidy; go mod vendor and go build -mod=vendor -x since we need that downsteam

Change-Id: Id3611db0a7aa81db6df97d7d5e9a7048fb30b197
Signed-off-by: nickboldt <nboldt@redhat.com>

* revert to use mod download, then copy sources, THEN mod vendor; no need for -x flag

Change-Id: Id5598026473544abdf6b082c9aa7fbff617353bb
Signed-off-by: nickboldt <nboldt@redhat.com>

* seems like upstream doesn't like vendoring so we can keep this to midstream

Change-Id: If858e058b712e16b218f546324c46e703115dee8
Signed-off-by: nickboldt <nboldt@redhat.com>
@amisevsk amisevsk deleted the rework-reconcile-loop branch February 8, 2023 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants