-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework reconcile loop for workspace controller #36
Conversation
Tried this out and ran into a few things:
Tested with webhooks disabled and everything was working 👍 In terms of the code, it LGTM but I don't have that much experience in go/kubernetes/the operator world yet. I'm going to take another look through it all again tomorrow though |
I've seen this before, but I think usually it's due to starting with webhooks disabled and then enabling them (then the webhook would add the annotation but not have a user for it)? I'm not sure exactly what causes it, but could not reproduce with a fresh deployment I've added https(s) back into the workspace URL :) -- I didn't add it initially because clicking the url never works in my terminal for some reason. While looking into including http(s) in the URL status, I came across a few issues:
|
dde776d
to
93fdedb
Compare
log.Info("Setting up webhooks") | ||
if err := webhook.SetUpWebhooks(mgr, ctx); err != nil { | ||
log.Error(err, "unable to register webhooks to the manager") | ||
os.Exit(1) | ||
} | ||
|
||
// TODO: Required to filter GVK for metrics, since we add routes and templates to the scheme. | ||
// TODO: see: https://github.com/operator-framework/operator-sdk/pull/2606 | ||
//if err = serveCRMetrics(cfg); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I undestand why it's commented but does it make sense to create a separate issue to uncomment it and eventually make CR metrics working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was initially commented to resolve a log spamming issue Josh found while porting webhooks to the forked repo, and I didn't have time to delve deeply enough to figure out the full problem. I agree that it should be removed and an issue created, though.
@@ -5,24 +5,24 @@ metadata: | |||
spec: | |||
additionalPrinterColumns: | |||
- JSONPath: .status.workspaceId | |||
name: Id | |||
description: The workspace's unique id | |||
name: Workspace ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not related to this PR but still interesting topic to raise:
Does Workspace ID makes sense since workspace is CR?
Do you think we can reuse UID?
Sometime ago I tried to remove it, but workspace id now it's workspace word + cut version of UID, which is used for objects/names generation. But I wonder if it makes sense in general and if we can drop it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason for leaving it in is that we use workspace ID (which is a just the object's UID, but modified) in various places and as a label for all created objects. We could use UID directly, but parsing the UID to a UUID can error, so it would complicate using a unique ID the way we do. We would also still have to store it to pass to subcomponents, since their UID is different from the main workspace UID.
Another option would be using the CR's name as a base, so everything created for cloud-shell
would be prefixed by cloud-shell
.
}, | ||
}, | ||
Spec: routeV1.RouteSpec{ | ||
Host: hostname, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue is that hostname must be set on ingresses; it's done here mostly because the route code is a port of the ingress code; we should drop it here, but the PR will need to be reworked to create only routes and not ingresses when running on OpenShift.
Also, I don't have a minishift instance handy for testing, but I wonder how it would work there since we currently rely on using $(minishift ip).nip.io
.
bfe2779
to
923c1cc
Compare
@sleshchenko Commit 22ab54d should fix your issue with |
pkg/controller/workspacerouting/solvers/oauth_proxy_container.go
Outdated
Show resolved
Hide resolved
pkg/controller/workspacerouting/solvers/oauth_proxy_container.go
Outdated
Show resolved
Hide resolved
}, | ||
{ | ||
Name: "CHE_WORKSPACE_NAMESPACE", | ||
Value: wkspCtx.Namespace, | ||
Value: namespace, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a quick solution for CloudShell with OpenShiftOAuth:
}, | |
}, | |
{ | |
Name: "USE_BEARER_TOKEN", | |
Value: config.ControllerCfg.GetWebhooksEnabled(), | |
}, |
As a bit better solution I wonder if OpenShift OAuth Routing could contribute env var to all containers via PodAdditions... But then we have another question that SA with exec rights is not needed anymore, then who should contribute it?
I'm OK with trying to solve this issue in scope of a dedicated issue.
proxyServices := getServicesForEndpoints(proxyPorts, workspaceMeta) | ||
for idx := range proxyServices { | ||
proxyServices[idx].Annotations = map[string]string{ | ||
"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be just ported from master but since it's easy fix - let's fix running multiple workspaces in the same namespace
"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls", | |
"service.alpha.openshift.io/serving-cert-secret-name": "proxy-tls-" + workspaceMeta.WorkspaceId, |
P.S. As you can see two workspaces does not work in hard-coded secret-name.
I assume it's not supposed to use the same secret for multiple services, the service name can be set in certificates as DNS name... And it's generated once.
It makes me thinking: what if we mustn't set this annotation for all services, only for openshift-oauth one.
If we really need TLS inside for all endpoint - we must add endpoint name into secret name as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. Controller does not seem to update this field after reconciling, not sure it's expected since CR was not modified. I've just update controller. So, we should keep it in mind since it might lead to unexpected inconsistency after update
P.P.S. Well, seems like a bug since deployment is update but not service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, we currently ignore ObjectMeta
when checking if a service needs to be updated. I'll fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into it a bit, this is maybe an task for another issue -- we can't naively check that annotations match since some are added by the cluster, and I don't want to hard-code checking service.alpha.openshift.io/serving-cert-secret-name
in the generic case without further discussion.
For now, this annotation should not change (unless you rollout a new operator that changes the semantics).
if err != nil { | ||
return nil, nil, err | ||
} | ||
// TODO: Alias for plugins seems to be ignored in regular Che |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. From che.openshift.io:
{
"namespace": "sleshche",
"temporary": false,
"id": "workspace6wt7edpromg4w9w0",
"status": "RUNNING",
"runtime": {
"machines": {
"nodejs": {
"attributes": {
"component": "nodejs",
"memoryRequestBytes": "209715200",
"memoryLimitBytes": "1073741824",
"source": "recipe",
"cpuLimitCores": "2.0",
"cpuRequestCores": "0.125"
}
}
},
"commands": [
{
"commandLine": "yarn run lint",
"name": "lint",
"attributes": {
"componentAlias": "nodejs",
"machineName": "nodejs",
"workingDir": "${CHE_PROJECTS_ROOT}/angular-realworld-example-app"
},
"type": "exec"
}
]
},
"devfile": {
"metadata": {
"name": "wksp-sek5"
},
"components": [
{
"mountSources": true,
"endpoints": [
{
"name": "angular",
"port": 4200
}
],
"memoryLimit": "1Gi",
"type": "dockerimage",
"alias": "nodejs",
"image": "quay.io/eclipse/che-nodejs10-community:nightly"
}
],
"apiVersion": "1.0.0",
"commands": [
{
"name": "lint",
"actions": [
{
"workdir": "${CHE_PROJECTS_ROOT}/angular-realworld-example-app",
"type": "exec",
"command": "yarn run lint",
"component": "nodejs"
}
]
}
]
}
}
So, alias is propagated to machine's and commands attributes. Then Theia use them to match command to container.
I'm OK with skipping alias for time being but it would be good to make comment up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'm fine to update this -- it was on my TODO list for Friday that I didn't get to. I recognize that alias is used for dockerimage components, but I could not find it being used for plugin components. I need to look into it more deeply.
pkg/controller/workspacerouting/solvers/oauth_proxy_container.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look my proposal/fixes https://github.com/amisevsk/che-workspace-crd-operator/compare/rework-reconcile-loop...sleshchenko:rework-reconcile-loop-2?expand=1
servers[endpoint.Name] = v1alpha1.CheWorkspaceServer{ | ||
Attributes: endpoint.Attributes, | ||
Status: v1alpha1.RunningServerStatus, // TODO: This is just set so the circles are green | ||
URL: fmt.Sprintf("%s://%s", protocol, endpoint.Url), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It leads to url with double protocol like "http://https://workspace383a324431b54d9f-theia-proxy-4400.apps-crc.testing
Something like amisevsk@84dd613 could be done to solve this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an artifact of me implementing Josh's request for protocol in status -- I forgot to remove it here. It doesn't work properly anyways, since protocol is always http
even for https routes. We should have protocol included in URL everywhere now, so checking is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah except when it doesn't:
ws://http://workspace82ba65b9c2b84595-che-machine-exec-4444.apps-crc.testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, now we get https URL even when protocol is ws, right?
Tested again briefly on |
}, | ||
}, | ||
Strategy: appsv1.DeploymentStrategy{ | ||
Type: "RollingUpdate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to start with Recreate since workspaces will fails to update if PVC is used (at least on production-ready cluster)
apiVersion: workspace.che.eclipse.org/v1alpha1 | ||
kind: Workspace | ||
metadata: | ||
name: cloud-shell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it really cloud-shell? Will controller propagate a default editor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, these got added half-accidentally as I threw them together for testing. I'll update the names since they could still be useful for test cases.
// Valid workspace Statuses | ||
const ( | ||
WorkspaceStatusStarting WorkspacePhase = "Starting" | ||
WorkspaceStatusReady WorkspacePhase = "Ready" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously it was Running, right?
I don't have a strong preferences but just reminder that OpenShift Console depends on Running state and we need inform them after PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, great catch -- I mixed up the workspace phase and the workspaceRouting phase. Fixed in PR #46
This commit represents an almost-complete overhaul of the main reconcile loop in the workspace controller. A high level overview of changes: - All controllers do full reconciling of all objects they are responsible for; this means deleting a route or the workspace deployment means it will be recreated - The main workspace controller now watches all resources it creates to trigger reconciles - The main reconcile loop is split into phases with subcontrollers; it only progresses based on status of earlier steps (i.e. if components aren't ready, we don't try to create routing) - All service/ingress/route creation is delegated to WorkspaceRouting - The openshift-oauth routingClass results in the openshift oauth-proxy container running in the main workspace deployment - There's a cleaner separation between elements in `pkg/controller` -- no imports across controllers (i.e. WorkspaceRouting imports nothing from Workspace) - All shared structs are extracted to `apis` folder - One service is created for all workspace endpoints (except discoverable endpoints) - Add Component subcontroller that converts devfile components into k8s objects A design doc and more detailed history for these changes is found at https://github.com/amisevsk/che-workspace-operator-rework Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
It's currently difficult to support aliases for chePlugin and cheEditor components, since the component name no longer matches the container that is created. This means that we cannot match endpoints to plugins when they have an alias that is different from container name. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>
Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>
Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>
- Name secret used for oauth proxy with workspaceId to prevent name conflicts when running more than one workspace - Add workspace namespace to SAR request with oauth proxy - Use bearer token if webhooks are enabled - Set default webhooks enabled to false Signed-off-by: Angel Misevski <amisevsk@redhat.com>
- Avoid errors when workspace doesn't contain any dockerimage (or plugin/editor) components - Fix double protocol for URLs in runtime annotation - Fix import alias naming Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Make will try to avoid launching a shell to execute simple commands this breaks for shell builtins, such as 'command'. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
d3df36b
to
9e6f26b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested cloud-shell with openshift/basic default routing class.
Basically works fine.
There is an issue with mixing http and https when we run Theia workspace but we agreed to address separately. Then we probably can do more precisely review
I faced issues in corner cases: updating default_routing_class when there is already started WS, but it's not critical at all and don't work in the master. We can address it separately as well.
Feel free to merge as soon as you make sure CloudShell works for you as well
BTW Great job! 💪 😎 👍
Add dependency to relevant rules in makefile to print current env var settings, to avoid accidentally reverting changes in e.g. the configmap Signed-off-by: Angel Misevski <amisevsk@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving this a +1 as well, I've been working off of this branch for a few days now and everything has been working on my side
Tested again, and the issues seem to be related to I'm merging this PR to unblock things; we can address edge cases in separate issues. |
@sleshchenko I missed your last comments before merging; they are addressed in PR #46 |
…g of GO flags to match DWO (devfile#36) * use go mod tidy; go mod vendor and go build -mod=vendor -x since we need that downsteam Change-Id: Id3611db0a7aa81db6df97d7d5e9a7048fb30b197 Signed-off-by: nickboldt <nboldt@redhat.com> * revert to use mod download, then copy sources, THEN mod vendor; no need for -x flag Change-Id: Id5598026473544abdf6b082c9aa7fbff617353bb Signed-off-by: nickboldt <nboldt@redhat.com> * seems like upstream doesn't like vendoring so we can keep this to midstream Change-Id: If858e058b712e16b218f546324c46e703115dee8 Signed-off-by: nickboldt <nboldt@redhat.com>
What does this PR do?
Reworks the controller reconcile loop pretty thoroughly. For reviewing this PR, the diff will be nearly useless -- apart from a few copied sections (e.g. webhooks, some specs), the state of this PR represents a ground-up rewrite of most of the controller code.
A more detailed commit history is contained in repository https://github.com/amisevsk/che-workspace-operator-rework; see the design doc for a high-level overview.
The vast majority of the additions are because we store
PodAdditions
(a list of elements to be added to the workspace deployment, e.g. pods) in the status of subcomponents. This results in their full spec being included in the relevant CRDs. To make even looking at this PR slightly less daunting:What issues does this PR fix or reference?
eclipse-che/che#16494
eclipse-che/che#15786
supercedes PR #22
Is it tested? How?
Tested java-mysql and cloud shell with and without webhooks, oauth, on crc.