-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] How to configure monitoring/tracing with che CRD #15136
Comments
usecase 1 - already installed non-crd monitoring stack maintained by cluster adminsWe can't do much here as we don't have control over already installed services and we do not know what to expect. We have few options how to help admins with configuration, but usefullnes of these is questionable.
Considering that all this is for already installed monitoring stack maintained by customer, I find "only" documenting how to configure Prometheus the best option for now. Even if we have access to monitoring stack namespace, updating config would be very tricky (they could have set it as ConfigMap, Secret, file in container, ...) and IMHO not worth the effort. I would suggest to listen to customers with this setup and implement when request comes. Until that, there are too many unknowns in possible deployments and we would be only speculating. |
cc: @skabashnyuk |
Prometheus service discovery config would look like this:
|
usecase 2 - already installed monitoring stack with operators (CR based) maintained by cluster adminsPrometheus installed by
|
yes, Prometheus CR have to have defined
TODO: what's minimal permissions needed for
yes.
this ServiceMonitor CR will find workspace service. We can further limit that to certain namespaces. Unfortunately, there is no option to define wildcards. Namespaces must be listed only with exact values so it is useless if we have workspaces in various namespaces https://github.com/coreos/prometheus-operator/blob/master/example/prometheus-operator-crd/servicemonitor.crd.yaml#L267
|
Important note: when using prometheus deployed with prometheus-operator, the minimal permissions for service discovery seems to be:
|
Jaegerjaeger-operator can be installed with
Then this will deploy simple jaeger instance
To connect Che to Jaeger instance, we need to know Jaeger's collector endpoint. For 1. and 2. use-case we should just enhance che-operator to enable tracing and set url to jaeger collector endpoint. We need to set env values as documented here https://www.eclipse.org/che/docs/che-7/tracing-che/#enabling-che-metrics-collections_tracing-che with |
Grafanagrafana-operator has There is one open issue. Dashboard json have
|
Summary/ProposalSo here's the summary and proposal of the solution Update CheServer CRD with following options:
With this, we will be able to support all 3 scenarios 1] Connect to existing Prom/Jaeger/Grafana
che-operator:will properly configure che monitoring endpoint and connection to jaeger. Rest is on documentation. Documentation
2] Connect to existing Prom/Jaeger/Grafana CR basedche-operator:Configure same as in 1]. Operator will detect DocumentationHow to configure prometheus-operator and grafana-operator to be able to discover 3] Install and connect Prom/Jaeger/Grafana with operators
che-operator:With DocumentationEverything should be installed, connected and working. Document only where to find the services. What's next ?I'd like to take it in parts. I intentionally didn't write much details here like all custom resources and exact configurations. Just overall logic from che-operator point of view and proposed changes to Thoughts ? cc: @skabashnyuk (please cc anyone who you think should read this) |
@sparko that is a great summary/proposal. General look and feelmetrics:
enable:true (default false)
prometheus:
enable:true (default false)
cr:
operator:
enable:true (default false)
config:TBD
serviceMonitor:
enable:true (default false)
cr:
podMonitor
enable:true (default false)
cr:
alertmanager:
enable:true (default false)
cr:
prometheusRule
enable:true (default false)
cr:
grafana:
enable:true (default false)
cr:
operator:
enable:true (default false)
config:TBD
grafanaDashboard:
enable:true (default false)
cr:
grafanaDatasource:
enable:true (default false)
cr:
tracing:
enable:true
jaegerClientConfig:
serviceName:"che-server"
endpoint: "http://jaeger-collector:14268/api/traces"
sampler:
managerHostPort: "jaeger:5778"
type:"const"
param:"1"
reporter:
maxQueueSize: "10000"
jaeger:
cr:
operator:
enable:true (default false)
config:TBD Important parts
metrics:
enable:false
tracing:
enable:false
jaegerClientConfig:
serviceName:"che-server"
endpoint: "http://jaeger-collector:14268/api/traces"
sampler:
managerHostPort: "jaeger:5778"
type:"const"
param:"1"
reporter:
maxQueueSize: "10000"
metrics:
prometheus:
cr:
serviceMonitor:
cr:
podMonitor
cr:
alertmanager:
cr:
prometheusRule
cr:
grafana:
cr:
grafanaDashboard:
cr:
grafanaDatasource:
cr:
tracing:
jaeger:
cr:
metrics:
prometheus:
operator:
enable:true (default false)
config:TBD
grafana:
operator:
enable:true (default false)
config:TBD
tracing:
jaeger:
operator:
enable:true (default false)
config:TBD 3 scenarios1 Connect to existing Prom/Jaeger/Grafanametrics:
enable:true
tracing:
enable:true
jaegerClientConfig:
serviceName:"che-server"
??? Do we need more here 2 Connect to existing Prom/Jaeger/Grafana CR basedSame behavior as described here #15136 (comment) 3 Install and connect Prom/Jaeger/Grafana with operatorsThis config explicitly defines that we want to enable metrics and tracing and install all operators with default settings. In case if CRD's are already registered I think we should fail and explain the user to go with usecase 2. metrics:
enable:true
prometheus:
operator:
enable:true
grafana:
operator:
enable:true
tracing:
enable:true
jaeger:
operator:
enable:true WDYT @sparkoo @metlos @sleshchenko @mshaposhnik ? |
yes, this is really good, very generic approach.
|
Interesting POV. I must admit I have some doubts about the default value for
+1
Ok. Agree. Then just install operator and let's explain in docs that users have to distinct use-case 2 and use case 3 by themself.
I have some doubts here. According to https://github.com/jaegertracing/jaeger-client-java/blob/master/jaeger-core/README.md only |
Can you please closer specify what you mean by this?
You can have also more Prometheus instances, so I guess we should make all arrays? I'm wondering where to set the boundary between generic and simplistic solution. If we do it too generic, what value do we bring instead of let admins to provide their own stack? TBH, I don't know what approach I like more. I think that my way is kind of hidden in your structure. So maybe I would define the minimal subset to fulfill 3 scenarios we've set, and try to find the best structure to achieve that and don't close the door to the future. I think that some mixture of both will raise up from that. Maybe we should ask someone who is more involved in che-operator? |
Sorry for interupting discussion about CRD spec, but here's another important piece of puzzle. Programatically install an operator from operator hubNice OpenShift documentation is here: https://docs.openshift.com/container-platform/4.2/operators/olm-adding-operators-to-cluster.html#olm-installing-operator-from-operatorhub-using-cli_olm-adding-operators-to-a-cluster
We need to set few things here: exampleLet's say we want to install Prometheus operator to
Now we need to find informations about prometheus operator:
So resulting
CaveatsIf we want to support only OpenShift 4.2, we can do it with this. However, it's probaly evolving (the OS 4.1 looks a bit different https://docs.openshift.com/container-platform/4.1/applications/operators/olm-adding-operators-to-cluster.html#olm-installing-operator-from-operatorhub-using-cli_olm-adding-operators-to-a-cluster) and what about 3.11 and k8s? We might need to do some backup way to install it without OperatorHub and support only latest OpenShift version. Even with that, maintanance might be some non-trivial amount work, depending how OpenShift will evolve. |
I mean that these configurations would be equivalent. In both cases uses expressed the intention to have everything except metrics:
enable:true
prometheus:
enable:true
serviceMonitor:
enable:true
alertmanager:
enable:false (default true) metrics:
enable:true
alertmanager:
enable:false (default true)
Changing the content of CR is an exceptional case when a user wants to override the default behavior and he knows what is he doing. Yes something like that: metrics:
prometheus:
cr:
- content: |
- content: |
serviceMonitor:
cr:
- content |
- content |
podMonitor:
cr:
- content |
alertmanager:
cr:
- content |
prometheusRule:
cr:
- content |
grafana:
cr:
- content |
grafanaDashboard:
cr:
- content |
grafanaDatasource:
cr:
- content |
tracing:
jaeger:
cr:
- content | |
I'm ok with that. Once
Can we leave it for later then? Simply avoiding all |
Yes. sure. We can postpone that for the time we will need that. So the minimal config will looks like: metrics:
enable:true (default false)
tracing:
enable:true (default false)
jaegerClientConfig: (optional)
serviceName:"che-server"
endpoint: "http://jaeger-collector:14268/api/traces"
sampler:
managerHostPort: "jaeger:5778"
type:"const"
param:"1"
reporter:
maxQueueSize: "10000"
|
@skabashnyuk with that, we're not fulfilling monitoring with external stack scenarios. At minimum, we would need I'd like to take it in parts. I think that implementation steps I've suggested (#15046 (comment)) is still valid and make sense as we've basically renamed the properties I've came up with. I'll update the individual tasks to reflect this. |
Proposal v2 (with installing 3rd party operators)Here is the 4phase plan, further splitted into several steps, that I think would make sense to follow. Each step is building block for next one, but still adds value and make sense on its own. Phase 1Goal: make it possible to monitor Che master and workspaces with external and installed Prometheus & Grafana
Phase 2Goal: make it possible to trace Che with external and installed Jaeger
Phase 3Goal: enable to fine tune the stack with customizable CR definitions final monitorin/tracing part of CheCluster CRD
Phase 4Goal: fine tune operators deployment (Deployment, ServiceAccount, Role, ClusterRole, ...) |
Proposal v3Update of previous proposal after todays team call. Main idea of this update is that We don't want to install other operators, because then we will have to support them. Here is the 4phase plan, further splited into several steps, that I think would make sense to follow. Each step is building block for next one, but still adds value and make sense on its own. Phase 1Goal: make it possible to monitor Che master and workspaces with external and "installed" Prometheus & Grafana. We're leaving instalation/maintanance of prometheus-operator and grafana-operator to user. We're only creating CRs managed by operators.
Phase 2Goal: make it possible to trace Che with external and "installed" Jaeger. We're leaving instalation/maintanance of jaeger-operator to user. We're only creating operator's CRs.
Phase 3Goal: enable to fine tune the stack with customizable CR definitions. By default, we will enable all options that we use and have reasonable default values. final monitoring/tracing part of CheCluster CRD
Phase 4 (can be switched with Phase 3 based on priorities)Goal: Make it possible to install [grafana|prometheus|jaeger]-operator as a development/unsupported option. May be done by chectl behind some flag. |
proposal v3 is final in the scope of this task. Remaining questions will be solved in individual issues. Closing |
Is your task related to a problem? Please describe.
How to configure monitoring/tracing with che CRD
Describe the solution you'd like
The goal of this task is to describe how we see how to configure different use cases with Che CRD (https://github.com/eclipse/che-operator/blob/master/deploy/crds/org_v1_che_crd.yaml)
Describe alternatives you've considered
n/a
Additional context
#15046
The text was updated successfully, but these errors were encountered: