Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] How to configure monitoring/tracing with che CRD #15136

Closed
skabashnyuk opened this issue Nov 6, 2019 · 21 comments
Closed

[POC] How to configure monitoring/tracing with che CRD #15136

skabashnyuk opened this issue Nov 6, 2019 · 21 comments
Assignees
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Milestone

Comments

@skabashnyuk
Copy link
Contributor

skabashnyuk commented Nov 6, 2019

Is your task related to a problem? Please describe.

How to configure monitoring/tracing with che CRD

Describe the solution you'd like

The goal of this task is to describe how we see how to configure different use cases with Che CRD (https://github.com/eclipse/che-operator/blob/master/deploy/crds/org_v1_che_crd.yaml)

  1. Connect to existed Prom/Jaeger/Grafana
  2. Connect to existed Prom/Jaeger/Grafana cr based
  3. Install and connect Prom/Jaeger/Grafana operator based

Describe alternatives you've considered

n/a

Additional context

#15046

@skabashnyuk skabashnyuk added kind/task Internal things, technical debt, and to-do tasks to be performed. team/platform severity/P1 Has a major impact to usage or development of the system. labels Nov 6, 2019
@skabashnyuk skabashnyuk added this to the Backlog - Platform milestone Nov 6, 2019
@skabashnyuk skabashnyuk modified the milestones: Backlog - Platform, 7.5.0 Nov 7, 2019
@skabashnyuk skabashnyuk added this to In progress in Platform-2019-11-26 Nov 8, 2019
@skabashnyuk skabashnyuk assigned skabashnyuk and sparkoo and unassigned skabashnyuk Nov 8, 2019
@sparkoo
Copy link
Member

sparkoo commented Nov 11, 2019

usecase 1 - already installed non-crd monitoring stack maintained by cluster admins

We can't do much here as we don't have control over already installed services and we do not know what to expect. We have few options how to help admins with configuration, but usefullnes of these is questionable.

Considering that all this is for already installed monitoring stack maintained by customer, I find "only" documenting how to configure Prometheus the best option for now. Even if we have access to monitoring stack namespace, updating config would be very tricky (they could have set it as ConfigMap, Secret, file in container, ...) and IMHO not worth the effort. I would suggest to listen to customers with this setup and implement when request comes. Until that, there are too many unknowns in possible deployments and we would be only speculating.

@sparkoo
Copy link
Member

sparkoo commented Nov 11, 2019

cc: @skabashnyuk

@sparkoo
Copy link
Member

sparkoo commented Nov 11, 2019

Prometheus service discovery config would look like this:

- job_name: 'che'
	static_configs:
	  - targets: ['che-host:8087']
- job_name: 'che-workspaces'
    kubernetes_sd_configs:
    - role: service
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_annotation_org_eclipse_che_machine_name]
      action: keep
      regex: theia-ide(.*)
    - source_labels: [__meta_kubernetes_service_labelpresent_che_workspace__id]
      action: keep
    - source_labels: [__meta_kubernetes_service_port_name]
      action: keep
      regex: server-3100

@sparkoo
Copy link
Member

sparkoo commented Nov 11, 2019

usecase 2 - already installed monitoring stack with operators (CR based) maintained by cluster admins

Prometheus installed by prometheus-operator (https://github.com/coreos/prometheus-operator) uses it's own way of managing Prometheus configuration with ServiceMonitor CRD, because of the reasons mentioned in previous comment (#15136 (comment)). It has it's own config reloader handling generating and reloading configuration from CRs to Prometheus config. We can help here by creating ServiceMonitor for discovery of our services. There are few things that I have to investigate:

  • I had to add ServiceMonitor CR into namespace of prometheus-operator. Is it possibile to create ServiceMonitor in our namespace so that prometheus-operator could find it?
    • If not, we would need permissions to create ServiceMonitor in monitoring namespace
    • Maybe the prometheus-operator will need permissions to our namespace
  • Is it possible to create ServiceMonitor to discover workspaces? I didn't succeed yet, but I didn't try hard enough.

@sparkoo
Copy link
Member

sparkoo commented Nov 12, 2019

Is it possibile to create ServiceMonitor in our namespace so that prometheus-operator could find it?

yes, Prometheus CR have to have defined serviceMonitorNamespaceSelector. When set to {}, it watch all namespaces for ServiceMonitor CRs. Preferably, we can set desired namespace with label selector (https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/). All our ServiceMonitor CRs can be in one namespace with all other Che services so we will have to set label to the namespace (e.g. app: che) and then cluster admins will have to set to the Prometheus CR

serviceMonitorNamespaceSelector:
    matchLabels:
      app: che

TODO: what's minimal permissions needed for ServiceMonitor discovery?

* Is it possible to create ServiceMonitor to discover workspaces?

yes.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: workspaces-monitoring
spec:
  endpoints:
  - port: server-3100
    interval: 1s
  selector:
    matchExpressions:
    - key: che.workspace_id
      operator: Exists

this ServiceMonitor CR will find workspace service. We can further limit that to certain namespaces. Unfortunately, there is no option to define wildcards. Namespaces must be listed only with exact values so it is useless if we have workspaces in various namespaces https://github.com/coreos/prometheus-operator/blob/master/example/prometheus-operator-crd/servicemonitor.crd.yaml#L267
For completion, here is ServiceMonitor CR for che-master:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: master-monitoring
spec:
  endpoints:
  - port: metrics
    interval: 1s
  namespaceSelector:
    matchNames:
    - che
  selector:
    matchLabels:
      app: che

@sparkoo
Copy link
Member

sparkoo commented Nov 12, 2019

Important note: when using prometheus deployed with prometheus-operator, the minimal permissions for service discovery seems to be:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring
rules:
- apiGroups: [""]
  resources:
  - endpoints
  - pods
  - services
  verbs: ["list", "watch"]

@sparkoo
Copy link
Member

sparkoo commented Nov 13, 2019

Jaeger

jaeger-operator can be installed with

oc create project observability
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/crds/jaegertracing.io_jaegers_crd.yaml
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/service_account.yaml
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/role.yaml
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/role_binding.yaml
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/operator.yaml

Then this will deploy simple jaeger instance

kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
EOF

To connect Che to Jaeger instance, we need to know Jaeger's collector endpoint. For 1. and 2. use-case we should just enhance che-operator to enable tracing and set url to jaeger collector endpoint. We need to set env values as documented here https://www.eclipse.org/che/docs/che-7/tracing-che/#enabling-che-metrics-collections_tracing-che with JAEGER_ENDPOINT=https://<jaeger-collector-service>.<jaeger-project>:14268/api/traces.
For 3. use-case, we should install jaeger-operator and Jaeger CR as described above to project together with Che and configure Che env variables in same manner.

@sparkoo
Copy link
Member

sparkoo commented Nov 14, 2019

Grafana

grafana-operator has GrafanaDashboard CRD (https://github.com/integr8ly/grafana-operator/blob/master/documentation/dashboards.md). We will create GrafanaDashboard in che namespace and let grafana-operator discover it. The dashboard will contain full dashboard json (https://github.com/eclipse/che/blob/master/deploy/openshift/templates/monitoring/grafana-dashboards.yaml).

There is one open issue. Dashboard json have datasource field (https://github.com/eclipse/che/blob/master/deploy/openshift/templates/monitoring/grafana-dashboards.yaml#L22). In 2nd scenario, we don't know that as it's not managed by us. In current Che templates, we're using some $(datasource) variable, but I'm not sure where it is set. <<- TODO

GrafanaDashboard CR:

apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
  labels:
    app: grafana
  name: che-dashboard
  namespace: che
spec:
  name: che-dashboard.json
  json: |
    {
      ...
    }

@sparkoo
Copy link
Member

sparkoo commented Nov 14, 2019

Summary/Proposal

So here's the summary and proposal of the solution

Update CheServer CRD with following options:

  • cheMonitoring # bool - enables monitoring endpoint of che server
  • externalMonitoring # bool - use external prometheus and grafana or deploy our own (prometheus-operator, grafana-operator)
  • cheTracing # bool - enable tracing on che server
  • externalTracing # bool - use external jaeger or deploy our own (jaeger-operator)
  • jaegerEndpoint # string - url to jaeger collector service

With this, we will be able to support all 3 scenarios

1] Connect to existing Prom/Jaeger/Grafana

cheMonitoring: true
externalMonitoring: true
cheTracing: true
externalTracing: true
jaegerEndpoint: <jaeger-collector-servicename>.<jaeger-namespace>:14268/api/traces

che-operator:

will properly configure che monitoring endpoint and connection to jaeger. Rest is on documentation.

Documentation
  • How to configure prometheus to che-master monitoring endpoint and to discover workspace services with kubernetes service discovery
  • Provide our grafana dashboard json

2] Connect to existing Prom/Jaeger/Grafana CR based

che-operator:

Configure same as in 1]. Operator will detect ServiceMonitor and GrafanaDashboard CRDs and create proper CRs in Che namespace and let rest of the work on already installed prometheus-operator and grafana-operator

Documentation

How to configure prometheus-operator and grafana-operator to be able to discover ServiceMonitor and GrafanaDashboard CRs in our namespace including documenting needed permissions.

3] Install and connect Prom/Jaeger/Grafana with operators

cheMonitoring: true
externalMonitoring: false
cheTracing: true
externalTracing: false
jaegerEndpoint: ''

che-operator:

With externalMonitoring:false, install prometheus-operator and grafana-operator (preferably with operator hub if possible) and create proper CRs to initiate prometheus and grafana instances in our namespace. Create ServiceMonitor and GrafanaDashboard CRs. Operators will discover them. Create GrafanaDatasource CR to create prometheus datasource in grafana.
With externalTracing: false, install jaeger-operator and create proper CRs to initiate Jaeger instance. Configure che env variables to point to jaeger-collector service.

Documentation

Everything should be installed, connected and working. Document only where to find the services.

What's next ?

I'd like to take it in parts. I intentionally didn't write much details here like all custom resources and exact configurations. Just overall logic from che-operator point of view and proposed changes to CheCluster CRD. If there are no objections, I will split the task #15137 to more issues and descibe details in individual tasks. If we agree on this overall picture and will be discussing technical details in more focused issues, I would close this one.

Thoughts ?

cc: @skabashnyuk (please cc anyone who you think should read this)

@skabashnyuk
Copy link
Contributor Author

skabashnyuk commented Nov 18, 2019

@sparko that is a great summary/proposal.
I would like to add a few adjustments to make it more generic

General look and feel

metrics:
 enable:true (default false)
 prometheus:
   enable:true (default false)	
   cr:
   operator:
    enable:true (default false)	
    config:TBD  
 serviceMonitor:
   enable:true (default false)
   cr:
 podMonitor
   enable:true (default false) 
   cr:
 alertmanager:
   enable:true (default false)
   cr: 
 prometheusRule
    enable:true (default false)
    cr:
 grafana:
    enable:true (default false)
   cr:
   operator:
    enable:true (default false)	
    config:TBD  
 grafanaDashboard:
    enable:true (default false)
    cr:
 grafanaDatasource:
    enable:true (default false)
    cr:
tracing:
 enable:true
 jaegerClientConfig:
   serviceName:"che-server"
   endpoint: "http://jaeger-collector:14268/api/traces"
   sampler:
      managerHostPort: "jaeger:5778"
      type:"const"
      param:"1"
   reporter:
      maxQueueSize: "10000"
jaeger:
   cr:
   operator:
     enable:true (default false)
     config:TBD

Important parts

  1. Your are able to enable or disable the whole tracing or monitoring in a single field
metrics:
 enable:false
tracing:
 enable:false
  1. Jaeger java client https://github.com/jaegertracing/jaeger-client-java/tree/master/jaeger-core has to be configured on che master side
jaegerClientConfig:
   serviceName:"che-server"
   endpoint: "http://jaeger-collector:14268/api/traces"
   sampler:
      managerHostPort: "jaeger:5778"
      type:"const"
      param:"1"
   reporter:
      maxQueueSize: "10000"
  1. There is an ability to configure CR of Prometheus, grafana, jaeger operators
metrics:
 prometheus:
   cr:
 serviceMonitor:
   cr:
 podMonitor
   cr:
 alertmanager:
   cr: 
 prometheusRule
    cr:
 grafana:
    cr:
 grafanaDashboard:
    cr:
 grafanaDatasource:
    cr:
tracing:
  jaeger:
    cr:
  1. There is an ability to configure operators by itself
metrics:
 prometheus:
   operator:
     enable:true (default false)
     config:TBD  
 grafana:
   operator:
     enable:true (default false)
     config:TBD  
tracing:
 jaeger:
   operator:
     enable:true (default false)
     config:TBD  

3 scenarios

1 Connect to existing Prom/Jaeger/Grafana

metrics:
 enable:true
tracing:
 enable:true
 jaegerClientConfig:
   serviceName:"che-server"
   ??? Do we  need more here

2 Connect to existing Prom/Jaeger/Grafana CR based

Same behavior as described here #15136 (comment)

3 Install and connect Prom/Jaeger/Grafana with operators

This config explicitly defines that we want to enable metrics and tracing and install all operators with default settings. In case if CRD's are already registered I think we should fail and explain the user to go with usecase 2.

metrics:
 enable:true
 prometheus:
   operator:
      enable:true
 grafana:
   operator:
     enable:true 
tracing:
 enable:true
 jaeger:
   operator:
     enable:true

WDYT @sparkoo @metlos @sleshchenko @mshaposhnik ?

@sparkoo
Copy link
Member

sparkoo commented Nov 18, 2019

yes, this is really good, very generic approach.
Just few issues/questions I can see after 1st quick gothrough:

  • does serviceMonitor.enable make sense? Isn't just simply 'create CR if defined' enough ? What if I set enable: false and CR defined?
  • properties like ServiceMonitor or GrafanaDashboard will have to be an array as it must be possible to create more of them.
  • 'In case if CRD's are already registered...' - this is almost certain on OpenShift with it's default cluster monitoring stack.
  • '??? Do we need more here' - for Jaeger, we definitely need url to collector service

@skabashnyuk
Copy link
Contributor Author

does serviceMonitor.enable make sense? Isn't just simply 'create CR if defined' enough ? What if I set enable: false and CR defined?

Interesting POV. I must admit I have some doubts about the default value for *.cr.enable. I was thinking to make it equal true by default. The basic idea here is to be able to turn off something that is not needed/inappropriate in a given environment. Can we say that if some cr config is omitted - then it's enabled/wanted by default? But you can turn it off with *.cr.enable=false`

properties like ServiceMonitor or GrafanaDashboard will have to be an array as it must be possible to create more of them.

+1

'In case if CRD's are already registered...' - this is almost certain on OpenShift with it's default cluster monitoring stack.

Ok. Agree. Then just install operator and let's explain in docs that users have to distinct use-case 2 and use case 3 by themself.

'??? Do we need more here' - for Jaeger, we definitely need url to collector service

I have some doubts here. According to https://github.com/jaegertracing/jaeger-client-java/blob/master/jaeger-core/README.md only JAEGER_SERVICE_NAME is mandatory.

@sparkoo
Copy link
Member

sparkoo commented Nov 18, 2019

Can we say that if some cr config is omitted - then it's enabled/wanted by default?

Can you please closer specify what you mean by this?

properties like ServiceMonitor or GrafanaDashboard will have to be an array as it must be possible to create more of them.

+1

You can have also more Prometheus instances, so I guess we should make all arrays?

I'm wondering where to set the boundary between generic and simplistic solution. If we do it too generic, what value do we bring instead of let admins to provide their own stack?
Also if you look to for example devfile registry: devfileRegistryImage, devfileRegistryMemoryLimit, devfileRegistryMemoryRequest, devfileRegistryPullPolicy, devfileRegistryUrl, externalDevfileRegistry. It's not so generic so you can define deployment of the registry.

TBH, I don't know what approach I like more. I think that my way is kind of hidden in your structure. So maybe I would define the minimal subset to fulfill 3 scenarios we've set, and try to find the best structure to achieve that and don't close the door to the future. I think that some mixture of both will raise up from that. Maybe we should ask someone who is more involved in che-operator?

@sparkoo
Copy link
Member

sparkoo commented Nov 18, 2019

Sorry for interupting discussion about CRD spec, but here's another important piece of puzzle.

Programatically install an operator from operator hub

Nice OpenShift documentation is here: https://docs.openshift.com/container-platform/4.2/operators/olm-adding-operators-to-cluster.html#olm-installing-operator-from-operatorhub-using-cli_olm-adding-operators-to-a-cluster
All operators in OpenShift 4.2 cluster are in namespace openshift-marketplace and can be listed with oc get packagemanifests -n openshift-marketplace. To see details of one concrete operator, use oc describe packagemanifests <operator_name> -n openshift-marketplace. To install an operator from Operator Hub on OpenShift 4.2 cluster, we need to craete two objects:

OperatorGroup - this must be present in the namespace where we want to install our operator. Only one OperatorGroup is needed per namespace.

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: <operatorgroup_name>
  namespace: <namespace>
spec:
  targetNamespaces:
  - <namespace>

Subscription - this will start installation of particular operator from Operator Hub:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: <operator_name>
  namespace: <namespace>
spec:
  channel: <operator_channel>
  name: <operator_name> 
  source: <operator_source>
  sourceNamespace: openshift-marketplace 

We need to set few things here:
<operator_name> - name for the Subscription, does not effect functionality
<namespace> - namespace where to install operator
<operator_channel> - update channel of the operator
<operator_name> - name of the operator we want to install, listed from openshift-marketplace
<source> - CatalogSource of the installed operator

example

Let's say we want to install Prometheus operator to eclipse-che namespace. We need to have OperatorGroup:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: che-operatorgroup
  namespace: eclipse-che
spec:
  targetNamespaces:
  - eclipse-che

Now we need to find informations about prometheus operator:

# try to find `prometheus` name
[~] λ oc get packagemanifests -n openshift-marketplace | grep prometheus
prometheus                                   Community Operators   19d

# find channel
[~] λ oc describe packagemanifests prometheus -n openshift-marketplace | grep "Default Channel"
  Default Channel:  beta

# find catalog source
[~] λ oc describe packagemanifests prometheus -n openshift-marketplace | grep "Catalog Source:"
  Catalog Source:               community-operators

So resulting Subscription would look like this:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: che-prometheus
  namespace: eclipse-che
spec:
  channel: beta
  name: prometheus
  source: community-operators
  sourceNamespace: openshift-marketplace

Caveats

If we want to support only OpenShift 4.2, we can do it with this. However, it's probaly evolving (the OS 4.1 looks a bit different https://docs.openshift.com/container-platform/4.1/applications/operators/olm-adding-operators-to-cluster.html#olm-installing-operator-from-operatorhub-using-cli_olm-adding-operators-to-a-cluster) and what about 3.11 and k8s? We might need to do some backup way to install it without OperatorHub and support only latest OpenShift version. Even with that, maintanance might be some non-trivial amount work, depending how OpenShift will evolve.

@skabashnyuk
Copy link
Contributor Author

Can you please closer specify what you mean by this?

I mean that these configurations would be equivalent. In both cases uses expressed the intention to have everything except alertmanager

metrics:
 enable:true
 prometheus:
   enable:true 
 serviceMonitor:
   enable:true
 alertmanager:
   enable:false (default true)
metrics:
 enable:true
 alertmanager:
   enable:false (default true)

You can have also more Prometheus instances, so I guess we should make all arrays?

Changing the content of CR is an exceptional case when a user wants to override the default behavior and he knows what is he doing.

Yes something like that:

metrics:
 prometheus:
   cr:
   - content: |
   - content: |
 serviceMonitor:
   cr:
    - content | 
    - content | 
 podMonitor:
   cr:
    - content | 
 alertmanager:
   cr: 
    - content | 
 prometheusRule:
    cr:
    - content |
 grafana:
    cr:
    - content |
 grafanaDashboard:
    cr:
    - content |
 grafanaDatasource:
    cr:
     - content |
tracing:
  jaeger:
    cr:
     - content |

@sparkoo
Copy link
Member

sparkoo commented Nov 20, 2019

Can you please closer specify what you mean by this?

I mean that these configurations would be equivalent. In both cases uses expressed the intention to have everything except alertmanager

I'm ok with that. Once metrics.enable:true, then deploy Prometheus and Grafana, connect them together, configure service discovery of Che master and Workspace services, create grafana dashboards as defined here https://github.com/eclipse/che/blob/master/deploy/openshift/templates/monitoring/grafana-dashboards.yaml
I don't think it make sense to bother with alert manager for now, if we're not using it. And I'm not sure about Pod monitors either. TBH I'm not sure what is the use-case for them.

You can have also more Prometheus instances, so I guess we should make all arrays?

Changing the content of CR is an exceptional case when a user wants to override the default behavior and he knows what is he doing.

Can we leave it for later then? Simply avoiding all cr properties for now. We're not closing the door for the future and we will narrow the scope of this. Basically everything will have just bool enable + Jaeger will have it's jaegerClientConfig.

@skabashnyuk
Copy link
Contributor Author

Can we leave it for later then? Simply avoiding all cr properties for now. We're not closing the door for the future and we will narrow the scope of this. Basically everything will have just bool enable + Jaeger will have it's jaegerClientConfig.

Yes. sure. We can postpone that for the time we will need that. So the minimal config will looks like:

metrics:
 enable:true (default false)
tracing:
 enable:true (default false)
 jaegerClientConfig: (optional)
   serviceName:"che-server"
   endpoint: "http://jaeger-collector:14268/api/traces"
   sampler:
      managerHostPort: "jaeger:5778"
      type:"const"
      param:"1"
   reporter:
      maxQueueSize: "10000"

+ we have a plan for future how to customise operators and CR
@sparkoo WDYT?

@sparkoo
Copy link
Member

sparkoo commented Nov 21, 2019

@skabashnyuk with that, we're not fulfilling monitoring with external stack scenarios. At minimum, we would need metrics.prometheus.enable and metrics.grafana.enable. I'm kind of ok to skip it for now, but it isn't much of extra work so I'm not sure it's worth it to let it for the future, when we will be somewhere else with our focus.

I'd like to take it in parts. I think that implementation steps I've suggested (#15046 (comment)) is still valid and make sense as we've basically renamed the properties I've came up with. I'll update the individual tasks to reflect this.

@skabashnyuk skabashnyuk added the area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator label Nov 21, 2019
@sparkoo
Copy link
Member

sparkoo commented Nov 21, 2019

Proposal v2 (with installing 3rd party operators)

Here is the 4phase plan, further splitted into several steps, that I think would make sense to follow. Each step is building block for next one, but still adds value and make sense on its own.

Phase 1

Goal: make it possible to monitor Che master and workspaces with external and installed Prometheus & Grafana

  1. enable metrics to be able to monitor with external stack
    metrics:
     enable:true # default false
    
  2. When PrometheusServiceMonitor or GrafanaDashboard CRDs defined in the cluster, create CRs
    metrics:
      enable: true
      prometheus-servicemonitor:
    	enable: true # default false
      grafana-dashboard:
    	enable: true # default false
    
  3. Install Prometheus (prometheus-operator, Prometheus CR)
    metrics:
      enable: true
      prometheus-operator:
    	enable: true # default false
      prometheus:
    	enable: true # default false
    
  4. install Grafana (grafana-operator, Grafana CR, GrafanaDatasource CR)
    metrics:
      enable: true
      grafana-operator:
    	enable: true # default false
      grafana:
    	enable: true # default false
      grafana-datasource:
    	enable: true # default false
    
  5. Install Prometheus and Grafana by default if metrics.enable:true. Basically set metrics.*.enable to default true. After this, we get into metrics state proposed here [POC] How to configure monitoring/tracing with che CRD #15136 (comment)

Phase 2

Goal: make it possible to trace Che with external and installed Jaeger

  1. enable tracing to be able to trace with external Jaeger
    tracing:
     enable:true # default false
     jaegerClientConfig:	# optional
       serviceName:"che-server"
       endpoint: "http://jaeger-collector:14268/api/traces"	# endpoint is mandatory for external jaeger
       sampler:
    	  managerHostPort: "jaeger:5778"
    	  type:"const"
    	  param:"1"
       reporter:
    	  maxQueueSize: "10000"
    
  2. install Jaeger (jaeger-operator)
    tracing:
     enable: true
     jaeger-operator:
      enable: true # default false
    
  3. install Jaeger by default if tracing.enable: true. basically set tracing.jaeger-operator.enable to default true. After this, we get into tracing state proposed here [POC] How to configure monitoring/tracing with che CRD #15136 (comment)

Phase 3

Goal: enable to fine tune the stack with customizable CR definitions

final monitorin/tracing part of CheCluster CRD
metrics:
  enable:

  prometheus-operator:
	enable:
  prometheus:
	enable:
	cr:
	- content: |
  prometheus-servicemonitor:
	enable:
	cr:
	- content: |
  prometheus-podmonitor:
	enable:
	cr:
	- content: |
  prometheus-alertmanager:
	enable:
	cr:
	- content: |
  prometheus-rule:
	enable:
	cr:
	- content: |

  grafana-operator:
	enable:
  grafana:
	enable:
	cr:
	- content: |
  grafana-dashboard:
	enable:
	cr:
	- content: |
  grafana-datasource:
	enable:
	cr:
	- content: |
tracing:
  enable:
  jaegerClientConfig:
	serviceName:
	endpoint:
	sampler:
		managerHostPort:
		type:
		param:
	reporter:
		maxQueueSize:
  jaeger-operator:
	enable:
  jaeger:
	enable:
	cr:
	- content: |

Phase 4

Goal: fine tune operators deployment (Deployment, ServiceAccount, Role, ClusterRole, ...)

@sparkoo
Copy link
Member

sparkoo commented Nov 22, 2019

Proposal v3

Update of previous proposal after todays team call. Main idea of this update is that We don't want to install other operators, because then we will have to support them.

Here is the 4phase plan, further splited into several steps, that I think would make sense to follow. Each step is building block for next one, but still adds value and make sense on its own.

Phase 1

Goal: make it possible to monitor Che master and workspaces with external and "installed" Prometheus & Grafana. We're leaving instalation/maintanance of prometheus-operator and grafana-operator to user. We're only creating CRs managed by operators.

  1. enable metrics to be able to monitor with external stack
metrics:
  enable:true # default false
  1. Create Prometheuse's ServiceMonitor and Grafana's GrafanaDashboard when CRDs defined.
metrics:
  enable: true
  prometheusServicemonitor:
    enable: true # default false
  grafanaDashboard:
    enable: true # default false
  1. Instantiate Prometheus when Prometheus CRD is defined. This will create Prometheus instance only if there is installed prometheus-operator.
metrics:
  enable: true
  prometheus:
    enable: true # default false
  1. Instantiate Grafana when Grafana CRD is defined. This will create Grafana instance only if there is installed grafana-operator.
metrics:
  enable: true
  grafana:
    enable: true # default false
  grafanaDatasource:
    enable: true # default false
  1. Create all CRs by default. Basically set metrics.*.enable to default true. There still will be "global" metrics.enable with default false.

Phase 2

Goal: make it possible to trace Che with external and "installed" Jaeger. We're leaving instalation/maintanance of jaeger-operator to user. We're only creating operator's CRs.

  1. enable tracing to be able to trace with external Jaeger
tracing:
  enable:true # default false
  jaegerClientConfig:	# optional
    serviceName:"che-server"
    endpoint: "http://jaeger-collector:14268/api/traces"	# endpoint is mandatory for external jaeger
    sampler:
      managerHostPort: "jaeger:5778"
      type:"const"
      param:"1"
    reporter:
      maxQueueSize: "10000"       
  1. Instantiate Jaeger when CRD is defined. This will create Jaeger instance only if there is installed jaeger-operator.
tracing:
  enable: true
  jaeger:
    enable: true # default true

Phase 3

Goal: enable to fine tune the stack with customizable CR definitions. By default, we will enable all options that we use and have reasonable default values.

final monitoring/tracing part of CheCluster CRD
metrics:
  enable: # false

  prometheus:
    enable: # true
    cr:
    - content: |
  prometheusServicemonitor:
    enable: # true
    cr:
    - content: |
  prometheusPodmonitor:
    enable: # false
    cr:
    - content: |
  prometheusAlertmanager:
    enable: # false
    cr:
    - content: |
  prometheusRule:
    enable: # false
    cr:
    - content: |

  grafana:
    enable: # true
    cr:
    - content: |
  grafanaDashboard:
    enable: # true
    cr:
    - content: |
  grafanaDatasource:
    enable: # true
    cr:
    - content: |


tracing:
  enable: # false

  jaegerClientConfig:
    serviceName:
    endpoint:
    sampler:
      managerHostPort:
      type:
      param:
    reporter:
      maxQueueSize:
  jaeger:
    enable: # true
    cr:
    - content: |

Phase 4 (can be switched with Phase 3 based on priorities)

Goal: Make it possible to install [grafana|prometheus|jaeger]-operator as a development/unsupported option. May be done by chectl behind some flag.

@sparkoo
Copy link
Member

sparkoo commented Nov 26, 2019

proposal v3 is final in the scope of this task. Remaining questions will be solved in individual issues. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Projects
No open projects
Development

No branches or pull requests

2 participants