[POC] How to configure monitoring/tracing with che CRD #15136

skabashnyuk · 2019-11-06T14:59:04Z

Is your task related to a problem? Please describe.

How to configure monitoring/tracing with che CRD

Describe the solution you'd like

The goal of this task is to describe how we see how to configure different use cases with Che CRD (https://github.com/eclipse/che-operator/blob/master/deploy/crds/org_v1_che_crd.yaml)

Connect to existed Prom/Jaeger/Grafana
Connect to existed Prom/Jaeger/Grafana cr based
Install and connect Prom/Jaeger/Grafana operator based

Describe alternatives you've considered

n/a

Additional context

#15046

sparkoo · 2019-11-11T14:46:22Z

usecase 1 - already installed non-crd monitoring stack maintained by cluster admins

We can't do much here as we don't have control over already installed services and we do not know what to expect. We have few options how to help admins with configuration, but usefullnes of these is questionable.

ConfigMap - we could create config map with Prometheus config of service discovery for our services. However, there is no option how to "merge" Prometheus config from multiple sources. There is always only one config file on filesystem. So admin would have to copy content of our ConfigMap and include it into existing Prometheus config. Also, I haven't found a way how to access ConfigMap from another namespace, so it must be copied manually across namespaces.
File Service Discovery - Prometheus can do file service discovery (https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config). That means that Prometheus config points to json file where are listed endpoints that it should monitor and Prometheus is watching this file. My idea was that we could provide an endpoint that would list all services that can be monitored in Prometheus friendly format and admin will have to configure Prometheus just to watch this endpoint.

However, Prometheus can read only local filesystem. There are several feature requests to http or even more general way (Implement a platform agnostic service discovery prometheus/prometheus#6079, service discovery by exec prometheus/prometheus#5212, Extend JSON file-based service discovery to http prometheus/prometheus#2514, A http based discovery service prometheus/prometheus#1675), but always strictly rejected.

There is workaround to this. Simple script that will curl the endpoint and save it to file that Prometheus observes and run this script with cron. In k8s world that could mean to run new pod that will every N seconds update ConfigMap, that will be mapped to file in Prometheus pod. 😒

Considering that all this is for already installed monitoring stack maintained by customer, I find "only" documenting how to configure Prometheus the best option for now. Even if we have access to monitoring stack namespace, updating config would be very tricky (they could have set it as ConfigMap, Secret, file in container, ...) and IMHO not worth the effort. I would suggest to listen to customers with this setup and implement when request comes. Until that, there are too many unknowns in possible deployments and we would be only speculating.

sparkoo · 2019-11-11T14:54:59Z

cc: @skabashnyuk

sparkoo · 2019-11-11T15:04:26Z

Prometheus service discovery config would look like this:

- job_name: 'che'
	static_configs:
	  - targets: ['che-host:8087']
- job_name: 'che-workspaces'
    kubernetes_sd_configs:
    - role: service
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_annotation_org_eclipse_che_machine_name]
      action: keep
      regex: theia-ide(.*)
    - source_labels: [__meta_kubernetes_service_labelpresent_che_workspace__id]
      action: keep
    - source_labels: [__meta_kubernetes_service_port_name]
      action: keep
      regex: server-3100

sparkoo · 2019-11-11T17:38:37Z

usecase 2 - already installed monitoring stack with operators (CR based) maintained by cluster admins

Prometheus installed by prometheus-operator (https://github.com/coreos/prometheus-operator) uses it's own way of managing Prometheus configuration with ServiceMonitor CRD, because of the reasons mentioned in previous comment (#15136 (comment)). It has it's own config reloader handling generating and reloading configuration from CRs to Prometheus config. We can help here by creating ServiceMonitor for discovery of our services. There are few things that I have to investigate:

I had to add ServiceMonitor CR into namespace of prometheus-operator. Is it possibile to create ServiceMonitor in our namespace so that prometheus-operator could find it?
- If not, we would need permissions to create ServiceMonitor in monitoring namespace
- Maybe the prometheus-operator will need permissions to our namespace
Is it possible to create ServiceMonitor to discover workspaces? I didn't succeed yet, but I didn't try hard enough.

sparkoo · 2019-11-12T15:47:07Z

Is it possibile to create ServiceMonitor in our namespace so that prometheus-operator could find it?

yes, Prometheus CR have to have defined serviceMonitorNamespaceSelector. When set to {}, it watch all namespaces for ServiceMonitor CRs. Preferably, we can set desired namespace with label selector (https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/). All our ServiceMonitor CRs can be in one namespace with all other Che services so we will have to set label to the namespace (e.g. app: che) and then cluster admins will have to set to the Prometheus CR

serviceMonitorNamespaceSelector:
    matchLabels:
      app: che

TODO: what's minimal permissions needed for ServiceMonitor discovery?

* Is it possible to create ServiceMonitor to discover workspaces?

yes.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: workspaces-monitoring
spec:
  endpoints:
  - port: server-3100
    interval: 1s
  selector:
    matchExpressions:
    - key: che.workspace_id
      operator: Exists

this ServiceMonitor CR will find workspace service. We can further limit that to certain namespaces. Unfortunately, there is no option to define wildcards. Namespaces must be listed only with exact values so it is useless if we have workspaces in various namespaces https://github.com/coreos/prometheus-operator/blob/master/example/prometheus-operator-crd/servicemonitor.crd.yaml#L267
For completion, here is ServiceMonitor CR for che-master:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: master-monitoring
spec:
  endpoints:
  - port: metrics
    interval: 1s
  namespaceSelector:
    matchNames:
    - che
  selector:
    matchLabels:
      app: che

sparkoo · 2019-11-12T15:49:41Z

Important note: when using prometheus deployed with prometheus-operator, the minimal permissions for service discovery seems to be:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring
rules:
- apiGroups: [""]
  resources:
  - endpoints
  - pods
  - services
  verbs: ["list", "watch"]

sparkoo · 2019-11-13T23:39:00Z

Jaeger

jaeger-operator can be installed with

oc create project observability
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/crds/jaegertracing.io_jaegers_crd.yaml
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/service_account.yaml
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/role.yaml
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/role_binding.yaml
oc create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/operator.yaml

Then this will deploy simple jaeger instance

kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
EOF

To connect Che to Jaeger instance, we need to know Jaeger's collector endpoint. For 1. and 2. use-case we should just enhance che-operator to enable tracing and set url to jaeger collector endpoint. We need to set env values as documented here https://www.eclipse.org/che/docs/che-7/tracing-che/#enabling-che-metrics-collections_tracing-che with JAEGER_ENDPOINT=https://<jaeger-collector-service>.<jaeger-project>:14268/api/traces.
For 3. use-case, we should install jaeger-operator and Jaeger CR as described above to project together with Che and configure Che env variables in same manner.

sparkoo · 2019-11-14T15:17:00Z

Grafana

grafana-operator has GrafanaDashboard CRD (https://github.com/integr8ly/grafana-operator/blob/master/documentation/dashboards.md). We will create GrafanaDashboard in che namespace and let grafana-operator discover it. The dashboard will contain full dashboard json (https://github.com/eclipse/che/blob/master/deploy/openshift/templates/monitoring/grafana-dashboards.yaml).

There is one open issue. Dashboard json have datasource field (https://github.com/eclipse/che/blob/master/deploy/openshift/templates/monitoring/grafana-dashboards.yaml#L22). In 2nd scenario, we don't know that as it's not managed by us. In current Che templates, we're using some $(datasource) variable, but I'm not sure where it is set. <<- TODO

GrafanaDashboard CR:

apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
  labels:
    app: grafana
  name: che-dashboard
  namespace: che
spec:
  name: che-dashboard.json
  json: |
    {
      ...
    }

sparkoo · 2019-11-14T15:26:43Z

Summary/Proposal

So here's the summary and proposal of the solution

Update CheServer CRD with following options:

cheMonitoring # bool - enables monitoring endpoint of che server
externalMonitoring # bool - use external prometheus and grafana or deploy our own (prometheus-operator, grafana-operator)
cheTracing # bool - enable tracing on che server
externalTracing # bool - use external jaeger or deploy our own (jaeger-operator)
jaegerEndpoint # string - url to jaeger collector service

With this, we will be able to support all 3 scenarios

1] Connect to existing Prom/Jaeger/Grafana

cheMonitoring: true
externalMonitoring: true
cheTracing: true
externalTracing: true
jaegerEndpoint: <jaeger-collector-servicename>.<jaeger-namespace>:14268/api/traces

che-operator:

will properly configure che monitoring endpoint and connection to jaeger. Rest is on documentation.

Documentation

How to configure prometheus to che-master monitoring endpoint and to discover workspace services with kubernetes service discovery
Provide our grafana dashboard json

2] Connect to existing Prom/Jaeger/Grafana CR based

che-operator:

Configure same as in 1]. Operator will detect ServiceMonitor and GrafanaDashboard CRDs and create proper CRs in Che namespace and let rest of the work on already installed prometheus-operator and grafana-operator

Documentation

How to configure prometheus-operator and grafana-operator to be able to discover ServiceMonitor and GrafanaDashboard CRs in our namespace including documenting needed permissions.

3] Install and connect Prom/Jaeger/Grafana with operators

cheMonitoring: true
externalMonitoring: false
cheTracing: true
externalTracing: false
jaegerEndpoint: ''

che-operator:

With externalMonitoring:false, install prometheus-operator and grafana-operator (preferably with operator hub if possible) and create proper CRs to initiate prometheus and grafana instances in our namespace. Create ServiceMonitor and GrafanaDashboard CRs. Operators will discover them. Create GrafanaDatasource CR to create prometheus datasource in grafana.
With externalTracing: false, install jaeger-operator and create proper CRs to initiate Jaeger instance. Configure che env variables to point to jaeger-collector service.

Documentation

Everything should be installed, connected and working. Document only where to find the services.

What's next ?

I'd like to take it in parts. I intentionally didn't write much details here like all custom resources and exact configurations. Just overall logic from che-operator point of view and proposed changes to CheCluster CRD. If there are no objections, I will split the task #15137 to more issues and descibe details in individual tasks. If we agree on this overall picture and will be discussing technical details in more focused issues, I would close this one.

Thoughts ?

cc: @skabashnyuk (please cc anyone who you think should read this)

skabashnyuk · 2019-11-18T10:17:36Z

@sparko that is a great summary/proposal.
I would like to add a few adjustments to make it more generic

General look and feel

metrics:
 enable:true (default false)
 prometheus:
   enable:true (default false)	
   cr:
   operator:
    enable:true (default false)	
    config:TBD  
 serviceMonitor:
   enable:true (default false)
   cr:
 podMonitor
   enable:true (default false) 
   cr:
 alertmanager:
   enable:true (default false)
   cr: 
 prometheusRule
    enable:true (default false)
    cr:
 grafana:
    enable:true (default false)
   cr:
   operator:
    enable:true (default false)	
    config:TBD  
 grafanaDashboard:
    enable:true (default false)
    cr:
 grafanaDatasource:
    enable:true (default false)
    cr:
tracing:
 enable:true
 jaegerClientConfig:
   serviceName:"che-server"
   endpoint: "http://jaeger-collector:14268/api/traces"
   sampler:
      managerHostPort: "jaeger:5778"
      type:"const"
      param:"1"
   reporter:
      maxQueueSize: "10000"
jaeger:
   cr:
   operator:
     enable:true (default false)
     config:TBD

Important parts

Your are able to enable or disable the whole tracing or monitoring in a single field

metrics:
 enable:false
tracing:
 enable:false

Jaeger java client https://github.com/jaegertracing/jaeger-client-java/tree/master/jaeger-core has to be configured on che master side

jaegerClientConfig:
   serviceName:"che-server"
   endpoint: "http://jaeger-collector:14268/api/traces"
   sampler:
      managerHostPort: "jaeger:5778"
      type:"const"
      param:"1"
   reporter:
      maxQueueSize: "10000"

There is an ability to configure CR of Prometheus, grafana, jaeger operators

metrics:
 prometheus:
   cr:
 serviceMonitor:
   cr:
 podMonitor
   cr:
 alertmanager:
   cr: 
 prometheusRule
    cr:
 grafana:
    cr:
 grafanaDashboard:
    cr:
 grafanaDatasource:
    cr:
tracing:
  jaeger:
    cr:

There is an ability to configure operators by itself

metrics:
 prometheus:
   operator:
     enable:true (default false)
     config:TBD  
 grafana:
   operator:
     enable:true (default false)
     config:TBD  
tracing:
 jaeger:
   operator:
     enable:true (default false)
     config:TBD

3 scenarios

1 Connect to existing Prom/Jaeger/Grafana

metrics:
 enable:true
tracing:
 enable:true
 jaegerClientConfig:
   serviceName:"che-server"
   ??? Do we  need more here

2 Connect to existing Prom/Jaeger/Grafana CR based

Same behavior as described here #15136 (comment)

3 Install and connect Prom/Jaeger/Grafana with operators

This config explicitly defines that we want to enable metrics and tracing and install all operators with default settings. In case if CRD's are already registered I think we should fail and explain the user to go with usecase 2.

metrics:
 enable:true
 prometheus:
   operator:
      enable:true
 grafana:
   operator:
     enable:true 
tracing:
 enable:true
 jaeger:
   operator:
     enable:true

WDYT @sparkoo @metlos @sleshchenko @mshaposhnik ?

sparkoo · 2019-11-18T11:16:43Z

yes, this is really good, very generic approach.
Just few issues/questions I can see after 1st quick gothrough:

does serviceMonitor.enable make sense? Isn't just simply 'create CR if defined' enough ? What if I set enable: false and CR defined?
properties like ServiceMonitor or GrafanaDashboard will have to be an array as it must be possible to create more of them.
'In case if CRD's are already registered...' - this is almost certain on OpenShift with it's default cluster monitoring stack.
'??? Do we need more here' - for Jaeger, we definitely need url to collector service

skabashnyuk · 2019-11-18T11:42:37Z

does serviceMonitor.enable make sense? Isn't just simply 'create CR if defined' enough ? What if I set enable: false and CR defined?

Interesting POV. I must admit I have some doubts about the default value for *.cr.enable. I was thinking to make it equal true by default. The basic idea here is to be able to turn off something that is not needed/inappropriate in a given environment. Can we say that if some cr config is omitted - then it's enabled/wanted by default? But you can turn it off with *.cr.enable=false`

properties like ServiceMonitor or GrafanaDashboard will have to be an array as it must be possible to create more of them.

+1

'In case if CRD's are already registered...' - this is almost certain on OpenShift with it's default cluster monitoring stack.

Ok. Agree. Then just install operator and let's explain in docs that users have to distinct use-case 2 and use case 3 by themself.

'??? Do we need more here' - for Jaeger, we definitely need url to collector service

I have some doubts here. According to https://github.com/jaegertracing/jaeger-client-java/blob/master/jaeger-core/README.md only JAEGER_SERVICE_NAME is mandatory.

sparkoo · 2019-11-18T14:52:35Z

Can we say that if some cr config is omitted - then it's enabled/wanted by default?

Can you please closer specify what you mean by this?

properties like ServiceMonitor or GrafanaDashboard will have to be an array as it must be possible to create more of them.

+1

You can have also more Prometheus instances, so I guess we should make all arrays?

I'm wondering where to set the boundary between generic and simplistic solution. If we do it too generic, what value do we bring instead of let admins to provide their own stack?
Also if you look to for example devfile registry: devfileRegistryImage, devfileRegistryMemoryLimit, devfileRegistryMemoryRequest, devfileRegistryPullPolicy, devfileRegistryUrl, externalDevfileRegistry. It's not so generic so you can define deployment of the registry.

TBH, I don't know what approach I like more. I think that my way is kind of hidden in your structure. So maybe I would define the minimal subset to fulfill 3 scenarios we've set, and try to find the best structure to achieve that and don't close the door to the future. I think that some mixture of both will raise up from that. Maybe we should ask someone who is more involved in che-operator?

sparkoo · 2019-11-18T16:53:07Z

Sorry for interupting discussion about CRD spec, but here's another important piece of puzzle.

Programatically install an operator from operator hub

Nice OpenShift documentation is here: https://docs.openshift.com/container-platform/4.2/operators/olm-adding-operators-to-cluster.html#olm-installing-operator-from-operatorhub-using-cli_olm-adding-operators-to-a-cluster
All operators in OpenShift 4.2 cluster are in namespace openshift-marketplace and can be listed with oc get packagemanifests -n openshift-marketplace. To see details of one concrete operator, use oc describe packagemanifests <operator_name> -n openshift-marketplace. To install an operator from Operator Hub on OpenShift 4.2 cluster, we need to craete two objects:

OperatorGroup - this must be present in the namespace where we want to install our operator. Only one OperatorGroup is needed per namespace.

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: <operatorgroup_name>
  namespace: <namespace>
spec:
  targetNamespaces:
  - <namespace>

Subscription - this will start installation of particular operator from Operator Hub:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: <operator_name>
  namespace: <namespace>
spec:
  channel: <operator_channel>
  name: <operator_name> 
  source: <operator_source>
  sourceNamespace: openshift-marketplace

We need to set few things here:
<operator_name> - name for the Subscription, does not effect functionality
<namespace> - namespace where to install operator
<operator_channel> - update channel of the operator
<operator_name> - name of the operator we want to install, listed from openshift-marketplace
<source> - CatalogSource of the installed operator

example

Let's say we want to install Prometheus operator to eclipse-che namespace. We need to have OperatorGroup:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: che-operatorgroup
  namespace: eclipse-che
spec:
  targetNamespaces:
  - eclipse-che

Now we need to find informations about prometheus operator:

# try to find `prometheus` name
[~] λ oc get packagemanifests -n openshift-marketplace | grep prometheus
prometheus                                   Community Operators   19d

# find channel
[~] λ oc describe packagemanifests prometheus -n openshift-marketplace | grep "Default Channel"
  Default Channel:  beta

# find catalog source
[~] λ oc describe packagemanifests prometheus -n openshift-marketplace | grep "Catalog Source:"
  Catalog Source:               community-operators

So resulting Subscription would look like this:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: che-prometheus
  namespace: eclipse-che
spec:
  channel: beta
  name: prometheus
  source: community-operators
  sourceNamespace: openshift-marketplace

Caveats

If we want to support only OpenShift 4.2, we can do it with this. However, it's probaly evolving (the OS 4.1 looks a bit different https://docs.openshift.com/container-platform/4.1/applications/operators/olm-adding-operators-to-cluster.html#olm-installing-operator-from-operatorhub-using-cli_olm-adding-operators-to-a-cluster) and what about 3.11 and k8s? We might need to do some backup way to install it without OperatorHub and support only latest OpenShift version. Even with that, maintanance might be some non-trivial amount work, depending how OpenShift will evolve.

skabashnyuk · 2019-11-20T14:04:54Z

Can you please closer specify what you mean by this?

I mean that these configurations would be equivalent. In both cases uses expressed the intention to have everything except alertmanager

metrics:
 enable:true
 prometheus:
   enable:true 
 serviceMonitor:
   enable:true
 alertmanager:
   enable:false (default true)

metrics:
 enable:true
 alertmanager:
   enable:false (default true)

You can have also more Prometheus instances, so I guess we should make all arrays?

Changing the content of CR is an exceptional case when a user wants to override the default behavior and he knows what is he doing.

Yes something like that:

metrics:
 prometheus:
   cr:
   - content: |
   - content: |
 serviceMonitor:
   cr:
    - content | 
    - content | 
 podMonitor:
   cr:
    - content | 
 alertmanager:
   cr: 
    - content | 
 prometheusRule:
    cr:
    - content |
 grafana:
    cr:
    - content |
 grafanaDashboard:
    cr:
    - content |
 grafanaDatasource:
    cr:
     - content |
tracing:
  jaeger:
    cr:
     - content |

sparkoo · 2019-11-20T22:40:01Z

Can you please closer specify what you mean by this?

I mean that these configurations would be equivalent. In both cases uses expressed the intention to have everything except alertmanager

I'm ok with that. Once metrics.enable:true, then deploy Prometheus and Grafana, connect them together, configure service discovery of Che master and Workspace services, create grafana dashboards as defined here https://github.com/eclipse/che/blob/master/deploy/openshift/templates/monitoring/grafana-dashboards.yaml
I don't think it make sense to bother with alert manager for now, if we're not using it. And I'm not sure about Pod monitors either. TBH I'm not sure what is the use-case for them.

You can have also more Prometheus instances, so I guess we should make all arrays?

Changing the content of CR is an exceptional case when a user wants to override the default behavior and he knows what is he doing.

Can we leave it for later then? Simply avoiding all cr properties for now. We're not closing the door for the future and we will narrow the scope of this. Basically everything will have just bool enable + Jaeger will have it's jaegerClientConfig.

skabashnyuk · 2019-11-21T07:28:47Z

Can we leave it for later then? Simply avoiding all cr properties for now. We're not closing the door for the future and we will narrow the scope of this. Basically everything will have just bool enable + Jaeger will have it's jaegerClientConfig.

Yes. sure. We can postpone that for the time we will need that. So the minimal config will looks like:

metrics:
 enable:true (default false)
tracing:
 enable:true (default false)
 jaegerClientConfig: (optional)
   serviceName:"che-server"
   endpoint: "http://jaeger-collector:14268/api/traces"
   sampler:
      managerHostPort: "jaeger:5778"
      type:"const"
      param:"1"
   reporter:
      maxQueueSize: "10000"

+ we have a plan for future how to customise operators and CR
@sparkoo WDYT?

sparkoo · 2019-11-21T07:45:56Z

@skabashnyuk with that, we're not fulfilling monitoring with external stack scenarios. At minimum, we would need metrics.prometheus.enable and metrics.grafana.enable. I'm kind of ok to skip it for now, but it isn't much of extra work so I'm not sure it's worth it to let it for the future, when we will be somewhere else with our focus.

I'd like to take it in parts. I think that implementation steps I've suggested (#15046 (comment)) is still valid and make sense as we've basically renamed the properties I've came up with. I'll update the individual tasks to reflect this.

sparkoo · 2019-11-21T14:36:20Z

Proposal v2 (with installing 3rd party operators)

Here is the 4phase plan, further splitted into several steps, that I think would make sense to follow. Each step is building block for next one, but still adds value and make sense on its own.

Phase 1

Goal: make it possible to monitor Che master and workspaces with external and installed Prometheus & Grafana

enable metrics to be able to monitor with external stack
```
metrics:
 enable:true # default false
```

When PrometheusServiceMonitor or GrafanaDashboard CRDs defined in the cluster, create CRs

metrics:
  enable: true
  prometheus-servicemonitor:
	enable: true # default false
  grafana-dashboard:
	enable: true # default false

Install Prometheus (prometheus-operator, Prometheus CR)

metrics:
  enable: true
  prometheus-operator:
	enable: true # default false
  prometheus:
	enable: true # default false

install Grafana (grafana-operator, Grafana CR, GrafanaDatasource CR)

metrics:
  enable: true
  grafana-operator:
	enable: true # default false
  grafana:
	enable: true # default false
  grafana-datasource:
	enable: true # default false

Install Prometheus and Grafana by default if metrics.enable:true. Basically set metrics.*.enable to default true. After this, we get into metrics state proposed here [POC] How to configure monitoring/tracing with che CRD #15136 (comment)

Phase 2

Goal: make it possible to trace Che with external and installed Jaeger

enable tracing to be able to trace with external Jaeger

tracing:
 enable:true # default false
 jaegerClientConfig:	# optional
   serviceName:"che-server"
   endpoint: "http://jaeger-collector:14268/api/traces"	# endpoint is mandatory for external jaeger
   sampler:
	  managerHostPort: "jaeger:5778"
	  type:"const"
	  param:"1"
   reporter:
	  maxQueueSize: "10000"

install Jaeger (jaeger-operator)

tracing:
 enable: true
 jaeger-operator:
  enable: true # default false

install Jaeger by default if tracing.enable: true. basically set tracing.jaeger-operator.enable to default true. After this, we get into tracing state proposed here [POC] How to configure monitoring/tracing with che CRD #15136 (comment)

Phase 3

Goal: enable to fine tune the stack with customizable CR definitions

final monitorin/tracing part of CheCluster CRD

metrics:
  enable:

  prometheus-operator:
	enable:
  prometheus:
	enable:
	cr:
	- content: |
  prometheus-servicemonitor:
	enable:
	cr:
	- content: |
  prometheus-podmonitor:
	enable:
	cr:
	- content: |
  prometheus-alertmanager:
	enable:
	cr:
	- content: |
  prometheus-rule:
	enable:
	cr:
	- content: |

  grafana-operator:
	enable:
  grafana:
	enable:
	cr:
	- content: |
  grafana-dashboard:
	enable:
	cr:
	- content: |
  grafana-datasource:
	enable:
	cr:
	- content: |
tracing:
  enable:
  jaegerClientConfig:
	serviceName:
	endpoint:
	sampler:
		managerHostPort:
		type:
		param:
	reporter:
		maxQueueSize:
  jaeger-operator:
	enable:
  jaeger:
	enable:
	cr:
	- content: |

Phase 4

Goal: fine tune operators deployment (Deployment, ServiceAccount, Role, ClusterRole, ...)

sparkoo · 2019-11-22T13:34:03Z

Proposal v3

Update of previous proposal after todays team call. Main idea of this update is that We don't want to install other operators, because then we will have to support them.

Here is the 4phase plan, further splited into several steps, that I think would make sense to follow. Each step is building block for next one, but still adds value and make sense on its own.

Phase 1

Goal: make it possible to monitor Che master and workspaces with external and "installed" Prometheus & Grafana. We're leaving instalation/maintanance of prometheus-operator and grafana-operator to user. We're only creating CRs managed by operators.

enable metrics to be able to monitor with external stack

metrics:
  enable:true # default false

Create Prometheuse's ServiceMonitor and Grafana's GrafanaDashboard when CRDs defined.

metrics:
  enable: true
  prometheusServicemonitor:
    enable: true # default false
  grafanaDashboard:
    enable: true # default false

Instantiate Prometheus when Prometheus CRD is defined. This will create Prometheus instance only if there is installed prometheus-operator.

metrics:
  enable: true
  prometheus:
    enable: true # default false

Instantiate Grafana when Grafana CRD is defined. This will create Grafana instance only if there is installed grafana-operator.

metrics:
  enable: true
  grafana:
    enable: true # default false
  grafanaDatasource:
    enable: true # default false

Create all CRs by default. Basically set metrics.*.enable to default true. There still will be "global" metrics.enable with default false.

Phase 2

Goal: make it possible to trace Che with external and "installed" Jaeger. We're leaving instalation/maintanance of jaeger-operator to user. We're only creating operator's CRs.

enable tracing to be able to trace with external Jaeger

tracing:
  enable:true # default false
  jaegerClientConfig:	# optional
    serviceName:"che-server"
    endpoint: "http://jaeger-collector:14268/api/traces"	# endpoint is mandatory for external jaeger
    sampler:
      managerHostPort: "jaeger:5778"
      type:"const"
      param:"1"
    reporter:
      maxQueueSize: "10000"

Instantiate Jaeger when CRD is defined. This will create Jaeger instance only if there is installed jaeger-operator.

tracing:
  enable: true
  jaeger:
    enable: true # default true

Phase 3

Goal: enable to fine tune the stack with customizable CR definitions. By default, we will enable all options that we use and have reasonable default values.

final monitoring/tracing part of CheCluster CRD

metrics:
  enable: # false

  prometheus:
    enable: # true
    cr:
    - content: |
  prometheusServicemonitor:
    enable: # true
    cr:
    - content: |
  prometheusPodmonitor:
    enable: # false
    cr:
    - content: |
  prometheusAlertmanager:
    enable: # false
    cr:
    - content: |
  prometheusRule:
    enable: # false
    cr:
    - content: |

  grafana:
    enable: # true
    cr:
    - content: |
  grafanaDashboard:
    enable: # true
    cr:
    - content: |
  grafanaDatasource:
    enable: # true
    cr:
    - content: |


tracing:
  enable: # false

  jaegerClientConfig:
    serviceName:
    endpoint:
    sampler:
      managerHostPort:
      type:
      param:
    reporter:
      maxQueueSize:
  jaeger:
    enable: # true
    cr:
    - content: |

Phase 4 (can be switched with Phase 3 based on priorities)

Goal: Make it possible to install [grafana|prometheus|jaeger]-operator as a development/unsupported option. May be done by chectl behind some flag.

sparkoo · 2019-11-26T14:32:28Z

proposal v3 is final in the scope of this task. Remaining questions will be solved in individual issues. Closing

skabashnyuk added kind/task Internal things, technical debt, and to-do tasks to be performed. team/platform severity/P1 Has a major impact to usage or development of the system. labels Nov 6, 2019

skabashnyuk added this to the Backlog - Platform milestone Nov 6, 2019

skabashnyuk mentioned this issue Nov 6, 2019

[che-operator] Setup of Tracing/Monitoring stack #15046

Closed

17 tasks

skabashnyuk modified the milestones: Backlog - Platform, 7.5.0 Nov 7, 2019

skabashnyuk mentioned this issue Nov 7, 2019

Platform-2019-11-26 (Sprint: 175) #15115

Closed

12 tasks

skabashnyuk added this to In progress in Platform-2019-11-26 Nov 8, 2019

skabashnyuk assigned skabashnyuk and sparkoo and unassigned skabashnyuk Nov 8, 2019

skabashnyuk mentioned this issue Nov 20, 2019

Option to enable che-server metrics endpoint eclipse-che/che-operator#117

Merged

skabashnyuk added the area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator label Nov 21, 2019

This was referenced Nov 25, 2019

[che-operator] - install Grafana CR #15299

Closed

[che-operator] when metrics enabled, create and configure full monitoring stack by default #15302

Closed

Jaeger persistent storage #15215

Closed

sparkoo closed this as completed Nov 26, 2019

Platform-2019-11-26 automation moved this from In progress to Done Nov 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] How to configure monitoring/tracing with che CRD #15136

[POC] How to configure monitoring/tracing with che CRD #15136

skabashnyuk commented Nov 6, 2019 •

edited by sparkoo

Loading

sparkoo commented Nov 11, 2019 •

edited

Loading

sparkoo commented Nov 11, 2019

sparkoo commented Nov 11, 2019

sparkoo commented Nov 11, 2019 •

edited

Loading

sparkoo commented Nov 12, 2019

sparkoo commented Nov 12, 2019 •

edited

Loading

sparkoo commented Nov 13, 2019 •

edited

Loading

sparkoo commented Nov 14, 2019 •

edited

Loading

sparkoo commented Nov 14, 2019 •

edited

Loading

skabashnyuk commented Nov 18, 2019 •

edited

Loading

sparkoo commented Nov 18, 2019

skabashnyuk commented Nov 18, 2019

sparkoo commented Nov 18, 2019

sparkoo commented Nov 18, 2019 •

edited

Loading

skabashnyuk commented Nov 20, 2019

sparkoo commented Nov 20, 2019

skabashnyuk commented Nov 21, 2019

sparkoo commented Nov 21, 2019

sparkoo commented Nov 21, 2019 •

edited

Loading

sparkoo commented Nov 22, 2019 •

edited

Loading

sparkoo commented Nov 26, 2019

[POC] How to configure monitoring/tracing with che CRD #15136

[POC] How to configure monitoring/tracing with che CRD #15136

Comments

skabashnyuk commented Nov 6, 2019 • edited by sparkoo Loading

Is your task related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

sparkoo commented Nov 11, 2019 • edited Loading

usecase 1 - already installed non-crd monitoring stack maintained by cluster admins

sparkoo commented Nov 11, 2019

sparkoo commented Nov 11, 2019

sparkoo commented Nov 11, 2019 • edited Loading

usecase 2 - already installed monitoring stack with operators (CR based) maintained by cluster admins

sparkoo commented Nov 12, 2019

sparkoo commented Nov 12, 2019 • edited Loading

sparkoo commented Nov 13, 2019 • edited Loading

Jaeger

sparkoo commented Nov 14, 2019 • edited Loading

Grafana

sparkoo commented Nov 14, 2019 • edited Loading

Summary/Proposal

1] Connect to existing Prom/Jaeger/Grafana

che-operator:

Documentation

2] Connect to existing Prom/Jaeger/Grafana CR based

che-operator:

Documentation

3] Install and connect Prom/Jaeger/Grafana with operators

che-operator:

Documentation

What's next ?

skabashnyuk commented Nov 18, 2019 • edited Loading

General look and feel

3 scenarios

1 Connect to existing Prom/Jaeger/Grafana

2 Connect to existing Prom/Jaeger/Grafana CR based

3 Install and connect Prom/Jaeger/Grafana with operators

sparkoo commented Nov 18, 2019

skabashnyuk commented Nov 18, 2019

sparkoo commented Nov 18, 2019

sparkoo commented Nov 18, 2019 • edited Loading

Programatically install an operator from operator hub

example

Caveats

skabashnyuk commented Nov 20, 2019

sparkoo commented Nov 20, 2019

skabashnyuk commented Nov 21, 2019

sparkoo commented Nov 21, 2019

sparkoo commented Nov 21, 2019 • edited Loading

Proposal v2 (with installing 3rd party operators)

Phase 1

Phase 2

Phase 3

Phase 4

sparkoo commented Nov 22, 2019 • edited Loading

Proposal v3

Phase 1

Phase 2

Phase 3

Phase 4 (can be switched with Phase 3 based on priorities)

sparkoo commented Nov 26, 2019

skabashnyuk commented Nov 6, 2019 •

edited by sparkoo

Loading

sparkoo commented Nov 11, 2019 •

edited

Loading

sparkoo commented Nov 11, 2019 •

edited

Loading

sparkoo commented Nov 12, 2019 •

edited

Loading

sparkoo commented Nov 13, 2019 •

edited

Loading

sparkoo commented Nov 14, 2019 •

edited

Loading

sparkoo commented Nov 14, 2019 •

edited

Loading

skabashnyuk commented Nov 18, 2019 •

edited

Loading

sparkoo commented Nov 18, 2019 •

edited

Loading

sparkoo commented Nov 21, 2019 •

edited

Loading

sparkoo commented Nov 22, 2019 •

edited

Loading