[MVP] Collecting Che Workspace metrics in single cluster mode with Prometheus #14888

skabashnyuk · 2019-10-15T08:45:34Z

Is your task related to a problem? Please describe.

As part of #13270 we have a requirement to collect metrics from Che Workspace.

In this issue #14245 and this pr eclipse-theia/theia#6303 that metrics would be added to Che Theia.

Describe the solution you'd like

To have an ability to reuse the result of this work on Openshift 3 we are not going to rely on Prometheus operator && CRD based discovery.

There are two modes that we should consider:

Prometheus && Che Server && Che workspaces all using the same namespace
Prometheus && Che workspaces using different namespaces

As part of this issue, we need to figure out the permissions that Prometheus requires in different modes to be able to effectively discover new services and collect metrics without using public routes.

This is an investigation task that doesn't include:

changes in chectl/helm/operator
Documentation updates.

Result of this work would be:

set of issues for different components that we need to implement
a known set of permission that is required for Prometheus to do his part of the job.

Describe alternatives you've considered

Prometheus operator && CRD based discovery

Additional context

#13270

sparkoo · 2019-10-24T14:55:41Z

I'm testing this on minishift and I was able to discover services in whole cluster with this setup:

prometheus config

global:
  scrape_interval:     1s
  evaluation_interval: 1s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
rule_files:
scrape_configs:
- job_name: 'kubernetes-service-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

ClusterRole

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus-monitoring
rules:
- apiGroups: [""]
  resources:
  - services
  - pods
  - endpoints
  verbs: ["list", "watch"]

application Service

apiVersion: v1
kind: Service
metadata:
  name: test-metrics
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port:   '8887'
spec:
  ports:
  - protocol: TCP
    port: 8887
    targetPort: 8887

So with this, we would need to deploy prometheus to any project, giving it ClusterRole permissions to list and watch - services, pods, endpoints. And all services that should be discoverable by prometheus must be labeled with prometheus.io/scrape: 'true' and prometheus.io/port: '8887' (port is on which port application provides metrics).

sparkoo · 2019-10-24T15:04:14Z

cc: @skabashnyuk @l0rd
Did I miss something important? Is service discovery by labels ok like this?
I want to look at needed permission in case we run all in one project. I guess we would need same permissions, but just for one project, but I haven't tested that yet.

skabashnyuk · 2019-10-25T06:24:24Z

Is service discovery by labels ok like this?

Is this some kind of standard or best practice or you just invent a new one?

sparkoo · 2019-10-25T07:13:57Z

We can discover services by parameters listed here https://prometheus.io/docs/prometheus/latest/configuration/configuration/#service
Only suitable for us seems to be labels or annotations. If I understand differences between them, we should use annotations (which I've used in the example, just named it wrong in my comment).

sparkoo · 2019-10-25T08:26:02Z

I was able to reduce needed permissions. Now we only need list and watch for services resource. Update in prometheus config is:

kubernetes_sd_configs:
      - role: service

sparkoo · 2019-10-25T14:06:16Z

I've successfully scraped metrics from che workspaces with no changes except building my own theia image with metrics plugin. Theia expose metrics endpoint at :3100/metrics so we already have service for that.

Then I've configured prometheus so it discovers services with following rules:

service annotation org.eclipse.che.machine.name matches regex theia-ide(.*)
service label che.workspace_id is defined
service port name is server-3100

With these rules, prometheus is able to find every running workspace service and scrape metrics.

I'll write full example to test with exact prometheus deployment and devfile later today.

sparkoo · 2019-10-25T20:00:22Z

Here's what's needed to monitor all workspaces in the cluster. I've tested on minishift.

run minishift login as system:admin (we have to create ClusterRole and ClusterRoleBinding for service discovery)
create new projectoc new-project monitoring
create new app from the template. This deploys prometheus with needed configuration and permissions oc new-app https://gist.githubusercontent.com/sparkoo/89fcb8e8c46720900cd89777f76d5347/raw/cac2af4127c7908a3187d96528102005e56a5d94/prometheus_template.yaml
1.. output should show you where you can access prometheus or run oc get routes to get prometheus url. Open that in the browser. It should be something like prometheus-monitoring.192.168.42.10.nip.io
now deploy che as you wish to any project. It can be multi or single user, with workspaces in one or multiple projects...
create workspace from this minimal devfile: https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347#file-devfile-yaml very minimal devfile just using my che-theia image
start the workspace
once workspace service is created, you will see it on prometheus at status > targets. When workspace is started, it will be marked green with UP state.
Now you can see individual metrics and execute queries

All resources I've prepared are in this gist: https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347
The most interesting are ConfigMap and ClusterRole from template for prometheus: https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347#file-prometheus_template-yaml

Basically we only need to configure prometheus service discovery and ensure that che-theia has metrics plugin. I'm also not 100% sure that service discovery conditions (see my previous comment #14888 (comment)) will be always satisfied.

thoughts? questions?

cc: @skabashnyuk @l0rd

sparkoo · 2019-10-29T07:57:32Z

I've tested on CRC without any issues, I've just followed steps I've described.
I've updated my gist with example theia metrics output - https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347#file-example_theia_metrics

cc: @skabashnyuk

skabashnyuk · 2019-10-29T08:35:19Z

@sparkoo could you please test this setup on OpenShift4 and k8s?

sparkoo · 2019-10-29T10:03:29Z

@skabashnyuk for k8s (minikube), I've prepared this List for Prometheus https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347#file-prometheus_list-yaml. There are no logic or configurations changes. Works fine, Prometheus discovered workspace services and is scraping the metrics without any issues.

OpenShift4 (crc) works with OpenShift template mentioned in my previous comment #14888 (comment).

skabashnyuk · 2019-10-30T13:07:04Z

Today we talked about this task with @l0rd @sparkoo and @ibuziuk
We noticed good progress in this task and identified such directions where we can move forward.
In some cluster setup, cluster-wide permissions can be quite a tricker to get. So we want to :

Investigate the possibility to bridge Theia service from workspace namespace to Prometheus namespace
Investigate the possibility to notify Prometheus about newly available services with some API

Also, we want to practice in the Prometheus federation in a slightly simpler form then required here #14889 . Instead of having multiple clusters we can set up multiple Prometheus in individual namespaces and a separate Prometheus that will act as a master role. This formation can simulate multi-cluster federation capabilities and give us some knowledge and feedback.

sparkoo · 2019-11-01T09:23:25Z

work of the scope of this task is done. closing.

skabashnyuk added kind/task Internal things, technical debt, and to-do tasks to be performed. team/platform severity/P1 Has a major impact to usage or development of the system. labels Oct 15, 2019

skabashnyuk added this to the Backlog - Platform milestone Oct 15, 2019

l0rd mentioned this issue Oct 15, 2019

[Metrics] Observability requirements for Hosted Che #13270

Closed

11 tasks

skabashnyuk mentioned this issue Oct 15, 2019

Collecting Che Workspace metrics in multi-cluster mode with help of Prometheus Federation mechanism #14889

Closed

skabashnyuk modified the milestones: Backlog - Platform, 7.4.0 Oct 16, 2019

skabashnyuk mentioned this issue Oct 16, 2019

Platform-2019-11-05 (Sprint: 174) #14835

Closed

11 tasks

sparkoo self-assigned this Oct 21, 2019

sparkoo mentioned this issue Oct 29, 2019

add metrics plugin eclipse-che/che-theia#520

Merged

This was referenced Oct 30, 2019

Investigate alternative possibilities to reduce Prometheus's permissions required to monitor workspaces #15031

Closed

Investigate how to report metrics from workspaces #13375

Closed

sparkoo closed this as completed Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MVP] Collecting Che Workspace metrics in single cluster mode with Prometheus #14888

[MVP] Collecting Che Workspace metrics in single cluster mode with Prometheus #14888

skabashnyuk commented Oct 15, 2019 •

edited by l0rd

Loading

sparkoo commented Oct 24, 2019

sparkoo commented Oct 24, 2019 •

edited

Loading

skabashnyuk commented Oct 25, 2019

sparkoo commented Oct 25, 2019

sparkoo commented Oct 25, 2019

sparkoo commented Oct 25, 2019

sparkoo commented Oct 25, 2019 •

edited

Loading

sparkoo commented Oct 29, 2019

skabashnyuk commented Oct 29, 2019

sparkoo commented Oct 29, 2019 •

edited

Loading

skabashnyuk commented Oct 30, 2019

sparkoo commented Nov 1, 2019

[MVP] Collecting Che Workspace metrics in single cluster mode with Prometheus #14888

[MVP] Collecting Che Workspace metrics in single cluster mode with Prometheus #14888

Comments

skabashnyuk commented Oct 15, 2019 • edited by l0rd Loading

Is your task related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

sparkoo commented Oct 24, 2019

sparkoo commented Oct 24, 2019 • edited Loading

skabashnyuk commented Oct 25, 2019

sparkoo commented Oct 25, 2019

sparkoo commented Oct 25, 2019

sparkoo commented Oct 25, 2019

sparkoo commented Oct 25, 2019 • edited Loading

sparkoo commented Oct 29, 2019

skabashnyuk commented Oct 29, 2019

sparkoo commented Oct 29, 2019 • edited Loading

skabashnyuk commented Oct 30, 2019

sparkoo commented Nov 1, 2019

skabashnyuk commented Oct 15, 2019 •

edited by l0rd

Loading

sparkoo commented Oct 24, 2019 •

edited

Loading

sparkoo commented Oct 25, 2019 •

edited

Loading

sparkoo commented Oct 29, 2019 •

edited

Loading