Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MVP] Collecting Che Workspace metrics in single cluster mode with Prometheus #14888

Closed
skabashnyuk opened this issue Oct 15, 2019 · 12 comments
Closed
Assignees
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Milestone

Comments

@skabashnyuk
Copy link
Contributor

skabashnyuk commented Oct 15, 2019

Is your task related to a problem? Please describe.

As part of #13270 we have a requirement to collect metrics from Che Workspace.

In this issue #14245 and this pr eclipse-theia/theia#6303 that metrics would be added to Che Theia.

Describe the solution you'd like

To have an ability to reuse the result of this work on Openshift 3 we are not going to rely on Prometheus operator && CRD based discovery.

There are two modes that we should consider:

  • Prometheus && Che Server && Che workspaces all using the same namespace
  • Prometheus && Che workspaces using different namespaces

As part of this issue, we need to figure out the permissions that Prometheus requires in different modes to be able to effectively discover new services and collect metrics without using public routes.

This is an investigation task that doesn't include:

  • changes in chectl/helm/operator
  • Documentation updates.

Result of this work would be:

  • set of issues for different components that we need to implement
  • a known set of permission that is required for Prometheus to do his part of the job.

Describe alternatives you've considered

Prometheus operator && CRD based discovery

Additional context

#13270

@skabashnyuk skabashnyuk added kind/task Internal things, technical debt, and to-do tasks to be performed. team/platform severity/P1 Has a major impact to usage or development of the system. labels Oct 15, 2019
@skabashnyuk skabashnyuk added this to the Backlog - Platform milestone Oct 15, 2019
@skabashnyuk skabashnyuk modified the milestones: Backlog - Platform, 7.4.0 Oct 16, 2019
@sparkoo sparkoo self-assigned this Oct 21, 2019
@sparkoo
Copy link
Member

sparkoo commented Oct 24, 2019

I'm testing this on minishift and I was able to discover services in whole cluster with this setup:

prometheus config
global:
  scrape_interval:     1s
  evaluation_interval: 1s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
rule_files:
scrape_configs:
- job_name: 'kubernetes-service-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name
ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus-monitoring
rules:
- apiGroups: [""]
  resources:
  - services
  - pods
  - endpoints
  verbs: ["list", "watch"]
application Service
apiVersion: v1
kind: Service
metadata:
  name: test-metrics
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port:   '8887'
spec:
  ports:
  - protocol: TCP
    port: 8887
    targetPort: 8887

So with this, we would need to deploy prometheus to any project, giving it ClusterRole permissions to list and watch - services, pods, endpoints. And all services that should be discoverable by prometheus must be labeled with prometheus.io/scrape: 'true' and prometheus.io/port: '8887' (port is on which port application provides metrics).

@sparkoo
Copy link
Member

sparkoo commented Oct 24, 2019

cc: @skabashnyuk @l0rd
Did I miss something important? Is service discovery by labels ok like this?
I want to look at needed permission in case we run all in one project. I guess we would need same permissions, but just for one project, but I haven't tested that yet.

@skabashnyuk
Copy link
Contributor Author

Is service discovery by labels ok like this?

Is this some kind of standard or best practice or you just invent a new one?

@sparkoo
Copy link
Member

sparkoo commented Oct 25, 2019

We can discover services by parameters listed here https://prometheus.io/docs/prometheus/latest/configuration/configuration/#service
Only suitable for us seems to be labels or annotations. If I understand differences between them, we should use annotations (which I've used in the example, just named it wrong in my comment).

@sparkoo
Copy link
Member

sparkoo commented Oct 25, 2019

I was able to reduce needed permissions. Now we only need list and watch for services resource. Update in prometheus config is:

kubernetes_sd_configs:
      - role: service

@sparkoo
Copy link
Member

sparkoo commented Oct 25, 2019

I've successfully scraped metrics from che workspaces with no changes except building my own theia image with metrics plugin. Theia expose metrics endpoint at :3100/metrics so we already have service for that.

Then I've configured prometheus so it discovers services with following rules:

  • service annotation org.eclipse.che.machine.name matches regex theia-ide(.*)
  • service label che.workspace_id is defined
  • service port name is server-3100

With these rules, prometheus is able to find every running workspace service and scrape metrics.

I'll write full example to test with exact prometheus deployment and devfile later today.

@sparkoo
Copy link
Member

sparkoo commented Oct 25, 2019

Here's what's needed to monitor all workspaces in the cluster. I've tested on minishift.

  1. run minishift login as system:admin (we have to create ClusterRole and ClusterRoleBinding for service discovery)
  2. create new projectoc new-project monitoring
  3. create new app from the template. This deploys prometheus with needed configuration and permissions oc new-app https://gist.githubusercontent.com/sparkoo/89fcb8e8c46720900cd89777f76d5347/raw/cac2af4127c7908a3187d96528102005e56a5d94/prometheus_template.yaml
    1.. output should show you where you can access prometheus or run oc get routes to get prometheus url. Open that in the browser. It should be something like prometheus-monitoring.192.168.42.10.nip.io
  4. now deploy che as you wish to any project. It can be multi or single user, with workspaces in one or multiple projects...
  5. create workspace from this minimal devfile: https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347#file-devfile-yaml very minimal devfile just using my che-theia image
  6. start the workspace
  7. once workspace service is created, you will see it on prometheus at status > targets. When workspace is started, it will be marked green with UP state.
  8. Now you can see individual metrics and execute queries

20191025_214336_1423x562_scrot
20191025_214816_1587x830_scrot

All resources I've prepared are in this gist: https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347
The most interesting are ConfigMap and ClusterRole from template for prometheus: https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347#file-prometheus_template-yaml

Basically we only need to configure prometheus service discovery and ensure that che-theia has metrics plugin. I'm also not 100% sure that service discovery conditions (see my previous comment #14888 (comment)) will be always satisfied.

thoughts? questions?

cc: @skabashnyuk @l0rd

@sparkoo
Copy link
Member

sparkoo commented Oct 29, 2019

cc: @skabashnyuk

@skabashnyuk
Copy link
Contributor Author

@sparkoo could you please test this setup on OpenShift4 and k8s?

@sparkoo
Copy link
Member

sparkoo commented Oct 29, 2019

@skabashnyuk for k8s (minikube), I've prepared this List for Prometheus https://gist.github.com/sparkoo/89fcb8e8c46720900cd89777f76d5347#file-prometheus_list-yaml. There are no logic or configurations changes. Works fine, Prometheus discovered workspace services and is scraping the metrics without any issues.

OpenShift4 (crc) works with OpenShift template mentioned in my previous comment #14888 (comment).

@skabashnyuk
Copy link
Contributor Author

Today we talked about this task with @l0rd @sparkoo and @ibuziuk
We noticed good progress in this task and identified such directions where we can move forward.
In some cluster setup, cluster-wide permissions can be quite a tricker to get. So we want to :

  • Investigate the possibility to bridge Theia service from workspace namespace to Prometheus namespace
  • Investigate the possibility to notify Prometheus about newly available services with some API

Also, we want to practice in the Prometheus federation in a slightly simpler form then required here #14889 . Instead of having multiple clusters we can set up multiple Prometheus in individual namespaces and a separate Prometheus that will act as a master role. This formation can simulate multi-cluster federation capabilities and give us some knowledge and feedback.

@sparkoo
Copy link
Member

sparkoo commented Nov 1, 2019

work of the scope of this task is done. closing.

@sparkoo sparkoo closed this as completed Nov 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

2 participants