Skip to content

[CELEBORN-1552] automatically support prometheus to scrape metrics for helm chart#2673

Closed
lianneli wants to merge 1 commit intoapache:mainfrom
lianneli:PR-1552
Closed

[CELEBORN-1552] automatically support prometheus to scrape metrics for helm chart#2673
lianneli wants to merge 1 commit intoapache:mainfrom
lianneli:PR-1552

Conversation

@lianneli
Copy link
Contributor

@lianneli lianneli commented Aug 8, 2024

What changes were proposed in this pull request?

  1. Add Annotations to Master Service and Worker Service for automatically scraping by Prometheus.
  2. Add Ports to Worker Service, since it's empty before that prometheus cannot connect to workers.

Why are the changes needed?

Although master and worker provide http interfaces, We still need add annotations manually for prometheus automatically scraping.

Does this PR introduce any user-facing change?

No. Users will not feel any changes to install and use Celeborn.

How was this patch tested?

test locally and in dev environment.

Before:
image

After:
image

@waitinfuture
Copy link
Contributor

@RexXiong Could you please take a look at this?

{{- range $key, $val := .Values.celeborn }}
{{- if eq $key "celeborn.metrics.enabled" }}
{{- $metricsEnabled = $val -}}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about helm chart indentation. Is there any guideline or best practice?

Copy link
Contributor Author

@lianneli lianneli Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FMX Thanks for you review. The file _helpers.tpl is go template syntax, some complex logic can be coded in this file, I understand. The param celeborn.metrics.enabled, and others, is written in values.yaml as a single key, but the dot . inside key does not conform to the specification which cannot be used directly by .Values.xxx. So, I imitate the way in configMap.yaml, using the range syntax to loop through, and using if to scratch the value. The indentation and format in this file is not significant, and it is processed through variable in template/xx/service.yaml.
image

{{- end }}
{{- end }}

{{/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also required for Celeborn Master?

Copy link
Contributor Author

@lianneli lianneli Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also required for Celeborn Master?

@RexXiong Not required. Something interesting is although the port set in master service is not for metrics, prometheus still can scrape metrics successfully. This is not proper for worker service, since there is not any port set in it. So I only set the prometheus http port in worker service other than master worker. All the changes have tested in our clusters, containing both dev and prod environment.

Of Course, we can add the prometheus http port in master service for understanding more easily. Should I do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of Course, we can add the prometheus http port in master service for understanding more easily. Should I do that?

No, but I'm still very interested in why Prometheus can still scrape metrics successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of Course, we can add the prometheus http port in master service for understanding more easily. Should I do that?

No, but I'm still very interested in why Prometheus can still scrape metrics successfully.

@RexXiong
Good question, and I made a test.

In k8s, the core role of Service is translating Pod's ip to domain name like "xxx.namespace.svc.cluster.local", whether the port set in Service is a real one. However, Service will not work if there is none port set.

Prometheus would search all the "on working" service, fetch annotations to get real port, then get Pod's ip though K8s Api to finish the metrics scratching.

That's why we need not set another port in Master Service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. Thanks.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merge to main(0.6.0) and branch-0.5(0.5.2)

@RexXiong RexXiong closed this in 1ea704f Aug 26, 2024
RexXiong pushed a commit that referenced this pull request Aug 26, 2024
…r helm chart

### What changes were proposed in this pull request?
1. Add Annotations to Master Service and Worker Service for automatically scraping by Prometheus.
2. Add Ports to Worker Service, since it's empty before that prometheus cannot connect to workers.

### Why are the changes needed?
Although master and worker provide http interfaces, We still need add annotations manually for prometheus automatically scraping.

### Does this PR introduce _any_ user-facing change?
No. Users will not feel any changes to install and use Celeborn.

### How was this patch tested?
test locally and in dev environment.

Before:
![image](https://github.com/user-attachments/assets/d924929f-1cd9-4487-afc6-08390fc8dfc2)

After:
![image](https://github.com/user-attachments/assets/145b0727-e66a-4268-af4d-cf0619eb3b14)

Closes #2673 from lianneli/PR-1552.

Authored-by: Lianne Li <lmlianne@outlook.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
(cherry picked from commit 1ea704f)
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
@lianneli lianneli deleted the PR-1552 branch August 26, 2024 08:55
wankunde pushed a commit to wankunde/celeborn that referenced this pull request Oct 11, 2024
…r helm chart

### What changes were proposed in this pull request?
1. Add Annotations to Master Service and Worker Service for automatically scraping by Prometheus.
2. Add Ports to Worker Service, since it's empty before that prometheus cannot connect to workers.

### Why are the changes needed?
Although master and worker provide http interfaces, We still need add annotations manually for prometheus automatically scraping.

### Does this PR introduce _any_ user-facing change?
No. Users will not feel any changes to install and use Celeborn.

### How was this patch tested?
test locally and in dev environment.

Before:
![image](https://github.com/user-attachments/assets/d924929f-1cd9-4487-afc6-08390fc8dfc2)

After:
![image](https://github.com/user-attachments/assets/145b0727-e66a-4268-af4d-cf0619eb3b14)

Closes apache#2673 from lianneli/PR-1552.

Authored-by: Lianne Li <lmlianne@outlook.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants