change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP #435

humblebundledore · 2023-02-16T11:38:05Z

What this PR does:
This PR correct the alertmanager-headless service with exposing grpc port instead of http.

cortex-alertmanager-headless     ClusterIP   None             <none>        9095/TCP    1d
cortex-distributor-headless      ClusterIP   None             <none>        9095/TCP    1d
cortex-ingester-headless         ClusterIP   None             <none>        9095/TCP    1d
cortex-query-frontend-headless   ClusterIP   None             <none>        9095/TCP    1d
cortex-store-gateway-headless    ClusterIP   None             <none>        9095/TCP    1d

This PR also expose port 9094 TCP and UDP for gossip cluster in alertmanager statefulset.

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
  alertmanager:
    Ports:         8080/TCP, 7946/TCP, 9095/TCP, 9094/TCP, 9094/UDP

Which issue(s) this PR fixes:
This PR does not fix an issue BUT is related to conversation in #420

2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed

Just missing. No real reason I guess. See explanation below.

Checklist

CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

dpericaxon · 2023-03-01T18:22:17Z

Hello, it would be great if we could add some logic for populating the peers list. I think it would look somthing like this(not tested):

          args:
            - "-target=alertmanager"
            - "-config.file=/etc/cortex/cortex.yaml"
            {{- if gt (int .Values.alertmanager.replicas) 1}}
            {{- $fullName := include "cortex.alertmanagerFullname" . }}
            {{- range $i := until (int .Values.alertmanager.replicas) }}
            - --cluster.peer={{ $fullName }}-{{ $i }}.{{ $fullName }}-headless:9094
            {{- end }}

This looks to be the way that the prometheus-community alertmanager is handling peers.
Source

nschad

I'm not really sure how this is supposed to be configured correctly.

templates/alertmanager/alertmanager-statefulset.yaml

nschad

What do you think about the following? @AlexandreRoux

CC: @kd7lxl

templates/alertmanager/alertmanager-statefulset.yaml

templates/alertmanager/alertmanager-svc-headless.yaml

kd7lxl · 2023-03-08T17:34:13Z

Hello, it would be great if we could add some logic for populating the peers list. I think it would look somthing like this(not tested):
          args:
            - "-target=alertmanager"
            - "-config.file=/etc/cortex/cortex.yaml"
            {{- if gt (int .Values.alertmanager.replicas) 1}}
            {{- $fullName := include "cortex.alertmanagerFullname" . }}
            {{- range $i := until (int .Values.alertmanager.replicas) }}
            - --cluster.peer={{ $fullName }}-{{ $i }}.{{ $fullName }}-headless:9094
            {{- end }}
This looks to be the way that the prometheus-community alertmanager is handling peers. Source

Is seeding the cluster peers necessary when we have memberlist enabled?

kd7lxl · 2023-03-08T18:04:07Z

templates/alertmanager/alertmanager-statefulset.yaml

+            - name: grpc
+              containerPort: {{ .Values.config.server.grpc_listen_port }}
+              protocol: TCP
+            - containerPort: 9094


I'm conflicted on this. The magic number 9094 bothers me. It would be nice to be able to support the configuration and only default to 9094, but that's awkward because the port number is embedded at the end of the listen string. Something like this (but this is kind of ugly):

Suggested change

- containerPort: 9094

- containerPort: {{ (.Values.config.alertmanager.cluster.listen_address | default "0.0.0.0:9094" | split ":").1 }}

What about introducing an extra variable with a docstring in values.yaml?

I haven't manage to figure out a way rather than the above suggestion of @kd7lxl to make 9094 configurable since it will always come from config.alertmanager.cluster.listen_address and the need to split the port at the end.

I will push that for now (to progress on the PL) but suggestion are welcome.

humblebundledore · 2023-03-15T13:44:55Z

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question :
https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

nschad · 2023-03-15T14:03:48Z

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

Pretty sure you are correct with this assessment. That's probably also why it's giving us so much grief.

humblebundledore · 2023-03-15T16:18:35Z

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

Pretty sure you are correct with this assessment. That's probably also why it's giving us so much grief.

I redeployed my alertmanagers by removing peers and result is at follow :

# values.yaml shipped to helm
config
  alertmanager:
    cluster:
      listen_address: '0.0.0.0:9094'
alertmanager:
  replicas: 2
 
# /multitenant_alertmanager/status
<h3>Members</h3>
Name | Addr
-- | --
01GVK00A2JF0H9DK4FYXKTCGXR | 172.17.0.13

# k get pods
cortex-alertmanager-0                   2/2     Running   0             6m35s   172.17.0.18   minikube   <none>           <none>
cortex-alertmanager-1                   2/2     Running   0             6m47s   172.17.0.13   minikube   <none>           <none>

so we indeed need to seed peers (too many "ee" in that phrase) and the two available options are :

array of alertmanager pod address (prometheus recommend)
alertmanager-headless service

to reply to @dpericaxon comment

Hello, it would be great if we could add some logic for populating the peers list. I think it would look something like this(not tested)

Isn't this a templating values.yaml problem rather than a pod/args templating problem since we are passing peers to the alertmanager via /etc/cortex/cortex.yaml ?

    Args:
      -target=alertmanager
      -config.file=/etc/cortex/cortex.yaml

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

nschad · 2023-05-12T06:41:44Z

Where are we with this?

Signed-off-by: Niclas Schad <niclas.schad@gmail.com>

nschad · 2023-05-12T07:43:18Z

I've consolidated the proposed changes here #460

Feel free to try it out and let me know if that works.

humblebundledore · 2023-06-15T09:46:52Z

I've consolidated the proposed changes here #460

Feel free to try it out and let me know if that works.

I am planning to come back on that issue next week, I will try #460 and keep posted.

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

humblebundledore · 2023-07-03T14:20:10Z

I've consolidated the proposed changes here #460
Feel free to try it out and let me know if that works.

I am planning to come back on that issue next week, I will try #460 and keep posted.

#460 has been implemented with comma separated list in 1d830da

humblebundledore · 2023-07-03T16:24:51Z

@nschad / @kd7lxl - I thought to be mostly done here but I noticed a very strange behavior :

each replica of my alertmanager cluster is debugging about not finding old replica in cluster.go :

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
level=debug ts=2023-07-03T16:13:40.509653841Z caller=cluster.go:441 component=cluster msg=reconnect result=failure peer=01H4E5FTYCW0FVZW78X181J77S addr=10.244.1.4:9094 err="1 error occurred:\n\t* Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n\n"

level=debug ts=2023-07-03T16:15:20.50957654Z caller=cluster.go:339 component=cluster memberlist="2023/07/03 16:15:20 [DEBUG] memberlist: Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n"

but my current peers are at follow :

# k describe pods/cortex-base-alertmanager-2 -n cortex-base | grep peer
      -alertmanager.cluster.peers=cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-2.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094

# http://<forwarded-am-service>:<port>/multitenant_alertmanager/status
01H4E5G6WY5AWVHT8AAE1GB31A	10.244.1.5
01H4E85YE1XWTA82HQ24T3733D	        10.244.1.8
01H4E81T31RJ87MKW35J84RJSN	        10.244.1.7

I don't know how to explain that 😕 (yet), I am also very confused on that port 9094 that should be only cluster related and not memberlist. I may post that on Slack let's see.

nschad · 2023-07-03T17:23:10Z

@nschad / @kd7lxl - I thought to be mostly done here but I noticed a very strange behavior :

each replica of my alertmanager cluster is debugging about not finding old replica in cluster.go :

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
level=debug ts=2023-07-03T16:13:40.509653841Z caller=cluster.go:441 component=cluster msg=reconnect result=failure peer=01H4E5FTYCW0FVZW78X181J77S addr=10.244.1.4:9094 err="1 error occurred:\n\t* Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n\n"

level=debug ts=2023-07-03T16:15:20.50957654Z caller=cluster.go:339 component=cluster memberlist="2023/07/03 16:15:20 [DEBUG] memberlist: Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n"

but my current peers are at follow :

# k describe pods/cortex-base-alertmanager-2 -n cortex-base | grep peer
      -alertmanager.cluster.peers=cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-2.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094

# http://<forwarded-am-service>:<port>/multitenant_alertmanager/status
01H4E5G6WY5AWVHT8AAE1GB31A	10.244.1.5
01H4E85YE1XWTA82HQ24T3733D	        10.244.1.8
01H4E81T31RJ87MKW35J84RJSN	        10.244.1.7

I don't know how to explain that 😕 (yet), I am also very confused on that port 9094 that should be only cluster related and not memberlist. I may post that on Slack let's see.

Might be normal that the ring has old instances in its ring temporarily. Do these logs disappear after a few minutes?

humblebundledore · 2023-07-04T14:09:16Z

Might be normal that the ring has old instances in its ring temporarily. Do these logs disappear after a few minutes?

In my case it seems to last abnormally long. I am curious to know if someone else have the same behavior.
It is in debug logs thoo, maybe it is something I haven't noticed in the past too .. I am currently unsure.

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

… disable Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

nschad · 2023-07-09T14:47:57Z

templates/alertmanager/alertmanager-dep.yaml

This would be a breaking change. I would rather not remove it. Also the configuration below then does make little sense.

The {{- if .Values.alertmanager.statefulSet.enabled -}} would become obsolete.

Also currently if you set statefulSet to false (the default) and deploy cortex then you only get the non-headless service of alertmanager. Makes no sense.

Suggestion:

Keep the alertmanager deployment (maybe clean that up too if necessary) but don't enable support for clustering out-of-the-box. Let's keep it as it is.

templates/alertmanager/alertmanager-svc-headless.yaml

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

nschad

Can you run helm-docs in the repo root to update the README.md?

stale · 2023-09-17T01:17:38Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

nschad · 2024-10-17T08:58:39Z

superseded by #494

humblebundledore added 2 commits February 16, 2023 10:03

change alertmanager-svc-headless from http to grpc port

dcc7660

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

edit CHANGELOG.md

d8bdc70

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

humblebundledore mentioned this pull request Feb 16, 2023

Exposing 9094 tcp port for alertmanager #420

Open

expose 9094 TCP UDP for gossip cluster

a001e49

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

friedrichg requested a review from nschad February 28, 2023 18:15

humblebundledore changed the title ~~change alertmanager-svc-headless from http to grpc port~~ change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP Mar 1, 2023

nschad reviewed Mar 6, 2023

View reviewed changes

templates/alertmanager/alertmanager-statefulset.yaml Outdated Show resolved Hide resolved

nschad requested changes Mar 8, 2023

View reviewed changes

templates/alertmanager/alertmanager-statefulset.yaml Outdated Show resolved Hide resolved

templates/alertmanager/alertmanager-svc-headless.yaml Show resolved Hide resolved

kd7lxl approved these changes Mar 8, 2023

View reviewed changes

nschad approved these changes Mar 15, 2023

View reviewed changes

Merge branch 'master' into master

b0df50e

humblebundledore added 2 commits March 15, 2023 21:03

support configuration of gossip cluster port

e483cd5

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

Merge branch 'master' into master

84c1331

WIP: configure alertmanager HA cluster mode for sts

05b3403

Signed-off-by: Niclas Schad <niclas.schad@gmail.com>

nschad self-requested a review June 23, 2023 06:34

humblebundledore added 2 commits June 28, 2023 12:37

Merge branch 'cortexproject:master' into master

d29022e

Merge remote-tracking branch 'upstream/feature/alertmanager-cluster'

8d22e3a

humblebundledore mentioned this pull request Jun 29, 2023

WIP: configure alertmanager HA cluster mode for sts #460

Closed

1 task

configure alertmanager cluster peers as comma seperated list

1d830da

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

humblebundledore added 2 commits July 4, 2023 16:12

clarify values.yaml about cluster enable by default

e909803

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

replicaset should not set -alertmanager-cluster-peers when cluster is…

b9efaee

… disable Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

humblebundledore requested a review from kd7lxl July 4, 2023 15:37

nschad requested changes Jul 9, 2023

View reviewed changes

humblebundledore added 2 commits July 26, 2023 10:44

fix wrong name for grpc targetPort

87e6eb4

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

re-introduce alertmanager-dep.yaml

62c1222

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>

nschad reviewed Jul 28, 2023

View reviewed changes

stale bot added the stale label Sep 17, 2023

locmai mentioned this pull request Apr 3, 2024

fix: change headless service to gRPC and expose 9094 TCP #494

Merged

nschad closed this Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP #435

change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP #435

humblebundledore commented Feb 16, 2023 •

edited

Loading

dpericaxon commented Mar 1, 2023

nschad left a comment

nschad left a comment

kd7lxl commented Mar 8, 2023

kd7lxl Mar 8, 2023

nschad Mar 15, 2023

humblebundledore Mar 15, 2023 •

edited

Loading

humblebundledore commented Mar 15, 2023

nschad commented Mar 15, 2023

humblebundledore commented Mar 15, 2023

nschad commented May 12, 2023

nschad commented May 12, 2023 •

edited

Loading

humblebundledore commented Jun 15, 2023

humblebundledore commented Jul 3, 2023

humblebundledore commented Jul 3, 2023 •

edited

Loading

nschad commented Jul 3, 2023

humblebundledore commented Jul 4, 2023

nschad Jul 9, 2023

nschad Jul 9, 2023

nschad left a comment

stale bot commented Sep 17, 2023

nschad commented Oct 17, 2024

	- containerPort: 9094
	- containerPort: {{ (.Values.config.alertmanager.cluster.listen_address \| default "0.0.0.0:9094" \| split ":").1 }}

change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP #435

change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP #435

Conversation

humblebundledore commented Feb 16, 2023 • edited Loading

dpericaxon commented Mar 1, 2023

nschad left a comment

Choose a reason for hiding this comment

nschad left a comment

Choose a reason for hiding this comment

kd7lxl commented Mar 8, 2023

kd7lxl Mar 8, 2023

Choose a reason for hiding this comment

nschad Mar 15, 2023

Choose a reason for hiding this comment

humblebundledore Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

humblebundledore commented Mar 15, 2023

nschad commented Mar 15, 2023

humblebundledore commented Mar 15, 2023

nschad commented May 12, 2023

nschad commented May 12, 2023 • edited Loading

humblebundledore commented Jun 15, 2023

humblebundledore commented Jul 3, 2023

humblebundledore commented Jul 3, 2023 • edited Loading

nschad commented Jul 3, 2023

humblebundledore commented Jul 4, 2023

nschad Jul 9, 2023

Choose a reason for hiding this comment

nschad Jul 9, 2023

Choose a reason for hiding this comment

nschad left a comment

Choose a reason for hiding this comment

stale bot commented Sep 17, 2023

nschad commented Oct 17, 2024

humblebundledore commented Feb 16, 2023 •

edited

Loading

humblebundledore Mar 15, 2023 •

edited

Loading

nschad commented May 12, 2023 •

edited

Loading

humblebundledore commented Jul 3, 2023 •

edited

Loading