Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP #435

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

AlexandreRoux
Copy link
Contributor

@AlexandreRoux AlexandreRoux commented Feb 16, 2023

What this PR does:
This PR correct the alertmanager-headless service with exposing grpc port instead of http.

cortex-alertmanager-headless     ClusterIP   None             <none>        9095/TCP    1d
cortex-distributor-headless      ClusterIP   None             <none>        9095/TCP    1d
cortex-ingester-headless         ClusterIP   None             <none>        9095/TCP    1d
cortex-query-frontend-headless   ClusterIP   None             <none>        9095/TCP    1d
cortex-store-gateway-headless    ClusterIP   None             <none>        9095/TCP    1d

This PR also expose port 9094 TCP and UDP for gossip cluster in alertmanager statefulset.

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
  alertmanager:
    Ports:         8080/TCP, 7946/TCP, 9095/TCP, 9094/TCP, 9094/UDP

Which issue(s) this PR fixes:
This PR does not fix an issue BUT is related to conversation in #420

2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed

Just missing. No real reason I guess. See explanation below.

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>
Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>
Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>
@AlexandreRoux AlexandreRoux changed the title change alertmanager-svc-headless from http to grpc port change alertmanager-svc-headless from http to grpc port + expose 9094 TCP and UDP Mar 1, 2023
@dpericaxon
Copy link
Contributor

Hello, it would be great if we could add some logic for populating the peers list. I think it would look somthing like this(not tested):

          args:
            - "-target=alertmanager"
            - "-config.file=/etc/cortex/cortex.yaml"
            {{- if gt (int .Values.alertmanager.replicas) 1}}
            {{- $fullName := include "cortex.alertmanagerFullname" . }}
            {{- range $i := until (int .Values.alertmanager.replicas) }}
            - --cluster.peer={{ $fullName }}-{{ $i }}.{{ $fullName }}-headless:9094
            {{- end }}

This looks to be the way that the prometheus-community alertmanager is handling peers.
Source

Copy link
Collaborator

@nschad nschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure how this is supposed to be configured correctly.

templates/alertmanager/alertmanager-statefulset.yaml Outdated Show resolved Hide resolved
Copy link
Collaborator

@nschad nschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about the following? @AlexandreRoux

CC: @kd7lxl

@kd7lxl
Copy link
Collaborator

kd7lxl commented Mar 8, 2023

Hello, it would be great if we could add some logic for populating the peers list. I think it would look somthing like this(not tested):

          args:
            - "-target=alertmanager"
            - "-config.file=/etc/cortex/cortex.yaml"
            {{- if gt (int .Values.alertmanager.replicas) 1}}
            {{- $fullName := include "cortex.alertmanagerFullname" . }}
            {{- range $i := until (int .Values.alertmanager.replicas) }}
            - --cluster.peer={{ $fullName }}-{{ $i }}.{{ $fullName }}-headless:9094
            {{- end }}

This looks to be the way that the prometheus-community alertmanager is handling peers. Source

Is seeding the cluster peers necessary when we have memberlist enabled?

- name: grpc
containerPort: {{ .Values.config.server.grpc_listen_port }}
protocol: TCP
- containerPort: 9094
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted on this. The magic number 9094 bothers me. It would be nice to be able to support the configuration and only default to 9094, but that's awkward because the port number is embedded at the end of the listen string. Something like this (but this is kind of ugly):

Suggested change
- containerPort: 9094
- containerPort: {{ (.Values.config.alertmanager.cluster.listen_address | default "0.0.0.0:9094" | split ":").1 }}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about introducing an extra variable with a docstring in values.yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't manage to figure out a way rather than the above suggestion of @kd7lxl to make 9094 configurable since it will always come from config.alertmanager.cluster.listen_address and the need to split the port at the end.

I will push that for now (to progress on the PL) but suggestion are welcome.

@AlexandreRoux
Copy link
Contributor Author

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question :
https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

@nschad
Copy link
Collaborator

nschad commented Mar 15, 2023

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

Pretty sure you are correct with this assessment. That's probably also why it's giving us so much grief.

@AlexandreRoux
Copy link
Contributor Author

Is seeding the cluster peers necessary when we have memberlist enabled?

I believe gossip memberlist and gossip cluster are two separate things here, please correct me if I am wrong but I asked in slack channel about that specific question : https://cloud-native.slack.com/archives/CCYDASBLP/p1676555646043829

Pretty sure you are correct with this assessment. That's probably also why it's giving us so much grief.

I redeployed my alertmanagers by removing peers and result is at follow :

# values.yaml shipped to helm
config
  alertmanager:
    cluster:
      listen_address: '0.0.0.0:9094'
alertmanager:
  replicas: 2
 
# /multitenant_alertmanager/status
<h3>Members</h3>
Name | Addr
-- | --
01GVK00A2JF0H9DK4FYXKTCGXR | 172.17.0.13

# k get pods
cortex-alertmanager-0                   2/2     Running   0             6m35s   172.17.0.18   minikube   <none>           <none>
cortex-alertmanager-1                   2/2     Running   0             6m47s   172.17.0.13   minikube   <none>           <none>

so we indeed need to seed peers (too many "ee" in that phrase) and the two available options are :

  • array of alertmanager pod address (prometheus recommend)
  • alertmanager-headless service

to reply to @dpericaxon comment

Hello, it would be great if we could add some logic for populating the peers list. I think it would look something like this(not tested)

Isn't this a templating values.yaml problem rather than a pod/args templating problem since we are passing peers to the alertmanager via /etc/cortex/cortex.yaml ?

    Args:
      -target=alertmanager
      -config.file=/etc/cortex/cortex.yaml

@nschad
Copy link
Collaborator

nschad commented May 12, 2023

Where are we with this?

Signed-off-by: Niclas Schad <niclas.schad@gmail.com>
@nschad
Copy link
Collaborator

nschad commented May 12, 2023

I've consolidated the proposed changes here #460

Feel free to try it out and let me know if that works.

@AlexandreRoux
Copy link
Contributor Author

I've consolidated the proposed changes here #460

Feel free to try it out and let me know if that works.

I am planning to come back on that issue next week, I will try #460 and keep posted.

@nschad nschad self-requested a review June 23, 2023 06:34
Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>
@AlexandreRoux
Copy link
Contributor Author

I've consolidated the proposed changes here #460
Feel free to try it out and let me know if that works.

I am planning to come back on that issue next week, I will try #460 and keep posted.

#460 has been implemented with comma separated list in 1d830da

@AlexandreRoux
Copy link
Contributor Author

AlexandreRoux commented Jul 3, 2023

@nschad / @kd7lxl - I thought to be mostly done here but I noticed a very strange behavior :

each replica of my alertmanager cluster is debugging about not finding old replica in cluster.go :

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
level=debug ts=2023-07-03T16:13:40.509653841Z caller=cluster.go:441 component=cluster msg=reconnect result=failure peer=01H4E5FTYCW0FVZW78X181J77S addr=10.244.1.4:9094 err="1 error occurred:\n\t* Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n\n"

level=debug ts=2023-07-03T16:15:20.50957654Z caller=cluster.go:339 component=cluster memberlist="2023/07/03 16:15:20 [DEBUG] memberlist: Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n"

but my current peers are at follow :

# k describe pods/cortex-base-alertmanager-2 -n cortex-base | grep peer
      -alertmanager.cluster.peers=cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-2.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094

# http://<forwarded-am-service>:<port>/multitenant_alertmanager/status
01H4E5G6WY5AWVHT8AAE1GB31A	10.244.1.5
01H4E85YE1XWTA82HQ24T3733D	        10.244.1.8
01H4E81T31RJ87MKW35J84RJSN	        10.244.1.7

I don't know how to explain that 😕 (yet), I am also very confused on that port 9094 that should be only cluster related and not memberlist. I may post that on Slack let's see.

@nschad
Copy link
Collaborator

nschad commented Jul 3, 2023

@nschad / @kd7lxl - I thought to be mostly done here but I noticed a very strange behavior :

each replica of my alertmanager cluster is debugging about not finding old replica in cluster.go :

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
level=debug ts=2023-07-03T16:13:40.509653841Z caller=cluster.go:441 component=cluster msg=reconnect result=failure peer=01H4E5FTYCW0FVZW78X181J77S addr=10.244.1.4:9094 err="1 error occurred:\n\t* Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n\n"

level=debug ts=2023-07-03T16:15:20.50957654Z caller=cluster.go:339 component=cluster memberlist="2023/07/03 16:15:20 [DEBUG] memberlist: Failed to join 10.244.1.4: dial tcp 10.244.1.4:9094: connect: no route to host\n"

but my current peers are at follow :

# k describe pods/cortex-base-alertmanager-2 -n cortex-base | grep peer
      -alertmanager.cluster.peers=cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-2.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094

# http://<forwarded-am-service>:<port>/multitenant_alertmanager/status
01H4E5G6WY5AWVHT8AAE1GB31A	10.244.1.5
01H4E85YE1XWTA82HQ24T3733D	        10.244.1.8
01H4E81T31RJ87MKW35J84RJSN	        10.244.1.7

I don't know how to explain that 😕 (yet), I am also very confused on that port 9094 that should be only cluster related and not memberlist. I may post that on Slack let's see.

Might be normal that the ring has old instances in its ring temporarily. Do these logs disappear after a few minutes?

@AlexandreRoux
Copy link
Contributor Author

Might be normal that the ring has old instances in its ring temporarily. Do these logs disappear after a few minutes?

In my case it seems to last abnormally long. I am curious to know if someone else have the same behavior.
It is in debug logs thoo, maybe it is something I haven't noticed in the past too .. I am currently unsure.

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>
… disable

Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>
@AlexandreRoux AlexandreRoux requested a review from kd7lxl July 4, 2023 15:37
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a breaking change. I would rather not remove it. Also the configuration below then does make little sense.

The {{- if .Values.alertmanager.statefulSet.enabled -}} would become obsolete.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also currently if you set statefulSet to false (the default) and deploy cortex then you only get the non-headless service of alertmanager. Makes no sense.

Suggestion:

Keep the alertmanager deployment (maybe clean that up too if necessary) but don't enable support for clustering out-of-the-box. Let's keep it as it is.

templates/alertmanager/alertmanager-svc-headless.yaml Outdated Show resolved Hide resolved
Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>
Signed-off-by: AlexandreRoux <alexandreroux42@protonmail.com>
Copy link
Collaborator

@nschad nschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run helm-docs in the repo root to update the README.md?

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants