Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

East/west connectivity monitoring tool #5514

Open
zm1990s opened this issue Sep 20, 2023 · 21 comments
Open

East/west connectivity monitoring tool #5514

zm1990s opened this issue Sep 20, 2023 · 21 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lfx-mentorship Issues which have been proposed for the LFX Mentorship program lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@zm1990s
Copy link

zm1990s commented Sep 20, 2023

Description
Antrea only monitors Controller/Agent status at the moment, and Controller/Agent's status doesn’t means East-West connectivity is good, and metrics provieded by Antrea also does not reflect to Pod to Pod connectivity.
From an application perspective, we need a tool that can detect and inform Pod-to-pod connectivity issues.

Core feature required
A tool (maybe a Daemonset) that can generate East/West traffic periodically and check whether the E/W connectivity is good. if some of the detection fails, alerts or logs should be send out to external monitoring tools.

The detection interval should be adjustable like traditional loadbalancer do, for example send detection every 1 second and when 3 consecutive detection fails, sends out an alert.

Other related features
Since we're doing a E/W monitoring tool, so other related Antrea features can be monitored too. For example:

  • Pod to service type ClusterIP
  • Pod to service type Nodeport
  • Pod to external network via SNAT
@zm1990s zm1990s added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 20, 2023
@tnqn
Copy link
Member

tnqn commented Sep 20, 2023

@zm1990s Thanks for the proposal.
Monitoring EW connectivity should be feasible. But having extra long running Pods, especially DaemonSet deployed in user's cluster for this purpose may not be wanted by most users. We can probably leverage the state of memberlist ran by antrea-agent as the source of connectvitiy status. Regarding to alerts, K8s events associated with the unreachable Node may be a way. However, we need to ensure no duplicate events would flood the event system. Using consistent hash to select one "reporter" node among available nodes may be feasible.

Regarding Pod-to-Service, Pod-to-External monitoring, I'm not sure if it could be really helpful and practicable to proactively generating traffic. It's not easy to even know whether a particular access is supposed to succeed or not given different policy, firewall, network topology configuration, not to mention the generated traffic on behalf of user's application or towards user's application may be not wanted by many users. I think in practice most such tools are implemented as script/playbook executed out-band using user's own application according what they want to monitor.

cc @jianjuns @antoninbas @salv-orlando

@tnqn
Copy link
Member

tnqn commented Sep 20, 2023

However, it seems monitoring EW connectivity via memberlist would just be a faster way to get the notification of Node unreachable event compared with the K8s's native Node status. If user just wants such status is reported faster, they can also just update node-monitor-grace-period, so still wondering what value this can really add.

@tnqn
Copy link
Member

tnqn commented Sep 20, 2023

A tool (like anctl subcommand) for smoke testing may be the most practicable way in the end.

@antoninbas
Copy link
Contributor

However, it seems monitoring EW connectivity via memberlist would just be a faster way to get the notification of Node unreachable event compared with the K8s's native Node status.

I think that from an Antrea perspective, it would be good to monitor the health of the overlay network (in encap mode) by running some ping-mesh across all gateways. Being able to report latency across Nodes would also be quite nice, but I don't think we can do that with memberlist (IIRC, we discussed that in the past). With latency data available, we could even display a heat map in Antrea UI and update it in real-time.

@jianjuns
Copy link
Contributor

Agree with what @antoninbas said.

@tnqn
Copy link
Member

tnqn commented Sep 21, 2023

If without the need of latency data, I think the health of the overlay network shares the health status of memberlist in practice. Unless a misconfiguration that the memberlist port is whitelisted but not the overlay port, which could only occur when deploying a cluster and not during the routine running, I don't think of a situation that memberlist reports a Node is health but its overlay doesn't work. But if we want to add latency data, I agree memberlist may not achieve it (However, I don't quite rememember we discussed this, could you share a link if there is one?).

@antoninbas
Copy link
Contributor

If without the need of latency data, I think the health of the overlay network shares the health status of memberlist in practice. Unless a misconfiguration that the memberlist port is whitelisted but not the overlay port, which could only occur when deploying a cluster and not during the routine running, I don't think of a situation that memberlist reports a Node is health but its overlay doesn't work.

I think overlay (ping between gateway) is a bit more "end-to-end". In addition to port whitelisting, we could potentially detect issues like a missing route on the host (granted, that has not happened in a while, but we used to have such issues). I was thinking that with the right "probe" (e.g. a TCP data exchange), the health check would also fail in case of checksum issue (basically any issue with the NIC configuration that is specific to double encapsulation).

But if we want to add latency data, I agree memberlist may not achieve it (However, I don't quite rememember we discussed this, could you share a link if there is one?).

The latency heat map is something that has been on my mind for a while. I remember someone telling me that Weave had something like this, but I can't find a reference to it.
I brought this up very superficially when we added memberlist as a dependency: #2128 (comment)

@zm1990s
Copy link
Author

zm1990s commented Sep 22, 2023

@tnqn I think this tool should be decoupled from Antrea Controller/Agent, just like nsx-interworking. Users can decide whether they need to use it or not.

@antoninbas
Copy link
Contributor

Assigning to @tushartathgur who said he would look into this.
cc @yuntanghsu as well.

Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2023
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2024
@antoninbas antoninbas added the lfx-mentorship Issues which have been proposed for the LFX Mentorship program label Jan 23, 2024
@antoninbas
Copy link
Contributor

We have submitted this issue as a project idea for the LFX mentorship program: cncf/mentoring#1129. So no-one should ideally work on this issue until we know if the proposal is accepted and if we can match a mentee to work on it.
See https://docs.linuxfoundation.org/lfx/mentorship for more information on the program.

@prakrit55
Copy link
Contributor

@antoninbas, I am greatly interested in the project, how do I reach you guys in slack or are there other options? Though I have an intermediate knowledge of k8s and golang, I am curious to know how much frontend approach has to be driven here. Thank you.

@antoninbas
Copy link
Contributor

@prakrit55 you can reach out to us on Slack (we have the #antrea channel in the K8s slack)

but for this specific issue, please see comment above (#5514 (comment)). If you are interested, you could consider applying for the LFX mentorship program.

@prakrit55
Copy link
Contributor

prakrit55 commented Jan 24, 2024

@prakrit55 you can reach out to us on Slack (we have the #antrea channel in the K8s slack)

but for this specific issue, please see comment above (#5514 (comment)). If you are interested, you could consider applying for the LFX mentorship program.

Hey thank you @antoninbas, I got your channel. I would really like to apply for lfx mentorship for it, in the term March-May.

@btwshivam
Copy link

@antoninbas The prospect of working collectively on a comprehensive project like this is truly exciting, and I am keen on contributing my skills and enthusiasm to its success. The outlined sub-projects align perfectly with my interests, and they present a great opportunity for learning growth, and industrial exposure.
I look forward to contributing to the project and learning from the experience.

@nate-double-u
Copy link

Hello, everyone. I'm pleased to see how many folks are interested in participating in the LFX Mentorship Program.

Upstream issues like this are an excellent place to discuss specific technical topics or provide ideas about how you may tackle a problem; however, please post any questions about the LFX program and how to apply on the mentorship discussion forums (and indeed, some of these questions may have already been answered there, or on the Program Guidelines page).

@antoninbas
Copy link
Contributor

For all the folks who have applied or are considering applying to one of the Antrea projects for the LFX mentorship program, we have published instructions to complete test tasks: #5976. We will review your submissions for these tasks alongside other material (resume, cover letter) when selecting mentees. The deadline for submitting is February 20th 5PM PST.

ImMdsahil added a commit to ImMdsahil/antrea that referenced this issue Feb 12, 2024
Signed-off-by: Md Sahil <contact.mdsahil@gmail.com>
@antoninbas
Copy link
Contributor

@IRONICBo will work on this as part of the LFX mentorship program

@IRONICBo
Copy link
Contributor

IRONICBo commented Mar 29, 2024

Monitoring tool api design proposal

Monitoring tool needs a uniform config

Users and administrators need a way to measure and monitor network performance, specifically the latency between nodes, to ensure optimal cluster performance and troubleshoot potential issues.

Watch a singleton CRD

The proposed solution is to introduce a new Custom Resource Definition (CRD) called PingMonitoringToolConfig in Antrea. This CRD will allow users to enable and configure a ping monitoring tool that measures the latency between nodes. The configuration will include parameters such as the ping interval, timeout, and concurrency limit.

The Antrea agents will listen for changes to this CRD and adjust their monitoring behavior accordingly. When we enable this monitoring feature in Feature gate and config, agent will watch the events of creation/update/deletion of this CRD and update the start, stop and parameter update of monitor tool in real time.

Additionally, a singleton pattern will be enforced using a validation webhook to ensure that only one instance of the CRD exists in the cluster.

Use Feature Gate & Config & CRD to start monitoring tool

The solution introduces a new user-facing feature that allows users to enable and configure the ping monitoring tool via a YAML config file. Users can apply this YAML file using kubectl to create or update the PingMonitoringToolConfig resource.

The changes will be automatically picked up by the Antrea agents, and the monitoring behavior will be updated accordingly. This feature provides users with a structured and easy-to-consume API for enabling and configuring the ping mesh feature.

Main design/architecture

The main design involves the following components:

  1. CRD Definition: A new CRD PingMonitoringToolConfig will be defined with fields for enabling the tool, ping interval, timeout, and concurrency limit.
    Here is an example of the CRD definition in Go:
    type PingMonitoringToolConfigSpec struct {
        PingInterval        string `json:"pingInterval,omitempty"`
        PingTimeout         string `json:"pingTimeout,omitempty"`
        PingConcurrentLimit int    `json:"pingConcurrentLimit,omitempty"`
    }

    type PingMonitoringToolConfig struct {
        metav1.TypeMeta   `json:",inline"`
        metav1.ObjectMeta `json:"metadata,omitempty"`

        Spec PingMonitoringToolConfigSpec `json:"spec,omitempty"`
    }
  1. Singleton Pattern Enforcement: A validation webhook will be implemented to ensure that only one instance of the PingMonitoringToolConfig resource can exist in the cluster. This webhook will reject the creation of additional instances if one already exists.

  2. Agent Behavior: Antrea agents will listen for changes to the PingMonitoringToolConfig resource and update their monitoring behavior based on the configuration. The agents will use a Kubernetes client to watch for changes to the resource and adjust their ping interval, timeout, and concurrency limit accordingly.

  3. Monitoring Logic: The ping monitoring tool will measure the latency between nodes and provide metrics that can be used for monitoring and troubleshooting. The tool will use ICMP ping requests to measure the latency between nodes. Here is an example of a YAML configuration file for the PingMonitoringToolConfig resource:

apiVersion: networking.antrea.io/v1alpha1
kind: PingMonitoringToolConfig
metadata:
  name: default
spec:
  pingInterval: "10s"
  pingTimeout: "5s"
  pingConcurrentLimit: 10

In this example, the ping monitoring tool is enabled with a ping interval of 10 seconds, a ping timeout of 5 seconds, and a concurrency limit of 10.

Alternative solutions

  1. Using a ConfigMap: Instead of a CRD, a ConfigMap could be used to configure the ping monitoring tool. However, this approach lacks the structure and validation capabilities provided by CRDs.

  2. Using antctl API Server: We need to register an apiServer and use antctl to update the parameters of our monitoring tool. We need to consider the uniformity and observability of the configuration parameters in the cluster.

This proposal aims to provide a flexible and user-friendly way to monitor node-to-node latency in a Kubernetes cluster, enhancing the observability and manageability of the network performance in Antrea-managed clusters.

@Dyanngg
Copy link
Contributor

Dyanngg commented Apr 1, 2024

A validation webhook won't be necessary if we simply add an open-api validation rule which constraints the name of the CRD object created. See https://github.com/kubernetes-sigs/network-policy-api/blob/main/apis/v1alpha1/baselineadminnetworkpolicy_types.go#L29 as an example

antoninbas pushed a commit that referenced this issue May 31, 2024
We introduce a new feature to measure inter-Node latency in a K8s
cluster running Antrea. The feature is currently Alpha and uses the
NodeLatencyMonitor FeatureGate.

In addition to the FeatureGate, enablement of the feature is controlled
by a new CRD, called NodeLatencyMonitor. This CRD supports at most one
CR instance, which must be named "default". When the CR exists, Antrea
Agents will start "pinging" each other to take latency measurements.

Each Agent only stores the latest measured value (at least at the
moment), we do not store time series data.

We support both IPv4 and IPv6. When an oberlay is used by Antrea, the
ping is sent over the tunnel (by using the gateway IP as the
destination).

This change does not add any functionality besides collecting latency
data at each Agent. A follow-up change will take care of reporting the
latency data to the Antrea Controller, so it can be consumed via an
APIService.

For #5514

Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: Asklv <boironic@gmail.com>
antoninbas pushed a commit that referenced this issue Jun 18, 2024
Follow up to #6120 

See #5514 

Signed-off-by: Asklv <boironic@gmail.com>
Copy link
Contributor

github-actions bot commented Jul 1, 2024

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lfx-mentorship Issues which have been proposed for the LFX Mentorship program lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

10 participants