Community Meetings

Agenda, Minutes & Recordings for Antrea Community Meetings

YouTube Playlist

June 17, 2024

Minutes

Egress support for Antrea networkPolicyOnly mode.
- The feature request if for networkPolicyOnly mode, but the proposal could apply to other modes such as encap and hybrid.
- When using networkPolicyOnly mode, Antrea is chained with another "arbitrary" primary CNI responsible for IPAM / routing; in this experiment, Calico was used as the primary CNI.
- Calico SNAT rules are always enforced first (the Calico agent periodically enforces this), preventing Egress SNAT rules installed by Antrea from taking effect; we have to find a way to disable this behavior. No solution has been found yet.
- We need to ensure a symmetric path for return Egress traffic, which is not straightforward but possible using policy routing and OVS learned flows.
- The experiment used the iptables Calico datapath, not clear if Egress is even possible when using the eBPF datapath.
Should we add an except field for ipBlock in Antrea native policies?
- See issue #6428
- Implementing this requires "expanding" the cidrs manually: the ipBlock cidr needs to be broken up into individual non-contiguous cidrs, and we need to install OVS policy flows for each one.
- We already do this today for K8s NetworkPolicy, as it is supported by this API. So from an implementation perspective, it should not be difficult.
- Consensus is that we should accept this feature request and provide this functionality.
Should we add support for NotSelf / NotSameLabel matching for Antrea native policies?
- See issue #6424?
- While we acknowledge that this is a valid use case and that it would be convenient for users, there is risk in implementing this and it would be costly to implement with our current system.
- We may come up with a different more convenient way to efficiently provide namespace / org isolation in the future.

Recording

Antrea Community Meeting 06/17/2024

June 3, 2024

Minutes

Replacing nanoserver (Windows Server 2019) with hpc (HostProcess Container) as the base container image for antrea-agent on Windows.
- See slides
- Many benefits including broader compatibility with the Windows Node OS, image size, build time, build simplicity (image can even be built on Linux).
- The image can only be run as HostProcess Containers; this is not really an issue now that we only support containerd on Windows Nodes.

Recording

Antrea Community Meeting 06/03/2024

May 20, 2024

Minutes

End-of-term presentation from all 3 mentees for this iteration of the LFX mentorship program
- Replace deprecated bincover with golang built-in coverage profiling tool - #4962 - @shikharish
  - see slides
- Node latency monitoring tool - #5514 - IRONICBo
  - see slides and watch demo included in meeting recording
  - we send ICMP echo request messages to other Nodes and process their replies, but we are not in charge of replying to requests (the OS takes care of it)
  - it may be a good idea to spread out the ICMP echo request messages over the configured interval, to avoid bursts
  - we will remove "same Node" latency measurement (Node pings itself) to avoid confusion
- Pre and Post-installation checks with antctl - #6061 #6153 - kanha-gupta
  - watch demo included in meeting recording
  - refer to Kanha's blog post
  - could we run some more advanced tests for specific features (e.g., AntreaIPAM)?
    - at the moment we run tests that should always work, regardless of the configuration in which Antrea is deployed
    - we could check the configuration, and run some feature-specific tests; one suggestion was to validate encryption when Wireguard is enabled
    - we want tests to run quickly (no more than a couple of minutes), and we don't want to duplicate existing e2e tests
  - in the future, we want to enhance pre-installation checks so that they are consistent with the intended user configuration
- A big thanks to all mentees and mentors!

Recording

Antrea Community Meeting 05/20/2024

May 6, 2024

Minutes

Technical discussion around possible implementation for proxyAll when kube-proxy is still present.
- See #6232
LFX mentorship term is coming to an end.
- We plan to have mentees do a quick presentation of their work at the next community meeting.
- All 3 mentees may not be able to make it, in which case we will ask them to prepare a short video recording and we will play it during the meetings, and mentors will field questions.

Recording

No meeting recording available for this week.

April 22, 2024

Minutes

A proposal for a composable scale-testing framework - see slides
- The framework should be extensible so that we can easily scale test new features in the future; there is ongoing work to use this framework to perform scale testing of the Egress feature.
- The current PR (https://github.com/antrea-io/antrea/pull/5772) includes test scenarios for NetworkPolicy realization and Service realization.
- Framework supports a mix of real worker Nodes and antrea-agent simulators (cluster needs to be created and CNI needs to be installed ahead of time).
- The framework itself is not specific to the Antrea CNI: K8s NetworkPolicy scale testing can be performed on all CNIs which implement the NetworkPolicy API. However, some test cases / steps are Antrea-specific and need to be skipped / disabled for other CNIs.
- The framework can deploy kube-state-metrics + Prometheus + Grafana for metrics collection / visualization.
- Graphs used to show results should use a linear scale, otherwise results are confusing / misleading.

Recording

Antrea Community Meeting 04/22/2024

April 8, 2024

Minutes

Update on some recent CI enhancements - see slides
- CI testbeds have been moved from VMC (VMware on AWS) to native AWS, new Jenkins URL is jenkins.antrea.io
- Ongoing work to migrate additional jobs to CAPA / AWS and run more jobs in Kind
- Windows Docker jobs are being removed (starting with Antrea v2.0, we will officially only support containerd for Windows Nodes)
  - updated trigger phrases for Windows CI jobs
- All test images are being transitioned from Harbor to DockerHub (transition has already been completed for user-facing images)
Update on Antrea releases
- We have recently released 1.15.1, 1.14.3, 1.13.4 (last patch release for 1.13.x)
- Upcoming v2.0 release at the end of the month, more code reviews needed!

Recording

Antrea Community Meeting 04/08/2024

March 11, 2024

Minutes

Running concurrent CI jobs on the same VM using different Kind clusters - see slides
- We had a discussion about running Kind-in-Kubernetes instead (similar to what K8s does for CI, with Prow)
- The 2 approaches aim to solve the same issue (better utilization of CI resources), so we don't really need both capabilities
- The current approach (concurrent CI Kind jobs on the same VM) is almost ready; we will investigate the other approach (Kind-in-Kubernetes) after rolling that one out, and see if it provides additional value
- The ability to run multiple Kind test jobs on one machine may be useful for local development as well
- Initial concern that this would be complex to achieve was expressed in the issue, but we may have overestimated the complexity
Update on BGPPolicy API design - see design doc
- Is there an actual use case for BGPFilter and explicitly excluding IPs, as opposed to just using selectors to select resources for which we want to advertise IPs?
- We are not currently looking into learning and installing routes through BGP, only advertising them (the default routing policies will be used).
- Do we really have a use case for different BGP configurations / BGP processes / BGP "virtual routers"?
  - Supported by Cilium, but not by Calico
  - Most common case is probably single BGP configuration with single local ASN, even single BGP peer to which we advertise all the desired IPs
  - If we do need multiple BGP configurations in the future, maybe we should just allow multiple BGPPolicy CRs (with different local ASNs) to select the same Node(s). That would just require changing our validation strategy.
- If we omit BGP filters and support a single BGP configuration per BGPPolicy instance, we are essentially falling back to "proposal 1".
- In the first rollout phase, should we skip selectors and advertise either all IPs of a given type or none of them, based on a boolean toggle?
  - We had a user request to support selecting specific Service IPs and even Pod IPs (at least at the Namespace level)
  - If we choose not to add selectors in "phase 1", we should make sure that we design the API to accommodate for selectors in the future (without breaking API backwards-compatibility). For example, an absence of selector would mean "all IPs" (if the boolean is set), while a selector would mean "only IPs which belong to selected resources". If a user does not set the selector, the behavior stays the same.
- We plan on using gobgp, but we will make sure that we abstract away the implementation with an interface in case we want to support an alternative implementation in the future.

Recording

Antrea Community Meeting 03/11/2024

February 26, 2024

Minutes

E2e testing for the Flexible IPAM feature using Kind - see slides
- Is a separate Docker bridge really required for Flexible IPAM testing? Probably not, it was done this way because the e2e test cases assume that the K8s Nodes are all in a certain subnet, but that can be changed. Should be similar to Egress VLAN testing, which uses the default Kind docker bridge.
- Static routes installed manually on the test machine are for return traffic to the Pod primary network interface; typical requirement in noEncap mode. These routes are not specific to the Flexible IPAM feature.
Proposal for BGP support in Antrea - see slides
- We could simplify the API by removing Service & Namespace selectors, unless there is a use case for them. We could have a boolean toggle for each IP type to advertise, with no granular filtering.
- Service IPs (LoadBalancerIPs, ExternalIPs) can be advertise from all Nodes. kube-proxy or AntreaProxy (proxyAll) can then load-balance the traffic to a Service endpoint. With ECMP (BGP multi-path), ingress Service traffic can be load-balanced across a set of Nodes.
- What's the use case for applying multiple BGP policies to the same Node, with different (local) ASNs? Advertise a different set of IP addresses to different peers.
- Need to confirm that go-bgp is the right choice for us (performance / feature set).
- There should be no extra config required on our side to enable ECMP for Services.
- For Services with Local ExternalTrafficPolicy, we will only advertise Service IPs from Nodes with at least one local Service endpoint.
- How quick is BGP convergence when an Egress IP is re-assigned to another Node?
- Users will need to use consistent Node selectors for BGPPolicies and Egress ExternalIPPools (same for Services with Local ExternalTrafficPolicy).

Recording

Antrea Community Meeting 02/26/2024

January 16, 2024

Minutes

Splitting up Agent and Controller container images to reduce their size - see slides
- The claimed size reduction for the antrea-agent image (300MB) seems a bit too extreme, given that only the antrea-controller binary (~100MB) was removed.
- Need to double check and compare size of unified image vs sizes of dedicated / split images.
- Could we have a shared layer for antctl binary across both images (antrea-agent & antrea-controller)?
- Our antctl binary seems to be excessively large (100MB). Binaries from other projects tend to be smaller even when they provide more features. For example, kubectl is only around 50MB. Maybe we can investigate how to reduce the size of binaries.
Increase minimum version requirement for K8s - see #5879
- Currently we require K8s 1.16, and we were planning to start requiring K8s 1.19.
- If we decide to require a recent K8s version (more recent than K8s 1.19), we probably need to check with users and check which K8s versions are still supported by cloud-managed K8s services.
- Current plan is to be conservative and increase it to K8s 1.19 for the next Antrea release (post v1.15).
Antrea v2.0 release - see #4832
- There is no strong reason to bump up the major version number to v2: no significant breaking change, no massive architectural change.
- There is also no strong reason not to do it; Antrea is not a library.
- We have graduated some features to GA, and we have deprecated some APIs; that could be reason enough.
- There are some remaining API changes we may want to do before 2.0 (e.g., change subnet definition in IPPool CRD), as well as some configuration changes?
- So the current plan is for Antrea v1.15 to be the last 1.X minor release; after that we will move to 2.X.

Recording

Antrea Community Meeting 01/16/2024

January 2, 2024

Minutes

Antrea v1.15 release status update

Recording

Antrea Community Meeting 01/02/2024

December 19, 2023

Minutes

Presentation of updated Antrea ROADMAP - see #5807
- The CNCF gave us some feedback during our annual review that we should update the ROADMAP.
- Multiple items were either already completed or no longer planned; some new items were missing.
- PR will be merged in early Jan, please review / leave comments before then.
VLAN tagging for Egress - see slides
- How many VLAN sub-interfaces can be created for a given parent interface? Not sure, but this is unlikely to be an actual limitation compared to the limit on the number of route tables (~250).
  - we only need one sub-interface per subnet, and we currently limit the number of subnets to 20
- 1-1 mapping between subnet and route table.
- Users must take care of configuring the "physical" Node network with their chosen VLAN IDs.
- Does the current design support subnet overlap between different VLANs?
  - the uplink router(s) may support this (same IP in 2 different VLANs would map to a different SNAT IP)
  - not covered by the current design, maybe not a very realistic use case (maybe we can assume that if one wants to use Egress, there is no other SNAT happening for the traffic, or at least no need to share Egress IPs across VLANs)
- The packet mark logic has not changed compared to the current Egress implementation; a specific packet mark maps to a specific Egress IP, and now also to a specific route table.
This was our last meeting of the year! Thanks for a great year 2023, and we are looking forward to 2024.

Recording

Antrea Community Meeting 12/19/2023

December 4, 2023

Minutes

Review of open issues for secondary networks support
- #5047: Antrea as secondary CNI with Multus
  - definitely requires code changes to the Antrea Agent, not a use case we had in mind for Antrea
  - still waiting for a compelling use case from users requesting the feature
  - can we reuse the SecondaryNetwork work (in particular, VLAN support for secondary network interfaces) to achieve this? Unclear at the moment. The SecondaryNetwork feature uses a controller-based approach (K8s controller watching for Pod annotation updates) and network provisioning is done asynchronously from the CNI Add call. This is not what "secondary CNIs" used with Multus (e.g. macvlan) usually do.
  - secondary VLAN networks / secondary overlay networks
  - do users expect additional K8s features for secondary networks (NetworkPolicy enforcement, Service load-balancing)? If yes, this starts looking more and more like the upstream effort to support multi-networks for Pods as part of K8s itself.
- #5693: Configuring multiple RX / TX queues for Pod veths
  - feature request is for secondary networks only
  - can achieve better throughput for multiple concurrent connections between Pods (tested with Pods on the same Node) when Pods have access to multiple CPUs
  - no objection to supporting this, but we may not have the cycles to work on this
- #5693: Ability to provision secondary network interfaces (VLAN networks) without an IP address
  - issue already addressed, a patch has been merged
- #5735: Use a Node's primary NIC as the secondary OVS bridge physical interface
  - Jianjun thinks it is doable, similar to the Antrea "bridging mode" which already exists for the primary network interface
  - no one is actively working on this

Recording

Antrea Community Meeting 12/04/2023

November 20, 2023

Minutes

Proposal to drop support for the Docker CE (Moby) container runtime on Windows
- See slides.
- Drop support for rancher/wins installation method == Drop support for Docker?
- Key point is that we no longer test Docker CE support in Windows CI, so we are not really in a position to claim support.
- Proposed action items for Antrea v1.15: deprecate Docker support (documentation change) + remove unused CI scripts.
- No plan at the moment to deprecate running OVS daemons + Antrea Agent as Windows Services.
- In the future, we may only offer the HostProcess containers method.
Proposal to support Egress on Windows
- See slides.
- Proposal wants to add the following:
  - ability to assign Egress IPs to Windows Nodes
  - Linux Pods can egress through Linux / Windows Nodes
  - Windows Pods can egress through Linux / Windows Nodes
- More discussion required for implementation of SNAT on Windows Egress Nodes.
  - differentiate between Egress reply traffic which we need to un-SNAT, and traffic from source Node that requires SNAT (dest IP is the same -> Egress IP)
  - is "learn" flow really needed?
- Demo of PoC implementation.
- One suggestion is to only support Linux Nodes as Egress Nodes.
  - simplified datapath implementation, no functional difference for users
  - K8s clusters will always include Linux Nodes - no such thing as a "Window-only" cluster
  - unlikely to have an availability zone with only Windows Nodes available for Egress
  - Egressing traffic from Linux Pods through Windows Nodes doesn't seem like a very good idea (potential stability issues)
No time to discuss "secondary network" feature requests, postponed to next meeting.

Recording

Antrea Community Meeting 11/20/2023

November 6, 2023

Minutes

Proposal for Node NetworkPolicies support
- See issue #5671
- See design slides
- It's a popular user request
- Similar to NetworkPolicy enforcement for the ExternalNode case.
  - ExternalNode can have stability issues because the physical interface is moved to the OVS bridge
- We propose to find an alternative solution that will apply to both ExternalNode and Node NetworkPolicies.
- The simplest solution is iptables-based, with the only drawback being increased connection latency with a large number of rules.
  - we expect the number of rules to be on the smaller side for this use case (Node NetworkPolicies)
  - it's not clear that other solutions would not suffer from this issue
- API change for ClusterNetworkPolicy only (new nodeSelector in appliedTo field).
- Node NetworkPolicies for Windows?
  - different approach required for Windows
  - compatibility issues between OVS and Windows Firewall rules
  - for Windows, we always need to move the uplink to OVS anyway (like for ExternalNode)
  - not planned at the moment, but we could enforce Node NetworkPolicies in OVS (single uplink interface only)
- Using ClusterNetworkPolicy vs introducing a new API / CRD?
  - there are use cases for selecting Pods as peers for Node NetworkPolicies
- Dataplane implementation deep-dive and demo.
  - see slides

Recording

Antrea Community Meeting 11/06/2023

October 23, 2023

Minutes

Antrea E2E tests flakiness
- A script collects test failures of Kind E2E tests for the main branch.
- The script runs every week automatically and generates reports for 30-day failures and 90-day failures.
- See this repo for script and generated reports.
- Some flaky tests have already been fixed since this project was started.
- Most test failures come from the TestAntreaPolicy/TestGroupNoK8sNP test group.
  - So far we have not been able to root cause the failures. Quan has been experimenting with different PRs to try and troubleshoot the issue. With PR #5507, which was meant for troubleshooting, the failures don't seem to happen anymore.
Helping users with migrating from Calico to Antrea - see issue #5578 and slides
- The presenter had some microphone issues, so we apologize for the poor audio quality in the first part of the presentation.
- The idea is to provide tooling to enable CNI migration with minimal workload downtime, and the ability to convert NetworkPolicy CRs when possible.
- Did we have requests from users to provide this?
- We need to have existing Pods switch over from the Calico network to the Antrea network. The current proposal is to kill the sandbox container to force a new CNI ADD invocation. We need to be more granular about which containers we kill, i.e., exclude hostNetwork Pods.
- NetworkPolicy CR migrator:
  - The 2 NetworkPolicy APIs are quite different, so converting correctly may be quite challenging, and there are some resources that we may not be able to convert at all.
  - We need to introduce a "dry-run" mode so that cluster admins can check ahead of time which policies can be converted and which cannot, and make an informed decision about how to proceed.
- How will the migrator be packaged? The plan is to have an antctl subcommand.
- Apart from NetworkPolicy conversion and cleanup of stale Calico resources, the process is generic, so we could reuse it for other CNIs if needed (e.g., Flannel).
- The migrator should be ready for the Antrea v1.15 release.

Recording

Antrea Community Meeting 10/23/2023

October 9, 2023

Minutes

Very short meeting, we discussed pending items for the next Antrea release - Antrea v1.14.

September 25, 2023

Minutes

Several folk are representing Antrea at KubeCon in China
CI improvements using CAPA (Cluster API AWS) - see slides
- New CI test matrix for Antrea e2e tests: 4 most recent K8s versions x 2 most recent Ubuntu LTS versions (for Nodes)
- Run using Jenkins weekly and on-demand
- Some improvements are needed for workload cluster resource deletion (AWS compute instances and LB)

Recording

Antrea Community Meeting 09/25/2023

September 11, 2023

Minutes

Proposal for new PacketSampling CRD - see #5443
- When capturing traffic, will we capture traffic in both directions (request + reply) for a "connection"?
  - Capturing in both directions is useful to troubleshoot latency issues, retransmissions, etc.
- CRD definition: parameters which are specific to a sampling method should be grouped. Best practice is to use a "oneOf", with a specific CRD field for each case (i.e., each sampling method).
- The format for captured packets is PcapNG. Can tcpdump read these files?
  - Probably, need to double check.
- It's tedious to provide file server connection and authentication details in every CR. Is there a better way?
  - We could consider introducing a new CRD for users to provide connection information; SupportBundle CRs and PacketSampling CRs could then refer to that object, to avoid redundancy for users.
  - Let's wait for user feedback?
Antctl support for VM Agent case - see slides
- Currently not supported; when running antctl on a VM managed by Antrea, it will default to "controller" mode (out-of-cluster), which is incorrect.
- Compared to regular agent mode, we can only support a subset of commands (as many commands are designed for K8s Nodes running Pods); hence we will focus on NetworkPolicy commands.
- Proposal is to introduce a flag to specify the antctl "mode" (optional flag, default behavior stays the same); the flag will be required for the VM case.
- New command (antctl get entityinterface)
  - Proposal to rename command from entityinterface to vminterface for consistency with the mode name
- Can the mode be read from the environment?
- Maybe there should be a way to persist the mode to a local config file (in the home driectory?), to avoid having to repeat the mode every time? E.g., antctl set mode vm. It would also persist across shell sessions and reboots.

Recording

Antrea Community Meeting 09/11/2023

August 28, 2023

Minutes

Very short meeting, nothing on the agenda
Release cadence for new Antrea minor versions is changing from 8 weeks to 12 weeks
New PacketSampling CRD proposal will be discussed at the next meeting - see #5443

August 14, 2023

Minutes

Feature Gate promotions for Antrea v2: see #5068
- Consensus on promoting AntreaProxy and EndpointSlice from Beta to GA
  - enabled by default for a while and used widely
  - new features which are added to AntreaProxy (e.g., DSR) typically get their own Feature Gate
  - for AntreaProxy, we will add a boolean toggle to the antreaProxy configuration section, for users who still want to disable it (e.g., because they prefer using kube-proxy IPVS mode)
- Not enough confidence for other Feature Gates
  - ServiceExternalIP needs more testing / verification in production scenarios; more user requirements (e.g., VIP sharing) are still pending implementation
  - We still have known issues for L7 NetworkPolicies: users need to disable checksum offload which may impact datapath performance
  - We also want to be conservative for the FlowExporter feature: not enough user feedback at scale, recent modifications to config, new key functionality added recently (e.g., TLS support for Flow Aggregator), improvements to the implementation are being investigated (e.g. using conntrack events in addition to polling), ...
  - ExternalNode: still working through issues in the public cloud, feature is still new
  - SupportBundleCollection (for ExternalNode): only supports SFTP for now, maybe we want to add support for cloud storage (e.g., S3)?
- IPsecCertAuth: maybe we should consider enabling it by default if it doesn't have any scale implications and is reasonably self-contained; let's do some more manual verification and evaluate the risk of promoting it to Beta
- We could create separate issues to track requirements for Feature Gate promotions
A couple of general updates:
- Go v1.21 has been released recently; we are working on migrating all Antrea projects from 1.19 (no longer maintained) to 1.21
- Hashicorp has changed the license for its most popular projects (Vagrant, Terraform, ...) to a "source available, non-open source" license (MPL -> BUSL)
  - The Hashicorp Go libraries that we use (e.g., memberlist) are not affected
  - We do have some Terraform scripts, as well as some Vagrant usage, that could be impacted
  - We are waiting for some guidance from the CNCF - see #617

Recording

Antrea Community Meeting 08/14/2023

July 31, 2023

Minutes

Traceflow "FirstN" sampling
- See slides
- Implementation: PacketIn messages (same as Traceflow) vs OVS native sampling support (IPFIX)?
- How can users retrieve the results of the sampling Traceflow?
  - Cannot use the CRD Status as the results are too large
  - We will use the same methodology as for support bundle: API endpoint to download the result data (maybe in pcap format)
  - User-friendly consumption when using Antctl
- We should have a reasonable upper limit for N, to limit disk usage
- Does the hard timeout of 300s still apply for Traceflow sampling?
- Should we define a new CRD for this functionality?
  - Differs significantly from the existing "liveTraffic" Traceflow, which only captures packet headers
Proposal for several CI improvements
- See slides
- Instead of a manual command to kill "stale" jobs, should this be done automatically when a PR is updated?
  - Need to ensure that stale test clusters are deleted properly to reclaim resources; maybe we need a separate cleanup job for this (like we do for public cloud tests: EKS / AKS / GKE)?
  - For a few PRs (e.g., release PRs), it is better to run the job to completion to avoid repeating tests every time the PR is updated (e.g., change in release notes)
  - Instead, should typing another /test-X command be the trigger to cancel previous jobs for the same PR?
- Normalize Jenkins job names / commands for ipv6: "ipv6-ds" vs "ipv6"

Recording

Antrea Community Meeting 07/31/2023

July 17, 2023

Minutes

Tech talk about XMasq: Bring Cache to the Container Overlay Network
- See slides; see paper
- Using an overlay network adds flexibility but has significant impact on network performance.
- The XMasq solution is not specific to any CNI in particular, but at the moment it cannot work with CNIs which rely on an eBPF dataplane (Cilium, Calico). This is why it has been tested with Antrea.
- XMasq can co-exist with the OVS dataplane; when there is a cache hit in XMasq, the OVS bridge is bypassed.
- Using XMasq means that some datapath features will not be available.
- The cache is only updated for encapsulated packets, so Pod-to-External traffic is not impacted.
- NetworkPolicy implementation is still a work-in-progress.
  - stale entries are not removed from the cache
  - current implementation is "stateless", and does not track individual connections
- The eBPF program does not have a significant impact on latency of the first packet (cache-miss path).
- The current implementation does not support Pod-to-Service traffic (destination IP is ClusterIP), which seems like a big limitation.
- Value of XMasq for primary Pod network vs secondary Pod networks (specialized use cases which don't require as many features but require higher throughput)?
Antrea OVS containerization on Windows (with containerd only)
- Ability to run OVS userspace processes in a container (as Pod)
  - symmetry with Linux
  - easier OVS upgrades
- Dependency on Windows hostprocess feature, so only available with containerd.
- New container image is based on servercore vs nanoserver (big size difference).

Recording

Antrea Community Meeting 08/17/2023

July 3, 2023

Minutes

Short meeting where we discussed API and Feature Gate promotions for upcoming Antrea releases.
- see open PRs: https://github.com/antrea-io/antrea/pulls?q=is%3Apr+is%3Aopen+label%3Aapi-review

Recording

Antrea Community Meeting 07/03/2023

June 20, 2023

Minutes

Design proposal for Antrea Controller High-Availability (HA)
- See slides
- If the Node running the Antrea Controller goes down, it can take more than 5 minutes for the Controller Pod to be rescheduled to another healthy Node; we can get this down to 40s easily with the appropriate "tolerations" for the Antrea Controller Pod, but that still may be too much for some users.
- Active-active vs Active-standby: Active-active more complex and not really needed in our case (we have stress tested the Controller to very large clusters, with a single replica).
- Leader election can be tuned to failover to the standby in around 15s.
- No "perfect" solution to route the antrea Service traffic to the active replica; 3 possible solutions with different drawbacks.
  - currently, the preferred solution is "Service without selector" (refer to slides)
- Before implementing Active-standby HA, we should confirm whether a 40s delay is good enough for our users.
- No state synchronization needed across replicas: all the state is persisted to K8s / etcd.
- Different Service definitions for "HA mode" and "single-replica mode" (so different YAML manifests) at first to avoid disruption to users.
  - we want to evaluate the "Service without selector" solution and if it works well, we can use the same approach even for the single-replica case
- Are we aware of other projects using a similar approach for HA? Not really. The K8s apiserver uses a similar approach, but for different reasons.
- We have a somewhat unique architecture in Antrea where the Antrea Controller is used both as API server (serving in-memory data) and controller (running computations, processing state). If the 2 functionalities were split, the API server would use Active-active mode (multiple replicas could serve APIs from a distributed store) and the controller would use Active-standby (no need for traffic routing, just leader election).
- Using the "Pod readiness" approach does not help use do failover faster than 40s, so it's not better than using a single replica and setting tolerationSeconds to 0.

Recording

Antrea Community Meeting 06/20/2023

June 5, 2023

Minutes

New implementation for NetworkPolicy Logging - see slides
- We recently became aware of issue #5018: enabling logging can cause massive drop of user traffic (for Allow NP rules).
- New proposal is to use separate OVS flows for the SendToController action, so that the meter action only applies to these flows.
- Use group(ALL) to make a separate copy of the packet, that will be used for logging purposes.
- If the only action for the packet is sending to controller for logging, we probably do not need to define a group with a single bucket.
- Copy should only happen for the first packet of the connection, subsequent packets bypass that part of the pipeline thanks to conntrack lookup.
- No impact for NetworkPolicy rules that do not have logging enabled.
- SendToController is moved to the end of the pipeline. Will it be a problem for the following scenario: packet is applied an egress policy rule with Allow action and logging enabled, then is applied an ingress policy rule with Drop action? Will the logging happen correctly in this case?
  - It should work correctly, given that the copy created for logging will go directly to the Output table and will not be processed by Ingress NetworkPolicy tables
- Is creating packet copies an issue for the following scenario: large UDP flow hitting a policy rule with Drop action and logging enabled? Even though the meter will prevent the Agent from having to process too many PacketIn messages, we will still create a copy in the OVS datapath for each packet in the flow.
- This fix should be part of Antrea v1.13

Recording

Antrea Community Meeting 06/05/2023

May 22, 2023

Minutes

Antrea scale testing for agentless VMs
- Scale testing in the context of Nephe: VMs are onboarded into Antrea using the ExternalEntity API, and are selected by Antrea-native policies
- 1 Namespace, 10K ExternalEntities, 10K policies
- Results: 17 seconds to recompute policies, 1000m CPU, 300MB
- There was a question about whether the methodology used to measure resource consumption (metrics exposed by kubelet) was accurate enough
DSR support for LoadBalancers in Antrea
- user issue: #4956
- slides
- design issue: #5025
- connection will be invalid in conntrack on the ingress Node (no return traffic is observed); but OVS doesn't expose ct_mark and ct_label for invalid connections, which are needed to store connection state (which backend Node was selected)
  - one possible solution is to leverage the OVS learn action; but there is a latency between flow learning and datapath modification, which would create issues if the next packet in the connection is received before the datapath is ready
  - other solutions will be investigated
- network performance metrics could be better with DSR if implemented well
- DSR mode can have an impact on NetworkPolicy enforcement, given that the source client IP is preserved
Enforcing NetworkPolicy logging can have a huge impact on performance
- see #5018
- this was not the original intent when adding an OVS meter to rate-limit PacketIn messages: PacketIn messages should be rate-limited but it should not impact user traffic
- we should address it in the v1.13 release time frame
Node policy support in Antrea
- see #4213
- this is a legitimate use case, but we are not sure what's the best way to implement it
- we could leverage the work done for ExternalNode support and move the Node's transport interface to the OVS bridge
  - one risk is that managing the physical network interface is complicated and can depend on Node OS and hardware
  - a bug in Antrea could impact Node connectivity

Recording

Antrea Community Meeting 05/22/2023

May 8, 2023

Minutes

Using Antrea ProxyAll on Windows to replace kube-proxy user-space
- Kube-proxy user-space has been dropped in K8s v1.26, we now need to rely on AntreaProxy
  - Antrea cannot co-exist with kube-proxy Windows kernel mode (HNSNetwork Extension conflict)
- ProxyAll will be enabled by default for Windows starting with v1.12
- Antrea cannot use the ClusterIP to access the Kubernetes Service (we need the kubeAPIServerOverride option to be set)
- In Antrea v1.10.0, Windows crash (BSoD) was observed when ProxyAll was enabled (has been fixed and back ported)
  - Working theory so far is that there was a traffic loop caused by some missing flows -> high CPU usage and eventually system crash
  - May be applicable to Linux (with less dramatic consequences)
Antrea v1.12 release status update
- See in-flight PRs

Recording

Antrea Community Meeting 05/08/2023

April 24, 2023

Minutes

Antrea v1.11.1 has been released
- Includes some important bug fixes for AntreaProxy
- Quan in the process of backporting them to v1.10
Update on Antrea UI - see slide
- First release (v0.1.0) will be this week
Proposal for a new exporter for the FlowAggregator to log all connections to a local file - see slides
- Motivated by user issue #3794
- Blocked connections are included in audit logs, is it also the case for the Flow Aggregator?
  - Yes, and they both rely on the same mechanism in the Antrea Agent (OVS PacketIn messages)
  - We have rate-limiting in place, so in both cases, if there are too many PacketIn messages, some will be dropped
- For connections committed to conntrack on Linux, the Flow Exporter polls conntrack every ~~60s~~ 5s.
  - Poll interval should be small enough not to miss connections
- If there is a delay between IP to Pod name translation, could the information be stale / invalid?
  - It's an issue for audit logging if information is missing / incomplete / invalid
  - Yes it's possible, we should see what we can do to avoid that

Recording

Antrea Community Meeting 04/24/2023

April 10, 2023

Minutes

Antrea CI enhancements over the last couple of months - see slides
- Windows testbeds: we currently support both Docker and containerd; we may drop Docker in the future to reduce the number of jobs
Antrea v2.0 plans - see issue #4832
- Serving our production-ready APIs using an Alpha version doesn't send the right message to our users; some of these APIs haven't changed in years
- If you have enhancements to suggest for existing APIs (may or may not be backward compatible), now is the right time to suggest them
- We will follow our documented best practices for API deprecation & removal, and provide tooling for users to easily migrate the stored version of their existing CRs

Recording

Antrea Community Meeting 04/10/2023

March 14, 2023

Minutes

Status update for upcoming release (Antrea v1.11 / Theia v0.5)
We want to enforce new rules for running Windows CI jobs, in order to avoid breaking Windows support in the future
1. as before, if a patch modifies / adds any Windows-specific source file, all Windows tests should pass
2. for every patch, the job that runs e2e tests on a Windows containerd testbed should pass
3. we will add a periodic Jenkins job to run all Windows tests on that main branch
4. we want to improve speed and robustness of Windows CI jobs
The new mandatory Windows job should run automatically as part of /test-all, and we should have a corresponding /skip-* command

Recording

Antrea Community Meeting 03/14/2023

February 27, 2023

Minutes

Proposal for a custom Antrea UI to replace the Antrea Octant plugin - issue #4640
- React-based UI, using the Clarity design system
- Demo of the UI prototype, which supports Traceflow
- Suggestion to develop using a Lens plugin instead, as Lens has become the de facto replacement for Octant and has a better plugin ecosystem
  - Antonin to look into Lens
  - Plugin-based mechanisms are by nature not as flexible / extensible as a custom UI
  - Ideally we would not require users to deploy any other piece of software to access our UI
- Built-in authentication mechanism: password-based login and JWT token for accessing the backend APIs
- Is this also meant to replace our Grafana dashboard for Flow Visibility?
  - It would be good to have a unified UI for everything, but porting all dashboards may be quite a bit of work
  - Team would need to get familiar with React & Javascript libraries for rendering dashboards
CI pipeline to test Antrea with Rancher - slides
- We hope that this new CI pipeline will help make Antrea an officially-supported CNI plugin for Rancher
AntreaProxy enhancements plan - slides
- Some enhancements are needed in AntreaProxy to catch up with latest upstream API changes for Services (e.g., ProxyTerminatingEndpoints)
- We need to pass more upstream conformance tests when proxyAll is enabled and kube-proxy is removed

Recording

Antrea Community Meeting 02/27/2023

February 13, 2023

Minutes

Nothing on the agenda, so very quick meeting.
We briefly discussed the implications of running Antrea on a cluster where SELinux is enabled on the Nodes.

January 30, 2023

Minutes

Throughput Anomaly Detection in Theia - slides
- 3 different algorithms supported to detect anomalies
- Ongoing work to make the results easier to consume
- Plan is to support running TAD in the background on "real-time" data
- Test network data came from an actual Antrea cluster, synthetic anomalies were injected into the data manually
- It should be possible to tune the algorithm(s) to make detection less sensitive
- With default Flow Exporter / Flow Aggregator settings, we do not have many data points (one data point per connection per minute)

Recording

Antrea Community Meeting 01/30/2023

January 17, 2023

Minutes

Secure Wireguard tunnels for traffic between clusters in multi-cluster deployments - slides
- Intra-cluster traffic does not change
- For inter-cluster traffic, Geneve traffic will be encapsulated with Wireguard
  - Why not replace Geneve with Wireguard instead? We want to avoid too may changes to the datapath; the Geneve VNI field is required for Stretched NetworkPolicies
- Traffic that needs to be routed to Wireguard is marked with a special mark in OVS pipeline (when the dest IP matches the Service CIDR for a remote cluster).
  - AI: check if the selected packet mark value is consistent with other mark(s) used in Antrea and update it if necessary to prevent conflicts
- At the moment, we are not considering Wireguard for both intra-cluster encryption and inter-cluster encryption
- Wireguard vs IPsec: an Antrea user requested Wireguard support so this is what we are supporting now; some users may want IPsec instead (for FIPS compliance)
- Need to check if rp_filter needs to be changed for the Wireguard interface
- Only one packet mark is needed, no matter how many other clusters (Wireguard peers) are in the cluster set; Wireguard handles the routing
- We need one OVS flow for each remote cluster (same as when Wireguard is enabled, we just add one action to each flow to set the packet mark)

Recording

Antrea Community Meeting 01/17/2023

January 3, 2023

Minutes

Ofnet enhancement plan
- Priority is dead lock bug, logging improvements and switching to a buffered channel
Review of user issues that need to be triaged / addressed (18 open issues for feature requests)
- #4309: Allow multiple Services to share the same LB IP [ServiceExternalIP]
- #4246: Default ExternalIPPool for ServiceExternalIP, so that the feature can be used with controllers which automatically create Services
  - instead of having a global default IPPool, we could support Namespace-level annotations to let users specify a different IPPool for each Namespace
- #4385: Ability to fail-over Egress across multiple subnets (e.g., AZs in the public cloud)
  - for each Egress resource we would have 2 static EgressIPs or 2 ExternalIPPools (primary / backup)
- #3805: In public cloud, the control plane is not typically part of the cluster itself, so it can be difficult to define policies which select traffic from the control plane (the nodeSelector cannot be used).
  - user proposal is to have an endpointSelector, but this seems convoluted and applicable only to this very specific use case
  - the user proposal may not work if the kubernetes control plane service resolves to a load-balancer IP
  - GKE uses apiserver-network-proxy; in that case the source IP for control plane traffic would be the IP of the proxy agent Pod
  - this is not Antrea-specific, but should affect all CNIs
  - Yang will follow up on the issue
- #3794: NetworkPolicy audit logs are missing source and dest Pod namespace and name
  - Valid request, but this information is not all available in the Antrea agent and cannot be included in the local log files
  - We could have centralized logging, but we want to avoid duplication with Flow Aggregator and Theia
  - Some users can be reluctant to deploy Theia just for this information
  - A solution could be to add this functionality to the Flow Aggregator: it already has all the required information, and could generate a centralized log file
  - Antonin will follow up on this issue
- #4213: NetworkPolicy support for Node traffic
  - Not possible in Antrea today, as the Node traffic is not managed by Antrea / OVS (except when FlexibleIPAM is enabled, or for ExternalNode)
  - We still believe that it is an important to have; in Calico, this is implemented using iptables
- #3540: upstream NetworkPolicy Status support
  - at the moment, the feature can only be used by CNIs to report whether endPort (port range) is supported
  - Yang will look into this
It has come to our attention that Octant is no longer actively maintained, with the last commit dating to 1 year ago
- We want to find an alternative, but we don't know yet what it will be

Recording

Antrea Community Meeting 01/03/2023

December 19, 2022

Minutes

Short status update for Antrea v1.10
- Major new features for this release are L7 NetworkPolicy support and changes to support bundle collection (new SupportBundleCollection CRD and support for ExternalNode)
Short status update for Theia v0.4
Next meeting is postponed by 24 hours because of EOY holidays

Recording

Antrea Community Meeting 12/19/2022

December 5, 2022

Minutes

L7 NetworkPolicy demo
- Open question on how users can specify that only a specific peer can access the application using L7 NetworkPolicies

Recording

Antrea Community Meeting 12/05/2022

November 21, 2022

Minutes

Update to L7 NetworkPolicy API
- Main change: L7 NetworkPolicy capabilities will be added to existing L3/L4 NetworkPolicy API.
- Rules can be port-agnostic (all traffic is sent to the L7 engine) or port-specific (traffic which doesn't match the specified port will not go through the L7 engine at all).
- The L4 ports field is used to scope the traffic that needs to be sent to the L7 engine.
- Any policy rule after a L7 policy rule will be ignored (applied L7 policy rules are terminal).
Update on networkPolicyOnly mode with multi-cluster
- A tunnel interface needs to be created on each Node for cross-cluster traffic.
- Change of plan for handling reply traffic from gateway to general mode: use a L3Forwarding rule for each cluster Pod instead of relying on CT label. These rules are installed in the gateway's OVS bridge. Assumption is that the number of Pods is not that large in networkPolicyOnly mode. We have chosen this option because it is simpler, with no significant performance difference (performance is better with small number of rules, not measured with large number of rules).
- Some open questions for stretched NetworkPolicy support

Recording

Antrea Community Meeting 11/21/2022

November 7, 2022

Minutes

Review open questions for L7 NetworkPolicy API & implementation
- see slides
- a follow-up discussion is needed to determine the API behavior for "unsupported" protocols
Benchmark results for L7 NetworkPolicy implementation with Suricata
- see slides
Proposal for supporting networkPolicyOnly mode with multi-cluster
- see slides

Recording

Antrea Community Meeting 11/07/2022

October 24, 2022

Minutes

Finer-grained datapath updates in AntreaProxy
- Motivations:
  - Avoid unnecessary datapath updates (OVS flows, Linux routes & ipsets) for some Service changes
  - Better organized code
- Proposal should be revised to account for the fact that some Service Spec fields are immutable (e.g. ClusterIP unless some specific conditions are met)
- Supporting incremental endpoint updates with Openflow (incremental bucket updates) seems more important than optimizing for very infrequent scenarios (e.g., changing the Service Type or the NodePort).
- For OVS flows, we have an in-memory cache, so the benefits of this new approach may be small (could be different for routes / ipsets).

Recording

Antrea Community Meeting 10/24/2022

October 10, 2022

Minutes

Antrea L7NetworkPolicy API - see slides
- In the first release, only HTTP will be supported as the protocol.
- At the moment, isolated behavior is per direction AND per protocol (e.g., if there is an HTTP rule and no DNS rule, all DNS traffic is allowed as well as all other non-HTTP traffic).
- If Host is empty in an HTTP rule, any Host name is allowed.
- Applying L7 NetworkPolicies on External Nodes (VMs) should be possible (few requirements), but not planned for the first release.
- Host doesn’t include the port number (open question).
- No support for HTTPS at the moment, which requires decrypting traffic and certificate injection. In theory, we can still support Host-only rules with SNI (host name is in clear text).
- No support for policies such as "drop all L7 traffic that is not HTTP" (implementation doesn't support wildcard rules and a protocol has to be explicitly specified for all rules). Quan to investigate if we can match on "ip" as the protocol for default drop rules.
- Every time Suricata signatures are modified, Suricata has to reload the full set of signatures (one file).
  - For a couple 100s rules, should take < 1s (Quan to verify)
- No third-party library to generate Suricata rules programmatically, rules are plain text and not binary.
- Antrea TrafficControl CRs will be generated for L7NetworkPolicy CRs to ensure that the right traffic gets sent to Suricata.
- Only drop & pass actions in first release.
- Integrating with Suricata requires disabling TX checksum offload in Pods; impact on performance not measured yet
- What's the effect if the NetworkPolicyPeer (for egress rules) uses FQDN? TBD
- First release with support for L7 NetworkPolicies will be Antrea 1.10 (HTTP only)
- Engine selection: Suricata vs other options?
  - Which protocols are the most requested by users? Suricata supports HTTP, FTP, SSH, ... (similar to other engines like Snort)
  - Other solutions based on Envoy have support for gRPC and rich set of matching features for HTTP traffic; Envoy can also easily be extended to support new protocols
  - It should also be possible to support new protocols in Suricata using signatures, even if Suricata doesn't know how to parse the protocol natively
  - The issue with Envoy is that it is actually a proxy which intercepts connections with L4 sockets. Our traffic control implementation works at L2, so it works very well with IDS / IPS engines, but not with Envoy. Other issues with Envoy: not a transparent proxy, performance, ...

Recording

Antrea Community Meeting 10/10/2022

September 26, 2022

Minutes

Stretched NetworkPolicies - see slides
- Why? ACNP replication doesn't support the enforcement of ingress policy rules (only toServices egress rules supported)
- We introduce the notion of label identity for Pods (more scalable than distributing IP addresses), which is generated from the Pod labels
- We use the VNI to store the label identity
- Even in noEncap mode (in a member cluster), cross-cluster traffic would always be encapsulated (including from the origin Node to the gateway Node), so there is always a VNI field
- No flow changes required on gateway Nodes: tunnel id / VNI will stay with packet when forwarding to the new tunnel port
- Should be part of Antrea v1.9
- How long does it take for a stretched policy to take effect? I takes a few seconds for 100 label identity changes
- If the identity tag is not known to the member cluster yet, traffic will be dropped for security reasons
Project Nephe - see slides
- Only "Allow" action supported for policies (no "Drop" / "Deny" rules); this is because some clouds don't support other actions (Azure?). Need to double-check and decide whether to expose these actions for clouds that do support them.
- Cloud VM tags are added as labels to the ExternalEntity CRs, allowing users to select workloads by tags when defining Antrea policies
- Instead of copying VM tags as VirtualMachine CR annotations, maybe we could use labels? There is ongoing work to change how cloud tags are propagated to ExternalEntity labels. The VirtualMachine CRD will also be removed eventually, superseded by the VirtualMachinePolicy CRD.

Recording

Antrea Community Meeting 09/26/2022

September 12, 2022

Minutes

Update on Theia manager
- will be in charge of interacting with Spark operator for NetworkPolicy recommendation
- more generally an entrypoint for controlling all observability features
- boiler plate code merged, now adding NetworkPolicy recommendation CR with corresponding controller code
Grafana new home page work completed for Theia
Multiple PRs have been opened & merged to increase unit test code coverage - #4142
Need to investigate code coverage measurement for Theia (which uses both Golang and Python)
L7 NetworkPolicy
- feature targeted for Antrea v1.10 (ongoing design and PoC)
- license for Suricata is GPL 2.0; not an issue as we just plan to distribute it as a binary as part of our Docker images
- new CRDs will be introduced for L7 NetworkPolicies; there will be no support for Tiers and priorities (not possible with the chosen implementation), just like for K8s NetworkPolicies

Recording

Antrea Community Meeting 09/12/2022

August 29, 2022

Minutes

Nothing on agenda
Update on support for stretched Antrea network policies for multi-cluster
- several PRs in progress
- upcoming demo at a later community meeting
PR for Theia manager is under review
- looking for feature parity with current Theia CLI first - then more features will be introduced to leverage the Theia manager capabilities

Recording

Antrea Community Meeting 08/29/2022

August 15, 2022

Minutes

Customized Grafana homepage
- Motivations: brand it for Antrea / Theia, provide quick insights about the cluster network, provide quick access to dashboards
- What are stopped connections (widget)? Connections that were active during the selected time windows but that are no longer active now. "Terminated" may be a better name.
- The amount of data transmitted out of the cluster may be an interesting metric to have as a widget.
- It seems that "Data Transmitted" doesn’t include traffic sent from the server side back to the client, which is confusing. It should include bi-directional traffic.
Antrea v1.8 release: some pending issues
- Last minute issue (security): #4116, new in this release (introduced by namespace-scoped groups)
- Agent cannot reconnect to OVS in some cases: #4092
Multi-cluster Gateway HA support: #3754
- At the moment, user needs to annotate Nodes manually to select gateways: only the most recently selected Node is used as gateway, with no fallback mechanism in case of failure
- We want to make Gateways more robust: active-standby at first (with active-active in the future)
- When active Gateway changes, existing connections may be reset

Recording

Antrea Community Meeting 08/15/2022

August 1, 2022

Minutes

Update on Theia changes for upcoming release (Antrea v1.8)
- extending Theia CLI capabilities
- improving unit test coverage for theia repository (Python Spark jobs, TypeScript Grafana plugins)
- improving e2e test coverage
- clustering support for ClickHouse DB (for horizontal scaling and HA / replication), dropping support for non-clustered deployment
- support for seamless schema upgrades (no data loss)
- minor UI improvements for Grafana dashboards
- is there a plan to drop support for the IPFIX exporter in the Flow Aggregator given that Theia only works with the ClickHouse exporter? Not at the moment, we know of at least one user for the IPFIX exporter (vRNI, a VMware product).
Update on upcoming Antrea v1.8 release (merged & pending PRs)
- support Topology Aware Hints in AntreaProxy (already supported in kube-proxy) - merged
- support for Helm chart installation method - merged
- multicast encap mode support - merged
- need more reviewers for pending PRs: https://github.com/antrea-io/antrea/milestone/21
- planning to freeze code around Wednesday August 10th
- limitations of audit logging support for K8s NetworkPolicies:
  - when traffic is accepted, it can be because of any number of NetworkPolicies
  - when traffic is dropped, it is not because of any specific NetworkPolicy, it is because the Pod is "isolated" (a Pod becomes isolated as long as at least one NetworkPolicy applies to it) - we just display "dropped by K8s NetworkPolicies".
The recording ended abruptly because of a technical error on our side: we did continue the meeting after that for an extra few minutes, but nothing important was said, and there is no recording available.

Recording

Antrea Community Meeting 08/01/2022

July 18, 2022

Minutes

Support bundle for External Nodes (VMs) running the Antrea Agent
- Proposal slides
- We plan to change the API to request support bundles from Agents: it will use a CRD. We will have a unified API for Agents running on K8s Nodes and Agents running on External Nodes / VMs.
- We should be able to request support bundles from both External Nodes and K8s Nodes with one single API request (i.e., one single CR).
- Currently we only plan to support HTTPS for file upload, maybe other protocols in the future (e.g. FTP).
- Secret references in the CRD should not include the uid.
- An internal channel between Agent and Controller can be used to distribute the Secrets to the Agents. In that case, the Agent doesn't need to be granted RBAC permissions to read the Secret.
- It is unclear whether we need an internal channel between Agent and Controller at all for this feature. Primary motivation was to hide Node information (Node name) from other External Nodes, including in the same Namespace. However, it doesn't seem that this is a reasonable motivation since at the moment External Nodes already have access to this information based on existing RBAC. If we remove the internal channel, how to we distribute Secrets?
- Some support bundle requests which select the same Node(s) cannot be processed concurrently. Rather than have complicated admission logic and a dedicated error code for this case, we could simply have a single worker processing requests one-by-one sequentially. Support bundle is not a very frequent or time-sensitive operation.
- Internal channel for communications between Agent and Controller: can we have a unified solution across multiple features (support bundle, policy stats, traceflow, ...)?
Helm charts for Antrea available starting with v1.8
- See https://github.com/antrea-io/antrea/blob/main/docs/helm.md
- Antrea Helm charts listed on https://artifacthub.io/

Recording

Antrea Community Meeting 07/18/2022

July 5, 2022

Minutes

Stretched Antrea network policies for multi-cluster
- Proposal slides
- Why single ResourceExport for all label identities in a cluster, but one ResourceImport object for each label identity from the leader cluster?
  - ResourceExport object may grow too big or may need to be exchanged too often. Could there be a performance issue?
  - Decision to have a single object was motivated by a potential race condition when different clusters make changes to the same normalized labels.
  - Maybe there is another more efficient way to avoid that potential race condition. Yang and Grayson will look into this.
- Two choices for carrying label identity information across clusters: Geneve header TLV (up to 32 bits) or tunnel ID / VNI (up to 24 bits)
  - Decision is to use VNI (simpler & no impact on MTU) even though it is limited to 24 bits
  - Why is the MTU change an issue? Concern is that an existing cluster joins a ClusterSet and that we enable the StretchedNetworkPolicy feature. In that case, the MTU would need to be reconfigured for all Pods in the cluster.
- How to map workloads to label identities when using VNI? 2 choices: 12-bit label identity for Namespace and 12-bit label identity for Pod OR one 24-bit label identity which encodes both the Namespace and the Pod labels. Both have pros and cons when it comes to data exchange and number of supported identities.
  - Decision to be made offline

Recording

Antrea Community Meeting 07/05/2022

June 21, 2022

Minutes

@tnqn is proposing to refactor the Kind e2e tests
- See https://github.com/antrea-io/antrea/pull/3922
- Plan is to have one job with all the Alpha+Beta features enabled and one job with all the Alpha+Beta features disabled (in addition to the jobs for different encap modes)
- Reduce redundancy in CI jobs
- Reduce side effects and unpredictability of tests by removing ConfigMap mutations by individual test cases (test cases which depend on a specific feature will only run if the feature is currently enabled in the ConfigMap)
- In the future, we can run additional tests in Kind (e.g., for Multicast) to reduce dependency on Jenkins jobs
Releases of Antrea v1.7.0 and Theia v0.1.0
- Legacy network flow visibility will be removed from the Antrea repository as part of the Antrea v1.8.0 release
Multi-cluster support in Theia
- 2 possible solutions: have a unique flow data store (ClickHouse DB) in each cluster, or a centralized flow data store in one of the clusters
- In both cases, we want a centralized Grafana instance, with the ability to select a specific cluster in the UI (from a list of "connected" clusters)
Next meeting will be on Tuesday July 5th, because of US holiday
- We will discuss support for network policies with multi-cluster

Recording

Antrea Community Meeting 06/21/2022

June 6, 2022

Minutes

Antrea Jenkins CI - current situation and migration plans
- Any security risk associated with using smee to propagate Github webhook payloads to the private Jenkins instance?
  - Smee is developed and run by Github, we don't think it's a big security risk and no plan to replace it at the moment
- Force pushing to a branch and running /test-all does not kill previous ongoing jobs, which creates delays for the new jobs and for other PRs
  - We can investigate a command-based solution to enable users to kill previous jobs, but we have to be careful about doing clean-up properly
- Budget for running Antrea CI jobs in AWS?
  - Currently $200, but we can increase it
  - AWS makes more sense for daily jobs, not for pre-merge jobs
- We plan to open-source the code we use to run CI jobs in the private Jenkins deployment (e.g., IPv6 & Windows CI jobs)

Recording

Antrea Community Meeting 06/06/2022

May 23, 2022

Minutes

Virtual Machine (VM) support in Antrea using ExternalNode
- Proposal slides
- An ExternalNode is a kind of ExternalEntity - for each ExternalNode, we create a corresponding ExternalEntity, with the same labels.
- There will be support for ExternalEntity when we introduce namespaced Groups (as a way to easily select ExternalEntities and group them together, for the sake of defining network policies).
- AntreaAgentInfo will now be created / deleted by the Antrea Controller, but still updated by the Antrea Agent. This reduces RBAC permissions for the Antrea Agent. This also means that if we create a dedicated ServiceAccount per ExternalNode, we just need to grant the ServiceAccount permission to update its own AntreaAgentInfo resource (unlike the create verb, the update verb can be restricted to a specific resourceName).
- 2 "pipelines" for ExternalNodes: IP pipeline focused on policy enforcement (no forwarding) and nonIP pipeline for non-IP packets
- When interfaces are moved to the OVS bridge, the internal ports get the original name of the pNIC to avoid disruptions to routing and other processes.
- No forwarding functions: the OS is in charge of everything, when a packet enters through ethX (internal port) is is immediately marked with the correct egress port, i.e. pnicX (the matching physical port).
- Are OVS restarts handled gracefully (no connectivity loss)? We know that NSX uses veth pairs to connect pNICs to the bridge for this reason, so something worth looking into.
- DHCP should keep working after moving the interface to the bridge: the DHCP client will keep renewing the lease for the IP.
- All ExternalNodes are ExternalEntities but not all ExternalEntities are ExternalNodes. In the cloud, we can support security policies for VMs using cloud-native constructs (e.g. security groups for AWS VPCs), without running the Antrea Agent on the VM. In this case, VM network interfaces map to an ExternalEntity, but there is no ExternalNode.
  - An ExternalNode really means "a compute node on which we run the Antrea Agent and OVS"
  - Note that we actually generate different ExternaEntities for the different network interfaces of an ExternalNode
Last week, there was an Antrea office hour session at Kubecon EU (hosted by Salvatore and Quan)
- Short introductory presentation of the project, recent features, current state of the project

Recording

Antrea Community Meeting 05/23/2022

May 9, 2022

Minutes

Transition for Kustomize to Helm to generate Antrea YAML manifests
- Currently we only use Helm for its templating capabilities, we don't have a chart repo and we are not releasing the chart yet
- Kustomize had too many limitations for our use case (not a templating engine), and we ended up using sed a lot
- No significant difference for the end user; the generated manifests are the same with the exception of the name of the Antrea ConfigMap to longer including a hash-generated suffix
GKE supporting NetworkPolicies for Windows Pods with Antrea
Antrea Network Flow Visibility solution is moving to its own Github repository under the name "Theia"
- The visibility solution has very different dependencies from Antrea and uses different technologies (e.g., Python + Spark for network policy recommendations)
- The Flow Aggregator is staying in the main Antrea repo, ELK integration is being removed starting with Antrea v1.7 (replaced by ClickHouse + Grafana)
- Upcoming Theia v0.1 release with new features: visualization for denied flows (flows denied by NetworkPolicies), network policy recommendation CLI
- Theia adds support for Helm to deploy all the different components
- Theia repo getting its own CI
- There is some duplicate documentation between the Antrea repo and the Theia repo, which could create some user confusion
- Antrea live show on flow visibility

Recording

Antrea Community Meeting 05/09/2022

April 25, 2022

Minutes

Certificate-based authentication for IPsec tunnels
- Proposal slides
- The Antrea Agents need access to the root CA certificate to configure the IPsec daemon; rather than add this certificate to a new ConfigMap, it can be added to the PEM-encoded certificate chain in the CSR Status field, after the CSR has been approved and signed (by the Antrea Controller).
- For certificate rotation, adding an extra OVS config option with a unique hash value seems like the best approach to trigger certificate reloading by the ovs-ipsec-monitor.
  - When using unique file names, we need to handle cleaning up stale files.
  - Updating any config option should be sufficient to trigger certificate reloading.
- Consensus is that the functionality for managing certificates (CSR creation, etc) should be located in the antrea-agent container, not in the antrea-ipsec container.
- The CN / SAN name in the certificate must match the remote_name in the IPsec config.
- With the current proposal for RBAC configuration, Agents can create any CSR they want (for any signing authority), including creating CSRs using another Node's name as the CN.
  - Possibility of escalation if a Node is compromised (and the Agent's serviceAccount token is compromised).
  - We could consider having an option so that Agent CSRs are not auto-approved; an admin would need to approve CSRs manually, including in case of Node reboot; the Antrea Agent can update the Node Status as "NotReady" until the CSR is approved (e.g., in case of reboot) to prevent Pods from being scheduled on the Node

Recording

Antrea Community Meeting 04/25/2022

April 11, 2022

Minutes

Proposal for introducing the traffic control capability
- Slides
- Design issue #3324
- Traffic can be mirrored or redirected through an OVS port.
- Users can either create the OVS port themselves, or let Antrea create it (can add an existing device, or create a new tunnel port); this is for added convenience, as creating an OVS port requires access to the ovsdb-server socket.
- Can multiple TrafficDestination CRs use the same port, and let Antrea create the port? Yes, the same port will be used if the name fields are the same. However, the configuration properties have to be the same in all cases, or there will be a conflict / error. This is similar to multiple Pods using the same hostPath volume on a K8s Node.
- Another approach that was considered was having a ConfigMap (could be the antrea-config ConfigMap) to create all required ports upfront. The API-driven approach is probably more user-friendly.
- Initial version is planned for Antrea v1.7, with additional features (e.g., filtering) in Antrea v1.8.
- The TrafficControl feature name may be too general, as tc in Linux supports many additional use cases.
- Adding e2e tests for this feature should not be an issue.
- Open question: stateful vs stateless implementation. Quan believes performance could be better with the stateless implementation, without compromising on supported use cases. Can be revisited when we start implementing filtering support.
- TunnelDestination could be a separate CRD, and multiple TrafficDestination CRs could refer to the same TunnelDestination CR if they want to use the same tunnel.
- Using OVS flow-based tunneling for mirroring is not an option: traffic needs to go out on the overlay (normal Pod traffic forwarding) but also be mirrored to another tunnel. We cannot use flow-based tunneling in both cases, as each packet only has one set of tunnel metadata in OVS.
- Some possible improvements for the API, will discuss them on the Github issue and offline.

Recording

Antrea Community Meeting 04/11/2022

March 28, 2022

Minutes

ICMP support in Antrea-native policies
- Feature request issue #3263
- Proposal slides
- CRD webhook conversion applies to served API resources, but stored resources are not modified automatically
- Consensus is to keep the current Ports field and introduce a new Protocols field. If both fields are present, they will be merged. If they conflict, we can fail early. This means we do not need to introduce a new API version.
- We could support arbitrary IP protocol numbers (as integers) as well.
- Demo video
- Implementation is in good shape, but there is an issue when using the Reject action, which needs more investigation.
Multi-cluster datapath connectivity support
- Issue #3502
- Design doc
- Currently the gateway Node (for cross-cluster communications) is chosen manually and there is no failover mechanism.
- Routing to other clusters is based on the destination IP (each cluster advertises its Cluster CIDRs and Service CIDRs)
- Active-standby failover may not be enough, we may need active-active support (multiple active gateway Nodes), as a single gateway Node may become the bottleneck.
- We may want to support overlapping Service CIDRs across clusters, by allocating virtual IPs (in Antrea) for multi-cluster Services.
- Might need to support noEncap mode (in cluster members) too to cover cloud-managed K8s services; at the moment we only support encap mode for cluster members.
- K8s upstream multi-cluster DNS specification recently merged: https://github.com/kubernetes/enhancements/pull/2577
- Using Node private IP or public IP (as reported by K8s API) to create tunnel endpoints? We probably should use the public IP if available, or fall back to private IP if not available (there isn't always a public IP). Private IP may not be routable across clusters (e.g. if member clusters are in different VPCs).
- Will try to include this in Antrea v1.7

Recording

Antrea Community Meeting 03/28/2022

March 15, 2022

Minutes

"Live Traffic Tracing for Antrea" proposal
- Github issue #3428
- Design doc
- Antrea Traceflow feature, 2 modes:
  - packet is injected by Antrea and traced through the cluster network, or
  - the first "live" (real traffic) matching packet is traced and captured
- This design wants to add more advanced tracing and sampling capabilities compared to Traceflow (e.g. capture multiple packets, ...)
- Packets "marked" for capture will be matched in the OVS pipeline and sent to the Agent. To mark the packets, the best solution seems to be eBPF with TC hook.
- Risk of using / setting the IP ID field to uniquely identify sampled packets?
- Packet dumps (with metadata) will be stored at source and destination Nodes; HTTP API can be used to retrieve sampled packets.
- Live tracing for Service traffic (which goes through DNAT)? Need to check how this will work.
- Traceflow is already overloaded, maybe a new CRD should be used to configure traffic sampling and avoid user confusion
- Do we need such a complicated implementation (capture at both the source and destination Nodes) or can we go with something simpler for sampling (e.g. sample traffic at a specific Node)?
- How much complexity will eBPF be adding to the Antrea codebase?

Recording

Antrea Community Meeting 03/15/2022

February 28, 2022

Minutes

Multicast API design
- Reuse the same IP block field for multicast IP addresses (no need to introduce a new field)
- Need more discussion for how to select IGMP messages: we should probably consider a generic solution that can also work for other protocols we want to support, such as ICMP
Multicast stats demo
- No need for a dedicated API for multicast NetworkPolicy stats (follow-up from last meeting's discussion)
- New APIs / antctl commands:
  - For Pod multicast traffic stats: only locally available in "agent-mode" (from the antrea-agent Pod), no aggregation
  - To query multicast group membership: membership information is aggregated in the Antrea Controller (there can be members across many different Nodes) and API available with APIService
- Proposed API name (multicastgrouppodsmembers) seems redundant, suggestion is to use multicastgroupmembers instead
Antrea NetworkPolicy NodeSelector
- https://github.com/antrea-io/antrea/issues/3023
- https://github.com/antrea-io/antrea/pull/3038
- Design doc
- Which Node IPs are used to enforce NetworkPolicies?
  - For intra-cluster communications, Nodes use the gateway IP assigned by the Antrea Agent
  - Node IPs must include uplink IP and gateway IP
  - The Controller can determine the gateway IP for each Node based on that Node's PodCIDR
  - Other cases we need to cover?
Antrea v1.6 release
- Largest ongoing PR is the "flexible pipeline" one, which changes how we manage OVS flows for the different Antrea features; should be merged very soon, which will unblock other PRs (multicast, flexible IPAM, ...)

Recording

Antrea Community Meeting 02/28/2022

February 14, 2022

Minutes

NetworkPolicy support for multicast traffic
- Ability to target multicast traffic but also IGMP messages (query / report)
- See https://github.com/antrea-io/antrea/issues/3323
- There are several possible options we should consider for the API design: dedicated types for multicast policies, dedicated types for multicast rules, same types but dedicated fields, ...
- In order to define rules for IGMP messages, should we consider introducing a generic mechanism to target arbitrary IP protocols?
Support for Antrea multicast stats API
- See https://github.com/antrea-io/antrea/issues/3294
- The proposed stats API may need to evolve based on the final design we agree upon for the multicast NetworkPolicy support
The legacy *.antrea.tanzu.vmware.com APIs are being removed in Antrea v1.6 (they were deprecated in favor of *.antrea.io APIs in v1.0)

Recording

Antrea Community Meeting 02/14/2022

January 18, 2022

Minutes

Proposal for Antrea IPAM multi-VLAN support post Antrea v1.5 - slides
- Status of Flexible IPAM:
  - Antrea v1.4: decouple Pod IP allocation from Node assignment (Pod can keep same IP when evicted to another Node), Linux + IPv4, ability to provide IPPool per Namespace
  - Antrea v1.5: ability to provide IPPool per Deployment / StatefulSet, ability to provide IP per Pod
- The plan is to avoid introducing a new feature gate for multi-VLAN support; for Pods using Node IPAM and for the antrea gateway we will keep using trunk ports so no change for those
- With the current design, OVS will route packets across VLANs locally: a Pod in VLAN 100 and a Pod in VLAN 101 can talk to each other locally without the traffic going through an underlay router
  - this may be surprising to users as this is not the typical VLAN isolation behavior
  - an alternative would be to forward the traffic to the uplink always and let the underlay network handle it (wether it's local or remote Pod traffic)
  - this could be configuration-based as well; macvlan offers a similar configuration as well?
- Same VLAN ID can be used in multiple IPPools
- The number of flows in the new VLAN table will be proportional to the number of local Pods
Field names for NetworkPolicy API
- In the current K8s NetworkPolicy API (but also in the Antrea-native API), workloads can be selected in many different ways. For example, you can select workloads with a podSelector, a namespaceSelector or by providing a combination of both selectors. This becomes messy as we add new selectors to the API (e.g. serviceAccount selector). Some selectors are compatible (e.g., podSelector & namespaceSelector) but some or not (e.g., podSelector & serviceAccountSelector).
- At the moment, we use a validating webhook to check that the provided selectors make sense together.
- This is a bad design when updating the API to support new selectors. Users can create policies using the new API version. These policies can be invalid. If an older version of Antrea is still running, the policies will not be rejected and will be stored by the apiserver. The unsupported fields will be ignored by Antrea (silently) leading to an implementation which is not expected by the user. When the Antrea Controller Pod is updated, it will complain that the policies are invalid.
- That is why the upstream NetworkPolicy API cannot be updated with new features in its current version.
- This was a mistake in the original design of the API. In retrospect, there should be more fields in the API spec, each with a verbose name and each corresponding to a specific combination of selectors.
- Should be incorporate these learnings as we evolve our Antrea-native NetworkPolicy APIs? What about existing selectors?
- There could be other API designs achieving the same goals.

Minutes

Release update
- 3 major features coming in this release
  - Default tunnel type update from VXLAN to Geneve: impacts users in case of upgrade, no change in overlay MTU
  - Default gw name change, no impact on Nodes in existing clusters
NodePortLocal proposal presentation by Sudipta Biswas:
- Main goal is to allow external Load Balancers to provide connectivity to Pods, bypassing limitations of NodePort and kube-proxy
- No Pod annotations required, Antrea Agent publishes a CRD that can be consumed by the external Load Balancer (includes Pod to host port mapping)
- How does it relate externalTrafficPolicy=Local for NodePort services? Still depends on kube-proxy and still has some limitations. Sudipta will get back to us and provide a detailed comparison between the 2.
- Sudipta will create a “proposal” issue in the Antrea repository as a next step.
- A performance comparison with a traditional external Load Balancer using NodePort may be useful, but probably time-consuming
- Session affinity cannot be achieved with NodePort Services, because the external Load Balancer may use a Virtual IP (VIP) as the source IP. The same VIP will be used by many clients, so Session affinity will cause poor load-balancing. With NodePortLocal, the Load Balancer is in charge of controlling Session affinity.
- What’s the plan for testing this as part of the Antrea e2e test suite? Sudipta to come up with a plan.
- Would be good to include diagrams in the design doc to show the different traffic paths for NodePortLocal vs "traditional" use case(s).
Documentation for Antrea ("Antrea the hard way")
- Some detailed document about Antrea works and how it uses OVS. Sometimes that can be consumed when trying to troubleshoot Antrea operations.
- Architecture document and OVS pipeline document are good places to start
- https://github.com/vmware-tanzu/antrea/issues/883
- Cody will look at some options, and especially with respect to how Antrea compares to "routed" CNIs like Calico
NetworkPolicy v2: Cody, Jay, Abhishek will give an update at the next meeting regarding the discussions happening in sig-network; the Antrea community could provide useful feedback
Zoom protection: maintainers and Cody will review options

Recording

The meeting was "Zoom-bombed" and as a result we had to edit-out 2 minutes of the footage to avoid uploading profanities. We apologize to the attendees and the presenter.

Antrea Community Meeting 06/29/2020

June 15, 2020

Minutes

First release retrospective (may not be a permanent link)
- Feedback around testing: flaky tests are making life harder for contributors
- Consider reducing our release cadence (4 weeks -> 6 weeks)
- We will have more retrospectives in the future (for each release?)
- Let’s add the most "popular" items to the agenda for future meetings
Antrea ClusterNetworkPolicy Agent-side design by Yang
- See slides
- Why float values for priorities? Ensures that the user is always able to insert new rules between 2 existing rules, without updating existing priorities themselves
- Priority zones can have 130 priorities: thanks to these zones we have a more reasonable boundary on how many rules we have to shuffle when inserting a new one
- Easy to adjust the design in the future without impacting the user (antrea-agent restart when doing update will just re-organize flows)
- Typical scale according to Cody: 4-5 tiers; 65K rules should be sufficient but we may want to be able to balance between tiers / zones - as a reminder the limit applies at the Node level, not at the cluster level, so 65K may be more than enough
- When introducing RBAC, multiple rules spread across multiple namespaces may share the same priority values, so some priority zones may become too packed - size of priority zones may need to be dynamic
- Ability to mix and match Antrea-native Namespaced Network Policies with Cluster Network Policies? exact evaluation order yet to be determined
- Shuffling the flows (changing the priority) will not impact existing connections; we may want to ensure that OVS counters are preserved though, which may not be the case for the current implementation (Yang to verify)
- Until we have a UI for ordering policies, user will need to be aware of all existing priority values
Next meeting
- AVI team may talk about their proposal for ingress NodePort policies

Recording

Antrea Community Meeting 06/15/2020

June 1, 2020

Minutes

Service Cluster IP access ("kube-proxy" functionality) implemented with OVS: presentation and demo by Weiqiang
- Generating ICMP host / port unreachable messages like kube-proxy with iptables for invalid Service IPs / ports ? not implemented at this time but will look into it
- For NodePort support, we will still rely on kube-proxy for now
- Documentation status? there is an available Google doc, will share after the meeting; we should also update the OVS pipeline documentation with new flows and tables as it is a useful document for new contributors
- We use OpenFlow groups for endpoint selection (for now, equal weights for all endpoints but later we can support topology awareness, annotations for weight specification, ...)
Flow-tracing presentation and demonstration by Ran
- Is it possible to map the rule ID in output (CRD status) to specific K8s NetworkPolicy? work in progress
- Support for Service ClusterIP traffic? it is dependent on ClusterIP implementation in OVS
- Traceflow requests install temporary flows in the OVS bridge.
- What happens when multiple traceflow requests are performed concurrently? We can have up to 15 traceflows running at the same time.
  - Need to think about RBAC for traceflow and rate-limiting.
Cody would like to set aside some time (10-15 minutes) in each community meeting to have an open forum for new users to ask any questions they may have about the project

Recording

Antrea Community Meeting 06/01/2020

May 18, 2020

Minutes

Update on ClusterNetworkPolicy proposal
- K8s Network Policies will be considered part of the lower-priority category ("default" category), but the user can also create Antrea Cluster Network Policies within that same category.
- Add "from" field to egress rules and "to" field to ingress rules.
  - As a consequence, the ingress and egress section in the ClusterNetworkPolicy CRD definition will use the same struct type
  - Not P0 feature
  - Not really useful for "in-cluster" policies which are applied to Pods: makes more sense for policies applied to Nodes / external entities
  - At first, CRD validation will ensure that the new fields are always empty, and will reject the object otherwise (before it gets to the Antrea controller).
- Ability to set AppliedTo on a rule basis (and override the policy field): probably a nice-to-have, but no specific use case in mind as of now, so could be added incrementally later
- Refer to design doc for implementation roadmap and feature priority
Discussion on API subgroup naming for Antrea-specific policy CRDs
- All these CRDs could go under a "security" subgroup?
- Could have "monitoring" and "troubleshooting" subgroups for other CRDs
- Let's punt this discussion to a future meeting to give people time to think about this
Flow exporter design
- Should flow exporter be part of the antrea-agent process / container?
  - For simplicity's sake, can be run in a separate container in the future
- Is there a plan to support flow filtering so that we only export flows specified by the user (e.g. flows from certain Pod)?
  - At the moment plan is to export information about all the flows going through the OVS bridge, but we can extend that in the future.
- What is the performance impact on the node?
  - Plan is to run some benchmarks after we take a first stab at the implementation.
- Flow information access? How do we restrict it in the context of multi tenancy?
  - Srikar will think about this angle and RBAC implications
- Why poll the conntrack module for the implementation?
  - Provides visibility into reverse traffic (counters)
  - Bad performance of OVS IPFix
  - We can embed NetworkPolicy information
- Are there other flows that we are interested in that are not committed to conntrack?
  - At the moment we commit all connections as part of Network Policy implementation.
- With the current proposal we are missing L7 information, maybe something we want to consider in the future.
- K8s context information will be added to the IPFix records (e.g. Pod information) so that the UI can display in terms of K8s objects.
- To limit the amount of traffic, plan is to poll conntrack flows every 5-10s, export every couple of minutes.
- Flow logging in calico enterprise: see section labeled "Flow Logs with Workload Metadata" here; is there anything there that we wouldn't be able to support with this proposal? Ability to classify workloads?
- Flow information compression / aggregation: may be worth looking into this to avoid generating too much data
  - Should it be done at the exporter / aggregator / collector?
- Sensitive to port scanning / SYN flood attacks?
  - Only send information about established connections, unanswered SYNs can be exposed as a separate metric
- How does the UI scale with the size of the cluster and the number of connections?
  - Needs to be benchmarked

Recording

Antrea Community Meeting 05/18/2020

May 4, 2020

Minutes

Moshe walked us through his hardware offload proposal
- Topology manager is used to ensure that VF net device and CPU are on the same NUMA
- OVS supports 2 offload mechanisms (DPDK rte_flow / TC flower): the proposal covers Kernel offload (using TC flower offload)
- Full offload model: either all actions can be offloaded, or it will be handled in software
- Recirculation should be ok: connection tracking + encap / decap
- Right now we can only test Pod-to-Pod traffic, since Antrea still relies on kube-proxy for Service traffic
- This approach should be applicable to other vendors as well. Other vendors besides Mellanox may be able to support full offload, including connection tracking.
- Goal on Mellanox side is to be able to offload 1M+ flows to hardware.
- Mellanox willing to help out with the CI by providing and hosting a testbed that we can integrate in our public CI infrastructure.
- Moshe will include documentation about the requirements (OVS, Multus, etc) and the necessary configuration steps as part of his PR
- You choose which Pods need to be accelerated in the Pod spec, you can have a mix of accelerated and non-accelerated Pods.
- Going to take a while to make sure that Pod-to-Pod traffic is fully supported: need to ensure there are no gaps in upstream OVS / Linux Kernel code (a few months needed?).
Questions on NetworkPolicy proposal from last week
- Ability to sandwich K8s NetworkPolicies between Antrea Cluster/Namespaced NetworkPolicies: common enterprise use case. All K8s NetworkPolicies should be relegated to a Default tier (lowest priority) as one block. Need ability to define relative ordering between K8s NetworkPoliciees and Antrea NetworkPolicies in that Default Tier.
- Need to abide by K8s isolated Pod behavior: if a K8s NetworkPolicy selects a namespace, and only allows egress TCP traffic from port 80, all other egress traffic from this namespace should be "denied", and there is no possibility to override that in a lower priority Antrea NetworkPolicy. These lower priority policies can only be used to deny more traffic.
- Can we unify externalEntitySelector and podSelector?
  - Cloud-native metadata can be automatically translated to K8s labels (possibly namespaces to avoid clashed) by code importing inventory.
  - The rationale for having both externalEntitySelector and podSelector was that when Antrea NetworkPolicies are only consumed in the context of K8s, we wanted the fields to be pretty much the same as for K8s NetworkPolicies. We could have a 3rd field endpointSelector to select across all endpoints (external and Pods)?
- Proposed Status field for NetworkPolicy CRDs: these are typically used to reflect current status and not to expose time series values. What is the value of including counters? Can’t we achieve the same thing with Prometheus metrics? Quan is still working on this proposal, we can review at a later time.
  - AI(@abhiraut): schedule an extra meeting this week for further discussions on Antrea NetworkPolicies.

Recording

Antrea Community Meeting 05/04/2020

April 20, 2020

Minutes

Cluster-scope Network Policy proposal by Abhishek: https://docs.google.com/document/d/1l-1P5sNKzUo3Zxf5Qfl6oQCWY8TYPOTdIwe9mfppqLg/edit?usp=sharing
- Motivations:
  - K8s Network Policies are namespace-scoped, so having a cluster-wide policy requires replication
  - Upstream changes are slow, but eventually we would like to have a standardized API instead of relying on an Antrea-specific CRD
  - No notion of policy tiering / priorities for policies that can be created by different roles
  - Ability to select other kind of workloads besides Pods (e.g. Nodes, external entities such as VMs)
- Other open-source CNIs have the same kind of CRD, we have experience at VMware for NSX
- Does the idea of supporting service selectors conflict with service mesh policies (e.g. Istio)?
  - This is still at layer 4 and is meant to complement K8s Network Policies
- How fast are Network Policies enforced?
  - It applies to existing Pods but (at least in the case of Antrea) it only applies to new connections (existing connections are not affected because of how we use conntrack to skip checks for established connections). This is also the case for standard K8s Network Policies: just an API specification and there is no mandate on how they should be implemented, and as far as we know other CNIs (e.g. Calico) implement them in the same way (using conntrack).
- For port lists in rules, we could consider supporting port ranges, for convenience
- The document needs to clarify how rules between different categories (with different priorities) interact
- Add concrete use cases / user stories to document
- More detailed presentation later about Status (plan to expose byte / packet counters as CRD statuses)

Recording

Antrea Community Meeting 04/20/2020

April 8, 2020

Minutes

Reschedule of the community meeting based on poll results:
- single meeting (no rotation), Monday 9PM PST - Tuesday 4AM GMT
- no conflict with K8s contributor calendar
- AI(Salvatore): update meeting time in README
Using DDlog in the Antrea Controller for NetworkPolicy computation
- see Antonin's slides
- a few things that still need to be figured out:
  - is NetworkPolicy computation the bottleneck, or is it actually the distribution to agents? if it's the latter, then some minor differences in computation time between the DDlog and native implementation are insignificant
  - how much additional complexity do we expect in the future (e.g. with NetworkPolicy tiers)? - more complexity could justify a move to DDlog
  - can DDlog help us support more features, such as connectivity queries?
- DDlog is used internally at VMware for some more complex projects
- additional optimizations can be done (e.g. in the Go <-> DDlog interface), but these represent a large engineering effort that is only justified if we commit to DDlog for Antrea

Recording

Antrea Community Meeting 04/08/2020

March 25, 2020

Minutes

v0.5.0 status update:
- Prometheus PRs still in-review, pushing it back to v0.6.0
- Antrea cleanup PR, no progress to report, pushing it back to v0.6.0
- Update to Go 1.14 - not important for release, keep it open for now as it is a good first issue
v0.6.0
- Windows support: still missing CI pipeline and installation process
- According to Cody, there are some more urgent features (SNAT, IPAM, policy tiering), which are hurting Antrea adoption and should be targeted for the June time frame
  - Issues need to be created for these features
- Missing stability features: "support-bundle" and log collection (e.g. with syslog)
  - Relying on container logs is not ideal (no log persistence when a container restarts)
  - crash-diagnostics is to collect information in a cluster for troubleshooting, it's not a log streaming / collection system
  - Antrea core code should be agnostic to the log collection system (syslog, fluentd, ...), but we should have a reference integration with some popular open-source stack like EFK (ElasticSearch + Fluentd + Kibana), e.g. in the form of a reference operator, with documentation, configuration, etc
  - Support bundle: what the user intended (configuration) + Antrea state snapshot + all logs available
  - Action items for v0.6.0? reference integration with EFK for log collection / analysis + prototype support-bundle with antctl (crash-diagnostics may be out of the picture because of SSH access requirement to each Node) - for both items, more detailed PRD is required.
Antrea support on ARM architectures
- See Antonin's slides
- K8s itself has issues on arm due to lack of testing
  - many issues reported on slow arm devices (e.g. https://github.com/kubernetes/kubeadm/issues/1380), means it could be difficult to use emulation (qemu) for CI testing
  - use x86 for control-plane node and arm only for workers (cannot use Kind cluster), which would be the typical use case
- According to Cody, this is available in Calico but not widely used
- According to Cody, we should try to tackle this for the end of the year, but not a priority for the summer time frame
- continue the discussion on Github / Slack
Re-scheduling Antrea community meeting
- conflict with Calico monthly community meeting
- more than half of the Antrea active contributors are based in China and they should be able to participate in the meetings
- => let's take it to the Slack channel

Recording

Antrea Community Meeting 03/25/2020

March 11, 2020

Minutes

Review open issues for 0.5.0 release
- #361: this needs to be resolved for 0.5.0, which will be aligned with K8s release 1.18 (March 25th); waiting to hear back from assignee
- Prometheus patches: code reviews are needed for patches #322, #325 and #446
- #494: Namespace deletion issue; Quan has a workaround for upgrading the YAML when API resources have changed
  - we need to review apimachinery guidelines for future upgrades
- #312: publish antctl binaries as part of 0.5.0 release; Antonin will review available antctl commands available and determine whether the binaries should be included in the release (depends on whether some useful commands can be run out of cluster).
  - documentation needed for antctl (#337)!
- => all open issues on target for 0.5.0
Public cloud update: EKS (AWS) support should be ready for 0.5.0
Windows update: lots of progress in feature branch; able to run some e2e tests; need to setup CI; still using OVS CloudBase (no progress to report for upstream)
Website update:
- website is all ready to go
- there was a lot of activity recently around VMware Tanzu, so Cody was waiting for the right time to do the launch to maximize impact
- also working on blog posts for performance + internal VMware Antrea deployment
Antrea community meeting time slot currently conflicts with monthly Calico community meeting
- we are leaning towards changing the time from 9am to 10am PST, but can also consider another day of the week
- need to check for conflicts with other meetings in the K8s space
- no meeting next week

Recording

The hosts forgot to start the recording at the beginning of the meeting, so we only have a very short recording for this meeting.

Antrea Community Meeting 03/11/2020

March 4, 2020

Minutes

Jay presented the ongoing "netpol" (new Network Policy testing framework) work
- DSL to quickly and easily define new test cases (Network Policy definition and expected reachability matrix)
- Runs fast, tests are concise and easy to understand
- Long-term goal is to move everything upstream - how far should we go before then (in terms of framework / improvements / test cases)?
- Stop running it as a Job, run it as a Pod and exit 0/1 for success/failure
- Other upstream tests can benefit from this approach (network e2e tests, other e2e tests?)
- Jay will present at k8s sig-network meeting (03/05)
- Upstream feedback for the KEP:
  - Similar project called illuminatio
  - Make sure that the tests work the same no matter in which order "objects" are created (Pods, Network Policies, Labels, ...) - too many combinations to test them all but maybe we can isolate a few interesting scenarios
- Illuminatio implements some fuzz testing: test Network Policies present in the cluster by generating test cases
- There is a shell script (hack/netpol/test-kind.sh)
- Feature parity with upstream tests? Not yet (missing CIDR test and a few others), but about 80-90% of them; we also keep adding tests upstream which increases the gap :)
- Next steps: keep coming up with ideas and pushing them to hack/ or hardening the current stuff?
  - Solve the scale test problem before we harden the current code - we know this is a required case and we don't know how to solve it yet
  - See https://github.com/vmware-tanzu/antrea/issues/464
- Add new area/ label to Github for netpol issues
New issue for implementing kube-proxy in OVS: https://github.com/vmware-tanzu/antrea/issues/463
- Salvatore and Kobi have been thinking about this already - they will join forces with Quan and others
Antrea community meeting is maintained next week (03/11)

Recording

Antrea Community Meeting 03/04/2020

February 26, 2020

Minutes

NoEncap support merged in for 0.4.0, investigating support for managed K8s services of public clouds
Prometheus patch is still a work in progress; as a community we need to define which metrics are important
- Is it at all possible to display some Prometheus metrics in Octant? Octant probably not suited to display time series.
IPsec regression in 0.4.0
- Suspect that IPsec broken after moving from port-based to flow-based tunnels
- Not captured in CI because of improper cleanup: the agent does not handle tunnel type changes correctly and traffic goes on the un-encrypted tunnel
- Jianjun will work on a fix and we can consider a bug fix release
Review of open issues for 0.5.0
- Let’s try to start using the lifecycle/active label when an issue is actively being worked-on
- Windows support unlikely to be ready for 0.5.0
  - Cody to create an epic to track individual sub-tasks for Windows support, so that we can have a timeline
- Antrea support on Windows depends on OVS CloudBase changes (https://github.com/cloudbase/ovs)
  - Should we push for these changes to be merged upstream?
  - Don’t want to hinder our ability to move forward with Windows support
At Kubecon we will have the opportunity to demonstrate Antrea at the VMware booth: send demo ideas to Cody
Network Policy upstream testing initiative: Jay has an open PR that needs review, question is where to host it until it gets accepted upstream
Antrea lighting talk at Rejekts in Amsterdam: https://cfp.cloud-native.rejekts.io/cloud-native-rejekts-eu-2020/talk/QQZY3D/
There will be a community meeting next week so that Jay can present the upstream Network Policy work

Recording

Antrea Community Meeting 02/26/2020

February 12, 2020

Minutes

Review open issues for 0.4.0 milestone
- #253: ongoing effort to move testbeds to VMC (VMware on AWS) so that the Jenkins UI is publicly accessible
- Named port still on track for release (with all community tests passing)
- #323: @weiqiangt has a fix ready, will be included in release
- #355: fix ready but @antoninbas is investigating why new e2e test (to test the feature) is failing in CI (it is passing on local cluster)
- Prometheus: no new progress
- Cody made some progress on license file generation tool - should be able to open a PR this week
- #347: documentation updates for issue / PR workflow and labels almost ready to merge
- Website has been merged into a branch, ongoing some final adjustments before making it into main branch
- Compatibility version matrix: some ongoing work
Several small "good first issues" have been opened to try to attract external contributors
Multus integration: use Antrea only for primary IP or for secondary IPs as well? #368 needs more information from submitter.
#374: STT kernel module is not part of upstream kernel (OVS needs to be built from source)
- does STT really provide a performance benefit?
- if no, maybe we should just drop STT "support"; if yes, then we should update documentation with instructions on how to enable STT
#379: ongoing process to improve the upstream network policy tests
2 open PRs for Windows support
- CI system does not include a Windows K8s testbed at the moment
- let's use a "windows" feature branch and submit patches against it; merge the feature branch into main branch once it is complete and we have the ability to test the code in CI
Moving to a bi-weekly cadence for meetings; will plan 0.5.0 at the next meeting

Recording

Antrea Community Meeting 02/12/2020

January 29, 2020

Minutes

Review Cody’s draft for request for Antrea to be included as CNCF sandbox project
- Inspired by Harbor’s proposal
- We can also look at Contour’s proposal, which was presented at the last CNCF SIG Network meeting
- Salvatore has a concern that the document uses terminology specific to Antrea without defining it - Cody plans to add pointer to more detailed ROADMAP.md
Review open issues
- #345: Cody will point to an example -> Open Source License file will be required by some orgs consuming the project and by CNCF
- Antrea cleanup: Antonin will address Jianjun's feedback
- Prometheus: no new progress
  - current changes focus on enabling Antrea to report metrics to Prometheus, we don't have a comprehensive list of metrics we want to expose; Cody can provide feedback
- Publicly-accessible log servers for Jenkins CI
  - if we get accepted into CNCF, we can request some CI resources and move to public cloud
  - VMware IT request still pending to expose a public log server for current Jenkins testbed
- "No-encap" support: high probability that it will be 0.4.0
Come up with a compatibility matrix that we can publish with each release: K8s versions, OSes, cloud providers, ...; Cody will come up with a proposal and this should be published starting with 0.4.0
Bug scrub
Antrea website proof is ready, link will be posted on Antrea Slack channel for feedback
Cody will open some issues requesting additional documentation for some specific deployment modes; would like performance numbers to be available as well

Recording

Antrea Community Meeting 01/29/2020

January 22, 2020

Minutes

Proposed modifications to the development process: https://github.com/McCodeman/antrea/tree/project-management/docs/dev-process
- Lifecyle of issues / PRs, new labels for the Antrea repository, ...
- Motivation:
  - fairly young project but we want to grow fast in the upcoming year; there will be a lot of parallel work - this should give us more visibility into how the project is progressing
  - more formality => better transparency and higher velocity
- Cody will be responsible for making sure that issues / PRs are correctly triaged / labeled
- Code freeze a week before release? not needed yet, will revisit in the future
- Do we want to formalize how people can submit proposals; we currently use Google Docs but nothing has been formalized - seems like an okay place to start and we can revisit later
Review open issues for 0.3.0 release
- no useful antctl command can be run out-of-cluster and no available user-facing documentation: antctl binaries will not be shipped as part of the 0.3.0 release
- IPsec: Jianjun has an opened PR (approved) to limit the tunnel type to GRE, will open a new PR for documentation but maybe after the release
- Prometheus: opened PR ready for review - pushed out to 0.4.0
Need to review licenses of Antrea dependencies: Cody will look into it
Named Port support update: some opened PRs, ongoing conformance testing

Recording

Antrea Community Meeting 01/22/2020

January 15, 2020

Minutes

Review open issues for 0.3.0 release
- leaning towards postponing Prometheus integration and NoEncap mode
- Kobi to update Prometheus issue with design doc / status update
- we tagged a few other bug fixes for 0.3.0 release
- any features graduated to Beta / GA?
  - Octant support was improved by Mengdie since last release, and Tong provided some “third-party” feedback; let’s target Beta status for next release (v0.4.0)
Salvatore’s proposal to have Antrea be a supported networker for OpenShift 4, which some people may ask for
- may want to have a conversation with RedHat later to officially support Antrea in the open source code (like OVN)
- Yasen: what would be the differentiator for Antrea?
- Salvatore to open an issue to track this
Ongoing work by Salvatore on dual-stack support
- large chunks of the IPAM code need to be updated
- also changes to CNI client, Network Policy code
- kube-proxy has to be in IPVS mode
- ongoing process, but slow
Antrea vs other CNIs: Cody has some comparison data
VMware has been running some scale tests for Antrea in the lab, results may be available publicly in the future
Cody will present his Antrea project boards at the next community meeting (January 22nd)

Recording

Antrea Community Meeting 01/15/2020

January 8, 2020

Minutes

Walkthrough of NoEncapsulation proposal by Su (@suwang48404)
Meeting next week (Jan 15) is maintained
- Cody will present project boards for Antrea and his proposal to streamline issue triage
- Review open issues for 0.3.0 release

Recording

Antrea Community Meeting 01/08/2020

December 18, 2019

Minutes

Review of 0.2.0 release issues
- network policy fixes have all been merged -> named port support still missing (not v0.2.0), currently ignoring named ports in Network Policy rules (i.e. traffic not allowed)
- CLI moved to next release (only command supported in current PR is "version")
- more tunnel types, stale CRDs removed (fix) -> no tracking issue, mention it in CHANGELOG
- promoting features: monitoring CRDs to Beta, Network Policy support stays in beta until all conformance tests pass, connectivity stays in beta (many small changes were made to the OVS pipeline since last release)
- target Thursday for the release (make sure we have run all conformance tests)
Plan release 0.3.0
- CLI support with some useful commands
- No-encap mode: priority for some cloud-providers (e.g. AKS) - needs to talk about this offline (e.g. mailing list)
- IPsec support
- Any Prometheus integration?
- Delete all artifacts created by Antrea when it is deleted from the cluster
- Named port support?
Tentative release date for 0.3.0 is Jan 22nd
Prometheus integration
- architecture: have the controller be a central collector of have each agent report metrics?
- try to keep it separate from the core Antrea code as much as possible
- use Prometheus for more "static" data and commit to supporting that for 0.3.0, investigate what to do for more dynamic data (use Prometheus, another collector, ...)
- possible steps: 1) define metrics we want to expose & investigate endpoint discovery, 2) define a framework to export highly dynamic metrics, 3) troubleshooting: how to integrate the rest of the work we are planning with Prometheus / other visibility and open-tracing tools
- Kobi (@ksamoray) to drive this
Named port support for 0.3.0: when a named port corresponds to different port values for different pods, significant amount of work in Antrea
Integrating with cloud providers: maybe some changes required to Antrea core, need some additional work for each cloud provider
In the process of getting a proposal for the website ready, will share with the team soon
Postpone having versioned documentation as part of the Github directory structure until 1.0 release
Next meeting after the holidays

Recording

Antrea Community Meeting 12/18/2019

December 11, 2019

Minutes

Walkthrough of the OVS pipeline by Antonin using the contents of PR #206
- some ongoing changes, PR will need to be updated
- PR #200 adds ARP spoof check for gw interface - even though we probably have other problems if an attacker is able to do that
- the policy tables (ingress & egress) use conntrack to accept all established connections regardless of current network policies
  - this means that established connections cannot be broken by updating network policies. Is this the desired behavior? Policy updates can be "bypassed" by keeping a TCP connection open.
  - other CNIs have this same issue, but maybe we can do better by checking policies for every packet with no loss of performance thanks to OVS
  - connection tracking ensures that "reverse" traffic for a authorized connection does not get dropped, regardless of the Pod's ingress policy rules; may be hard to remove the flow for established connections
- we probably do not need add ip to all flows
Status of v0.2.0 release: https://github.com/vmware-tanzu/antrea/milestone/1
- support for except field for network policies has been merged in
- new issue we may want to address for release: #197; causing some community network policy tests to fail
- need to review CLI PR #208
How do we simplify log collection for bug reports / support requests?
- let's define what we want to collect, then worry about how we can collect it automatically
  - Antrea logs, kube-apiserver logs, kubelet logs, kube-proxy logs / config maps
  - make sure we do not expose secret / sensitive information
  - OVS logs, OVS flows, iptables rules
  - which container runtime is used
  - we can collaboratively build a list in issue #11 before next meeting
How to separate user-facing documentation / dev-facing documentation? Which tools are we planning to use to structure the documentation?
- Cody will look into it
- Keep everything on Github, documentation should be versioned
- Read The Docs / Jekyll?
Last meeting before the holidays will be next Wednesday (12/18) - finalize v0.2.0 release

Recording

Antrea Community Meeting 12/11/2019

December 4, 2019

Minutes

Objectives of the community meetings
- a mix between a developer meeting and a release management meeting
  - discuss issues, review proposals and brainstorm new ideas
  - releases are correctly planned and on track (if not re-assign issues appropriately)
- in the future, may become a "user meeting" as well to discuss users' needs and pain-points
Architecture walkthrough by Jianjun: Antrea components & traffic walk
- see architecture document
- L2 broadcast traffic never leaves the Node, local OVS switch replies to ARP requests for remote gateways
Upcoming documentation on OVS pipeline & network policy computation (with detailed examples)
- can do deep dives at the meeting when docs are available
Release management
- we will use Github milestones to track releases and tag issues appropriately (Jira has no free plan)
- bug fix releases:
  - we released v0.1.1 last week to fix Kind support on Linux
  - no outstanding bugs urgently require a new bug fix release - network policy patches may be hard to cherry-pick into the release branch (conflicts)
Release plan for v0.2.0
- we need to be able to run conformance tests (network e2e tests) and network policy tests
  - adding support for "named port" for network policies is not trivial (code re-org required) so should be a stretch goal for v0.2; some network policy tests will fail without it
- CLI framework with some basic debugging commands
- target date is December 18th
Running conformance tests / network policy tests as part of CI
- Run the full suite to qualify releases; ideally should be automated
- Run a smaller subset for every PR - it seems that running the entire network policy test suite takes 1+ hour
Review of open issues
- #119 AI: update documentation to state that old CNI must be deleted & Pods rescheduled when deploying Antrea
- Kind support is currently broken on macOS, no solution to fix it at the moment

Recording

Antrea Community Meeting 12/04/2019