Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPF MASQ for veth mode and ip-masq-agent #11148

Merged
merged 13 commits into from Apr 30, 2020
Merged

Conversation

brb
Copy link
Member

@brb brb commented Apr 24, 2020

This PR is the same as #10878, but minus the multi-dev NodePort and the tests. I've created it for easier reviewing. Once it's merged, I will open a PR for the rest.

  • Removes BPF MASQ from ipvlan.
  • BPF MASQ for veth mode (ipv4-only).
  • BPF ip-masq-agent.

The follow-up TODO: #11018.

Fix #10229
Fix #9922

Add BPF masquerading for veth mode
Add BPF-based ip-masq-agent

@joestringer I believe you have already reviewed all the changes in this PR.

@brb brb added pending-review sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/major This PR introduces major new functionality to Cilium. labels Apr 24, 2020
@brb brb added this to the 1.8 milestone Apr 24, 2020
@brb brb requested review from borkmann, joestringer and a team April 24, 2020 20:05
@brb brb requested review from a team as code owners April 24, 2020 20:05
@brb brb requested a review from a team April 24, 2020 20:05
@brb brb added this to In progress in 1.8.0 via automation Apr 24, 2020
@brb
Copy link
Member Author

brb commented Apr 24, 2020

test-me-please

@brb
Copy link
Member Author

brb commented Apr 24, 2020

test-me-please

@coveralls
Copy link

coveralls commented Apr 24, 2020

Coverage Status

Coverage increased (+0.005%) to 44.607% when pulling 892cefb on pr/brb/bpf-masq-and-agent into 4e17dfe on master.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor nits in addition to previous feedback, LGTM.

Documentation/cmdref/cilium_bpf_ipmasq_list.md Outdated Show resolved Hide resolved
Documentation/gettingstarted/ipvlan.rst Show resolved Hide resolved
pkg/datapath/loader/base.go Show resolved Hide resolved
bpf/lib/nodeport.h Outdated Show resolved Hide resolved
daemon/cmd/daemon_main.go Outdated Show resolved Hide resolved
if option.Config.EnableIPMasqAgent {
if err := ipmasq.Start(option.Config.IPMasqAgentConfigPath,
option.Config.IPMasqAgentSyncPeriod); err != nil {
log.WithError(err).Fatal("ip-masq-agent failed to start")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note that the inability to restore will cause daemon crash here. On the flip side, if there's a problem with the config we will instead just put the controller into an error state and still successfully run the daemon. Not sure if that's specifically thought through or just the most convenient way to represent those errors?

I can't think of a specific failure case that would cause this to be a problem, just seems slightly mismatched in the representation of the errors. Probably restore won't fail anyway, it just requires the ability to create the map (incl. the resource limits stuff) + ability to dump a (potentially empty) map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's intended.

If the restore fails, then it's a good indicator that something is wrong with a system (e.g. the resource limit), and it's OK for cilium-agent not to start. Otherwise, we might risk introducing subtle changes in behavior which could be difficult for a user to debug.

Meanwhile, crashing a cluster network due to a syntax error in a ip-masq-agent config file or inability to update the related BPF map is IMHO too dangerous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this can come down to two different user approaches: There are some users that would prefer things crash/break as soon as any failure occurs so that they can deal with them. Other users will prefer that things just work, and log a message describing why things are not 100% so that they can follow up later.

In general I think in Cilium if there is a pure user input mismatch (eg daemon config options) we are crashing out early to signal to the user that Cilium is misconfigured and will not operate correctly. I think this is fine as it is most likely to occur when the user first configures Cilium or first attempts upgrade; and it's purely functional on a set of provided parameters. This state from the configmap is probably similar "user input" which we should validate clearly, but I'm still mulling it over. The weird part I point out above is that if a user customizes the configmap while Cilium is running and misconfigures it, then Cilium will continue running happily (admittedly with a warning log). Next time Cilium is restarted, it will go into crashloopbackoff because of this input validation. If they don't test properly then they may inadvertently set Cilium up to crash out on one day, then only find out days/weeks later when restarting Cilium (or worst case when Cilium crashes). At that point a misconfiguration can lead to a wider Cilium outage.

When we get into potential transient errors particularly for areas like this which are not pure connectivity or pure security, it gets a little less clear when Cilium should crash out. Crashing the Cilium daemon can cause DNS connectivity issues for instance if the user runs FQDN policy. So if we're getting away from "this is the first time the user runs Cilium in this setup" then I think we should err on the side of reporting an error through normal mechanisms (eg log a warning; that will also show up in metrics via the 'number of warnings/errors' metric) and continuing with functionality disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. If they don't test properly then they may inadvertently set Cilium up to crash out on one day, then only find out days/weeks later when restarting Cilium

That's a very valid concern.

I think we should err on the side of reporting an error through normal mechanisms (eg log a warning

Agreed. I've changed restore() to log an error instead of crashing.

pkg/ipmasq/ipmasq.go Show resolved Hide resolved
@brb brb force-pushed the pr/brb/bpf-masq-and-agent branch from 0e9781c to 262fd09 Compare April 25, 2020 06:17
@brb
Copy link
Member Author

brb commented Apr 25, 2020

test-me-please

1 similar comment
@brb
Copy link
Member Author

brb commented Apr 25, 2020

test-me-please

@brb
Copy link
Member Author

brb commented Apr 26, 2020

Hmm, hitting K8sDatapathConfig ManagedEtcd Check connectivity with managed etcd.

@brb
Copy link
Member Author

brb commented Apr 26, 2020

test-me-please

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Documentation/cmdref/cilium-agent.md Show resolved Hide resolved
Documentation/cmdref/cilium-agent.md Show resolved Hide resolved
# destinations outside of the cluster.
masquerade: true

# bpfMasquerade enables masquerading with BPF instead of iptables
bpfMasquerade: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we enabling this by default for new instalations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have enough confidence about this feature, I'm happy to enable this by default;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brb so when will we have confidence about it? Does this mean the feature is in beta? If it's not enabled for new installations will anyone use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss it during the next meeting.

{{- if .Values.global.egressMasqueradeInterfaces }}
egress-masquerade-interfaces: {{ .Values.global.egressMasqueradeInterfaces }}
{{- end }}
{{- if and .Values.global.ipMasqAgent .Values.global.ipMasqAgent.enabled }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires a warning somewhere in the docs or in helm generation, the ip-masq-agent config map needs to be set in the same namespace where Cilium is running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will mention it in the docs.

daemon/cmd/daemon_main.go Outdated Show resolved Hide resolved
daemon/cmd/daemon_main.go Outdated Show resolved Hide resolved
daemon/cmd/daemon_main.go Outdated Show resolved Hide resolved

// config represents the ip-masq-agent configuration file encoded as YAML
type config struct {
NonMasqCIDRs []Ipnet `yaml:"nonMasqueradeCIDRs"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about json?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it in the follow-up!

Comment on lines +99 to +106
manager.UpdateController("ip-masq-agent",
controller.ControllerParams{
DoFunc: func(ctx context.Context) error {
return a.Update()
},
RunInterval: syncPeriod,
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2] Use something similar as

type configDirectoryWatcher struct {
watcher *fsnotify.Watcher
lifecycle clusterLifecycle
path string
stop chan struct{}
}
func createConfigDirectoryWatcher(path string, lifecycle clusterLifecycle) (*configDirectoryWatcher, error) {
watcher, err := fsnotify.NewWatcher()
if err != nil {
return nil, err
}
if err := watcher.Add(path); err != nil {
watcher.Close()
return nil, err
}
return &configDirectoryWatcher{
watcher: watcher,
path: path,
lifecycle: lifecycle,
stop: make(chan struct{}),
}, nil
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both you and @joestringer asked me about this (https://github.com/cilium/cilium/pull/10878/files#r409979845). I could implement it in the follow-up, but do you really think that's worth increasing the complexity of ip-masq-agent? I expect the config files to be relatively simple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's one less flag that we have and we won't wait at least 60 seconds until changes are enforced. They will be done as soon kubelet has the files locally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK if I do it in the follow-up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the follow-up items.

Comment on lines +542 to +548
flags.Duration(option.IPMasqAgentSyncPeriod, 60*time.Second, "ip-masq-agent configuration file synchronization period")
option.BindEnv(option.IPMasqAgentSyncPeriod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be required if [2] is done instead.

@brb brb force-pushed the pr/brb/bpf-masq-and-agent branch 2 times, most recently from 16e5182 to 3d8d178 Compare April 27, 2020 13:21
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming all follow up items are done.

Comment on lines 117 to 115
for cidrStr, cidr := range a.nonMasqCIDRsInMap {
if _, ok := a.nonMasqCIDRsFromConfig[cidrStr]; !ok {
log.WithField(logfields.CIDR, cidrStr).Info("Removing CIDR")
a.ipMasqMap.Delete(cidr)
delete(a.nonMasqCIDRsInMap, cidrStr)
}
}

for cidrStr, cidr := range a.nonMasqCIDRsFromConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one more point here. Did you consider the ordering between add/delete here?

Recent similar BPF map updates changes in policy side have shown that the ordering can impact datapath availability, see #10936

EDIT: To be explicit, I mean to say: If someone updates the configmap, will we cause connectivity issues / misrouted traffic temporarily during the update because of the ordering between deletes and adds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order so that adds happen before deletes just to make new CIDRs available a bit quicker.

The ordering won't have any effect on established connections. However, if there is a connection which previously was masqueraded, and after the CIDR update it's no longer a subject to masquerading, then the egressing packets from the pod after the update won't be masqueraded, and thus it will be dropped. If it's an issue, we can always add a lookup in the BPF (S)NAT map when deciding whether the masquerading is needed. But that would come with a performance penalty in the fastpath.

Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small things on bpf datapath side, once resolved lgtm. ipv6 support is missing and should probably be added to the follow-ups. The snat engine supports ipv6 and we do it in nodeport snat as well, so we should also do it for the masq agent.

brb added 13 commits April 30, 2020 13:24
Currently, it is broken, and it requires a re-work. The removal will
help us to refactor the BPF SNAT engine.

Also, this allows us to remove the SNAT dead code, and to inline the
helper macros.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Currently, the flag is noop except that it disables provisioning of
iptables MASQUERADE rules when set.

The flag is going to be used to control BPF MASQUERADE'ing.

By default, it is set to false, so that users who will upgrade to
Cilium v1.8 will keep the existing behaviour (= MASQ is done by
iptables).

Signed-off-by: Martynas Pumputis <m@lambda.lt>
This commit enables MASQUERADE'ing packets sent from local endpoints to
outside (WORLD_ID).

The MASQ is enabled only when --enable-bpf-masquerade is set.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
This commit fixes DSR and BPF MASQ when both are enabled:

- For egress path (local_endpoint -> outside) do not skip (S)NAT if L4
  protocol uses DSR.
- For ingress do not avoid trying to do the reverse SNAT xlation if L4
  protocol uses DSR. Currently, the rev-xlaton happens inside the
  CILIUM_CALL_IPV4_NODEPORT_NAT tailcall.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
The param controls whether BPF MASQ should be enabled. Currently, it
defaults to "false", and cilium-agent defaults to "false" as well, so
that existing users who upgrade to v1.8 will preserve the existing
behavior (iptables).

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Otherwise, the map creation in the probe (sometimes) fails with EPERM.
See [1] for details re rlimit.

[1]: https://docs.cilium.io/en/v1.7/bpf/#miscellaneous

Signed-off-by: Martynas Pumputis <m@lambda.lt>
This commit introduces the following ip-masq-agent related flags:

* --enable-ip-masq-agent
* --ip-masq-agent-config-path
* --ip-masq-agent-sync-period

Currently, the flags are noop.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
The LPM map is going to be used by Cilium's ip-masq-agent to
store CIDRs which should bypass masquerading.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
This commit introduces the "ip-masq-agent" controller which is going to
be used to sync the ipmasq BPF maps with a config files updated by
users.

Also, enable ip-masq-agent when --enable-ip-masq-agent is set to true.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
It makes an integration via ConfigMap much easier, as a user provided
ConfigMap will be mounted as a file to cilium-agent pod, and its content
is encoded as YAML.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
This commit extends the SNAT engine in a way that a packet sent from a
local endpoint to outside is NOT masqueraded if a dst IP addr of the
packet matches the one in the ipmasq map.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
The cmd lists all CIDRs in the ip-masq-agent BPF LPM map.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
The BPF ip-masq-agent can be enabled with global.ipMasqAgent.enabled.

When enabled, the ip-masq-agent configmap (created by a user) is mounted
into /etc/config/ip-masq-agent of the cilium-agent pod.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb force-pushed the pr/brb/bpf-masq-and-agent branch from 8290499 to 892cefb Compare April 30, 2020 11:38
@brb
Copy link
Member Author

brb commented Apr 30, 2020

@borkmann

What if REMOTE_NODE_ID is not used and the other endpoints also have HOST_ID?

Hmm, good question. I assumed that new deployments will have it enabled. If it's disabled, perhaps it's safe to check for HOST_ID instead, as a packet hitting this check is intended to leave the current node. I've added this item to the follow-up list.

Should this still be defined(hybrid) && !defined(masquerade) given the latter is not enabled by default anyway? Otherwise we'd snat unnecessarily here.

The removal of the check should not make DSR to be SNAT'd. After my PR is merged, nodeport_nat_ipv4_needed() is called only with dir=NAT_DIR_EGRESS, i.e. for packets which leave a node. In the case of DSR, the helper function is called for the following packets (SNAT is not done for all of them):

  • Forwarded request from a client to a remote node.
  • Reply from a remote endpoint to a middle node (on the remote node).

And it won't be called for the following packet which would be a subject to SNAT:

  • Reply from a local endpoint to a client (because ..._NODEPORT_REVNAT sets MARK_MAGIC_SNAT_DONE which makes the packet to bypass the helper function).

Also, if we missed a case and DSR would have been SNAT'd, I believe that the CI tests would catch it, as we do the source IP addr validation.

@brb
Copy link
Member Author

brb commented Apr 30, 2020

test-me-please

@borkmann
Copy link
Member

@borkmann

What if REMOTE_NODE_ID is not used and the other endpoints also have HOST_ID?

Hmm, good question. I assumed that new deployments will have it enabled. If it's disabled, perhaps it's safe to check for HOST_ID instead, as a packet hitting this check is intended to leave the current node. I've added this item to the follow-up list.

Yeah, I think HOST_ID check should suffice here since there's a prior check on !(ep->flags & ENDPOINT_F_HOST). So this should only be remote addresses iiuc.

@brb
Copy link
Member Author

brb commented Apr 30, 2020

CI hit the #10763 flake.

@brb
Copy link
Member Author

brb commented Apr 30, 2020

test-me-please

@brb
Copy link
Member Author

brb commented Apr 30, 2020

CI failed:

[2020-04-30T14:04:40.154Z] echo "docker push cilium/docker-plugin:latest-amd64"
[2020-04-30T14:04:40.154Z] docker push cilium/docker-plugin:latest-amd64
[2020-04-30T14:04:40.154Z] build -f hubble-relay.Dockerfile -t "cilium/hubble-relay:latest" .
[2020-04-30T14:04:40.154Z] /bin/bash: build: command not found
[2020-04-30T14:04:40.154Z] Makefile:290: recipe for target 'docker-hubble-relay-image' failed
[2020-04-30T14:04:40.154Z] make: *** [docker-hubble-relay-image] Error 127

@brb
Copy link
Member Author

brb commented Apr 30, 2020

test-me-please

1 similar comment
@qmonnet
Copy link
Member

qmonnet commented Apr 30, 2020

test-me-please

@joestringer
Copy link
Member

Code has approvals for all codeowners, and GKE check is not required. Merging.

@joestringer joestringer merged commit 462da1d into master Apr 30, 2020
1.8.0 automation moved this from In progress to Merged Apr 30, 2020
@joestringer joestringer deleted the pr/brb/bpf-masq-and-agent branch April 30, 2020 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/major This PR introduces major new functionality to Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

BPF based MASQUERADE for veth mode ip-masq-agent in eBPF
7 participants