Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling --enable-bpf-masquerade with enable-host-services=false when using kube-proxy prevents connectivity to ClusterIP services from remote hosts #12699

Open
joestringer opened this issue Jul 29, 2020 · 10 comments
Assignees
Labels
area/kube-proxy Issues related to kube-proxy (not the kube-proxy-free mode). area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/snat Relates to SNAT or Masquerading of traffic kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@joestringer
Copy link
Member

joestringer commented Jul 29, 2020

Summary

When Cilium is configured with kube-proxy, hostNetwork pods (or applications on hosts) cannot establish connectivity with pods on remote nodes via ClusterIP. Regular pods are not affected by this issue.

Environment

  • Cilium v1.8.2
    • Using kube-proxy
  • Kernel 5.4
  • Kubernetes v1.17 via KIND v0.7.0

How to reproduce the issue

  1. Create 2-node KIND cluster from cilium git tree using kind create cluster --config=.github/kind-config.yaml
  2. Load Cilium via these instructions:
    - name: Load local images into kind cluster
    run: |
    kind load docker-image --name chart-testing cilium/cilium:latest
    kind load docker-image --name chart-testing cilium/operator-generic:latest
    - name: Install cilium chart
    run: |
    helm install cilium ./install/kubernetes/cilium \
    --wait \
    --namespace kube-system \
    --set global.nodeinit.enabled=true \
    --set global.kubeProxyReplacement=partial \
    --set global.hostServices.enabled=false \
    --set global.externalIPs.enabled=true \
    --set global.nodePort.enabled=true \
    --set global.hostPort.enabled=true \
    --set config.ipam=kubernetes \
    --set global.pullPolicy=Never

    (I later modified the image manually to v1.8.2 but you can docker pull ... + kind load docker-image ... to preload specific versions)
  3. Deploy connectivity check YAML
  4. Edit the host-to-echo-b-clusterip deployment via kubectl edit deploy host-to-b-multi-node-clusterip
    • Replace livenessProbe with readinessProbe
  5. Observe that the new pod created for this deployment never becomes ready

Symptoms

Source node sees SYN go out to the remote pod via the overlay:

root@kind-control-plane:/home/cilium# cilium monitor
...
-> overlay flow 0x88ee9aa7 identity 6->0 state new ifindex cilium_vxlan orig-ip 0.0.0.0: 172.17.0.3:28440 -> 10.244.1.175:80 tcp SYN

Destination node (kind-worker) sees the SYN, passes to pod, which responds with a SYN-ACK. This gets passed to the stack:

root@kind-worker:/home/cilium# cilium monitor
...
-> endpoint 351 flow 0xfd614d2c identity 6->13896 state new ifindex lxccc74c557b238 orig-ip 172.17.0.3: 172.17.0.3:29228 -> 10.244.1.175:80 tcp SYN
-> stack flow 0xeac287bd identity 13896->6 state reply ifindex 0 orig-ip 0.0.0.0: 10.244.1.175:80 -> 172.17.0.3:29228 tcp SYN, ACK

With tcpdump at the destination node (kind-worker), we see that the response is being SNAT'd to the node's IP:

# tcpdump -nvei eth0 | grep  172.17.0.3
...
    172.17.0.3.50507 > 10.244.1.175.80: Flags [S], cksum 0xb8e5 (incorrect -> 0x36d2), seq 2626802039, win 64240, options [mss 1460,sackOK,TS val 3387777011 ecr 0,nop,wscale 7], length 0
...
    172.17.0.2.57629 > 172.17.0.3.50507: Flags [S.], cksum 0x5856 (incorrect -> 0x72a4), seq 2606385031, ack 2626802040, win 64308, options [mss 1410,sackOK,TS val 901524270 ecr 3387777011,nop,wscale 7], length 0

Mitigation

I manually modified the cilium-config ConfigMap to configure enable-bpf-masquerade: "false" then restarted Cilium pods, and the connectivity check started working.

@joestringer joestringer added kind/bug This is a bug in the Cilium logic. kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Jul 29, 2020
@joestringer
Copy link
Member Author

@brb Should we be disabling bpf-based masquerade when using kube-proxy to prevent issues like this?

@joestringer joestringer changed the title Enabling --enable-bpf-masquerade when using kube-proxy prevents connectivity to ClusterIP services from remote hosts when running kube-proxy Enabling --enable-bpf-masquerade with enable-host-services=false when using kube-proxy prevents connectivity to ClusterIP services from remote hosts when running kube-proxy Jul 29, 2020
@joestringer
Copy link
Member Author

There may be some simple answers here, like requiring enable-host-services if enable-bpf-masquerade is enabled.

@brb brb changed the title Enabling --enable-bpf-masquerade with enable-host-services=false when using kube-proxy prevents connectivity to ClusterIP services from remote hosts when running kube-proxy Enabling --enable-bpf-masquerade with enable-host-services=false when using KIND and kube-proxy prevents connectivity to ClusterIP services from remote hosts Jul 29, 2020
@brb brb changed the title Enabling --enable-bpf-masquerade with enable-host-services=false when using KIND and kube-proxy prevents connectivity to ClusterIP services from remote hosts Enabling --enable-bpf-masquerade with enable-host-services=false when using kube-proxy prevents connectivity to ClusterIP services from remote hosts Jul 29, 2020
@brb brb self-assigned this Jul 29, 2020
@brb
Copy link
Member

brb commented Jul 29, 2020

Does KIND create two interfaces on each node (one with a default route, one for communication between nodes)?

@joestringer
Copy link
Member Author

@joestringer
Copy link
Member Author

Whatever solution we decide, we should consider updating the kind install parameters to ensure they're simplified (along with the smoke tests): https://docs.cilium.io/en/latest/gettingstarted/kind

joestringer added a commit to joestringer/cilium that referenced this issue Jul 29, 2020
In the smoke test, we are relying on kube-proxy for service connectivity
so it doesn't make sense to enable BPF masquerading. In fact, this
causes issues for connectivity from a node to a pod on a remote node via
ClusterIP (see related issue).

For the moment, disable BPF masquerading while we figure out the
longer-term solution to that issue.

Related: cilium#12699

Signed-off-by: Joe Stringer <joe@cilium.io>
nathanjsweet pushed a commit that referenced this issue Jul 29, 2020
* connectivity-check: Add 'make clean' support

Factor out the targets for all YAMLs so it can be reused by a new phony
target, 'clean'.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Introduce cuelang framework

CUE (https://cuelang.org/) is a data constraint language built defined
as a superset of JSON which aims to "simplify tasks involving defining
and using data".

In context of the connectivity check YAMLs, CUE is useful to allow us to
"evaporate" the boilerplate necessary to define Kubernetes YAMLs for
Deployments, Services and CiliumNetworkPolicies and allow developers to
specify various permutations for connectivity checks concisely.

Why should we use it?

* It's more concise: One template definition, multiple reuse. This is
  useful for introducing new connectivity checks as upcoming commits
  will demonstrate as the developer doesn't need to perform the tedious
  and error-prone process of copying and modifying the YAMLs to
  implement various permutations of a check. Furthermore this helps
  reviewers as they will not have to read through swathes of YAMLs but
  can instead focus on the diffs in the templating that are introduced
  and compare to existing data definitions.

* Consolidate constant declaration. When a core change needs to be made
  to something like the readinessProbe for probes that expect a
  success or failure, we can update one definition in the main CUE file
  and all YAMLs will subsequently be generated with this change in mind.
  During the process of preparing these changes, I noticed
  inconsistencies between various existing YAMLs which appear to just be
  unintentional, where some YAMLs were improved with better timeoute
  behaviour or error rendering, but other YAMLs were left out.

* The data is more structured. Upcoming commits will introduce simple
  CLI tools that allow matching on different classes of connectivity
  checks to generate the corresponding YAMLs. Previously we have
  depended upon file naming schemes and Makefile globbing magic to
  implement this which quickly reaches a limit in which checks should be
  selected for a specific check.

What are the dangers?

* It's relatively immature. At current version v0.2.2 it is subject to
  language changes. Upcoming commits will pin the CLI tool usage to a
  docker container derived from this version to ensure compatibility.
* One more language in the tree to understand, review and interact with.
  Mitigating circumstances: This language comes out of the Golang
  community and as such brings some commonalities; furthermore it is
  beginning to be used in other Kubernetes projects, so there is some
  broader community alignment.
* Its power allows you to hide as much or as little complexity as you
  want. It's tricky to strike a fine balance between explicitly declaring
  (and duplicating) relevant fields in the local file vs. hiding
  convenient templating language in common files. For examples, see
  defaults.cue which automatically derives connectivity check destinations
  based on object name declarations matching regexes of "pod-to-X", and
  applies affinity/anti-affinity via matches on "intra-host" or
  "multi-host".
* All declarations are additive, ie there is no ordering based upon the
  layout in code; instead, data dependencies are determined using the
  declarations, and all data is arranged into a lattice to determine the
  evaluation ordering[0]. This can be counter-intuitive to reason about
  for the uninitiated.

The general approach used in this commit was to `cue import` various
existing YAML files to generate JSON equivalents, then iteratively
combining & consolidating existing definitions using the language
constructs provided by CUE. CUE also provides mechanisms to generate
schemas and autogenerate the structures used here directly from API
definitions (eg from k8s source or Cilium tree), however this area was
not explored in this PR yet. While this doesn't take advantage of one
major aspect of the language, upcoming commits will demonstrate the way
that these changes were validated without the use of standardized
schemas from the underlying Kubernetes resource definitions. (TL;DR:
`kubectl diff ...` with kubectl validation on a live cluster). This was
sufficient to extend the connectivity checks and does not preclude
future explanation of the use of schemas for these definitions.

This commit introduces usage of CUE in a relatively minimal way into the
tree which was useful for my goals of extending the connectivity checks.
If we find that it is useful and powerful, we may consider whether to
extend its usage to other areas of the code (such as for test manifest
generation).

[0] https://cuelang.org/docs/concepts/logic/#the-value-lattice

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add cue CLI tools

Add some basic tooling around connectivity-check YAML generation:

  $ cue cmd help
  List connectivity-check resources specified in this directory

  Usage:
    cue [-t component=<component>] [-t name=<name>] [-t topology=<topology>] <command>

  Available Commands:
    dump   Generate connectivity-check YAMLs from the cuelang scripts
    ls     List connectivity-check resources specified in this directory

List available connectivity-check components:

  $ cue cmd ls
  KIND                  COMPONENT        TOPOLOGY     NAME
  Service               network-check    any          echo-a
  Service               services-check   any          echo-b
  Service               services-check   any          echo-b-headless
  Service               services-check   any          echo-b-host-headless
  Deployment            network-check    any          echo-a
  Deployment            services-check   any          echo-b
  Deployment            services-check   any          echo-b-host
  Deployment            network-check    any          pod-to-a
  Deployment            network-check    any          pod-to-external-1111
  Deployment            policy-check     any          pod-to-a-allowed-cnp
  Deployment            policy-check     any          pod-to-a-denied-cnp
  Deployment            policy-check     any          pod-to-external-fqdn-allow-google-cnp
  Deployment            services-check   multi-node   pod-to-b-multi-node-clusterip
  Deployment            services-check   multi-node   pod-to-b-multi-node-headless
  Deployment            services-check   intra-node   pod-to-b-intra-node-clusterip
  Deployment            services-check   intra-node   pod-to-b-intra-node-headless
  Deployment            services-check   multi-node   host-to-b-multi-node-clusterip
  Deployment            services-check   multi-node   host-to-b-multi-node-headless
  CiliumNetworkPolicy   policy-check     any          pod-to-a-allowed-cnp
  CiliumNetworkPolicy   policy-check     any          pod-to-a-denied-cnp
  CiliumNetworkPolicy   policy-check     any          pod-to-external-fqdn-allow-google-cnp

These can be filtered by component, topology or name. For example:

  $ cue cmd -t component=network ls
  KIND         COMPONENT       TOPOLOGY    NAME
  Service      network-check   any         echo-a
  Deployment   network-check   any         echo-a
  Deployment   network-check   any         pod-to-a
  Deployment   network-check   any         pod-to-external-1111

Finally, to gather the (filtered) YAMLs for the specified resources:

  $ cue cmd dump | head -n 20
  metadata:
    name: echo-a
    labels:
      name: echo-a
      topology: any
      component: network-check
  spec:
    ports:
      - port: 80
    selector:
      name: echo-a
    type: ClusterIP
  apiVersion: v1
  kind: Service
  ---
  ...

Or with an upcoming commit you can just use the Makefile, which now depends
on the cuelang/cue:v0.2.2 Docker image:

  $ make connectivity-check.yaml

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Support generating YAMLs via cue

Replace the existing YAML generation from individual YAML declarations
for each service with generating YAMLs from the CUE definitions.

Three new targets will assist in validating the migration from the
existing definitions over to CUE:

 * make generate_all
   * For object declared in CUE, generate a file corresponding to that
     definition. For most of the existing YAMLs, this will overwrite the
     copy of the YAML in the tree. This can allow manual inspection of
     individual YAMLs, though the 'inspect' approach is broadly more
     useful for evaluating the overall diff.
 * make deploy
   * Deploy the hostport connectivity checks YAML into an existing cluster.
 * make inspect
   * Generate the YAML file for all connectivity checks, then use
     kubectl to diff these newly generated definitions against the
     running cluster (assuming it was deployed via make deploy).

This commit is purely the makefile changes for easier review &
inspection. Upcoming commits will use these targets to demonstrate that
there is no meaningful change in the generated YAMLs for existing YAMLs
in the tree. In particular, `make inspect` can be used in an iterative
manner by initially deploying the current version of the YAMLs from the
tree, then making changes to the CUE files and inspecting each time a
change is made. When the diff in the cluster represents the changes that
the developer intends to make, the developer can commit the changes to
the CUE files and re-generate the tree versions of the YAMLs.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Replace YAMLs with cue-generated YAMLs

Prior commits introduced CUE definitions that are equivalent to these
YAML files, so we can now:

 * Remove the individual declarations which were previously
   source-of-truth for the connectivity checks
 * Update the overall connectivity-check YAMLs to reflect the minor
   changes that the CUE definitions represent.

To validate this, heavy use of `make inspect` was used. As described in
the prior commit message where this was introduced, this allows diffing
the latest CUE-based YAML definitions against a running copy of the
YAMLs in a cluster.

There are few meaningful changes in this commit which are hard to assess
directly from the git diff, but are easier using `make inspect`:

 * All containers are converted to use readinessProbe and not
   livenessProbe.
 * All readiness probes now specify --connect-timeout of 5s.
 * Readiness probes access `/public` or `/private` per the underlying
   container HTTP server paths rather than just accessing `/`.
 * DNS allow policies are converted to consistently allow both TCP and
   UDP-based DNS.
 * Container names are derived from pod names.
 * The new YAMLs declare additional labels for all resourcess, such as
   'component' and 'topology'.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Introduce proxy checks

These new checks configure various L7 proxy paths to validate
connectivity via L7 proxies, in the following dimensions:

- Apply policy on egress; ingress; or both (proxy-to-proxy)
- Intra-node / Multi-node
- Allow / Deny

Note that proxy-to-proxy always configures egress allow policy to ensure
that the traffic goes via the proxy and in the drop case the requests
are only rejected at the destination. This is because applying egress
deny at the source would prevent proxy-to-proxy connectivity, meaning
the test would be equivalent to the egress-only reject policy case. This
way, we ensure that the path via the egress proxy to the destination is
tested in the reject case.

These are implemented partially through a new 'echo-c' pod which always
has ingress policy applied to allow GET requests to '/public'. Depending
on whether ingress policy is needed to check the particular permutation
the new checks may connect to 'echo-a' or 'echo-c'.

These are implemented by adding pods for each permutation of policy
apply point and topology; then by adding allow / deny containers within
that pod to test the allow/deny cases.

The 'connectivity-check-proxy.yaml' includes all of the above.

Finally, the omissions: This commit does not attempt to address
variations in datapath configuration. This includes IPv4 vs. IPv6;
tunnel/direct-routing; endpoint config; kube proxy/free; encryption.
These are left up to the cluster operator configuring Cilium in specific
modes and subsequently deploying these YAMLs.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Minor naming fixups

Make some of these resource names a bit more consistent.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add quarantine label to metadata

This new label will be used during YAML generation to ensure that
resources which we are still working on fixes for are kept in a separate
category apart from the regular connectivity checks, to allow us to
check them in & distribute them without causing CI to instantly fail.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add hostport + proxy checks

Introduces checks for egress proxy policy when accessing a hostport on a
remote node. These are added as part of the component=hostport-check to
ensure they are not pulled in when running connectivity checks in
environments without hostport support.

Additionally, these new tests are quarantined for now as they are known
to fail in some environments.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Expand readme for latest checks

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Re-add liveness probes

It appears that some of these checks require liveness probes rather than
readiness probes to pass on the github actions smoke-test, so ensure all
containers are checked with both.

Signed-off-by: Joe Stringer <joe@cilium.io>

* smoke-test: Improve state gathering upon failure

Commit bb91571 ("smoke-test: Print pod/deploy state on failure")
attempted to improve the information available during a failure from the
smoke-tests, but only added it to the quick-install test and not the
conformance test. Add the same output also to the conformance test so we
can more easily debug failures there.

Signed-off-by: Joe Stringer <joe@cilium.io>

* smoke-test: Disable bpf masquerading

In the smoke test, we are relying on kube-proxy for service connectivity
so it doesn't make sense to enable BPF masquerading. In fact, this
causes issues for connectivity from a node to a pod on a remote node via
ClusterIP (see related issue).

For the moment, disable BPF masquerading while we figure out the
longer-term solution to that issue.

Related: #12699

Signed-off-by: Joe Stringer <joe@cilium.io>

* docs: Update connectivity-check examples

Signed-off-by: Joe Stringer <joe@cilium.io>
@tgraf tgraf removed the kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. label Aug 26, 2020
brb added a commit that referenced this issue Aug 26, 2020
Disable BPF-masq when deploying in KIND until
#12699 has been fixed.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@networkop
Copy link
Contributor

I've come across a similar-looking issue on a self-managed cluster. The cluster is built (using kubespray) on top of Azure but has cloud-provider disabled. API-server pods are running as static pods in host OS namespace on 3 controller nodes.
Every time a pod (e.g. ingress controller) wants to communicate with the API server, it sends a packet to its clusterIP. The packets get picked up by cilium and DNAT'ed to one of the node IPs.

16   10.0.128.1:443         ClusterIP      1 => 10.0.255.10:6443
                                           2 => 10.0.255.6:6443
                                           3 => 10.0.255.7:6443

However, when the packet hits the node's stack, it's source IP is still the IP of the POD (10.0.7.117)

root@node-1:~# tcpdump -i any host 10.0.7.117
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
13:49:27.862772 IP 10.0.7.117.34906 > 10.0.255.6.6443: Flags [S], seq 2746917916, win 42300, options [mss 1410,sackOK,TS val 3319336610 ecr 0,nop,wscale 8], length 0
13:49:28.891100 IP 10.0.7.117.34906 > 10.0.255.6.6443: Flags [S], seq 2746917916, win 42300, options [mss 1410,sackOK,TS val 3319337638 ecr 0,nop,wscale 8], length 0
13:49:30.906961 IP 10.0.7.117.34906 > 10.0.255.6.6443: Flags [S], seq 2746917916, win 42300, options [mss 1410,sackOK,TS val 3319339654 ecr 0,nop,wscale 8], length 0
13:49:35.163173 IP 10.0.7.117.34906 > 10.0.255.6.6443: Flags [S], seq 2746917916, win 42300, options [mss 1410,sackOK,TS val 3319343910 ecr 0,nop,wscale 8], length 0

So it leaves the VM nic with its original IP which is not known by Azure SDN and gets dropped by it.

Here's our cilium-config cm (sorry I couldn't find the helm values)

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  # Identity allocation mode selects how identities are shared between cilium
  # nodes by setting how they are stored. The options are "crd" or "kvstore".
  # - "crd" stores identities in kubernetes as CRDs (custom resource definition).
  #   These can be queried with:
  #     kubectl get ciliumid
  # - "kvstore" stores identities in a kvstore, etcd or consul, that is
  #   configured below. Cilium versions before 1.6 supported only the kvstore
  #   backend. Upgrades from these older cilium versions should continue using
  #   the kvstore by commenting out the identity-allocation-mode below, or
  #   setting it to "kvstore".
  identity-allocation-mode: crd

  # If you want to run cilium in debug mode change this value to true
  debug: "false"

  # Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
  # address.
  enable-ipv4: "true"

  # Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
  # address.
  enable-ipv6: "false"
  enable-bpf-clock-probe: "true"

  # If you want cilium monitor to aggregate tracing for packets, set this level
  # to "low", "medium", or "maximum". The higher the level, the less packets
  # that will be seen in monitor output.
  monitor-aggregation: medium

  # The monitor aggregation interval governs the typical time between monitor
  # notification events for each allowed connection.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-interval: 5s

  # The monitor aggregation flags determine which TCP flags which, upon the
  # first observation, cause monitor notifications to be generated.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-flags: all
  # bpf-policy-map-max specified the maximum number of entries in endpoint
  # policy map (per endpoint)
  bpf-policy-map-max: "16384"
  # Specifies the ratio (0.0-1.0) of total system memory to use for dynamic
  # sizing of the TCP CT, non-TCP CT, NAT and policy BPF maps.
  bpf-map-dynamic-size-ratio: "0.0025"

  # Pre-allocation of map entries allows per-packet latency to be reduced, at
  # the expense of up-front memory allocation for the entries in the maps. The
  # default value below will minimize memory usage in the default installation;
  # users who are sensitive to latency may consider setting this to "true".
  #
  # This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
  # this option and behave as though it is set to "true".
  #
  # If this value is modified, then during the next Cilium startup the restore
  # of existing endpoints and tracking of ongoing connections may be disrupted.
  # This may lead to policy drops or a change in loadbalancing decisions for a
  # connection for some time. Endpoints may need to be recreated to restore
  # connectivity.
  #
  # If this option is set to "false" during an upgrade from 1.3 or earlier to
  # 1.4 or later, then it may cause one-time disruptions during the upgrade.
  preallocate-bpf-maps: "false"

  # Regular expression matching compatible Istio sidecar istio-proxy
  # container image names
  sidecar-istio-proxy-image: "cilium/istio_proxy"

  # Encapsulation mode for communication between nodes
  # Possible values:
  #   - disabled
  #   - vxlan (default)
  #   - geneve
  tunnel: vxlan

  # Name of the cluster. Only relevant when building a mesh of clusters.
  cluster-name: default

  # wait-bpf-mount makes init container wait until bpf filesystem is mounted
  wait-bpf-mount: "false"

  masquerade: "true"
  # (Michael) workaround for https://github.com/cilium/cilium/issues/12699
  enable-bpf-masquerade: "false"
  enable-xt-socket-fallback: "true"
  install-iptables-rules: "true"
  auto-direct-node-routes: "false"
  kube-proxy-replacement:  "strict"
  enable-health-check-nodeport: "true"
  node-port-bind-protection: "true"
  enable-auto-protect-node-port-range: "true"
  enable-session-affinity: "true"
  enable-endpoint-health-checking: "true"
  enable-well-known-identities: "false"
  enable-remote-node-identity: "true"
  operator-api-serve-addr: "127.0.0.1:9234"
  # Enable Hubble gRPC service.
  enable-hubble: "true"
  # UNIX domain socket for Hubble server to listen to.
  hubble-socket-path:  "/var/run/cilium/hubble.sock"
  # An additional address for Hubble server to listen to (e.g. ":4244").
  hubble-listen-address: ""
  ipam: "kubernetes"
  disable-cnp-status-updates: "true"
  
  # (Michael) workaround for https://github.com/cilium/cilium/issues/10627
  blacklist-conflicting-routes: "false"

borkmann pushed a commit that referenced this issue Aug 26, 2020
Disable BPF-masq when deploying in KIND until
#12699 has been fixed.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb
Copy link
Member

brb commented Aug 26, 2020

@networkop Thanks for the report. It looks that your issue seems to be unrelated to the issue discussed here, as in your case you have enable-host-services=true because of kube-proxy-replacement=true. So, I'd suggest to open a separate issue.

However, when the packet hits the node's stack, it's source IP is still the IP of the POD (10.0.7.117)

The BPF-based masquerading happens on TC egress of the native device, so it's expected that in tcpdump output you will see the pod IP addr. If you run tcpdump -i any -n 'host 10.0.255.10 or host 10.0.255.6 or 10.0.255.7', you should be able to see two traces per request - one with the pod IP addr, and one with the native device IP addr (after SNAT).

kaworu pushed a commit that referenced this issue Aug 27, 2020
[ upstream commit 7403251 ]

Disable BPF-masq when deploying in KIND until
#12699 has been fixed.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
kaworu pushed a commit that referenced this issue Aug 28, 2020
[ upstream commit 8a03d54 ]

* connectivity-check: Add 'make clean' support

Factor out the targets for all YAMLs so it can be reused by a new phony
target, 'clean'.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Introduce cuelang framework

CUE (https://cuelang.org/) is a data constraint language built defined
as a superset of JSON which aims to "simplify tasks involving defining
and using data".

In context of the connectivity check YAMLs, CUE is useful to allow us to
"evaporate" the boilerplate necessary to define Kubernetes YAMLs for
Deployments, Services and CiliumNetworkPolicies and allow developers to
specify various permutations for connectivity checks concisely.

Why should we use it?

* It's more concise: One template definition, multiple reuse. This is
  useful for introducing new connectivity checks as upcoming commits
  will demonstrate as the developer doesn't need to perform the tedious
  and error-prone process of copying and modifying the YAMLs to
  implement various permutations of a check. Furthermore this helps
  reviewers as they will not have to read through swathes of YAMLs but
  can instead focus on the diffs in the templating that are introduced
  and compare to existing data definitions.

* Consolidate constant declaration. When a core change needs to be made
  to something like the readinessProbe for probes that expect a
  success or failure, we can update one definition in the main CUE file
  and all YAMLs will subsequently be generated with this change in mind.
  During the process of preparing these changes, I noticed
  inconsistencies between various existing YAMLs which appear to just be
  unintentional, where some YAMLs were improved with better timeoute
  behaviour or error rendering, but other YAMLs were left out.

* The data is more structured. Upcoming commits will introduce simple
  CLI tools that allow matching on different classes of connectivity
  checks to generate the corresponding YAMLs. Previously we have
  depended upon file naming schemes and Makefile globbing magic to
  implement this which quickly reaches a limit in which checks should be
  selected for a specific check.

What are the dangers?

* It's relatively immature. At current version v0.2.2 it is subject to
  language changes. Upcoming commits will pin the CLI tool usage to a
  docker container derived from this version to ensure compatibility.
* One more language in the tree to understand, review and interact with.
  Mitigating circumstances: This language comes out of the Golang
  community and as such brings some commonalities; furthermore it is
  beginning to be used in other Kubernetes projects, so there is some
  broader community alignment.
* Its power allows you to hide as much or as little complexity as you
  want. It's tricky to strike a fine balance between explicitly declaring
  (and duplicating) relevant fields in the local file vs. hiding
  convenient templating language in common files. For examples, see
  defaults.cue which automatically derives connectivity check destinations
  based on object name declarations matching regexes of "pod-to-X", and
  applies affinity/anti-affinity via matches on "intra-host" or
  "multi-host".
* All declarations are additive, ie there is no ordering based upon the
  layout in code; instead, data dependencies are determined using the
  declarations, and all data is arranged into a lattice to determine the
  evaluation ordering[0]. This can be counter-intuitive to reason about
  for the uninitiated.

The general approach used in this commit was to `cue import` various
existing YAML files to generate JSON equivalents, then iteratively
combining & consolidating existing definitions using the language
constructs provided by CUE. CUE also provides mechanisms to generate
schemas and autogenerate the structures used here directly from API
definitions (eg from k8s source or Cilium tree), however this area was
not explored in this PR yet. While this doesn't take advantage of one
major aspect of the language, upcoming commits will demonstrate the way
that these changes were validated without the use of standardized
schemas from the underlying Kubernetes resource definitions. (TL;DR:
`kubectl diff ...` with kubectl validation on a live cluster). This was
sufficient to extend the connectivity checks and does not preclude
future explanation of the use of schemas for these definitions.

This commit introduces usage of CUE in a relatively minimal way into the
tree which was useful for my goals of extending the connectivity checks.
If we find that it is useful and powerful, we may consider whether to
extend its usage to other areas of the code (such as for test manifest
generation).

[0] https://cuelang.org/docs/concepts/logic/#the-value-lattice

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add cue CLI tools

Add some basic tooling around connectivity-check YAML generation:

  $ cue cmd help
  List connectivity-check resources specified in this directory

  Usage:
    cue [-t component=<component>] [-t name=<name>] [-t topology=<topology>] <command>

  Available Commands:
    dump   Generate connectivity-check YAMLs from the cuelang scripts
    ls     List connectivity-check resources specified in this directory

List available connectivity-check components:

  $ cue cmd ls
  KIND                  COMPONENT        TOPOLOGY     NAME
  Service               network-check    any          echo-a
  Service               services-check   any          echo-b
  Service               services-check   any          echo-b-headless
  Service               services-check   any          echo-b-host-headless
  Deployment            network-check    any          echo-a
  Deployment            services-check   any          echo-b
  Deployment            services-check   any          echo-b-host
  Deployment            network-check    any          pod-to-a
  Deployment            network-check    any          pod-to-external-1111
  Deployment            policy-check     any          pod-to-a-allowed-cnp
  Deployment            policy-check     any          pod-to-a-denied-cnp
  Deployment            policy-check     any          pod-to-external-fqdn-allow-google-cnp
  Deployment            services-check   multi-node   pod-to-b-multi-node-clusterip
  Deployment            services-check   multi-node   pod-to-b-multi-node-headless
  Deployment            services-check   intra-node   pod-to-b-intra-node-clusterip
  Deployment            services-check   intra-node   pod-to-b-intra-node-headless
  Deployment            services-check   multi-node   host-to-b-multi-node-clusterip
  Deployment            services-check   multi-node   host-to-b-multi-node-headless
  CiliumNetworkPolicy   policy-check     any          pod-to-a-allowed-cnp
  CiliumNetworkPolicy   policy-check     any          pod-to-a-denied-cnp
  CiliumNetworkPolicy   policy-check     any          pod-to-external-fqdn-allow-google-cnp

These can be filtered by component, topology or name. For example:

  $ cue cmd -t component=network ls
  KIND         COMPONENT       TOPOLOGY    NAME
  Service      network-check   any         echo-a
  Deployment   network-check   any         echo-a
  Deployment   network-check   any         pod-to-a
  Deployment   network-check   any         pod-to-external-1111

Finally, to gather the (filtered) YAMLs for the specified resources:

  $ cue cmd dump | head -n 20
  metadata:
    name: echo-a
    labels:
      name: echo-a
      topology: any
      component: network-check
  spec:
    ports:
      - port: 80
    selector:
      name: echo-a
    type: ClusterIP
  apiVersion: v1
  kind: Service
  ---
  ...

Or with an upcoming commit you can just use the Makefile, which now depends
on the cuelang/cue:v0.2.2 Docker image:

  $ make connectivity-check.yaml

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Support generating YAMLs via cue

Replace the existing YAML generation from individual YAML declarations
for each service with generating YAMLs from the CUE definitions.

Three new targets will assist in validating the migration from the
existing definitions over to CUE:

 * make generate_all
   * For object declared in CUE, generate a file corresponding to that
     definition. For most of the existing YAMLs, this will overwrite the
     copy of the YAML in the tree. This can allow manual inspection of
     individual YAMLs, though the 'inspect' approach is broadly more
     useful for evaluating the overall diff.
 * make deploy
   * Deploy the hostport connectivity checks YAML into an existing cluster.
 * make inspect
   * Generate the YAML file for all connectivity checks, then use
     kubectl to diff these newly generated definitions against the
     running cluster (assuming it was deployed via make deploy).

This commit is purely the makefile changes for easier review &
inspection. Upcoming commits will use these targets to demonstrate that
there is no meaningful change in the generated YAMLs for existing YAMLs
in the tree. In particular, `make inspect` can be used in an iterative
manner by initially deploying the current version of the YAMLs from the
tree, then making changes to the CUE files and inspecting each time a
change is made. When the diff in the cluster represents the changes that
the developer intends to make, the developer can commit the changes to
the CUE files and re-generate the tree versions of the YAMLs.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Replace YAMLs with cue-generated YAMLs

Prior commits introduced CUE definitions that are equivalent to these
YAML files, so we can now:

 * Remove the individual declarations which were previously
   source-of-truth for the connectivity checks
 * Update the overall connectivity-check YAMLs to reflect the minor
   changes that the CUE definitions represent.

To validate this, heavy use of `make inspect` was used. As described in
the prior commit message where this was introduced, this allows diffing
the latest CUE-based YAML definitions against a running copy of the
YAMLs in a cluster.

There are few meaningful changes in this commit which are hard to assess
directly from the git diff, but are easier using `make inspect`:

 * All containers are converted to use readinessProbe and not
   livenessProbe.
 * All readiness probes now specify --connect-timeout of 5s.
 * Readiness probes access `/public` or `/private` per the underlying
   container HTTP server paths rather than just accessing `/`.
 * DNS allow policies are converted to consistently allow both TCP and
   UDP-based DNS.
 * Container names are derived from pod names.
 * The new YAMLs declare additional labels for all resourcess, such as
   'component' and 'topology'.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Introduce proxy checks

These new checks configure various L7 proxy paths to validate
connectivity via L7 proxies, in the following dimensions:

- Apply policy on egress; ingress; or both (proxy-to-proxy)
- Intra-node / Multi-node
- Allow / Deny

Note that proxy-to-proxy always configures egress allow policy to ensure
that the traffic goes via the proxy and in the drop case the requests
are only rejected at the destination. This is because applying egress
deny at the source would prevent proxy-to-proxy connectivity, meaning
the test would be equivalent to the egress-only reject policy case. This
way, we ensure that the path via the egress proxy to the destination is
tested in the reject case.

These are implemented partially through a new 'echo-c' pod which always
has ingress policy applied to allow GET requests to '/public'. Depending
on whether ingress policy is needed to check the particular permutation
the new checks may connect to 'echo-a' or 'echo-c'.

These are implemented by adding pods for each permutation of policy
apply point and topology; then by adding allow / deny containers within
that pod to test the allow/deny cases.

The 'connectivity-check-proxy.yaml' includes all of the above.

Finally, the omissions: This commit does not attempt to address
variations in datapath configuration. This includes IPv4 vs. IPv6;
tunnel/direct-routing; endpoint config; kube proxy/free; encryption.
These are left up to the cluster operator configuring Cilium in specific
modes and subsequently deploying these YAMLs.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Minor naming fixups

Make some of these resource names a bit more consistent.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add quarantine label to metadata

This new label will be used during YAML generation to ensure that
resources which we are still working on fixes for are kept in a separate
category apart from the regular connectivity checks, to allow us to
check them in & distribute them without causing CI to instantly fail.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add hostport + proxy checks

Introduces checks for egress proxy policy when accessing a hostport on a
remote node. These are added as part of the component=hostport-check to
ensure they are not pulled in when running connectivity checks in
environments without hostport support.

Additionally, these new tests are quarantined for now as they are known
to fail in some environments.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Expand readme for latest checks

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Re-add liveness probes

It appears that some of these checks require liveness probes rather than
readiness probes to pass on the github actions smoke-test, so ensure all
containers are checked with both.

Signed-off-by: Joe Stringer <joe@cilium.io>

* smoke-test: Improve state gathering upon failure

Commit bb91571 ("smoke-test: Print pod/deploy state on failure")
attempted to improve the information available during a failure from the
smoke-tests, but only added it to the quick-install test and not the
conformance test. Add the same output also to the conformance test so we
can more easily debug failures there.

Signed-off-by: Joe Stringer <joe@cilium.io>

* smoke-test: Disable bpf masquerading

In the smoke test, we are relying on kube-proxy for service connectivity
so it doesn't make sense to enable BPF masquerading. In fact, this
causes issues for connectivity from a node to a pod on a remote node via
ClusterIP (see related issue).

For the moment, disable BPF masquerading while we figure out the
longer-term solution to that issue.

Related: #12699

Signed-off-by: Joe Stringer <joe@cilium.io>

* docs: Update connectivity-check examples

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
@brb
Copy link
Member

brb commented Aug 28, 2020

I managed to reproduce the original issue on non-KIND env. As a temporary mitigation, I'm going to create a PR to disable BPF-based masq if the host-reachable svc feature is disabled incl. all details why it's currently broken.

kkourt pushed a commit that referenced this issue Sep 2, 2020
[ upstream commit 7403251 ]

Disable BPF-masq when deploying in KIND until
#12699 has been fixed.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
christarazi pushed a commit that referenced this issue Sep 3, 2020
[ upstream commit 8a03d54 ]

* connectivity-check: Add 'make clean' support

Factor out the targets for all YAMLs so it can be reused by a new phony
target, 'clean'.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Introduce cuelang framework

CUE (https://cuelang.org/) is a data constraint language built defined
as a superset of JSON which aims to "simplify tasks involving defining
and using data".

In context of the connectivity check YAMLs, CUE is useful to allow us to
"evaporate" the boilerplate necessary to define Kubernetes YAMLs for
Deployments, Services and CiliumNetworkPolicies and allow developers to
specify various permutations for connectivity checks concisely.

Why should we use it?

* It's more concise: One template definition, multiple reuse. This is
  useful for introducing new connectivity checks as upcoming commits
  will demonstrate as the developer doesn't need to perform the tedious
  and error-prone process of copying and modifying the YAMLs to
  implement various permutations of a check. Furthermore this helps
  reviewers as they will not have to read through swathes of YAMLs but
  can instead focus on the diffs in the templating that are introduced
  and compare to existing data definitions.

* Consolidate constant declaration. When a core change needs to be made
  to something like the readinessProbe for probes that expect a
  success or failure, we can update one definition in the main CUE file
  and all YAMLs will subsequently be generated with this change in mind.
  During the process of preparing these changes, I noticed
  inconsistencies between various existing YAMLs which appear to just be
  unintentional, where some YAMLs were improved with better timeoute
  behaviour or error rendering, but other YAMLs were left out.

* The data is more structured. Upcoming commits will introduce simple
  CLI tools that allow matching on different classes of connectivity
  checks to generate the corresponding YAMLs. Previously we have
  depended upon file naming schemes and Makefile globbing magic to
  implement this which quickly reaches a limit in which checks should be
  selected for a specific check.

What are the dangers?

* It's relatively immature. At current version v0.2.2 it is subject to
  language changes. Upcoming commits will pin the CLI tool usage to a
  docker container derived from this version to ensure compatibility.
* One more language in the tree to understand, review and interact with.
  Mitigating circumstances: This language comes out of the Golang
  community and as such brings some commonalities; furthermore it is
  beginning to be used in other Kubernetes projects, so there is some
  broader community alignment.
* Its power allows you to hide as much or as little complexity as you
  want. It's tricky to strike a fine balance between explicitly declaring
  (and duplicating) relevant fields in the local file vs. hiding
  convenient templating language in common files. For examples, see
  defaults.cue which automatically derives connectivity check destinations
  based on object name declarations matching regexes of "pod-to-X", and
  applies affinity/anti-affinity via matches on "intra-host" or
  "multi-host".
* All declarations are additive, ie there is no ordering based upon the
  layout in code; instead, data dependencies are determined using the
  declarations, and all data is arranged into a lattice to determine the
  evaluation ordering[0]. This can be counter-intuitive to reason about
  for the uninitiated.

The general approach used in this commit was to `cue import` various
existing YAML files to generate JSON equivalents, then iteratively
combining & consolidating existing definitions using the language
constructs provided by CUE. CUE also provides mechanisms to generate
schemas and autogenerate the structures used here directly from API
definitions (eg from k8s source or Cilium tree), however this area was
not explored in this PR yet. While this doesn't take advantage of one
major aspect of the language, upcoming commits will demonstrate the way
that these changes were validated without the use of standardized
schemas from the underlying Kubernetes resource definitions. (TL;DR:
`kubectl diff ...` with kubectl validation on a live cluster). This was
sufficient to extend the connectivity checks and does not preclude
future explanation of the use of schemas for these definitions.

This commit introduces usage of CUE in a relatively minimal way into the
tree which was useful for my goals of extending the connectivity checks.
If we find that it is useful and powerful, we may consider whether to
extend its usage to other areas of the code (such as for test manifest
generation).

[0] https://cuelang.org/docs/concepts/logic/#the-value-lattice

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add cue CLI tools

Add some basic tooling around connectivity-check YAML generation:

  $ cue cmd help
  List connectivity-check resources specified in this directory

  Usage:
    cue [-t component=<component>] [-t name=<name>] [-t topology=<topology>] <command>

  Available Commands:
    dump   Generate connectivity-check YAMLs from the cuelang scripts
    ls     List connectivity-check resources specified in this directory

List available connectivity-check components:

  $ cue cmd ls
  KIND                  COMPONENT        TOPOLOGY     NAME
  Service               network-check    any          echo-a
  Service               services-check   any          echo-b
  Service               services-check   any          echo-b-headless
  Service               services-check   any          echo-b-host-headless
  Deployment            network-check    any          echo-a
  Deployment            services-check   any          echo-b
  Deployment            services-check   any          echo-b-host
  Deployment            network-check    any          pod-to-a
  Deployment            network-check    any          pod-to-external-1111
  Deployment            policy-check     any          pod-to-a-allowed-cnp
  Deployment            policy-check     any          pod-to-a-denied-cnp
  Deployment            policy-check     any          pod-to-external-fqdn-allow-google-cnp
  Deployment            services-check   multi-node   pod-to-b-multi-node-clusterip
  Deployment            services-check   multi-node   pod-to-b-multi-node-headless
  Deployment            services-check   intra-node   pod-to-b-intra-node-clusterip
  Deployment            services-check   intra-node   pod-to-b-intra-node-headless
  Deployment            services-check   multi-node   host-to-b-multi-node-clusterip
  Deployment            services-check   multi-node   host-to-b-multi-node-headless
  CiliumNetworkPolicy   policy-check     any          pod-to-a-allowed-cnp
  CiliumNetworkPolicy   policy-check     any          pod-to-a-denied-cnp
  CiliumNetworkPolicy   policy-check     any          pod-to-external-fqdn-allow-google-cnp

These can be filtered by component, topology or name. For example:

  $ cue cmd -t component=network ls
  KIND         COMPONENT       TOPOLOGY    NAME
  Service      network-check   any         echo-a
  Deployment   network-check   any         echo-a
  Deployment   network-check   any         pod-to-a
  Deployment   network-check   any         pod-to-external-1111

Finally, to gather the (filtered) YAMLs for the specified resources:

  $ cue cmd dump | head -n 20
  metadata:
    name: echo-a
    labels:
      name: echo-a
      topology: any
      component: network-check
  spec:
    ports:
      - port: 80
    selector:
      name: echo-a
    type: ClusterIP
  apiVersion: v1
  kind: Service
  ---
  ...

Or with an upcoming commit you can just use the Makefile, which now depends
on the cuelang/cue:v0.2.2 Docker image:

  $ make connectivity-check.yaml

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Support generating YAMLs via cue

Replace the existing YAML generation from individual YAML declarations
for each service with generating YAMLs from the CUE definitions.

Three new targets will assist in validating the migration from the
existing definitions over to CUE:

 * make generate_all
   * For object declared in CUE, generate a file corresponding to that
     definition. For most of the existing YAMLs, this will overwrite the
     copy of the YAML in the tree. This can allow manual inspection of
     individual YAMLs, though the 'inspect' approach is broadly more
     useful for evaluating the overall diff.
 * make deploy
   * Deploy the hostport connectivity checks YAML into an existing cluster.
 * make inspect
   * Generate the YAML file for all connectivity checks, then use
     kubectl to diff these newly generated definitions against the
     running cluster (assuming it was deployed via make deploy).

This commit is purely the makefile changes for easier review &
inspection. Upcoming commits will use these targets to demonstrate that
there is no meaningful change in the generated YAMLs for existing YAMLs
in the tree. In particular, `make inspect` can be used in an iterative
manner by initially deploying the current version of the YAMLs from the
tree, then making changes to the CUE files and inspecting each time a
change is made. When the diff in the cluster represents the changes that
the developer intends to make, the developer can commit the changes to
the CUE files and re-generate the tree versions of the YAMLs.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Replace YAMLs with cue-generated YAMLs

Prior commits introduced CUE definitions that are equivalent to these
YAML files, so we can now:

 * Remove the individual declarations which were previously
   source-of-truth for the connectivity checks
 * Update the overall connectivity-check YAMLs to reflect the minor
   changes that the CUE definitions represent.

To validate this, heavy use of `make inspect` was used. As described in
the prior commit message where this was introduced, this allows diffing
the latest CUE-based YAML definitions against a running copy of the
YAMLs in a cluster.

There are few meaningful changes in this commit which are hard to assess
directly from the git diff, but are easier using `make inspect`:

 * All containers are converted to use readinessProbe and not
   livenessProbe.
 * All readiness probes now specify --connect-timeout of 5s.
 * Readiness probes access `/public` or `/private` per the underlying
   container HTTP server paths rather than just accessing `/`.
 * DNS allow policies are converted to consistently allow both TCP and
   UDP-based DNS.
 * Container names are derived from pod names.
 * The new YAMLs declare additional labels for all resourcess, such as
   'component' and 'topology'.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Introduce proxy checks

These new checks configure various L7 proxy paths to validate
connectivity via L7 proxies, in the following dimensions:

- Apply policy on egress; ingress; or both (proxy-to-proxy)
- Intra-node / Multi-node
- Allow / Deny

Note that proxy-to-proxy always configures egress allow policy to ensure
that the traffic goes via the proxy and in the drop case the requests
are only rejected at the destination. This is because applying egress
deny at the source would prevent proxy-to-proxy connectivity, meaning
the test would be equivalent to the egress-only reject policy case. This
way, we ensure that the path via the egress proxy to the destination is
tested in the reject case.

These are implemented partially through a new 'echo-c' pod which always
has ingress policy applied to allow GET requests to '/public'. Depending
on whether ingress policy is needed to check the particular permutation
the new checks may connect to 'echo-a' or 'echo-c'.

These are implemented by adding pods for each permutation of policy
apply point and topology; then by adding allow / deny containers within
that pod to test the allow/deny cases.

The 'connectivity-check-proxy.yaml' includes all of the above.

Finally, the omissions: This commit does not attempt to address
variations in datapath configuration. This includes IPv4 vs. IPv6;
tunnel/direct-routing; endpoint config; kube proxy/free; encryption.
These are left up to the cluster operator configuring Cilium in specific
modes and subsequently deploying these YAMLs.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Minor naming fixups

Make some of these resource names a bit more consistent.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add quarantine label to metadata

This new label will be used during YAML generation to ensure that
resources which we are still working on fixes for are kept in a separate
category apart from the regular connectivity checks, to allow us to
check them in & distribute them without causing CI to instantly fail.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add hostport + proxy checks

Introduces checks for egress proxy policy when accessing a hostport on a
remote node. These are added as part of the component=hostport-check to
ensure they are not pulled in when running connectivity checks in
environments without hostport support.

Additionally, these new tests are quarantined for now as they are known
to fail in some environments.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Expand readme for latest checks

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Re-add liveness probes

It appears that some of these checks require liveness probes rather than
readiness probes to pass on the github actions smoke-test, so ensure all
containers are checked with both.

Signed-off-by: Joe Stringer <joe@cilium.io>

* smoke-test: Improve state gathering upon failure

Commit bb91571 ("smoke-test: Print pod/deploy state on failure")
attempted to improve the information available during a failure from the
smoke-tests, but only added it to the quick-install test and not the
conformance test. Add the same output also to the conformance test so we
can more easily debug failures there.

Signed-off-by: Joe Stringer <joe@cilium.io>

* smoke-test: Disable bpf masquerading

In the smoke test, we are relying on kube-proxy for service connectivity
so it doesn't make sense to enable BPF masquerading. In fact, this
causes issues for connectivity from a node to a pod on a remote node via
ClusterIP (see related issue).

For the moment, disable BPF masquerading while we figure out the
longer-term solution to that issue.

Related: #12699

Signed-off-by: Joe Stringer <joe@cilium.io>

* docs: Update connectivity-check examples

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
joestringer added a commit that referenced this issue Sep 4, 2020
[ upstream commit 8a03d54 ]

* connectivity-check: Add 'make clean' support

Factor out the targets for all YAMLs so it can be reused by a new phony
target, 'clean'.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Introduce cuelang framework

CUE (https://cuelang.org/) is a data constraint language built defined
as a superset of JSON which aims to "simplify tasks involving defining
and using data".

In context of the connectivity check YAMLs, CUE is useful to allow us to
"evaporate" the boilerplate necessary to define Kubernetes YAMLs for
Deployments, Services and CiliumNetworkPolicies and allow developers to
specify various permutations for connectivity checks concisely.

Why should we use it?

* It's more concise: One template definition, multiple reuse. This is
  useful for introducing new connectivity checks as upcoming commits
  will demonstrate as the developer doesn't need to perform the tedious
  and error-prone process of copying and modifying the YAMLs to
  implement various permutations of a check. Furthermore this helps
  reviewers as they will not have to read through swathes of YAMLs but
  can instead focus on the diffs in the templating that are introduced
  and compare to existing data definitions.

* Consolidate constant declaration. When a core change needs to be made
  to something like the readinessProbe for probes that expect a
  success or failure, we can update one definition in the main CUE file
  and all YAMLs will subsequently be generated with this change in mind.
  During the process of preparing these changes, I noticed
  inconsistencies between various existing YAMLs which appear to just be
  unintentional, where some YAMLs were improved with better timeoute
  behaviour or error rendering, but other YAMLs were left out.

* The data is more structured. Upcoming commits will introduce simple
  CLI tools that allow matching on different classes of connectivity
  checks to generate the corresponding YAMLs. Previously we have
  depended upon file naming schemes and Makefile globbing magic to
  implement this which quickly reaches a limit in which checks should be
  selected for a specific check.

What are the dangers?

* It's relatively immature. At current version v0.2.2 it is subject to
  language changes. Upcoming commits will pin the CLI tool usage to a
  docker container derived from this version to ensure compatibility.
* One more language in the tree to understand, review and interact with.
  Mitigating circumstances: This language comes out of the Golang
  community and as such brings some commonalities; furthermore it is
  beginning to be used in other Kubernetes projects, so there is some
  broader community alignment.
* Its power allows you to hide as much or as little complexity as you
  want. It's tricky to strike a fine balance between explicitly declaring
  (and duplicating) relevant fields in the local file vs. hiding
  convenient templating language in common files. For examples, see
  defaults.cue which automatically derives connectivity check destinations
  based on object name declarations matching regexes of "pod-to-X", and
  applies affinity/anti-affinity via matches on "intra-host" or
  "multi-host".
* All declarations are additive, ie there is no ordering based upon the
  layout in code; instead, data dependencies are determined using the
  declarations, and all data is arranged into a lattice to determine the
  evaluation ordering[0]. This can be counter-intuitive to reason about
  for the uninitiated.

The general approach used in this commit was to `cue import` various
existing YAML files to generate JSON equivalents, then iteratively
combining & consolidating existing definitions using the language
constructs provided by CUE. CUE also provides mechanisms to generate
schemas and autogenerate the structures used here directly from API
definitions (eg from k8s source or Cilium tree), however this area was
not explored in this PR yet. While this doesn't take advantage of one
major aspect of the language, upcoming commits will demonstrate the way
that these changes were validated without the use of standardized
schemas from the underlying Kubernetes resource definitions. (TL;DR:
`kubectl diff ...` with kubectl validation on a live cluster). This was
sufficient to extend the connectivity checks and does not preclude
future explanation of the use of schemas for these definitions.

This commit introduces usage of CUE in a relatively minimal way into the
tree which was useful for my goals of extending the connectivity checks.
If we find that it is useful and powerful, we may consider whether to
extend its usage to other areas of the code (such as for test manifest
generation).

[0] https://cuelang.org/docs/concepts/logic/#the-value-lattice

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add cue CLI tools

Add some basic tooling around connectivity-check YAML generation:

  $ cue cmd help
  List connectivity-check resources specified in this directory

  Usage:
    cue [-t component=<component>] [-t name=<name>] [-t topology=<topology>] <command>

  Available Commands:
    dump   Generate connectivity-check YAMLs from the cuelang scripts
    ls     List connectivity-check resources specified in this directory

List available connectivity-check components:

  $ cue cmd ls
  KIND                  COMPONENT        TOPOLOGY     NAME
  Service               network-check    any          echo-a
  Service               services-check   any          echo-b
  Service               services-check   any          echo-b-headless
  Service               services-check   any          echo-b-host-headless
  Deployment            network-check    any          echo-a
  Deployment            services-check   any          echo-b
  Deployment            services-check   any          echo-b-host
  Deployment            network-check    any          pod-to-a
  Deployment            network-check    any          pod-to-external-1111
  Deployment            policy-check     any          pod-to-a-allowed-cnp
  Deployment            policy-check     any          pod-to-a-denied-cnp
  Deployment            policy-check     any          pod-to-external-fqdn-allow-google-cnp
  Deployment            services-check   multi-node   pod-to-b-multi-node-clusterip
  Deployment            services-check   multi-node   pod-to-b-multi-node-headless
  Deployment            services-check   intra-node   pod-to-b-intra-node-clusterip
  Deployment            services-check   intra-node   pod-to-b-intra-node-headless
  Deployment            services-check   multi-node   host-to-b-multi-node-clusterip
  Deployment            services-check   multi-node   host-to-b-multi-node-headless
  CiliumNetworkPolicy   policy-check     any          pod-to-a-allowed-cnp
  CiliumNetworkPolicy   policy-check     any          pod-to-a-denied-cnp
  CiliumNetworkPolicy   policy-check     any          pod-to-external-fqdn-allow-google-cnp

These can be filtered by component, topology or name. For example:

  $ cue cmd -t component=network ls
  KIND         COMPONENT       TOPOLOGY    NAME
  Service      network-check   any         echo-a
  Deployment   network-check   any         echo-a
  Deployment   network-check   any         pod-to-a
  Deployment   network-check   any         pod-to-external-1111

Finally, to gather the (filtered) YAMLs for the specified resources:

  $ cue cmd dump | head -n 20
  metadata:
    name: echo-a
    labels:
      name: echo-a
      topology: any
      component: network-check
  spec:
    ports:
      - port: 80
    selector:
      name: echo-a
    type: ClusterIP
  apiVersion: v1
  kind: Service
  ---
  ...

Or with an upcoming commit you can just use the Makefile, which now depends
on the cuelang/cue:v0.2.2 Docker image:

  $ make connectivity-check.yaml

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Support generating YAMLs via cue

Replace the existing YAML generation from individual YAML declarations
for each service with generating YAMLs from the CUE definitions.

Three new targets will assist in validating the migration from the
existing definitions over to CUE:

 * make generate_all
   * For object declared in CUE, generate a file corresponding to that
     definition. For most of the existing YAMLs, this will overwrite the
     copy of the YAML in the tree. This can allow manual inspection of
     individual YAMLs, though the 'inspect' approach is broadly more
     useful for evaluating the overall diff.
 * make deploy
   * Deploy the hostport connectivity checks YAML into an existing cluster.
 * make inspect
   * Generate the YAML file for all connectivity checks, then use
     kubectl to diff these newly generated definitions against the
     running cluster (assuming it was deployed via make deploy).

This commit is purely the makefile changes for easier review &
inspection. Upcoming commits will use these targets to demonstrate that
there is no meaningful change in the generated YAMLs for existing YAMLs
in the tree. In particular, `make inspect` can be used in an iterative
manner by initially deploying the current version of the YAMLs from the
tree, then making changes to the CUE files and inspecting each time a
change is made. When the diff in the cluster represents the changes that
the developer intends to make, the developer can commit the changes to
the CUE files and re-generate the tree versions of the YAMLs.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Replace YAMLs with cue-generated YAMLs

Prior commits introduced CUE definitions that are equivalent to these
YAML files, so we can now:

 * Remove the individual declarations which were previously
   source-of-truth for the connectivity checks
 * Update the overall connectivity-check YAMLs to reflect the minor
   changes that the CUE definitions represent.

To validate this, heavy use of `make inspect` was used. As described in
the prior commit message where this was introduced, this allows diffing
the latest CUE-based YAML definitions against a running copy of the
YAMLs in a cluster.

There are few meaningful changes in this commit which are hard to assess
directly from the git diff, but are easier using `make inspect`:

 * All containers are converted to use readinessProbe and not
   livenessProbe.
 * All readiness probes now specify --connect-timeout of 5s.
 * Readiness probes access `/public` or `/private` per the underlying
   container HTTP server paths rather than just accessing `/`.
 * DNS allow policies are converted to consistently allow both TCP and
   UDP-based DNS.
 * Container names are derived from pod names.
 * The new YAMLs declare additional labels for all resourcess, such as
   'component' and 'topology'.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Introduce proxy checks

These new checks configure various L7 proxy paths to validate
connectivity via L7 proxies, in the following dimensions:

- Apply policy on egress; ingress; or both (proxy-to-proxy)
- Intra-node / Multi-node
- Allow / Deny

Note that proxy-to-proxy always configures egress allow policy to ensure
that the traffic goes via the proxy and in the drop case the requests
are only rejected at the destination. This is because applying egress
deny at the source would prevent proxy-to-proxy connectivity, meaning
the test would be equivalent to the egress-only reject policy case. This
way, we ensure that the path via the egress proxy to the destination is
tested in the reject case.

These are implemented partially through a new 'echo-c' pod which always
has ingress policy applied to allow GET requests to '/public'. Depending
on whether ingress policy is needed to check the particular permutation
the new checks may connect to 'echo-a' or 'echo-c'.

These are implemented by adding pods for each permutation of policy
apply point and topology; then by adding allow / deny containers within
that pod to test the allow/deny cases.

The 'connectivity-check-proxy.yaml' includes all of the above.

Finally, the omissions: This commit does not attempt to address
variations in datapath configuration. This includes IPv4 vs. IPv6;
tunnel/direct-routing; endpoint config; kube proxy/free; encryption.
These are left up to the cluster operator configuring Cilium in specific
modes and subsequently deploying these YAMLs.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Minor naming fixups

Make some of these resource names a bit more consistent.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add quarantine label to metadata

This new label will be used during YAML generation to ensure that
resources which we are still working on fixes for are kept in a separate
category apart from the regular connectivity checks, to allow us to
check them in & distribute them without causing CI to instantly fail.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Add hostport + proxy checks

Introduces checks for egress proxy policy when accessing a hostport on a
remote node. These are added as part of the component=hostport-check to
ensure they are not pulled in when running connectivity checks in
environments without hostport support.

Additionally, these new tests are quarantined for now as they are known
to fail in some environments.

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Expand readme for latest checks

Signed-off-by: Joe Stringer <joe@cilium.io>

* connectivity-check: Re-add liveness probes

It appears that some of these checks require liveness probes rather than
readiness probes to pass on the github actions smoke-test, so ensure all
containers are checked with both.

Signed-off-by: Joe Stringer <joe@cilium.io>

* smoke-test: Improve state gathering upon failure

Commit bb91571 ("smoke-test: Print pod/deploy state on failure")
attempted to improve the information available during a failure from the
smoke-tests, but only added it to the quick-install test and not the
conformance test. Add the same output also to the conformance test so we
can more easily debug failures there.

Signed-off-by: Joe Stringer <joe@cilium.io>

* smoke-test: Disable bpf masquerading

In the smoke test, we are relying on kube-proxy for service connectivity
so it doesn't make sense to enable BPF masquerading. In fact, this
causes issues for connectivity from a node to a pod on a remote node via
ClusterIP (see related issue).

For the moment, disable BPF masquerading while we figure out the
longer-term solution to that issue.

Related: #12699

Signed-off-by: Joe Stringer <joe@cilium.io>

* docs: Update connectivity-check examples

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
@claneys
Copy link

claneys commented Oct 19, 2022

Hi,

Quite a long time since no update in this issue, but I ran into the same issue, or pretty much. As the cilium's configuration evolve, I assume that enable-bpf-masquerade flag is now enable-ipv4-masquerade and enable-ipv6-masquerade. My setup do not uses kube-proxy and I have the masquerade enabled when I hit the issue when I try by example to resolve a hostname from a Pod with hostNetwork: true.

Here my pod yaml and cilium config map

apiVersion: v1
kind: Pod
metadata:
  name: dnsutils
  #  namespace: kube-system
spec:
  containers:
  - name: dnsutils
    image: k8s.gcr.io/e2e-test-images/jessie-dnsutils:1.3
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
  dnsConfig:
    searches:
      - operators.svc.cluster.local
      - svc.cluster.local
      - cluster.local
  nodeName: gilgamesh.uruk.home
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
apiVersion: v1
data:
  agent-not-ready-taint-key: node.cilium.io/agent-not-ready
  arping-refresh-period: 30s
  auto-direct-node-routes: "false"
  bpf-lb-external-clusterip: "false"
  bpf-lb-map-max: "65536"
  bpf-lb-sock: "false"
  bpf-map-dynamic-size-ratio: "0.0025"
  bpf-policy-map-max: "16384"
  bpf-root: /sys/fs/bpf
  cgroup-root: /run/cilium/cgroupv2
  cilium-endpoint-gc-interval: 5m0s
  cluster-id: "0"
  cluster-name: uruk-cluster
  cluster-pool-ipv4-cidr: 10.0.0.0/8
  cluster-pool-ipv4-mask-size: "24"
  custom-cni-conf: "false"
  debug: "false"
  disable-cnp-status-updates: "true"
  disable-endpoint-crd: "false"
  enable-auto-protect-node-port-range: "true"
  enable-bgp-control-plane: "false"
  enable-bpf-clock-probe: "true"
  enable-endpoint-health-checking: "true"
  enable-health-check-nodeport: "true"
  enable-health-checking: "true"
  enable-hubble: "true"
  enable-ipv4: "true"
  enable-ipv4-masquerade: "true"
  enable-ipv6: "false"
  enable-ipv6-masquerade: "true"
  enable-k8s-terminating-endpoint: "true"
  enable-l2-neigh-discovery: "true"
  enable-l7-proxy: "true"
  enable-local-redirect-policy: "false"
  enable-policy: default
  enable-remote-node-identity: "true"
  enable-svc-source-range-check: "true"
  enable-vtep: "false"
  enable-well-known-identities: "false"
  enable-xt-socket-fallback: "true"
  hubble-disable-tls: "false"
  hubble-listen-address: :4244
  hubble-socket-path: /var/run/cilium/hubble.sock
  hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt
  hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
  hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
  identity-allocation-mode: crd
  install-iptables-rules: "true"
  install-no-conntrack-iptables-rules: "false"
  ipam: cluster-pool
  kube-proxy-replacement: strict
  kube-proxy-replacement-healthz-bind-address: ""
  monitor-aggregation: medium
  monitor-aggregation-flags: all
  monitor-aggregation-interval: 5s
  node-port-bind-protection: "true"
  nodes-gc-interval: 5m0s
  operator-api-serve-addr: 127.0.0.1:9234
  preallocate-bpf-maps: "false"
  remove-cilium-node-taints: "true"
  set-cilium-is-up-condition: "true"
  sidecar-istio-proxy-image: cilium/istio_proxy
  synchronize-k8s-nodes: "true"
  tofqdns-dns-reject-response-code: refused
  tofqdns-enable-dns-compression: "true"
  tofqdns-endpoint-max-ip-per-hostname: "50"
  tofqdns-idle-connection-grace-period: 0s
  tofqdns-max-deferred-connection-deletes: "10000"
  tofqdns-min-ttl: "3600"
  tofqdns-proxy-response-max-delay: 100ms
  tunnel: vxlan
  unmanaged-pod-watcher-interval: "15"
  vtep-cidr: ""
  vtep-endpoint: ""
  vtep-mac: ""
  vtep-mask: ""
kind: ConfigMap

My iptables rules if it matters :

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
KUBE-FIREWALL  all  --  0.0.0.0/0            0.0.0.0/0           

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
KUBE-FIREWALL  all  --  0.0.0.0/0            0.0.0.0/0           

Chain KUBE-FIREWALL (2 references)
target     prot opt source               destination         
DROP       all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
DROP       all  -- !127.0.0.0/8          127.0.0.0/8          /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT

Chain KUBE-KUBELET-CANARY (0 references)
target     prot opt source     

@brb
Copy link
Member

brb commented Oct 25, 2022

@claneys I think it's different issue, as in your case you have enabled the socket-LB (prev. named as "host-services") with kube-proxy-replacement: strict.

@julianwiedmann julianwiedmann added area/kube-proxy Issues related to kube-proxy (not the kube-proxy-free mode). area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/snat Relates to SNAT or Masquerading of traffic labels Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy Issues related to kube-proxy (not the kube-proxy-free mode). area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/snat Relates to SNAT or Masquerading of traffic kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
Development

No branches or pull requests

6 participants