Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: K8sEgressGatewayTest tunnel disabled * both egress gw and basic connectivity work #18012

Closed
brb opened this issue Nov 25, 2021 · 4 comments
Closed
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! feature/egress-gateway Impacts the egress IP gateway feature. kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects

Comments

@brb
Copy link
Member

brb commented Nov 25, 2021

/home/jenkins/workspace/Cilium-PR-K8s-1.16-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:527
Expected command: kubectl exec -n kube-system log-gatherer-fxbn5 -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 http://10.0.0.183:80 -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" 
To succeed, but it failed:
Exitcode: 28 
Err: exit status 28
Stdout:
 	 time-> DNS: '0.000031()', Connect: '0.000000',Transfer '0.000000', total '5.001200'
Stderr:
 	 command terminated with exit code 28
	 

/home/jenkins/workspace/Cilium-PR-K8s-1.16-net-next/src/github.com/cilium/cilium/test/k8sT/Egress.go:208

https://gofile.io/d/XwghzF
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/2046/testReport/junit/Suite-k8s-1/16/K8sEgressGatewayTest_tunnel_disabled_with_endpointRoutes_enabled_egress_gw_policy_both_egress_gw_and_basic_connectivity_work/

After reading the code, the following was failing (TODO: either fix function call offset or add By(), otherwise it's difficult to determine which exactly assertion has failed):

res = kubectl.ExecInHostNetNS(context.TODO(), outsideNodeName,
This is the test which checks whether sending a request from the outside host (which is the dst in the egress gw policy) to a pod works, i.e., whether request doesn't get SNAT-ed.

Digging into the hubble logs of cilium running on k8s1 we can find the flow which failed:

{
  "flow": {
    "time": "2021-11-24T13:37:52.324267368Z",
    "verdict": "FORWARDED",
    "ethernet": {
      "source": "92:cf:43:95:57:7a",
      "destination": "1e:22:d0:61:1d:dc"
    },
    "IP": {
      "source": "192.168.56.13",
      "destination": "10.0.0.183",
      "ipVersion": "IPv4"
    },
    "l4": {
      "TCP": {
        "source_port": 38580,
        "destination_port": 80,
        "flags": {
          "SYN": true
        }
      }
    },
    "source": {
      "identity": 2,
      "labels": [
        "reserved:world"
      ]
    },
    "destination": {
      "ID": 212,
      "identity": 37363,
      "namespace": "202111241337k8segressgatewaytesttunneldisabledwithendpointroute",
      "labels": [
        "k8s:io.cilium.k8s.namespace.labels.ns=cilium-test",
        "k8s:io.cilium.k8s.policy.cluster=default",
        "k8s:io.cilium.k8s.policy.serviceaccount=default",
        "k8s:io.kubernetes.pod.namespace=202111241337k8segressgatewaytesttunneldisabledwithendpointroute",
        "k8s:zgroup=testDS"
      ],
      "pod_name": "testds-q42gz"
    },
    "Type": "L3_L4",
    "node_name": "k8s1",
    "event_type": {
      "type": 4
    },
    "traffic_direction": "INGRESS",
    "trace_observation_point": "TO_ENDPOINT",
    "is_reply": false,
    "Summary": "TCP Flags: SYN"
  },
  "node_name": "k8s1",
  "time": "2021-11-24T13:37:52.324267368Z"
}
{
  "flow": {
    "time": "2021-11-24T13:37:52.324295596Z",
    "verdict": "FORWARDED",
    "ethernet": {
      "source": "1e:22:d0:61:1d:dc",
      "destination": "92:cf:43:95:57:7a"
    },
    "IP": {
      "source": "10.0.0.183",
      "destination": "192.168.56.13",
      "ipVersion": "IPv4"
    },
    "l4": {
      "TCP": {
        "source_port": 80,
        "destination_port": 38580,
        "flags": {
          "SYN": true,
          "ACK": true
        }
      }
    },
    "source": {
      "ID": 212,
      "identity": 37363,
      "namespace": "202111241337k8segressgatewaytesttunneldisabledwithendpointroute",
      "labels": [
        "k8s:io.cilium.k8s.namespace.labels.ns=cilium-test",
        "k8s:io.cilium.k8s.policy.cluster=default",
        "k8s:io.cilium.k8s.policy.serviceaccount=default",
        "k8s:io.kubernetes.pod.namespace=202111241337k8segressgatewaytesttunneldisabledwithendpointroute",
        "k8s:zgroup=testDS"
      ],
      "pod_name": "testds-q42gz"
    },
    "destination": {
      "identity": 2,
      "labels": [
        "reserved:world"
      ]
    },
    "Type": "L3_L4",
    "node_name": "k8s1",
    "event_type": {
      "type": 4,
      "sub_type": 4
    },
    "trace_observation_point": "TO_OVERLAY",
    "interface": {
      "index": 20,
      "name": "cilium_vxlan"
    },
    "Summary": "TCP Flags: SYN, ACK"
  },
  "node_name": "k8s1",
  "time": "2021-11-24T13:37:52.324295596Z"
}
{
  "flow": {
    "time": "2021-11-24T13:37:52.325198441Z",
    "verdict": "FORWARDED",
    "ethernet": {
      "source": "92:cf:43:95:57:7a",
      "destination": "1e:22:d0:61:1d:dc"
    },
    "IP": {
      "source": "192.168.56.13",
      "destination": "10.0.0.183",
      "ipVersion": "IPv4"
    },
    "l4": {
      "TCP": {
        "source_port": 38580,
        "destination_port": 80,
        "flags": {
          "RST": true
        }
      }
    },
    "source": {
      "identity": 6,
      "labels": [
        "reserved:remote-node"
      ]
    },
    "destination": {
      "ID": 212,
      "identity": 37363,
      "namespace": "202111241337k8segressgatewaytesttunneldisabledwithendpointroute",
      "labels": [
        "k8s:io.cilium.k8s.namespace.labels.ns=cilium-test",
        "k8s:io.cilium.k8s.policy.cluster=default",
        "k8s:io.cilium.k8s.policy.serviceaccount=default",
        "k8s:io.kubernetes.pod.namespace=202111241337k8segressgatewaytesttunneldisabledwithendpointroute",
        "k8s:zgroup=testDS"
      ],
      "pod_name": "testds-q42gz"
    },
    "Type": "L3_L4",
    "node_name": "k8s1",
    "event_type": {
      "type": 4
    },
    "traffic_direction": "INGRESS",
    "trace_observation_point": "TO_ENDPOINT",
    "is_reply": false,
    "interface": {
      "index": 24,
      "name": "lxc845639c5bd85"
    },
    "Summary": "TCP Flags: RST"
  },
  "node_name": "k8s1",
  "time": "2021-11-24T13:37:52.325198441Z"
}
{
  "flow": {
    "time": "2021-11-24T13:37:52.325210424Z",
    "verdict": "FORWARDED",
    "ethernet": {
      "source": "92:cf:43:95:57:7a",
      "destination": "1e:22:d0:61:1d:dc"
    },
    "IP": {
      "source": "192.168.56.13",
      "destination": "10.0.0.183",
      "ipVersion": "IPv4"
    },
    "l4": {
      "TCP": {
        "source_port": 38580,
        "destination_port": 80,
        "flags": {
          "RST": true
        }
      }
    },
    "source": {
      "identity": 2,
      "labels": [
        "reserved:world"
      ]
    },
    "destination": {
      "ID": 212,
      "identity": 37363,
      "namespace": "202111241337k8segressgatewaytesttunneldisabledwithendpointroute",
      "labels": [
        "k8s:io.cilium.k8s.namespace.labels.ns=cilium-test",
        "k8s:io.cilium.k8s.policy.cluster=default",
        "k8s:io.cilium.k8s.policy.serviceaccount=default",
        "k8s:io.kubernetes.pod.namespace=202111241337k8segressgatewaytesttunneldisabledwithendpointroute",
        "k8s:zgroup=testDS"
      ],
      "pod_name": "testds-q42gz"
    },
    "Type": "L3_L4",
    "node_name": "k8s1",
    "event_type": {
      "type": 4
    },
    "traffic_direction": "INGRESS",
    "trace_observation_point": "TO_ENDPOINT",
    "is_reply": false,
    "Summary": "TCP Flags: RST"
  },
  "node_name": "k8s1",
  "time": "2021-11-24T13:37:52.325210424Z"
}

We can see from the flow log, that the reply from the pod to the outside (SYN+ACK) was sent over the tunnel:

   "trace_observation_point": "TO_OVERLAY",
    "interface": {
      "index": 20,
      "name": "cilium_vxlan"
    },

and which was apparently SNAT-ed by the egress gw node (k8s2). From the latter SNAT table dump:

TCP OUT 10.0.0.183:80 -> 192.168.56.13:38580 XLATE_SRC 192.168.56.100:43402 Created=450sec HostLocal=0
TCP IN 192.168.56.13:38580 -> 192.168.56.100:43402 XLATE_DST 10.0.0.183:80 Created=450sec HostLocal=0

The question is why the CT_REPLY check was bypassed? The CT entry from k8s1 (I couldn't find any other entry which could have used the 38580 port):

TCP IN 192.168.56.13:38580 -> 10.0.0.183:80 expires=16777495 RxPackets=9 RxBytes=546 RxFlagsSeen=0x06 LastRxReport=16777483 TxPackets=3 TxBytes=222 TxFlagsSeen=0x12 LastTxReport=16777483 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=0 SourceSecurityID=2 IfIndex=0 
@brb brb added kind/bug This is a bug in the Cilium logic. area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! feature/egress-gateway Impacts the egress IP gateway feature. labels Nov 25, 2021
@brb brb assigned jibi, brb and kkourt Nov 25, 2021
@brb
Copy link
Member Author

brb commented Nov 25, 2021

jibi added a commit that referenced this issue Nov 30, 2021
Temporary increase the Hubble buffer size in order to capture more
flows. This will hopefully help us understand why the
K8sEgressGatewayTest is occasionally failing (#18012)

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
borkmann pushed a commit that referenced this issue Nov 30, 2021
Temporary increase the Hubble buffer size in order to capture more
flows. This will hopefully help us understand why the
K8sEgressGatewayTest is occasionally failing (#18012)

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@joestringer joestringer added this to Unassigned in 1.11 CI via automation Dec 1, 2021
@joestringer joestringer moved this from Unassigned to Investigating in 1.11 CI Dec 1, 2021
@brb brb unassigned jibi, kkourt and brb May 5, 2022
@brb brb added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label May 6, 2022
@github-actions
Copy link

github-actions bot commented Jul 8, 2022

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 8, 2022
@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 16, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Sep 14, 2022
@github-actions
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

1.11 CI automation moved this from Investigating to Evaluate to exit quarantine Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! feature/egress-gateway Impacts the egress IP gateway feature. kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
No open projects
1.11 CI
Evaluate to exit quarantine
Development

No branches or pull requests

3 participants