Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent report of dropped logs from hubble. #15445

Closed
gtsjamesbond opened this issue Mar 24, 2021 · 29 comments
Closed

Intermittent report of dropped logs from hubble. #15445

gtsjamesbond opened this issue Mar 24, 2021 · 29 comments
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. need-more-info More information is required to further debug or fix the issue. needs/triage This issue requires triaging to establish severity and next steps. sig/hubble Impacts hubble server or relay sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@gtsjamesbond
Copy link

Hello,
we’ve been noticing some unusual behavior with cilium/hubble and we’re curious if anyone else has seen anything similar. Essentially, we’ve written network policies that allow traffic to a particular CIDR block/port, but we’re seeing intermittent drop logs from hubble with the log code 133 (policy interruption) where we know the traffic should be allowed (and usually is allowed). However, we haven’t actually actually noticed any service interruption due to those drops, so we’re unsure about whether these are real intermittent cilium drops or if hubble might be falsely reporting drops. Has anyone seen anything similar, whether the cause might be false reporting by hubble or real intermittent drops by cilium? We’re running cilium v1.8.6 on kubernetes 1.18.

here’s an example of a drop:

"attributes": {
    "source": {
        "namespace": "studio-live",
        "identity": 42150,
        "ID": 37,
        "labels": [
            "k8s:app.kubernetes.io/managed-by=spinnaker",
            "k8s:app.kubernetes.io/name=baas",
            "k8s:app=baas-config",
            "k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-v9rth",
            "k8s:io.cilium.k8s.policy.cluster=default",
            "k8s:io.cilium.k8s.policy.serviceaccount=default",
            "k8s:io.kubernetes.pod.namespace=studio-live",
            "k8s:studio-context=live"
        ],
        "pod_name": "baas-config-d84bc885f-8s564"
    },
    "IP": {
        "ipVersion": "IPv4",
        "destination": "10.8.2.59",
        "source": "10.8.1.121"
    },
    "l4": {
        "TCP": {
            "destination_port": 26257,
            "source_port": 48794,
            "flags": {
                "PSH": true,
                "ACK": true
            }
        }
    },
    "destination": {
        "identity": 2,
        "labels": [
            "reserved:world"
        ]
    },
    "node_name": "ip-10-8-1-11.us-east-2.compute.internal",
    "traffic_direction": "EGRESS",
    "ethernet": {
        "destination": "2e:cf:00:b3:44:3c",
        "source": "86:4f:19:2d:9b:ef"
    },
    "Type": "L3_L4",
    "event_type": {
        "sub_type": 133,
        "type": 1
    },
    "verdict": "DROPPED",
    "Summary": "TCP Flags: ACK, PSH",
    "time": "2021-03-19T18:08:08.011207109Z",
    "drop_reason": 133
}

policy:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  annotations:
    artifact.spinnaker.io/location: studio-live
    artifact.spinnaker.io/name: network-config
    artifact.spinnaker.io/type: kubernetes/CiliumNetworkPolicy.cilium.io
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"cilium.io/v2","kind":"CiliumNetworkPolicy", [...]}
    moniker.spinnaker.io/application: baas
    moniker.spinnaker.io/cluster: CiliumNetworkPolicy.cilium.io network-config
  creationTimestamp: "2021-01-24T15:04:42Z"
  generation: 4
  labels:
    app.kubernetes.io/managed-by: spinnaker
    app.kubernetes.io/name: baas
  managedFields:
  - apiVersion: cilium.io/v2
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:artifact.spinnaker.io/location: {}
          f:artifact.spinnaker.io/name: {}
          f:artifact.spinnaker.io/type: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:moniker.spinnaker.io/application: {}
          f:moniker.spinnaker.io/cluster: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
      f:spec:
        .: {}
        f:egress: {}
        f:endpointSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:app: {}
    manager: kubectl
    operation: Update
    time: "2021-03-06T23:00:22Z"
  name: network-config
  namespace: studio-live
  resourceVersion: "64708762"
  selfLink: /apis/cilium.io/v2/namespaces/studio-live/ciliumnetworkpolicies/network-config
  uid: af4b7347-921f-436d-ab8f-102933da6bf7
spec:
  egress:
  - toCIDRSet:
    - cidr: 10.0.0.0/8
  - toCIDR:
    - 169.254.169.254/32
    - 184.32.0.0/16
  - toFQDNs:
    - matchName: bondtech-docker-local.jfrog.io
    - matchPattern: baas-*.*.*.rds.amazonaws.com
[...]
  - toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
        - matchPattern: '*'
    - ports:
      - port: "8125"
        protocol: UDP
    - ports:
      - port: "6000"
        protocol: TCP
      - port: "7000"
        protocol: TCP
  - toEndpoints:
    - matchLabels:
        app: baas-auth
    - matchLabels:
        k8s:app: cockroachdb
        k8s:io.kubernetes.pod.namespace: crdb-dev3
[...]
  endpointSelector:
    matchLabels:
      app: baas-config

How to reproduce the issue

  1. Intermittent. This seems to happen randomly so we do not have a concrete way to reproduce at will. We have asked in the slack and after initially triaging with @pchaigno and @aditighag the recommendation was to gather a dump and post a bug report for further review.
    cilium-sysdump-prod2-20210324-012008.zip
    cilium-sysdump-prod3-20210324-012151.zip
@gtsjamesbond gtsjamesbond added the kind/bug This is a bug in the Cilium logic. label Mar 24, 2021
@pchaigno pchaigno added needs/triage This issue requires triaging to establish severity and next steps. sig/hubble Impacts hubble server or relay sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. kind/community-report This was reported by a user in the Cilium community, eg via Slack. labels Mar 24, 2021
@pchaigno
Copy link
Member

What is the second sysdump (prod3)?

Was 10.8.2.59 associated to a pod at any point? In the FQDN cache, I see that cockroachdb.crdb-prod2.svc.cluster.local. resolved to 10.8.2.59. If it was associated to a pod, was that pod matched by:

      - matchLabels:
          k8s:app: cockroachdb
          k8s:io.kubernetes.pod.namespace: crdb-prod2

Were the policy drops happening when the sysdump was taken?

@pchaigno pchaigno added the need-more-info More information is required to further debug or fix the issue. label Mar 24, 2021
@gtsjamesbond
Copy link
Author

Hey @pchaigno
Sorry for the confusion. We have two clusters running cilium and we were/are seeing the drops across both. So where we are seeing the drops logged are mainly to cockroach db, rds db and an external fdns. 10.8.2.59 was assigned to one of our cockroach pods, but I believe it was since been cycled/replaced. the cockroach pod should have matched that label. If there is any additional clarification/information that we need to provide, please let me know.

Regards.

@pchaigno
Copy link
Member

10.8.2.59 was assigned to one of our cockroach pods, but I believe it was since been cycled/replaced.

When was the sysdump collected? In the following series of events correct?

  1. Drops are happening.
  2. Pod 10.8.2.59 is deleted.
  3. Sysdump is collected.

It would be nice to get a sysdump before the pod is deleted, ideally while the drops are still happening.

@gtsjamesbond
Copy link
Author

Hey @pchaigno,
I believe that sequence is correct. That just happened to be the initial request that we opened our slack conversation with. We can attempt to capture a more recent failure and dump as we see these frequently enough. I'll post an update with updated information.

Regards.

@aditighag
Copy link
Member

aditighag commented Mar 24, 2021

@gtsjamesbond Just to confirm, there are no real packet drops, but only Hubble reported drops, correct? I think this context was missing in the Slack conversation. I'm not sure if we need the sig-policy label in this PR if that's the case.

@gtsjamesbond
Copy link
Author

@aditighag so this is one thing that we were unsure of, we are getting the hubble reports of drops, BUT the application is behaving as expected (like there are no drops), so this is leading us to believe something is not correct, or we are misinterpreting the meaning of the drop reports. This is one of the things we wanted to verify, namely if the drop reports that we are seeing are infact accurate, or false reports. If false reports, then we would like to determine why to see if there is any way to "tune these out", so that we arent alerted if we dont need to be.

Regards.

@gtsjamesbond
Copy link
Author

@pchaigno @aditighag
Below is a recent drop
`

{

destination { |  
ID549identity20476labels [k8s:app.kubernetes.io/managed-by=spinnaker,k8s:app.kubernetes.io/name=baas,k8s:app=baas-auth,k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=studio-sandbox,k8s:studio-context=sandbox | ID | 549 | identity | 20476 | labels [ |   | k8s:app.kubernetes.io/managed-by=spinnaker,k8s:app.kubernetes.io/name=baas,k8s:app=baas-auth,k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=studio-sandbox,k8s:studio-context=sandbox | k8s:app.kubernetes.io/managed-by=spinnaker,k8s:app.kubernetes.io/name=baas,k8s:app=baas-auth,k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=studio-sandbox,k8s:studio-context=sandbox | k8s:app.kubernetes.io/managed-by=spinnaker, | k8s:app.kubernetes.io/name=baas, | k8s:app=baas-auth, | k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9, | k8s:io.cilium.k8s.policy.cluster=default, | k8s:io.cilium.k8s.policy.serviceaccount=default, | k8s:io.kubernetes.pod.namespace=studio-sandbox, | k8s:studio-context=sandbox
ID | 549
identity | 20476
labels [ |  
k8s:app.kubernetes.io/managed-by=spinnaker,k8s:app.kubernetes.io/name=baas,k8s:app=baas-auth,k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=studio-sandbox,k8s:studio-context=sandbox | k8s:app.kubernetes.io/managed-by=spinnaker,k8s:app.kubernetes.io/name=baas,k8s:app=baas-auth,k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=studio-sandbox,k8s:studio-context=sandbox | k8s:app.kubernetes.io/managed-by=spinnaker, | k8s:app.kubernetes.io/name=baas, | k8s:app=baas-auth, | k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9, | k8s:io.cilium.k8s.policy.cluster=default, | k8s:io.cilium.k8s.policy.serviceaccount=default, | k8s:io.kubernetes.pod.namespace=studio-sandbox, | k8s:studio-context=sandbox
k8s:app.kubernetes.io/managed-by=spinnaker,k8s:app.kubernetes.io/name=baas,k8s:app=baas-auth,k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=studio-sandbox,k8s:studio-context=sandbox | k8s:app.kubernetes.io/managed-by=spinnaker, | k8s:app.kubernetes.io/name=baas, | k8s:app=baas-auth, | k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9, | k8s:io.cilium.k8s.policy.cluster=default, | k8s:io.cilium.k8s.policy.serviceaccount=default, | k8s:io.kubernetes.pod.namespace=studio-sandbox, | k8s:studio-context=sandbox
k8s:app.kubernetes.io/managed-by=spinnaker,
k8s:app.kubernetes.io/name=baas,
k8s:app=baas-auth,
k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,
k8s:io.cilium.k8s.policy.cluster=default,
k8s:io.cilium.k8s.policy.serviceaccount=default,
k8s:io.kubernetes.pod.namespace=studio-sandbox,
k8s:studio-context=sandbox
`

$ kubectl get pods -A -o wide | grep -i "10.8.4.89" studio-sandbox baas-auth-58db55c49-tb2x6
I have attached a current sysdump.
cilium-sysdump-20210324-135815.zip

Regards.

@aditighag
Copy link
Member

we are getting the hubble reports of drops, BUT the application is behaving as expected (like there are no drops)

So if pod1 on node1 is talking to pod2 on node2, you see Hubble reported drops on node1, but the packets reach node2/pod2?

@pchaigno
Copy link
Member

we are getting the hubble reports of drops, BUT the application is behaving as expected (like there are no drops), so this is leading us to believe something is not correct

If the drops are intermittent, they might not cause connectivity issues. That seems more likely than Hubble getting packet drop notifications out of nowhere 🙂

@gtsjamesbond
Copy link
Author

gtsjamesbond commented Mar 24, 2021

@pchaigno I see your point. What is of more interest to our team, is why we would get the drop to begin with if the target IP should be covered by one of the rules? Is what we are seeing an artifact of a network issue manifesting? or CNP actually causing a drop? We want to make sure that we have the proper interpretation of the drop logs that we are seeing.

We also see this intermittently with a target we have identified using FQDN. Any ideas there?

Regards.

@pchaigno
Copy link
Member

why we would get the drop to begin with if the target IP should be covered by one of the rules? Is what we are seeing an artifact of a network issue manifesting? or CNP actually causing a drop?

My guess would be that the drops are caused by some transient issue in the resolution of the security identity. The destination identity reported by Hubble is reserved:world, which is sort of a default for when we can't associate another identity.

We can attempt to capture a more recent failure and dump as we see these frequently enough.

I think that would help understand why the identity resolution failed.

@gtsjamesbond
Copy link
Author

gtsjamesbond commented Mar 25, 2021

@pchaigno here are two drops logged in the last hour to different destinations, and also the sysdump


> `{
> destination {	
> identity
> 	
> 
> 2
> 
> labels [	
> 
> reserved:world
> 
> ]
> }
> destination_names [	
> 
> cockroachdb.crdb-prod3.svc.cluster.local
> 
> ]
> drop_reason
> 	
> 
> 133
> 
> ethernet {	
> destination
> 	
> 
> 42:bb:c6:65:d5:80
> 
> source
> 	
> 
> de:3a:da:05:fb:38
> 
> }
> event_type {	
> sub_type
> 	
> 
> 133
> 
> type
> 	
> 
> 1
> 
> }
> IP {	
> destination
> 	
> 
> 10.8.3.79
> 
> ipVersion
> 	
> 
> IPv4
> 
> source
> 	
> 
> 10.8.5.49
> 
> }
> l4 {	
> TCP {	
> destination_port
> 	
> 
> 26257
> 
> flags {	
> ACK
> 	
> 
> true
> 
> PSH
> 	
> 
> true
> 
> }
> source_port
> 	
> 
> 58578
> 
> }
> }
> node_name
> 	
> 
> ip-10-8-5-103.us-east-2.compute.internal
> 
> source {	
> ID
> 	
> 
> 786
> 
> identity
> 	
> 
> 12971
> 
> labels [	
> 
> k8s:app.kubernetes.io/managed-by=spinnaker,
> 
> k8s:app.kubernetes.io/name=baas,
> 
> k8s:app=baas-config,
> 
> k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,
> 
> k8s:io.cilium.k8s.policy.cluster=default,
> 
> k8s:io.cilium.k8s.policy.serviceaccount=default,
> 
> k8s:io.kubernetes.pod.namespace=studio-live,
> 
> k8s:studio-context=live
> 
> ]
> namespace
> 	
> 
> studio-live
> 
> pod_name
> 	
> 
> baas-config-6f6d88ff9b-g8x6s
> 
> }
> Summary
> 	
> 
> TCP Flags: ACK, PSH
> 
> time
> 	
> 
> 2021-03-25T15:48:36.581529800Z
> 
> traffic_direction
> 	
> 
> EGRESS
> 
> Type
> 	
> 
> L3_L4
> 
> verdict
> 	
> 
> DROPPED
> 
> }`
> 
> 
> `{
> destination {	
> identity
> 	
> 
> 2
> 
> labels [	
> 
> reserved:world
> 
> ]
> }
> destination_names [	
> 
> rabbitmq.studio-live.svc.cluster.local
> 
> ]
> destination_service {	
> name
> 	
> 
> rabbitmq
> 
> namespace
> 	
> 
> studio-live
> 
> }
> drop_reason
> 	
> 
> 133
> 
> ethernet {	
> destination
> 	
> 
> 32:73:93:80:d3:68
> 
> source
> 	
> 
> 6e:f5:ae:67:4a:2e
> 
> }
> event_type {	
> sub_type
> 	
> 
> 133
> 
> type
> 	
> 
> 1
> 
> }
> IP {	
> destination
> 	
> 
> 172.20.244.191
> 
> ipVersion
> 	
> 
> IPv4
> 
> source
> 	
> 
> 10.8.5.18
> 
> }
> l4 {	
> TCP {	
> destination_port
> 	
> 
> 5672
> 
> flags {	
> SYN
> 	
> 
> true
> 
> }
> source_port
> 	
> 
> 38390
> 
> }
> }
> node_name
> 	
> 
> ip-10-8-5-103.us-east-2.compute.internal
> 
> source {	
> ID
> 	
> 
> 587
> 
> identity
> 	
> 
> 30337
> 
> labels [	
> 
> k8s:app.kubernetes.io/managed-by=spinnaker,
> 
> k8s:app.kubernetes.io/name=baas,
> 
> k8s:app=baas-bsa,
> 
> k8s:io.cilium.k8s.namespace.labels.field.cattle.io/projectId=p-f78q9,
> 
> k8s:io.cilium.k8s.policy.cluster=default,
> 
> k8s:io.cilium.k8s.policy.serviceaccount=default,
> 
> k8s:io.kubernetes.pod.namespace=studio-live,
> 
> k8s:studio-context=live
> 
> ]
> namespace
> 	
> 
> studio-live
> 
> pod_name
> 	
> 
> baas-bsa-6c5c9965f8-t8qkf
> 
> }
> Summary
> 	
> 
> TCP Flags: SYN
> 
> time
> 	
> 
> 2021-03-25T15:42:58.661531335Z
> 
> traffic_direction
> 	
> 
> EGRESS
> 
> Type
> 	
> 
> L3_L4
> 
> verdict
> 	
> 
> DROPPED
> 
> }`

@pchaigno
Copy link
Member

@gtsjamesbond I think you forgot to attach the sysdump. The Hubble log is also ill-formatted.

@gtsjamesbond
Copy link
Author

@pchaigno Attaching dump again.
cilium-sysdump-20210325-114456.zip

@gtsjamesbond
Copy link
Author

@pchaigno @aditighag , were you able to gain any insight from the latest attached logs uploaded on March 25?

Regards.

@aditighag
Copy link
Member

@pchaigno @aditighag , were you able to gain any insight from the latest attached logs uploaded on March 25?

Regards.

Busy with the 1.10 release, I'll try to take a look (most likely) after the release unless someone else beats me to it.

@stale
Copy link

stale bot commented Jun 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 3, 2021
@stale stale bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 8, 2021
@errordeveloper
Copy link
Contributor

Busy with the 1.10 release, I'll try to take a look (most likely) after the release unless someone else beats me to it.

@aditighag as you showed some interest, I've assigned this to you :)

@othmane399
Copy link

have the same behavior on some pods

@aditighag
Copy link
Member

have the same behavior on some pods

@othmane399 What cilium version is this? We've had a couple of fixes around identity propagation delays - #16302 that could be potentially related. Can you check if you have the fix?

@othmane399
Copy link

@aditighag I'm in 1.10.3 so basically yes I have the changes..

@joaoubaldo
Copy link
Contributor

I am experiencing something very similar to this. In my case, the drops reported by Hubble are real (see tests below).
Running Cilium 1.10.3 on EKS 1.20.

For context, we use Traefik (as a DaemonSet) to forward traffic to some-service Pods/Endpoints and from time to time, our metrics show an increasing number of HTTP 504's for ingresses.
When these 504's are happening, Traefik logs show messages such as:
'504 Gateway Timeout' caused by: dial tcp 10.210.145.199:3000: i/o timeout
We then confirm that the IP is assigned to a Pod:

$ kubectl get pods -o wide | grep some-service
some-service-6ccccf645d-222k9                     1/1     Running   0          3h12m   10.210.169.205   ip-10-210-184-168.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-5t9qk                     1/1     Running   0          3h12m   10.210.112.82    ip-10-210-117-158.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-6l4k6                     1/1     Running   0          137m    10.210.129.206   ip-10-210-128-32.eu-west-1.compute.internal    <none>           <none>
some-service-6ccccf645d-9snjn                     1/1     Running   2          3h12m   10.210.145.199   ip-10-210-147-20.eu-west-1.compute.internal    <none>           <none> <---- curling this from another pod results in DROPPED
some-service-6ccccf645d-bgpk6                     1/1     Running   0          156m    10.210.185.161   ip-10-210-184-168.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-btr8l                     1/1     Running   0          3h12m   10.210.124.231   ip-10-210-117-158.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-c9mdv                     1/1     Running   0          154m    10.210.164.170   ip-10-210-184-168.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-gtlpj                     1/1     Running   0          125m    10.210.135.205   ip-10-210-128-32.eu-west-1.compute.internal    <none>           <none>
some-service-6ccccf645d-kqvxb                     1/1     Running   0          3h12m   10.210.166.35    ip-10-210-184-168.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-ln4v5                     1/1     Running   0          125m    10.210.171.201   ip-10-210-184-168.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-rfmf4                     1/1     Running   0          142m    10.210.126.142   ip-10-210-117-158.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-wvc7s                     1/1     Running   2          3h11m   10.210.140.27    ip-10-210-147-20.eu-west-1.compute.internal    <none>           <none>
some-service-6ccccf645d-xfbtp                     1/1     Running   0          3h12m   10.210.109.77    ip-10-210-117-158.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-xfmdq                     1/1     Running   0          136m    10.210.132.141   ip-10-210-128-32.eu-west-1.compute.internal    <none>           <none>
some-service-6ccccf645d-z7gf7                     1/1     Running   0          154m    10.210.184.122   ip-10-210-184-168.eu-west-1.compute.internal   <none>           <none>
some-service-6ccccf645d-ztm87                     1/1     Running   0          143m    10.210.116.247   ip-10-210-117-158.eu-west-1.compute.internal   <none>           <none>

Making a HTTP request to 10.210.145.199:3000 from any Traefik Pod fails with Cilium dropping the packet. hubble observe shows

Mar  9 14:09:02.709: traefik/traefik-dhzrk:45498 <> 10.210.145.199:3000 Policy denied DROPPED (TCP Flags: SYN)
Mar  9 14:09:02.709: traefik/traefik-dhzrk:45498 <> 10.210.145.199:3000 Policy denied DROPPED (TCP Flags: SYN)
Mar  9 14:09:04.484: traefik/traefik-dhzrk:46534 <> 10.210.145.199:3000 Policy denied DROPPED (TCP Flags: SYN)
Mar  9 14:09:04.484: traefik/traefik-dhzrk:46534 <> 10.210.145.199:3000 Policy denied DROPPED (TCP Flags: SYN)
Mar  9 14:09:05.493: traefik/traefik-dhzrk:46534 <> 10.210.145.199:3000 Policy denied DROPPED (TCP Flags: SYN)
Mar  9 14:09:05.493: traefik/traefik-dhzrk:46534 <> 10.210.145.199:3000 Policy denied DROPPED (TCP Flags: SYN)

However a request to another some-service Pod IP (e.g. 10.210.164.170) succeeds.

The relevant Cilium Endpoint

  {
    "id": 1114,
    "spec": {
      "label-configuration": {},
      "options": {
        "Conntrack": "Enabled",
        "ConntrackAccounting": "Enabled",
        "ConntrackLocal": "Disabled",
        "Debug": "Disabled",
        "DebugLB": "Disabled",
        "DebugPolicy": "Disabled",
        "DropNotification": "Enabled",
        "MonitorAggregationLevel": "Medium",
        "NAT46": "Disabled",
        "PolicyAuditMode": "Disabled",
        "PolicyVerdictNotification": "Enabled",
        "TraceNotification": "Enabled"
      }
    },
    "status": {
      "controllers": [
        {
          "configuration": {
            "error-retry": true,
            "error-retry-base": "2s"
          },
          "name": "endpoint-1114-regeneration-recovery",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "0001-01-01T00:00:00.000Z"
          },
          "uuid": "95f459bf-b865-4323-af63-1566bedabadc"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "5m0s"
          },
          "name": "resolve-identity-1114",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2022-03-09T12:16:18.025Z",
            "success-count": 40
          },
          "uuid": "1c4fa874-3084-47c3-9f78-682d2b2f08fa"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "1m0s"
          },
          "name": "sync-policymap-1114",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2022-03-09T12:18:17.522Z",
            "success-count": 1062
          },
          "uuid": "9c7beb4f-0854-4d85-bce0-64679702ff01"
        },
        {
          "configuration": {
            "error-retry": true,
            "interval": "10s"
          },
          "name": "sync-to-k8s-ciliumendpoint (1114)",
          "status": {
            "last-failure-timestamp": "0001-01-01T00:00:00.000Z",
            "last-success-timestamp": "2022-03-09T12:18:38.699Z",
            "success-count": 1186
          },
          "uuid": "67de00c4-ba2b-4cc8-9c22-a576f26a9353"
        }
      ],
      "external-identifiers": {
        "container-id": "9723d07a45d0b8891c64eb3d715d1f17c4defaef00995ddfe4df4d3e74f58fd3",
        "k8s-namespace": "some-namespace",
        "k8s-pod-name": "some-service-6ccccf645d-9snjn",
        "pod-name": "some-namespace/some-service-6ccccf645d-9snjn"
      },
      "health": {
        "bpf": "OK",
        "connected": true,
        "overallHealth": "OK",
        "policy": "OK"
      },
      "identity": {
        "id": 38850,
        "labels": [
          "k8s:serviceSelector=some-service-selector",
          "k8s:app.kubernetes.io/name=some-service",
          "k8s:io.cilium.k8s.namespace.labels.functional_area=engineering",
          "k8s:io.kubernetes.pod.namespace=some-namespace",
          "k8s:io.cilium.k8s.policy.cluster=default",
          "k8s:io.cilium.k8s.namespace.labels.team=some-namespace",
          "k8s:app.kubernetes.io/managed-by=spinnaker",
          "k8s:io.cilium.k8s.policy.serviceaccount=some-service",
          "k8s:app.kubernetes.io/instance=some-service",
          "k8s:frontedBy=some-service-svc"
        ],
        "labelsSHA256": "3613c13e2efceb441c618ffc4c04ce3044f7737db2ffe6b932d897511ddc58a1"
      },
      "labels": {
        "derived": [
          "k8s:pod-template-hash=6ccccf645d"
        ],
        "realized": {},
        "security-relevant": [
          "k8s:app.kubernetes.io/instance=some-service",
          "k8s:app.kubernetes.io/managed-by=spinnaker",
          "k8s:app.kubernetes.io/name=some-service",
          "k8s:frontedBy=some-service-svc",
          "k8s:io.cilium.k8s.namespace.labels.functional_area=engineering",
          "k8s:io.cilium.k8s.namespace.labels.team=some-namespace",
          "k8s:io.cilium.k8s.policy.cluster=default",
          "k8s:io.cilium.k8s.policy.serviceaccount=some-service",
          "k8s:io.kubernetes.pod.namespace=some-namespace",
          "k8s:serviceSelector=some-service-selector"
        ]
      },
      "log": [
        {
          "code": "OK",
          "message": "Successfully regenerated endpoint program (Reason: one or more identities created or deleted)",
          "state": "ready",
          "timestamp": "2022-03-09T12:17:17Z"
        }
      ],
      "namedPorts": [
        {
          "name": "http",
          "port": 3000,
          "protocol": "TCP"
        }
      ],
      "networking": {
        "addressing": [
          {
            "ipv4": "10.210.145.199"
          }
        ],
        "host-mac": "d2:83:14:7e:b8:5d",
        "interface-index": 483,
        "interface-name": "lxc96c6117f9e34",
        "mac": "6a:66:e9:05:5d:c0"
      },
      "policy": {
        "proxy-statistics": [],
        "realized": {
          "allowed-egress-identities": [
            0
          ],
          "allowed-ingress-identities": [
            0,
            1
          ],
          "build": 28,
          "cidr-policy": {
            "egress": [],
            "ingress": []
          },
          "denied-egress-identities": null,
          "denied-ingress-identities": null,
          "id": 38850,
          "l4": {
            "egress": [],
            "ingress": []
          },
          "policy-enabled": "none",
          "policy-revision": 28
        },
        "spec": {
          "allowed-egress-identities": [
            0
          ],
          "allowed-ingress-identities": [
            1,
            0
          ],
          "build": 28,
          "cidr-policy": {
            "egress": [],
            "ingress": []
          },
          "denied-egress-identities": null,
          "denied-ingress-identities": null,
          "id": 38850,
          "l4": {
            "egress": [],
            "ingress": []
          },
          "policy-enabled": "none",
          "policy-revision": 28
        }
      },
      "realized": {
        "label-configuration": {},
        "options": {
          "Conntrack": "Enabled",
          "ConntrackAccounting": "Enabled",
          "ConntrackLocal": "Disabled",
          "Debug": "Disabled",
          "DebugLB": "Disabled",
          "DebugPolicy": "Disabled",
          "DropNotification": "Enabled",
          "MonitorAggregationLevel": "Medium",
          "NAT46": "Disabled",
          "PolicyAuditMode": "Disabled",
          "PolicyVerdictNotification": "Enabled",
          "TraceNotification": "Enabled"
        }
      },
      "state": "ready"
    }
  }
]

CiliumNetworkPolicy applied to Traefik

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: traefik
  namespace: traefik
spec:
  egress:
  - toEntities:
    - cluster
  - toPorts:
    - ports:
      - port: "443"
    toServices:
    - k8sService:
        namespace: default
        serviceName: kubernetes
  - toPorts:
    - ports:
      - port: "8126"
        protocol: TCP
      - port: "8125"
        protocol: UDP
    toServices:
    - k8sServiceSelector:
        namespace: datadog
        selector:
          matchLabels:
            agent.datadoghq.com/name: datadog
            app.kubernetes.io/instance: agent
  - toFQDNs:
    - matchName: update.traefik.io
    toPorts:
    - ports:
      - port: "443"
  - toEndpoints:
    - matchLabels:
        io.kubernetes.pod.namespace: kube-system
        k8s-app: kube-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: UDP
      rules:
        dns:
        - matchPattern: '*'
  - toEndpoints:
    - matchLabels:
        io.kubernetes.pod.namespace: engineering
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
  endpointSelector:
    matchLabels:
      app.kubernetes.io/name: traefik
  ingress:
  - fromEntities:
    - host
    - remote-node
    toPorts:
    - ports:
      - port: "8082"
        protocol: TCP
      - port: "8443"
      - port: "8445"
  - fromEntities:
    - world
    toPorts:
    - ports:
      - port: "8082"
      - port: "8443"
      - port: "8444"
      - port: "8445"

Any assistance on how to further troubleshoot is very welcomed.

Thank you

@pchaigno
Copy link
Member

@joaoubaldo Could you please open a new issue with the same information as above plus a full dump of those Hubble drops (-o jsonpb)? A Cilium sysdump of the cluster would also help.

@github-actions
Copy link

github-actions bot commented Jul 9, 2022

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. and removed stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. labels Jul 9, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Sep 14, 2022
@github-actions
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@luanphantiki
Copy link

I have the same issue, it was random, not repeating, and the TCP flag is always not SYN. My version:

root@xxx:/home/cilium# cilium version
Client: 1.12.2 c7516b9 2022-09-14T15:25:06+02:00 go version go1.18.6 linux/amd64
Daemon: 1.12.2 c7516b9 2022-09-14T15:25:06+02:00 go version go1.18.6 linux/amd64

@smeeklai
Copy link

Anybody knows how I can have the same drop traffic log as the author? What to enable please?

@joaoubaldo
Copy link
Contributor

Anybody knows how I can have the same drop traffic log as the author? What to enable please?

@smeeklai you can get that info with something like hubble observe -t drop --output jsonpb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. need-more-info More information is required to further debug or fix the issue. needs/triage This issue requires triaging to establish severity and next steps. sig/hubble Impacts hubble server or relay sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

8 participants