Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network policy doesn't work (even deny all) #1181

Closed
nsxdemo opened this issue Aug 30, 2020 · 6 comments · Fixed by #1186
Closed

network policy doesn't work (even deny all) #1181

nsxdemo opened this issue Aug 30, 2020 · 6 comments · Fixed by #1186
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@nsxdemo
Copy link

nsxdemo commented Aug 30, 2020

Describe the bug
Simply define a deny-all network policy, but it doesn't work as expected

To Reproduce
in my environment, I delete Antrea add-on, Kubeadm reset on node, and re-init and apply add-on, having the same issue

Expected
block all traffic between pods

Actual behavior
NetworkPolicy is not configured via Antrea, and traffic doesn't block

Versions:
root@an01:/# antctl version
agentVersion: 0.10.0-dev-6be5e2f.clean
antctlVersion: v0.10.0-dev-6be5e2f

root@an01:/# antctl get networkpolicy

root@an01:/# antctl version
agentVersion: 0.10.0-dev-6be5e2f.clean
antctlVersion: v0.10.0-dev-6be5e2f

kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:30:33Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

Additional context
barany@an01:$ kubectl get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
nginx10 1/1 Running 0 15m run=nginx10
web-96d5df5c8-bmfb2 1/1 Running 0 18m app=web,pod-template-hash=96d5df5c8
web-96d5df5c8-fnk8g 1/1 Running 0 18m app=web,pod-template-hash=96d5df5c8
web-96d5df5c8-q289m 1/1 Running 0 18m app=web,pod-template-hash=96d5df5c8
web-app02-75cb584798-h4p9d 1/1 Running 0 18m app=web-app02,pod-template-hash=75cb584798
web-app02-75cb584798-hwtkt 1/1 Running 0 18m app=web-app02,pod-template-hash=75cb584798
web-app02-75cb584798-rgskp 1/1 Running 0 18m app=web-app02,pod-template-hash=75cb584798
barany@an01:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx10 1/1 Running 0 16m 172.19.2.10 an03
web-96d5df5c8-bmfb2 1/1 Running 0 18m 172.19.1.2 an02
web-96d5df5c8-fnk8g 1/1 Running 0 18m 172.19.2.2 an03
web-96d5df5c8-q289m 1/1 Running 0 18m 172.19.2.3 an03
web-app02-75cb584798-h4p9d 1/1 Running 0 18m 172.19.1.7 an02
web-app02-75cb584798-hwtkt 1/1 Running 0 18m 172.19.1.6 an02
web-app02-75cb584798-rgskp 1/1 Running 0 18m 172.19.2.6 an03
barany@an01:$ kubectl describe networkpolicy
Name: denyall
Namespace: default
Created on: 2020-08-30 04:48:32 +0000 UTC
Labels:
Annotations:
Spec:
PodSelector: (Allowing the specific traffic to all pods in this namespace)
Allowing ingress traffic:
(Selected pods are isolated for ingress connectivity)
Allowing egress traffic:
(Selected pods are isolated for egress connectivity)
Policy Types: Ingress, Egress
barany@an01:
$ kubectl get networkpolicy
NAME POD-SELECTOR AGE
denyall 5m10s
barany@an01:~$ kubectl exec -it web-app02-75cb584798-hwtkt -- curl -i 172.19.2.10
HTTP/1.1 200 OK
Server: nginx/1.19.2
Date: Sun, 30 Aug 2020 04:54:08 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 11 Aug 2020 14:50:35 GMT
Connection: keep-alive
ETag: "5f32b03b-264"
Accept-Ranges: bytes

<title>Welcome to nginx!</title> <style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; } </style>

Welcome to nginx!

If you see this page, the nginx web server is successfully installed and working. Further configuration is required.

For online documentation and support please refer to nginx.org.
Commercial support is available at nginx.com.

Thank you for using nginx.

barany@an01:~$
@nsxdemo nsxdemo added the kind/bug Categorizes issue or PR as related to a bug. label Aug 30, 2020
@tnqn
Copy link
Member

tnqn commented Aug 31, 2020

@nsxdemo thanks for reporting. However I cannot reproduce it with master builds. Could you share the status and the logs of antrea-controller and antrea-agent Pods? You may use antctl to get them: https://github.com/vmware-tanzu/antrea/blob/master/docs/antctl.md#collecting-support-information

@yktsubo
Copy link
Contributor

yktsubo commented Aug 31, 2020

Hi, I'm seeing a similar issue with v0.9.1.
Networkpolicy is configured but there are no OVS flows to block traffic in EgressDefaultTable.

@tnqn
Copy link
Member

tnqn commented Aug 31, 2020

@yktsubo could you share logs? I cannot reproduce with 0.9.1 either.
This is the deny-all policy I used:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Egress

@yktsubo
Copy link
Contributor

yktsubo commented Aug 31, 2020

The applied policy is

 $ k -n elearning get netpol default-deny -o yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"networking.k8s.io/v1","kind":"NetworkPolicy","metadata":{"annotations":{},"name":"default-deny","namespace":"elearning"},"spec":{"podSelector":{},"policyTypes":["Ingress","Egress"]}}
  creationTimestamp: "2020-08-28T09:13:54Z"
  generation: 1
  managedFields:
  - apiVersion: networking.k8s.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        f:policyTypes: {}
    manager: kubectl
    operation: Update
    time: "2020-08-28T09:13:54Z"
  name: default-deny
  namespace: elearning
  resourceVersion: "219744"
  selfLink: /apis/networking.k8s.io/v1/namespaces/elearning/networkpolicies/default-deny
  uid: a081e326-d64d-4fe5-a896-988568949e48
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Here is the list of pods running in the namespace

$ k -n elearning get pod -o wide
NAME                       READY   STATUS    RESTARTS   AGE     IP           NODE         NOMINATED NODE   READINESS GATES
course-655f9b8b69-rx9rb    1/1     Running   0          37m     172.16.1.8   k8s-node-1   <none>           <none>
debug                      1/1     Running   0          2d20h   172.16.1.6   k8s-node-1   <none>           <none>
rating-575dbf8478-k8zdl    1/1     Running   0          2d22h   172.16.2.5   k8s-node-2   <none>           <none>
student-6458d58469-w68rn   1/1     Running   0          52m     172.16.1.7   k8s-node-1   <none>           <none>
teacher-7c57f94f7b-shjv9   1/1     Running   0          2d22h   172.16.2.9   k8s-node-2   <none>           <none>

What I found is that table60 doesn't have correct flows to block traffic on both node-1 and node-2.
But all controlplane connecion seems fine.

support-bundles_20200831T125822+0800.tar.gz

@yktsubo
Copy link
Contributor

yktsubo commented Aug 31, 2020

Here is the example from node1

$ kubectl -n kube-system exec -it $(kubectl -n kube-system get pod -o wide | grep k8s-node-1 | grep antrea-agent | cut -d ' '  -f 1) -c antrea-agent -- ovs-ofctl -O OpenFlow13 dump-flows br-int  --no-stats | grep 'table=60'
 cookie=0x1000000000000, table=60, priority=0 actions=goto_table:70

From my understanding, node-1 should block traffic from 172.16.1.8 and 172.16.1.6 to outside.

@tnqn
Copy link
Member

tnqn commented Aug 31, 2020

Thanks @yktsubo for sharing the information and providing a way to reproduce it.
After applying https://github.com/yktsubo/mock-demo several times, I can see that the networkpolicy reconcilers were not running any more (no Reconciling rule logs) while the agent can receive correct NetworkPolicy and AppliedToGroup events from the controller correctly. It should be the agent network policy workers were stuck by some tasks somehow.

After dumping the goroutines, there was indeed a deadlock when updating a ClusterNetworkPolicy rule:

sync.runtime_SemacquireMutex(0xc0006cc80c, 0x0, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0006cc808)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc0006cc808)
	/usr/local/go/src/sync/rwmutex.go:98 +0x97
github.com/vmware-tanzu/antrea/pkg/agent/controller/networkpolicy.(*reconciler).uninstallOFRule(0xc000534ea0, 0x3000000011, 0x0, 0x0)
	/antrea/pkg/agent/controller/networkpolicy/reconciler.go:640 +0x4bf
github.com/vmware-tanzu/antrea/pkg/agent/controller/networkpolicy.(*reconciler).update(0xc000534ea0, 0xc0007e77a0, 0xc0003c8380, 0xc000736098, 0xc000736030, 0x0, 0x0)
	/antrea/pkg/agent/controller/networkpolicy/reconciler.go:589 +0xc43
github.com/vmware-tanzu/antrea/pkg/agent/controller/networkpolicy.(*reconciler).Reconcile(0xc000534ea0, 0xc0003c8380, 0x0, 0x0)
	/antrea/pkg/agent/controller/networkpolicy/reconciler.go:222 +0x35d
github.com/vmware-tanzu/antrea/pkg/agent/controller/networkpolicy.(*Controller).syncRule(0xc000534f00, 0xc0004ba1e0, 0x10, 0x0, 0x0)
	/antrea/pkg/agent/controller/networkpolicy/networkpolicy_controller.go:405 +0x114
github.com/vmware-tanzu/antrea/pkg/agent/controller/networkpolicy.(*Controller).processNextWorkItem(0xc000534f00, 0x407000)
	/antrea/pkg/agent/controller/networkpolicy/networkpolicy_controller.go:359 +0xf4
github.com/vmware-tanzu/antrea/pkg/agent/controller/networkpolicy.(*Controller).worker(0xc000534f00)
	/antrea/pkg/agent/controller/networkpolicy/networkpolicy_controller.go:348 +0x2b

The mutex will be acquired two times if there are any stale openflow rules need to be deleted for a ClusterNetworkPolicy:
https://github.com/vmware-tanzu/antrea/blob/68fa2210e408723332a95ec2397f52cb38c5f2b5/pkg/agent/controller/networkpolicy/reconciler.go#L207-L213
https://github.com/vmware-tanzu/antrea/blob/68fa2210e408723332a95ec2397f52cb38c5f2b5/pkg/agent/controller/networkpolicy/reconciler.go#L639-L642

Then all workers that were reconciling ClusterNetworkPolicies would be pending on the lock, causing normal NetworkPolicies cannot be reconciled as well.
The issue is only present after enabling AntreaPolicy (formerly called ClusterNetworkPolicy) feature.

@Dyanngg could you consider a fix and check if there are other similar issues?

@tnqn tnqn added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 31, 2020
@tnqn tnqn added this to the Antrea v0.10.0 release milestone Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants