Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the data path filtering for policy verdict logs. #10477

Merged
merged 1 commit into from Mar 27, 2020

Conversation

lzang
Copy link
Contributor

@lzang lzang commented Mar 5, 2020

Generate policy verdict logs only if the endpoint does have a network policy
enforced on the direction of the traffic.

Signed-off-by: Zang Li zangli@google.com


This change is Reviewable

@lzang lzang requested a review from a team March 5, 2020 01:58
@lzang lzang requested review from a team as code owners March 5, 2020 01:58
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@coveralls
Copy link

coveralls commented Mar 5, 2020

Coverage Status

Coverage decreased (-0.02%) to 45.523% when pulling 300236c on lzang:master into 0b5e263 on cilium:master.

@joestringer joestringer added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. pending-review release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Mar 5, 2020
bpf/lib/policy_log.h Outdated Show resolved Hide resolved
@pchaigno
Copy link
Member

pchaigno commented Mar 5, 2020

test-me-please

1 similar comment
@lzang
Copy link
Contributor Author

lzang commented Mar 6, 2020

test-me-please

pkg/endpoint/bpf.go Outdated Show resolved Hide resolved
bpf/lib/policy_log.h Outdated Show resolved Hide resolved
@lzang
Copy link
Contributor Author

lzang commented Mar 6, 2020

test-me-please

pkg/endpoint/cache.go Outdated Show resolved Hide resolved
@lzang lzang force-pushed the master branch 2 times, most recently from e1c2a78 to 734fc53 Compare March 9, 2020 01:29
@lzang
Copy link
Contributor Author

lzang commented Mar 9, 2020

test-me-please previous failure https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/17796/

@aanm
Copy link
Member

aanm commented Mar 9, 2020

test-me-please

bpf/lib/static_data.h Outdated Show resolved Hide resolved
bpf/lib/policy_log.h Outdated Show resolved Hide resolved
@joestringer
Copy link
Member

joestringer commented Mar 9, 2020

From here:
https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/17796
Through here:
https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/17796/testReport/junit/Suite-k8s-1/11/K8sDatapathConfig_DirectRouting_Check_direct_connectivity_with_per_endpoint_routes/

Into the zip, into the cilium logs for pod cilium-cllgt, filtering out all level=debug / level=info, I found repeats of this set of complaints:

2020-03-09T02:54:00.2459225Z level=error msg="Command execution failed" cmd="[tc filter replace dev lxc5ea689f5bbdb ingress prio 1 handle 1 bpf da obj 564_next/bpf_lxc.o sec from-container]" error="exit status 1" subsys=datapath-loader
2020-03-09T02:54:00.245943042Z level=warning msg="Log buffer too small to dump verifier log 16777215 bytes (10 tries)!" subsys=datapath-loader
2020-03-09T02:54:00.24595512Z level=warning msg="Error fetching program/map!" subsys=datapath-loader
2020-03-09T02:54:00.245958089Z level=warning msg="Unable to load program" subsys=datapath-loader
2020-03-09T02:54:00.263696359Z level=warning msg="JoinEP: Failed to load program" containerID=70678a65fa datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=564 error="Failed to load tc filter: exit status 1" file-path=564_next/bpf_lxc.o identity=58084 ipv4=10.10.1.179 ipv6= k8sPodName=default/test-k8s2-8d47f6c76-mpxvg subsys=datapath-loader veth=lxc5ea689f5bbdb
2020-03-09T02:54:00.263715579Z level=error msg="Error while rewriting endpoint BPF program" containerID=70678a65fa datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=564 error="Failed to load tc filter: exit status 1" identity=58084 ipv4=10.10.1.179 ipv6= k8sPodName=default/test-k8s2-8d47f6c76-mpxvg subsys=endpoint
2020-03-09T02:54:00.264195958Z level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID=70678a65fa datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=564 file-path=564_next_fail identity=58084 ipv4=10.10.1.179 ipv6= k8sPodName=default/test-k8s2-8d47f6c76-mpxvg subsys=endpoint
2020-03-09T02:54:00.264422446Z level=warning msg="Regeneration of endpoint failed" bpfCompilation=1.389299726s bpfLoadProg=13.921122438s bpfWaitForELF=1.389554451s bpfWriteELF="211.137µs" buildDuration=15.316466496s containerID=70678a65fa datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=564 error="Failed to load tc filter: exit status 1" identity=58084 ipv4=10.10.1.179 ipv6= k8sPodName=default/test-k8s2-8d47f6c76-mpxvg mapSync="21.086µs" policyCalculation="162.228µs" prepareBuild="332.159µs" proxyConfiguration="4.986µs" proxyPolicyCalculation="132.499µs" proxyWaitForAck=0s reason="updated security labels" subsys=endpoint waitingForCTClean=2.773349ms waitingForLock=883ns
2020-03-09T02:54:00.264678764Z level=error msg="endpoint regeneration failed" containerID=70678a65fa datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=564 error="Failed to load tc filter: exit status 1" identity=58084 ipv4=10.10.1.179 ipv6= k8sPodName=default/test-k8s2-8d47f6c76-mpxvg subsys=endpoint

I'm not sure why the Jenkins CI didn't identify any errors/warnings in the logs, but I saw them there.

This particular case I think we've seen before on the net-next kernel image, do you recall the workaround there?

I've filed issue #10517 to track root-cause and upstream fixes for the underlying issue here.

For the other two failures from that link, they look like policy-related CI flakes. Given the timing they may have been resolved by #10493.

At a glance from the latest failures, I see endpoints in not-ready state, and I was able to find evidence of the above "log buffer too small" issue again.

@aanm aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 10, 2020
@lzang
Copy link
Contributor Author

lzang commented Mar 10, 2020

We bypassed the issue earlier by removing all the if else statements in the datapath patch. For this one, we cannot bypass the single check for the filtering operation.

@lzang
Copy link
Contributor Author

lzang commented Mar 16, 2020

test-me-please

bpf/lib/static_data.h Outdated Show resolved Hide resolved
@aanm aanm removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 17, 2020
@joestringer
Copy link
Member

I filed #10615 for the Travis flake.

@lzang
Copy link
Contributor Author

lzang commented Mar 17, 2020

test-me-please

@lzang
Copy link
Contributor Author

lzang commented Mar 17, 2020

Thanks @joestringer ! For the "log buffer too small" issue, it is possible to put in the mitigation with iproute2 to the CI?

@pchaigno
Copy link
Member

Thanks @joestringer ! For the "log buffer too small" issue, it is possible to put in the mitigation with iproute2 to the CI?

That wouldn't solve the underlying issue, which is a complexity error in the BPF verifier (we're hitting the limit). As a short term solution, we're hoping #10542 will give us some slack there and allow the currently-blocked PRs to pass.

@lzang
Copy link
Contributor Author

lzang commented Mar 18, 2020

That wouldn't solve the underlying issue, which is a complexity error in the BPF verifier (we're hitting the limit). As a short term solution, we're hoping #10542 will give us some slack there and allow the currently-blocked PRs to pass.

Yes, I know it doesn't solve the underlying issue, but I think it shouldn't block CI as it will take longer time for the real fix to get in. We can always remove the workaround later. It's great if #10542 can help and unblock CI.

@pchaigno
Copy link
Member

Hi @lzang 👋 Could you rebase your pull request against master? #10542 was merged yesterday. It updates the LLVM/Clang version we use and should fix the CI issues you're currently hitting.

@pchaigno pchaigno added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 26, 2020
@lzang
Copy link
Contributor Author

lzang commented Mar 26, 2020

test-me-please

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! The tests are passing 😃

I have a couple small comments below, but we should be okay to merge after that 🎉

bpf/lib/policy_log.h Outdated Show resolved Hide resolved
bpf/lxc_config.h Show resolved Hide resolved
bpf/netdev_config.h Outdated Show resolved Hide resolved
@pchaigno pchaigno removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 27, 2020
Generate policy verdict logs only if the endpoint does have a network policy
enforced on the direction of the traffic.

Signed-off-by: Zang Li <zangli@google.com>
@tklauser
Copy link
Member

test-me-please

@joestringer joestringer merged commit 307c3e5 into cilium:master Mar 27, 2020
1.8.0 automation moved this from In progress to Merged Mar 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

7 participants