Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datapath: Fix wrong rev-NAT xlation due to stale conntrack entry #10984

Merged
merged 3 commits into from Apr 21, 2020

Conversation

brb
Copy link
Member

@brb brb commented Apr 15, 2020

See commit msgs.

In addition to the CI tests, I've tested the fix on a user's cluster which previously was hit by #10983 - the fix resolved the problem.

Fix #10983
Also, this might resolve some of the NodePort BPF flakes.

@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@brb
Copy link
Member Author

brb commented Apr 15, 2020

test-me-please

@coveralls
Copy link

coveralls commented Apr 15, 2020

Coverage Status

Coverage increased (+0.02%) to 46.838% when pulling 56a2f84 on pr/brb/np-update-stale-conntrack into c7b9f3e on master.

@brb brb force-pushed the pr/brb/np-update-stale-conntrack branch from d18e85d to 48d9d5e Compare April 16, 2020 14:17
@brb brb changed the title WIP: Fix wrong rev-NAT xlation due to stale conntrack entry datapath: Fix wrong rev-NAT xlation due to stale conntrack entry Apr 16, 2020
@brb brb added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.3 Apr 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.9 Apr 16, 2020
@brb brb added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/bug This is a bug in the Cilium logic. area/kube-proxy-free labels Apr 16, 2020
@brb brb requested a review from borkmann April 16, 2020 14:18
@brb brb marked this pull request as ready for review April 16, 2020 14:18
@brb brb requested a review from a team April 16, 2020 14:18
@brb brb requested a review from a team as a code owner April 16, 2020 14:18
@brb
Copy link
Member Author

brb commented Apr 16, 2020

test-me-please

@brb
Copy link
Member Author

brb commented Apr 16, 2020

test-gke

@@ -541,13 +542,16 @@ static __always_inline int nodeport_lb6(struct __ctx_buff *ctx,

case CT_ESTABLISHED:
case CT_REPLY:
if (ct_state.rev_nat_index != svc->rev_nat_index) {
goto redo_all;
}
Copy link
Member

@borkmann borkmann Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: no extra braces here and below (kernel style) and wrap in unlikely().

Copy link
Member

@borkmann borkmann Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this racy? E.g. what happens if we had an earlier entry in there where packets are in flight and replies to be expected to go out from the Pod to the client? They will then go out with the wrong svc tuple?

Copy link
Member

@borkmann borkmann Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otoh, we have the same logic in https://github.com/cilium/cilium/blob/master/bpf/bpf_lxc.c#L199 . Given there can only be one such svc at the same time, it really must have been stale. Could we consolidate such check with the one from bpf_lxc into a small helper and add a comment on why it's okay to redo the full tuple? Also, is there any advantage to explicitly delete the existing one first? Perhaps we should remove that from bpf_lxc.c so both cases would be the same logic.

Copy link
Member Author

@brb brb Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this racy?

The stale conntrack entry problem can only happen when a packet is sent from the same IP addr and port. So, this means that the client before sending a request to the other svc waited long enough (TIME_WAIT) until the port became available again. Therefore, I assume that such races are very unlikely to happen.

Could we consolidate such check with the one from bpf_lxc into a small helper and add a comment on why it's okay to redo the full tuple?

The check in bpf_lxc.c compares rev_nat_index from two ct_states, while in nodeport.h - from svc and ct_state. Probably it's OK to replace the former to check against svc instead of ct_state_new, so then the consolidation would make sense.

Also, is there any advantage to explicitly delete the existing one first?

ct_delete{4,6}() also tries to remove related SNAT entries. Perhaps, it's fine to keep the SNAT related entries in the case of the nodeport.h, as it can be reused by snat_process() later on. In the case of bpf_lxc.c, do you see a case in which SNAT entries would be created?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm aware of, so it would make sense to have both logic the same, meaning we likely could just remove the ct_delete*().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL - I removed the ct_delete*()'s, but I didn't add a helper "ct_belongs_to_svc()", as IMO it might confuse developers by allowing them to think that passing any CT entry would test whether it belongs to the given svc. This is a wrong assumption, as not all entries have rev_nat_index set (e.g. CT_INGRESS).

@brb
Copy link
Member Author

brb commented Apr 17, 2020

GKE hit the Kafka panic flake and K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications flake.

@brb
Copy link
Member Author

brb commented Apr 17, 2020

test-gke

@brb brb force-pushed the pr/brb/np-update-stale-conntrack branch from 48d9d5e to 495460b Compare April 17, 2020 08:57
@brb
Copy link
Member Author

brb commented Apr 17, 2020

test-me-please

1 similar comment
@brb
Copy link
Member Author

brb commented Apr 17, 2020

test-me-please

@brb brb force-pushed the pr/brb/np-update-stale-conntrack branch from 495460b to ffe32a8 Compare April 17, 2020 12:01
@brb
Copy link
Member Author

brb commented Apr 17, 2020

test-me-please

@brb
Copy link
Member Author

brb commented Apr 20, 2020

test-me-please

@brb brb force-pushed the pr/brb/np-update-stale-conntrack branch from d4ee3ef to 86d58ce Compare April 20, 2020 14:21
@brb
Copy link
Member Author

brb commented Apr 20, 2020

test-me-please

@brb brb force-pushed the pr/brb/np-update-stale-conntrack branch from 86d58ce to 56a2f84 Compare April 20, 2020 17:20
@brb
Copy link
Member Author

brb commented Apr 20, 2020

test-me-please

brb added 3 commits April 20, 2020 19:21
The issue description: #10983

Signed-off-by: Martynas Pumputis <m@lambda.lt>
If a service endpoint is accessed via two different NodePort services
from the same source IP and port, a reply from the endpoint will be
wrongly rev-DNAT xlated.

E.g.:

0x6a60e2e6: Conntrack lookup 1/2: src=172.29.50.224:33632 dst=172.29.2.21:32532
0x6a60e2e6: Conntrack lookup 2/2: nexthdr=6 flags=4
0x6a60e2e6: CT verdict: New, revnat=0
0x6a60e2e6: Slave service lookup: slave=1, dport=32532
0x6a60e2e6: Conntrack create: proxy-port=0 revnat=8 src-identity=0 lb=0.0.0.0
0x6a60e2e6: Conntrack lookup 1/2: src=172.29.50.224:33632 dst=10.42.1.254:3306
0x6a60e2e6: Conntrack lookup 2/2: nexthdr=6 flags=1
0x6a60e2e6: CT entry found lifetime=501525, revnat=22
0x6a60e2e6: CT verdict: Established, revnat=22
0xa140b3a1: Conntrack lookup 1/2: src=10.42.1.254:3306 dst=172.29.50.224:33632
0xa140b3a1: Conntrack lookup 2/2: nexthdr=6 flags=0
0xa140b3a1: CT entry found lifetime=523408, revnat=22
0xa140b3a1: CT verdict: Reply, revnat=22
0xa140b3a1: Reverse NAT lookup, index=22
0xa140b3a1: Performing reverse NAT, address=192.168.58.0 port=3306

TCP OUT 172.29.2.21:32532 -> 172.29.50.224:33632 service expires=501869
RxPackets=0 RxBytes=28 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0
TxBytes=0 TxFlagsSeen=0x02 LastTxReport=501808 Flags=0x0000 [ ] RevNAT=8
SourceSecurityID=0
TCP OUT 192.168.58.0:3306 -> 172.29.50.224:33632 service expires=523409
RxPackets=0 RxBytes=28 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0
TxBytes=0 TxFlagsSeen=0x1f LastTxReport=501808 Flags=0x0012 [ TxClosing
SeenNonSyn ] RevNAT=22 SourceSecurityID=0
TCP OUT 172.29.50.224:33632 -> 10.42.1.254:3306 expires=523409
RxPackets=8 RxBytes=637 RxFlagsSeen=0x1b LastRxReport=501808
TxPackets=13 TxBytes=861 TxFlagsSeen=0x1f LastTxReport=501808
Flags=0x0032 [ TxClosing SeenNonSyn NodePort ] RevNAT=22
SourceSecurityID=2

First, the endpoint was accessed via 172.29.50.224:33632 ->
192.168.58.0:3306, and afterwards by 172.29.50.224:33632 ->
172.29.2.21:32532. In the second flow, the src IP and port of the
reply was wrongly xlated to 192.168.58.0:3306.

Fix this by recreating CT entries when stale (=of a different svc) entries
are detected.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Previously, when a stale CT entry (of non CT_SERVICE type) was detected
(from a previous connection to a service with the same IP/port, but
a different rev_nat_index), the entry was removed, and only afterwards
recreated.

The removal included the removal of related SNAT entries. However, in
such case there were no SNAT entries created, thus the removal of CT +
SNAT before the recreation was redundant.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb
Copy link
Member Author

brb commented Apr 20, 2020

test-me-please

1 similar comment
@brb
Copy link
Member Author

brb commented Apr 21, 2020

test-me-please

@borkmann borkmann merged commit c3c6954 into master Apr 21, 2020
1.8.0 automation moved this from In progress to Merged Apr 21, 2020
@borkmann borkmann deleted the pr/brb/np-update-stale-conntrack branch April 21, 2020 07:51
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.3 Apr 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.3 Apr 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.3 Apr 29, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.3 Apr 29, 2020
@joestringer
Copy link
Member

@brb do you consider this PR to be a major bugfix / affecting multiple users to justify backport to v1.6? Seems like there's non-trivial conflicts and without your assistance it will not make its way into a v1.6 release.

https://docs.cilium.io/en/latest/contributing/release/backports/#backport-criteria-for-x-y-1-z-and-x-y-2-z

@brb
Copy link
Member Author

brb commented Jun 4, 2020

@joestringer I do not expect that many users are accessing the same endoint via two different services from outside. So, it's OK to not include in v1.6. If we hear any reports, than I will backport it manually.

@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.6.9 Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.7.3
Backport done to v1.7
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

Stale conntrack entry might lead to invalid rev-DNAT xlation for reply from NodePort service
6 participants