New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datapath: Fix wrong rev-NAT xlation due to stale conntrack entry #10984
Conversation
Please set the appropriate release note label. |
test-me-please |
d18e85d
to
48d9d5e
Compare
test-me-please |
test-gke |
bpf/lib/nodeport.h
Outdated
@@ -541,13 +542,16 @@ static __always_inline int nodeport_lb6(struct __ctx_buff *ctx, | |||
|
|||
case CT_ESTABLISHED: | |||
case CT_REPLY: | |||
if (ct_state.rev_nat_index != svc->rev_nat_index) { | |||
goto redo_all; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: no extra braces here and below (kernel style) and wrap in unlikely().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this racy? E.g. what happens if we had an earlier entry in there where packets are in flight and replies to be expected to go out from the Pod to the client? They will then go out with the wrong svc tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otoh, we have the same logic in https://github.com/cilium/cilium/blob/master/bpf/bpf_lxc.c#L199 . Given there can only be one such svc at the same time, it really must have been stale. Could we consolidate such check with the one from bpf_lxc into a small helper and add a comment on why it's okay to redo the full tuple? Also, is there any advantage to explicitly delete the existing one first? Perhaps we should remove that from bpf_lxc.c so both cases would be the same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this racy?
The stale conntrack entry problem can only happen when a packet is sent from the same IP addr and port. So, this means that the client before sending a request to the other svc waited long enough (TIME_WAIT) until the port became available again. Therefore, I assume that such races are very unlikely to happen.
Could we consolidate such check with the one from bpf_lxc into a small helper and add a comment on why it's okay to redo the full tuple?
The check in bpf_lxc.c compares rev_nat_index from two ct_states, while in nodeport.h - from svc and ct_state. Probably it's OK to replace the former to check against svc instead of ct_state_new, so then the consolidation would make sense.
Also, is there any advantage to explicitly delete the existing one first?
ct_delete{4,6}() also tries to remove related SNAT entries. Perhaps, it's fine to keep the SNAT related entries in the case of the nodeport.h, as it can be reused by snat_process() later on. In the case of bpf_lxc.c, do you see a case in which SNAT entries would be created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I'm aware of, so it would make sense to have both logic the same, meaning we likely could just remove the ct_delete*().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL - I removed the ct_delete*()'s, but I didn't add a helper "ct_belongs_to_svc()", as IMO it might confuse developers by allowing them to think that passing any CT entry would test whether it belongs to the given svc. This is a wrong assumption, as not all entries have rev_nat_index set (e.g. CT_INGRESS).
GKE hit the Kafka panic flake and |
test-gke |
48d9d5e
to
495460b
Compare
test-me-please |
1 similar comment
test-me-please |
495460b
to
ffe32a8
Compare
test-me-please |
test-me-please |
d4ee3ef
to
86d58ce
Compare
test-me-please |
86d58ce
to
56a2f84
Compare
test-me-please |
The issue description: #10983 Signed-off-by: Martynas Pumputis <m@lambda.lt>
If a service endpoint is accessed via two different NodePort services from the same source IP and port, a reply from the endpoint will be wrongly rev-DNAT xlated. E.g.: 0x6a60e2e6: Conntrack lookup 1/2: src=172.29.50.224:33632 dst=172.29.2.21:32532 0x6a60e2e6: Conntrack lookup 2/2: nexthdr=6 flags=4 0x6a60e2e6: CT verdict: New, revnat=0 0x6a60e2e6: Slave service lookup: slave=1, dport=32532 0x6a60e2e6: Conntrack create: proxy-port=0 revnat=8 src-identity=0 lb=0.0.0.0 0x6a60e2e6: Conntrack lookup 1/2: src=172.29.50.224:33632 dst=10.42.1.254:3306 0x6a60e2e6: Conntrack lookup 2/2: nexthdr=6 flags=1 0x6a60e2e6: CT entry found lifetime=501525, revnat=22 0x6a60e2e6: CT verdict: Established, revnat=22 0xa140b3a1: Conntrack lookup 1/2: src=10.42.1.254:3306 dst=172.29.50.224:33632 0xa140b3a1: Conntrack lookup 2/2: nexthdr=6 flags=0 0xa140b3a1: CT entry found lifetime=523408, revnat=22 0xa140b3a1: CT verdict: Reply, revnat=22 0xa140b3a1: Reverse NAT lookup, index=22 0xa140b3a1: Performing reverse NAT, address=192.168.58.0 port=3306 TCP OUT 172.29.2.21:32532 -> 172.29.50.224:33632 service expires=501869 RxPackets=0 RxBytes=28 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x02 LastTxReport=501808 Flags=0x0000 [ ] RevNAT=8 SourceSecurityID=0 TCP OUT 192.168.58.0:3306 -> 172.29.50.224:33632 service expires=523409 RxPackets=0 RxBytes=28 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x1f LastTxReport=501808 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=22 SourceSecurityID=0 TCP OUT 172.29.50.224:33632 -> 10.42.1.254:3306 expires=523409 RxPackets=8 RxBytes=637 RxFlagsSeen=0x1b LastRxReport=501808 TxPackets=13 TxBytes=861 TxFlagsSeen=0x1f LastTxReport=501808 Flags=0x0032 [ TxClosing SeenNonSyn NodePort ] RevNAT=22 SourceSecurityID=2 First, the endpoint was accessed via 172.29.50.224:33632 -> 192.168.58.0:3306, and afterwards by 172.29.50.224:33632 -> 172.29.2.21:32532. In the second flow, the src IP and port of the reply was wrongly xlated to 192.168.58.0:3306. Fix this by recreating CT entries when stale (=of a different svc) entries are detected. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Previously, when a stale CT entry (of non CT_SERVICE type) was detected (from a previous connection to a service with the same IP/port, but a different rev_nat_index), the entry was removed, and only afterwards recreated. The removal included the removal of related SNAT entries. However, in such case there were no SNAT entries created, thus the removal of CT + SNAT before the recreation was redundant. Signed-off-by: Martynas Pumputis <m@lambda.lt>
test-me-please |
1 similar comment
test-me-please |
@brb do you consider this PR to be a major bugfix / affecting multiple users to justify backport to v1.6? Seems like there's non-trivial conflicts and without your assistance it will not make its way into a v1.6 release. |
@joestringer I do not expect that many users are accessing the same endoint via two different services from outside. So, it's OK to not include in v1.6. If we hear any reports, than I will backport it manually. |
See commit msgs.
In addition to the CI tests, I've tested the fix on a user's cluster which previously was hit by #10983 - the fix resolved the problem.
Fix #10983
Also, this might resolve some of the NodePort BPF flakes.