Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove conntrack lookups #67

Merged
merged 2 commits into from
Jul 22, 2019
Merged

Remove conntrack lookups #67

merged 2 commits into from
Jul 22, 2019

Conversation

lmb
Copy link
Contributor

@lmb lmb commented Jun 24, 2019

When a SYN packet is received, Linux creates a so called request socket.
This is smaller than a full socket, and helps conserve resources. Request
sockets (aka reqsk) always have sk_state equal to TCP_NEW_SYN_RECV, and
are part of the inet hash tables. This means they are returned by a call
to inet_lookup_established and friends.

This is true as of v4.4, and was introduced by commit torvalds/linux@079096f

Fixes #50

@theojulienne
Copy link
Contributor

I'm 👍 to deprecating support for pre-4.4 kernels, and generally getting to a place where conntrack isn't a requirement would be great. Did you end up managing to track down the few increments you saw in #50 (comment) ?

I said that conntrack are 0 most of the time: there are blips from time to time where 0.001% of traffic are accepted due to conntrack. I need to track down what is happening here, but I'm convinced that we don't need conntrack.

Mostly a curiosity though, it seems pretty clear that the TCP_NEW_SYN_RECV state allows this to work the same way now 🎉

@lmb
Copy link
Contributor Author

lmb commented Jun 27, 2019 via email

@theojulienne
Copy link
Contributor

I found the same conntrack counters in our setup, so used a bcc+scapy script to sniff the packets that were getting picked up by conntrack:

0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076927767 ack=0 dataofs=11L reserved=0L flags=S window=65535 chksum=0x6d2e urgptr=0 options=[('MSS', 1452), ('NOP', None), ('WScale', 5), ('NOP', None), ('NOP', None), ('Timestamp', (2731582313, 0)), ('SAckOK', ''), ('EOL', None)] |>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076927768 ack=686108104 dataofs=8L reserved=0L flags=A window=4138 chksum=0x63b1 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731582547, 66837879))] |>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076927768 ack=686108104 dataofs=8L reserved=0L flags=PA window=4138 chksum=0x1d4a urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731582552, 66837879))] |<Raw  ... |>>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928285 ack=686110952 dataofs=8L reserved=0L flags=A window=4049 chksum=0x55b9 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731582787, 66837939))] |>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928285 ack=686111419 dataofs=8L reserved=0L flags=A window=4034 chksum=0x53f5 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731582787, 66837939))] |>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928285 ack=686111419 dataofs=8L reserved=0L flags=PA window=4096 chksum=0x21ec urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731582794, 66837939))] |<Raw  ... |>>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928411 ack=686111470 dataofs=8L reserved=0L flags=A window=4094 chksum=0x51de urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731583024, 66838000))] |>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928411 ack=686111470 dataofs=8L reserved=0L flags=PA window=4096 chksum=0x2812 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731583024, 66838000))] |<Raw  ... |>>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928523 ack=686111548 dataofs=8L reserved=0L flags=A window=4093 chksum=0x4ffb urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731583259, 66838059))] |>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928523 ack=686111579 dataofs=8L reserved=0L flags=A window=4092 chksum=0x4fdd urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731583259, 66838059))] |>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928523 ack=686111579 dataofs=8L reserved=0L flags=PA window=4096 chksum=0xa813 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731583259, 66838059))] |<Raw  ... |>>
0       0       swapper/1       process    <TCP  sport=53457 dport=https seq=2076928554 ack=686111580 dataofs=8L reserved=0L flags=A window=4096 chksum=0x4fb9 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731583259, 66838059))] |>
19558   19558   python          process    <TCP  sport=53457 dport=https seq=2076928554 ack=686111580 dataofs=8L reserved=0L flags=FA window=4096 chksum=0x4fb7 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731583260, 66838059))] |>
19558   19558   python          conntrack  <TCP  sport=53457 dport=https seq=2076928554 ack=686111580 dataofs=8L reserved=0L flags=FA window=4096 chksum=0x4fb7 urgptr=0 options=[('NOP', None), ('NOP', None), ('Timestamp', (2731583260, 66838059))] |>

The corresponding full tcpdump:

19:07:24.939874 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [S], seq 2076927767, win 65535, options [mss 1452,nop,wscale 5,nop,nop,TS val 2731582313 ecr 0,sackOK,eol], length 0
19:07:24.939922 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [S.], seq 686108103, ack 2076927768, win 28480, options [mss 1436,sackOK,TS val 66837879 ecr 2731582313,nop,wscale 10], length 0
19:07:25.175654 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [.], ack 1, win 4138, options [nop,nop,TS val 2731582547 ecr 66837879], length 0
19:07:25.180416 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [P.], seq 1:518, ack 1, win 4138, options [nop,nop,TS val 2731582552 ecr 66837879], length 517
19:07:25.182669 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [.], seq 1:2849, ack 518, win 29, options [nop,nop,TS val 66837939 ecr 2731582552], length 2848
19:07:25.182699 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [P.], seq 2849:3316, ack 518, win 29, options [nop,nop,TS val 66837939 ecr 2731582552], length 467
19:07:25.414646 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [.], ack 2849, win 4049, options [nop,nop,TS val 2731582787 ecr 66837939], length 0
19:07:25.414669 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [.], ack 3316, win 4034, options [nop,nop,TS val 2731582787 ecr 66837939], length 0
19:07:25.425276 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [P.], seq 518:644, ack 3316, win 4096, options [nop,nop,TS val 2731582794 ecr 66837939], length 126
19:07:25.425809 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [P.], seq 3316:3367, ack 644, win 29, options [nop,nop,TS val 66838000 ecr 2731582794], length 51
19:07:25.655889 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [.], ack 3367, win 4094, options [nop,nop,TS val 2731583024 ecr 66838000], length 0
19:07:25.660846 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [P.], seq 644:756, ack 3367, win 4096, options [nop,nop,TS val 2731583024 ecr 66838000], length 112
19:07:25.661107 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [P.], seq 3367:3445, ack 756, win 29, options [nop,nop,TS val 66838059 ecr 2731583024], length 78
19:07:25.661142 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [P.], seq 3445:3476, ack 756, win 29, options [nop,nop,TS val 66838059 ecr 2731583024], length 31
19:07:25.661168 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [F.], seq 3476, ack 756, win 29, options [nop,nop,TS val 66838059 ecr 2731583024], length 0
19:07:25.894238 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [.], ack 3445, win 4093, options [nop,nop,TS val 2731583259 ecr 66838059], length 0
19:07:25.894261 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [.], ack 3476, win 4092, options [nop,nop,TS val 2731583259 ecr 66838059], length 0
19:07:25.894273 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [P.], seq 756:787, ack 3476, win 4096, options [nop,nop,TS val 2731583259 ecr 66838059], length 31
19:07:25.894305 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [R], seq 686111579, win 0, length 0
19:07:25.894314 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [.], ack 3477, win 4096, options [nop,nop,TS val 2731583259 ecr 66838059], length 0
19:07:25.894339 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [R], seq 686111580, win 0, length 0
19:07:25.898653 IP CLIENT.53457 > lb-192-30-253-80-iad.github.com.https: Flags [F.], seq 787, ack 3477, win 4096, options [nop,nop,TS val 2731583260 ecr 66838059], length 0
19:07:25.898689 IP lb-192-30-253-80-iad.github.com.https > CLIENT.53457: Flags [R], seq 686111580, win 0, length 0

It seems at the point of the flags=FA packet from the client, conntrack is being used to determine to continue processing locally, even for a previously valid connection. I wonder if this is dependent on connection latency and whether the client hard closes the connection after receiving data before the server gets into this state. I think what might be happening is that after sending an RST the kernel may be treating the connection as fully destroyed, so we might push it down the chain if we remove this. That may be OK, because the next machine would also send an RST, but I wonder if there is some cleaner way 🤔

@lmb
Copy link
Contributor Author

lmb commented Jul 5, 2019

Very interesting! The RST at seq 686111579 seems to be the point where the connection is terminated. According to Wikipedia, Linux will send a RST instead of FIN if there was outstanding data in the receive buffer.

From some digging, this happens in tcp_close. More specifically, here: tcp.c:2365. The call to tcp_set_state then leads to unhashing of sk. As the comment at the end of the function points out, TCP_CLOSE isn't available from the established hash table.

At this point, sk doesn't exist anymore for the purpose of tcp_v4_rcv. The inbound packets from the client end up generating a RST. This lines up with the seq 686111580 packets.

If I understand correctly there is not much to do here, besides documenting it in the code.

@lmb
Copy link
Contributor Author

lmb commented Jul 17, 2019

Updated the branch with a comment based on our investigation.

lmb added 2 commits July 17, 2019 11:37
When a SYN packet is received, Linux creates a so called request socket.
This is smaller than a full socket, and helps conserve resources. Request
sockets (aka reqsk) always have sk_state equal to TCP_NEW_SYN_RECV, and
are part of the inet hash tables. This means they are returned by a call
to inet_lookup_established and friends.

This is true as of v4.4, and was introduced by commit torvalds/linux@079096f

Fixes #50
v4.4 changed the way TCP connection requests are handled, which
means that conntrack lookups are not required anymore.

Make sure we are building against at least 4.4, and remove obsolete
compile time guards.
@lmb
Copy link
Contributor Author

lmb commented Jul 22, 2019

Ping @theojulienne

Copy link
Contributor

@theojulienne theojulienne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for adding the updated comment - I think this makes sense and is worth it to not rely on / require conntrack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

glb-redirect: conntrack lookups might not be needed
2 participants