Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New details #1

Closed
finestructure opened this issue Nov 26, 2021 · 38 comments
Closed

New details #1

finestructure opened this issue Nov 26, 2021 · 38 comments

Comments

@finestructure
Copy link
Owner

finestructure commented Nov 26, 2021

Interesting new details regarding the lockup error.

I wanted to understand if the problem is in Vapor, Fluent, or perhaps NIO so I re-implemented the little example with Hummingbird and its Fluent module: https://github.com/finestructure/hummingbird-lockup-repro

I ported the example with the same API and ran make post, which sends a post request with Rester. I got a crash in the server:

Swift/ContiguousArrayBuffer.swift:580: Fatal error: Index out of range
make: *** [run] Trace/BPT trap: 5

The index out of range is happening in the second precondition on line 77 in HBParser.swift:

    init(_ parser: HBParser, range: Range<Int>) {
        self.buffer = parser.buffer
        self.index = range.startIndex
        self.range = range

        precondition(range.startIndex >= 0 && range.endIndex <= self.buffer.endIndex)
        precondition(self.buffer[range.startIndex] & 0xC0 != 0x80) // check we arent in the middle of a UTF8 character
    }
(lldb) po range
▿ Range(7..<7)
  - lowerBound : 7
  - upperBound : 7

(lldb) po buffer
▿ 7 elements
  - 0 : 47
  - 1 : 116
  - 2 : 111
  - 3 : 100
  - 4 : 111
  - 5 : 115
  - 6 : 63

, which is /todos?. Not entirely sure where the ? is coming from, it's not in the initial request, but let's leave that aside for now.

Because I couldn't believe Hummingbird would crash on such a simple example I tried to run a post request with the shipping example todos-fluent via Rester and ... it crashed in the same place!

However, the todos-fluent example comes with a Paw file and running the post request from there ... works!

I inspected the difference in the requests between Paw and Rester and started eliminating some of the minor differences but that made no change.

Instead of trying to debug this I swapped Rester out of the repro steps in favour of curl. Now the Hummingbird example worked, and the lock-up was not reproducible. Apart from the weird crash when using Rester, Hummingbird performed just fine, processed all posts, gets, and even dumps in a loop.

On a hunch, I went back to the Vapor repro example and swapped out Rester for curl and ... the lockup does not reproduce anymore. Switching back to Rester does lock it up.

So it appears that the way Rester is making requests

  • crashes Hummingbird
  • causes Vapor to lock up

Now Rester shouldn't be doing anything esoteric: it's a Swift tool triggering web requests using URLSession.dataTask. Did I (having written Rester) somehow manage to create a Swift server nuke?

But even if the problem lies with my implementation in Rester, something similar must be happening in a Vapor server, because remember: we can bring the Swift Package Index server down when running web requests via Safari and dumping the Postgres database via pg_dump.

This example using Rester is purely a local way to reproduce the error.

It's unclear to me what exactly is going on but it certainly looks like there's a common root problem that affects both Vapor and Hummingbird - but in different ways.

Or maybe I'm just holding it very wrong, which is entirely possible 🤣

FYI @fabianfett @adam-fowler

@finestructure
Copy link
Owner Author

Sorry @adam-fowler, I tagged you into this issue without a lot of context but I suspect you might be interested in the findings. I've had a chat last week with Fabian about the original issue I outline at the very end above. It's probably easiest if I give you more details via Discord/Slack somewhere if you're interested - just let me know!

@adam-fowler
Copy link

Hi @finestructure , just looking into it. Not sure what the original Vapor issue is but a crash in Hummingbird isn't ideal. Just looking into it

@finestructure
Copy link
Owner Author

Ah great, let me write up some repro details for the crash in the other repo then.

@adam-fowler
Copy link

At a guess, the issue is the URL you are sending is not proper UTF8. But my parser should not be so fussy though and instead throw an error.

@finestructure
Copy link
Owner Author

I've updated the Readme over in the other repro project: https://github.com/finestructure/hummingbird-lockup-repro/blob/main/README.md

Thanks a lot for taking a look!

@adam-fowler
Copy link

Fix on its way. Stupid bug where I was checking to see if a byte was a valid starting byte for a UTF8 character, except the byte happened to be at the end index of the string, so would cause a crash when accessed.

@adam-fowler
Copy link

Latest version of Hummingbird 0.14.4 fixes this

@adam-fowler
Copy link

Because I couldn't believe Hummingbird would crash on such a simple example

You are far too trusting

@finestructure
Copy link
Owner Author

Very interesting: with the Adam's fix in Hummingbird 0.14.4 I can now reproduce the lockup issue both in Vapor and Hummingbird, if I hit the server with rester requests. The problem doesn't seem to manifest itself when the server's getting hit with curl requests.

It seems rester is doing something that something in the frameworks can't handle to the point that both frameworks lock up.

@finestructure
Copy link
Owner Author

finestructure commented Nov 27, 2021

One other piece of information: Adam mentioned yesterday that he suspects rester isn't closing its end of the connection. I'll take a closer look into how I implemented the looping behaviour in rester but in order to take any issues there out of the equation I re-ran the test with

while true ; do rester restfiles/get.restfile ; done

instead of

rester --loop 0 restfiles/get.restfile

(same for post). Running in that way there is no lockup with either framework, same as when the requests are run via curl.

But even if I'm doing something wrong in the way I'm looping in rester, the thing to bear in mind is that it triggers a behaviour in both frameworks that looks a lot like the lockup we see in production when dumping our db with pg_dump while write jobs are underway and where rester --loop 0 is not in play.

Another observation worth noting: The tests for Vapor and Hummingbird don't lock up when running the get request via rester --loop 0, with many, many parallel loops (more than 20) running. It's the post requests that cause the issue.

@adam-fowler
Copy link

I think the behaviour we are seeing is an issue with Swift NIO where it does not recover from being flooded with open connections, even after all of them are closed. I'm going to try and construct a simple example using their sample server.

@adam-fowler
Copy link

@finestructure I wrote a little app that flooded a server with connections and ran it multiple times and was able to replicate what we were talking about above. I had a chat with Fabian about this. And he spoke with @weissi who says one of the reasons we are seeing issues here is the mac by default only allows a maximum of 256 connections so it is very easy to flood the server and cause it to hang. Linux normally allows for a great deal more connections, so it isn't as much of a problem there.

This may be completely unrelated to your database issue though.

@weissi
Copy link

weissi commented Dec 2, 2021

@adam-fowler you can easily up that limit using ulimit -n 99999. You'll need to be root to raise it above the hard limit (which IIRC is 1024 on macOS).

You could put a file called /Library/LaunchDaemons/limit.maxfiles.plist with permissions 644 and the following content

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
        "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>limit.maxfiles</string>
    <key>ProgramArguments</key>
    <array>
      <string>launchctl</string>
      <string>limit</string>
      <string>maxfiles</string>
      <string>262144</string>
      <string>524288</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>ServiceIPC</key>
    <false/>
  </dict>
</plist>

to never have this problem again (after a reboot).

And please note, that it's the choice of the web server to never close those connections. A webserver may choose to say close connections that are idle for some time. But honestly, if you're running a server, the maximum file descriptor limit should be set rather high...

@weissi
Copy link

weissi commented Dec 2, 2021

Furthermore, I think that SwiftNIO tells the web server in the SERVER channel pipeline (configured via serverChannelInitializer, NOT childChannelInitializer) if it couldn't accept an incoming connection because of out fds. So a well behaved web server may want to check for errors on that Channel and do something with them (assuming that NIO actually behaves correctly here and signals this problem).

@adam-fowler
Copy link

@weissi we just trying to diagnose why we were seeing this issue. I'm not looking to run a server on macOS so this shouldn't really be an issue. I guess we should do testing on Linux. I'll look into seeing if I can catch errors on the server channel. Although not totally sure what I'd do if I caught one.

@weissi
Copy link

weissi commented Dec 2, 2021

@adam-fowler you could log and maybe close some open connections that are idle for a while? Also, you'll want to drop connections that are idle for long, otherwise it's quite easy for somebody to consume resources on your server indefinitely.

Additionally, you should have a very high fd limit in prod.

To see if you have that problem, you can check lsof -p SERVER_PID or netstat or /proc/SERVER_PID/fds/* and compare it to ulimit -n for the server. You can also read your limit using the getrlimit call.

@adam-fowler
Copy link

So I need to keep a record of how long all my open connections have been idle? Do I need an additional scheduled task that closes a connection if it is idle for too long?

@weissi
Copy link

weissi commented Dec 2, 2021

@adam-fowler you can use NIO's IdleStateHandler which sends you a certain user event if a channel has been idle for too long.

Apart from that you can also use NIO's AcceptBackoffHandler which can help you slow down the rate of accepts in case you get too many connections too quickly. But also be mindful of your listen backlog option in case you see rejected connections.

@weissi
Copy link

weissi commented Dec 2, 2021

There isn't one "right" answer here that works in all cases. As almost always, you're dealing with limited resources that you need to use in the ideal way.

Our resources here are (non exhaustive):

  • the number of available file descriptors
    • one option here: raise the available resources (through setrlimit/ulimit)
    • using fewer file descriptors (by dropping idle connections)
    • or by exerting backpressure when accepting connections (by slowing down the rate of reads in the server channel; the connections we don't accept immediately will then end up in the listen backlog)
  • the listen backlog (this is TCP SYN'd connections go between the kernel seeing them and NIO calling accept on them; ie. before they get assigned a file descriptor number)
    • we can raise the listen backlog by setting setServerChannelOption(ChannelOptions.backlog, HIGH_NUMBER) and presumably there also a bunch of sysctls etc to configure your OS
    • we could accept faster to drain the listen backlog quicker (but then we're trading a smaller listen backlog for more file descriptors)
    • or we can leave the listen backlog to fill up and then the kernel will at some point start to reject new connections

The right answer depends on the use case, I think the best thing you can do right now is to set the limits "high" enough but more importantly use logging and metrics to make sure the user has the tools to evaluate what's going on.

If you wanted to with SwiftNIO you can also build more adaptive strategies. Like you can very easily count the number of currently active connections (by having a handler in the server pipeline that counts the number of channelReads there (each being one accepted connection) subtracting one for each acceptedChannel.closeFuture.whenComplete { ... }. So you could say that if you're less than say 100 from the maximum fd limit, you'll start to throw out the longest idle connections or so.
You can also periodically query the OS and see how far you are from the limits...

@finestructure
Copy link
Owner Author

Sorry for the very late reply and thanks a lot for the follow up and additional detail, Adam and Johannes.

I think that while the repro on macOS may have been useful to start the discussion it seems like there’s a good chance it’s actually reproducing a similar looking issue without shedding enough light an what we actually see in our test and prod envs on Linux.

Next week I’ll try to create a trimmed down repro with our setup in a docker-compose file that can be deployed somewhere for testing, or just provide an env to look at. Hopefully that’ll help track this down!

Thanks again for your help and tips everyone 🙂

@weissi
Copy link

weissi commented Dec 10, 2021

Sounds good. In case you have a repro in prod, the information from the commands below would be extremely useful. All of the commands are non-destructive and can safely be run in production.

They should (all?) work in unprivileged containers too but if one doesn't then don't worry about it, we probably won't need all the information. If I put a sudo, then sudo would be great but you can also drop it if you can't access root.

All commands but the netstat command should also only list information about that one process so they shouldn't disclose anything else about your system. The netstat command will list all TCP connections on the system/container.

All commands will assume a shell variable $pid being set to the pid of your Vapor service. You can just run all the commands in a sequence and the copy & paste the output here in one go. These commands are all for Linux, if you need similar ones for macOS please reach out.

The easiest way to set that is

pid=$(pgrep NameOfMyBinary)

(in container land it's actually probably pid=1 :) )

Open file descriptors

(apt-get update && apt-get -yy install lsof)

lsof -Pnp $pid

(if you haven't got lsof installed and can't install it, then ls -l /proc/$pid/fd will do too)

Active TCP connections and their state

(apt-get update && apt-get -yy install net-tools)

sudo netstat --tcp -peona

What are the epolls waiting for

for fd_file in /proc/$pid/fd/*; do if [[ "$(readlink "$fd_file")" =~ eventpoll ]]; then fd=$(basename "$fd_file"); echo "Epoll $fd"; echo =====; cat "/proc/$pid/fdinfo/$fd"; fi; done

ulimits

ulimit -a

Number of threads and other stats

TERM=dumb top -H -p $pid -n 1

Which syscalls are the threads of the program stuck in?

cat /proc/$pid/task/*/syscall

Almost certainly, this will give us enough information that we know where to look next. It should allow us to see if/why the system is blocked on the system or NIO level.

Example output

root@1de88b6c5f26:/Users/johannes/devel/swift-nio# lsof -Pnp $pid
COMMAND   PID USER   FD      TYPE DEVICE SIZE/OFF     NODE NAME
NIOHTTP1S  17 root  cwd       DIR  0,122      832  4004924 /Users/johannes/devel/swift-nio
NIOHTTP1S  17 root  rtd       DIR  0,163     4096 11289603 /
NIOHTTP1S  17 root  txt       REG  0,122 11821272 95010424 /Users/johannes/devel/swift-nio/.build/x86_64-unknown-linux-gnu/debug/NIOHTTP1Server
NIOHTTP1S  17 root  mem       REG  254,1          14433441 /lib/x86_64-linux-gnu/librt-2.27.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237750 /usr/lib/swift/linux/libicudataswift.so.65.1 (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237756 /usr/lib/swift/linux/libicuucswift.so.65.1 (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237753 /usr/lib/swift/linux/libicui18nswift.so.65.1 (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          14433394 /lib/x86_64-linux-gnu/libgcc_s.so.1 (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          14953301 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25 (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          14433376 /lib/x86_64-linux-gnu/libc-2.27.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237740 /usr/lib/swift/linux/libBlocksRuntime.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237747 /usr/lib/swift/linux/libdispatch.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          14433386 /lib/x86_64-linux-gnu/libdl-2.27.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          14433459 /lib/x86_64-linux-gnu/libutil-2.27.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          14433437 /lib/x86_64-linux-gnu/libpthread-2.27.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          14433401 /lib/x86_64-linux-gnu/libm-2.27.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237757 /usr/lib/swift/linux/libswiftCore.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          14433358 /lib/x86_64-linux-gnu/ld-2.27.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237758 /usr/lib/swift/linux/libswiftDispatch.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237759 /usr/lib/swift/linux/libswiftGlibc.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237761 /usr/lib/swift/linux/libswiftSwiftOnoneSupport.so (path dev=0,163)
NIOHTTP1S  17 root  mem       REG  254,1          10237762 /usr/lib/swift/linux/libswift_Concurrency.so (path dev=0,163)
NIOHTTP1S  17 root    0u      CHR  136,0      0t0        3 /dev/pts/0
NIOHTTP1S  17 root    1u      CHR  136,0      0t0        3 /dev/pts/0
NIOHTTP1S  17 root    2u      CHR  136,0      0t0        3 /dev/pts/0
NIOHTTP1S  17 root    3u  a_inode   0,12        0    12340 [eventpoll]
NIOHTTP1S  17 root    4u  a_inode   0,12        0    12340 [eventfd]
NIOHTTP1S  17 root    5u  a_inode   0,12        0    12340 [timerfd]
NIOHTTP1S  17 root    6u  a_inode   0,12        0    12340 [eventpoll]
NIOHTTP1S  17 root    7u  a_inode   0,12        0    12340 [eventfd]
NIOHTTP1S  17 root    8u  a_inode   0,12        0    12340 [timerfd]
NIOHTTP1S  17 root    9u  a_inode   0,12        0    12340 [eventpoll]
NIOHTTP1S  17 root   10u  a_inode   0,12        0    12340 [eventfd]
NIOHTTP1S  17 root   11u  a_inode   0,12        0    12340 [timerfd]
NIOHTTP1S  17 root   12u  a_inode   0,12        0    12340 [eventpoll]
NIOHTTP1S  17 root   13u  a_inode   0,12        0    12340 [eventfd]
NIOHTTP1S  17 root   14u  a_inode   0,12        0    12340 [timerfd]
NIOHTTP1S  17 root   15u  a_inode   0,12        0    12340 [eventpoll]
NIOHTTP1S  17 root   16u  a_inode   0,12        0    12340 [eventfd]
NIOHTTP1S  17 root   17u  a_inode   0,12        0    12340 [timerfd]
NIOHTTP1S  17 root   18u  a_inode   0,12        0    12340 [eventpoll]
NIOHTTP1S  17 root   19u  a_inode   0,12        0    12340 [eventfd]
NIOHTTP1S  17 root   20u  a_inode   0,12        0    12340 [timerfd]
NIOHTTP1S  17 root   21u  a_inode   0,12        0    12340 [eventpoll]
NIOHTTP1S  17 root   22u  a_inode   0,12        0    12340 [eventfd]
NIOHTTP1S  17 root   23u  a_inode   0,12        0    12340 [timerfd]
NIOHTTP1S  17 root   24u  a_inode   0,12        0    12340 [eventpoll]
NIOHTTP1S  17 root   25u  a_inode   0,12        0    12340 [eventfd]
NIOHTTP1S  17 root   26u  a_inode   0,12        0    12340 [timerfd]
NIOHTTP1S  17 root   27u     IPv4  35912      0t0      TCP 127.0.0.1:8888 (LISTEN)
root@1de88b6c5f26:/Users/johannes/devel/swift-nio# netstat --tcp -peona
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name     Timer
tcp        0      0 127.0.0.1:8888          0.0.0.0:*               LISTEN      0          35912      17/.build/debug/NIO  off (0.00/0/0)
tcp        0      0 172.17.0.2:38538        91.189.91.39:80         TIME_WAIT   0          0          -                    timewait (36.22/0/0)
tcp        0      0 172.17.0.2:47784        91.189.88.152:80        TIME_WAIT   0          0          -                    timewait (57.13/0/0)
tcp        0      0 172.17.0.2:37380        91.189.88.142:80        TIME_WAIT   0          0          -                    timewait (38.25/0/0)
tcp        0      0 172.17.0.2:37378        91.189.88.142:80        TIME_WAIT   0          0          -                    timewait (36.15/0/0)
root@1de88b6c5f26:/Users/johannes/devel/swift-nio# for fd_file in /proc/$pid/fd/*; do if [[ "$(readlink "$fd_file")" =~ eventpoll ]]; then fd=$(basename "$fd_file"); echo "Epoll $fd"; echo =====; cat "/proc/$pid/fdinfo/$fd"; fi; done
Epoll 12
=====
pos:	0
flags:	02
mnt_id:	13
tfd:       14 events:     2019 data: ffffffff0000000e  pos:0 ino:3034 sdev:c
tfd:       13 events:       19 data: ffffffff0000000d  pos:0 ino:3034 sdev:c
Epoll 15
=====
pos:	0
flags:	02
mnt_id:	13
tfd:       16 events:       19 data: ffffffff00000010  pos:0 ino:3034 sdev:c
tfd:       17 events:     2019 data: ffffffff00000011  pos:0 ino:3034 sdev:c
Epoll 18
=====
pos:	0
flags:	02
mnt_id:	13
tfd:       20 events:     2019 data: ffffffff00000014  pos:0 ino:3034 sdev:c
tfd:       19 events:       19 data: ffffffff00000013  pos:0 ino:3034 sdev:c
Epoll 21
=====
pos:	0
flags:	02
mnt_id:	13
tfd:       23 events:     2019 data: ffffffff00000017  pos:0 ino:3034 sdev:c
tfd:       22 events:       19 data: ffffffff00000016  pos:0 ino:3034 sdev:c
Epoll 24
=====
pos:	0
flags:	02
mnt_id:	13
tfd:       26 events:     2019 data: ffffffff0000001a  pos:0 ino:3034 sdev:c
tfd:       25 events:       19 data: ffffffff00000019  pos:0 ino:3034 sdev:c
Epoll 3
=====
pos:	0
flags:	02
mnt_id:	13
tfd:       27 events:     2019 data: ffffffff0000001b  pos:0 ino:8c48 sdev:8
tfd:        5 events:     2019 data: ffffffff00000005  pos:0 ino:3034 sdev:c
tfd:        4 events:       19 data: ffffffff00000004  pos:0 ino:3034 sdev:c
Epoll 6
=====
pos:	0
flags:	02
mnt_id:	13
tfd:        8 events:     2019 data: ffffffff00000008  pos:0 ino:3034 sdev:c
tfd:        7 events:       19 data: ffffffff00000007  pos:0 ino:3034 sdev:c
Epoll 9
=====
pos:	0
flags:	02
mnt_id:	13
tfd:       10 events:       19 data: ffffffff0000000a  pos:0 ino:3034 sdev:c
tfd:       11 events:     2019 data: ffffffff0000000b  pos:0 ino:3034 sdev:c
root@1de88b6c5f26:/Users/johannes/devel/swift-nio# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63576
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
root@1de88b6c5f26:/Users/johannes/devel/swift-nio# TERM=dumb top -H -p $pid -n 1
top - 13:46:38 up 17 min,  0 users,  load average: 0.10, 0.07, 0.08
Threads:  15 total,   0 running,  15 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  0.6 sy,  0.0 ni, 98.1 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16395788 total, 12626816 free,   267336 used,  3501636 buff/cache
KiB Swap:  1048572 total,  1048572 free,        0 used. 15439336 avail Mem 

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIOHTTP1Server
18 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIO-ELT-0-#0
19 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIO-ELT-0-#1
20 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIO-ELT-0-#2
21 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIO-ELT-0-#3
22 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIO-ELT-0-#4
23 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIO-ELT-0-#5
24 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIO-ELT-0-#6
25 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 NIO-ELT-0-#7
26 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 TP-#0
27 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 TP-#1
28 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 TP-#2
29 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 TP-#3
30 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 TP-#4
31 root 20 0 1112364 11752 10796 S 0.0 0.1 0:00.00 TP-#5

root@1de88b6c5f26:/Users/johannes/devel/swift-nio# cat /proc/$pid/task/*/syscall
202 0x5596c6da47a8 0x80 0x0 0x0 0x0 0x0 0x7ffd14ffc290 0x7f131ac73ad3
232 0x3 0x7f1310000e70 0x40 0xffffffff 0x0 0x7f1317285b00 0x7f1317285a30 0x7f131a12da47
232 0x6 0x7f1308000c70 0x40 0xffffffff 0x0 0x7f1316a84b00 0x7f1316a84a30 0x7f131a12da47
232 0x9 0x7f130c000c70 0x40 0xffffffff 0x0 0x7f1316283b00 0x7f1316283a30 0x7f131a12da47
232 0xc 0x7f1300000c70 0x40 0xffffffff 0x0 0x7f1315a82b00 0x7f1315a82a30 0x7f131a12da47
232 0xf 0x7f1304000c70 0x40 0xffffffff 0x0 0x7f1315281b00 0x7f1315281a30 0x7f131a12da47
232 0x12 0x7f12f8000c70 0x40 0xffffffff 0x0 0x7f1314a80b00 0x7f1314a80a30 0x7f131a12da47
232 0x15 0x7f12f4000c70 0x40 0xffffffff 0x0 0x7f12ffffdb00 0x7f12ffffda30 0x7f131a12da47
232 0x18 0x7f12ec000c70 0x40 0xffffffff 0x0 0x7f12ff7fcb00 0x7f12ff7fca30 0x7f131a12da47
202 0x5596c6da15e0 0x189 0x0 0x0 0x0 0xffffffff 0x7f12feffc9a0 0x7f131ac767c6
202 0x5596c6da15e0 0x189 0x0 0x0 0x0 0xffffffff 0x7f12fe7fb9a0 0x7f131ac767c6
202 0x5596c6da15e0 0x189 0x0 0x0 0x0 0xffffffff 0x7f12fdffa9a0 0x7f131ac767c6
202 0x5596c6da15e0 0x189 0x0 0x0 0x0 0xffffffff 0x7f12fd7f99a0 0x7f131ac767c6
202 0x5596c6da15e0 0x189 0x0 0x0 0x0 0xffffffff 0x7f12fcff89a0 0x7f131ac767c6
202 0x5596c6da15e0 0x189 0x0 0x0 0x0 0xffffffff 0x7f12e3ffe9a0 0x7f131ac767c6

@weissi
Copy link

weissi commented Dec 10, 2021

I proposed a script called nio-diagnose to be added to SwiftNIO. That does pretty much the above but automatically.

@finestructure
Copy link
Owner Author

Happy new year @weissi and @adam-fowler ! Apologies for another delay in getting back to you. I tried unsuccessfully to replicate our setup in a local docker env, and that took a while. It's just not erroring out, at least not when running on my M1 MBP. I'm either not fully capturing what we're doing in the SPI or it's specific to our hosting env running on Linux.

Rather than try and fiddle with this any longer, I've now simply triggered the issue in our staging env and logged some diagnostics as advised. The next to last file (diag-2022-01-03-09.13.39.md) should have captured the error state, while the last was run just after I'd stopped pg_dump. So things might have started clearing up then.

I can't glean much from the content I'm afraid, other than that perhaps some permission tweaking might be necessary to get the syscalls:

grep: /proc/1/task/1/syscall: Operation not permitted
grep: /proc/1/task/10/syscall: Operation not permitted
grep: /proc/1/task/11/syscall: Operation not permitted
...

As you can see, I was root in the container so I suppose the container itself may need to be launched with more permissions?

I hope this is useful! Thanks in advance for taking a look and please let me know if there's any additional info I can provide.

Just as a reminder as to our setup:

  • Run in our app_server container is the process serving the SPI webpage (and I've only run nio-diagnose in that container)
  • there are other containers that run periodically to write to the db
  • running pg_dump without the write containers active does not seem to trigger the problem
  • in the error case, even a request to the privacy page which does not trigger a db request (however, it does read a markdown file) gives us a request timeout

This last one is particularly baffling to me. Somehow our combination of readers, writers + pg_dump makes everything grind to a halt.

I'm sure we've misconfigured something but as far as I can recall everything should be pretty much the default setup. So if that's the case maybe there are some defaults that require tweaking.

CleanShot 2022-01-03 at 10 13 46@2x

diag-2022-01-03-09:12:31.md
diag-2022-01-03-09:13:24.md
diag-2022-01-03-09:13:30.md
diag-2022-01-03-09:13:39.md
diag-2022-01-03-09:14:01.md

@weissi
Copy link

weissi commented Jan 4, 2022

Happy New Year! Alright, so there are 230 file descriptors open, that looks good. Nothing in TIME_WAIT or CLOSE_WAIT which is also good and expected.

All the sockets that I can see are in epoll state 2019 which is

  • EPOLLRDHUP = 0x2000
  • EPOLLHUP = 0x010
  • EPOLLERR = 0x008
  • EPOLLIN = 0x001

which is what you want to see: NIO is accepting bytes from the network (waiting for epoll_wait to tell NIO that it can call read) but isn't trying to send anything (as in everything that the app submitted to NIO has been sent).

So this all looks 100% normal so far.

Apart from some pretty severe bug in the kernel/NIO/... this pretty much leaves us with 2 options that I can think of right now:

  1. The EventLoops are both blocked in something. To (dis)prove this theory we'd need the /proc/$PID/task/*/syscall output (which requires --cap-add=SYS_PTRACE ie the CAP_SYS_PTRACE capability).
  2. Something in Vapor / your app does some form of queueing or similar that prevents the this from working

If you can't add CAP_SYS_PTRACE, then you can also check if the EventLoops are blocked in the application code.

Something like

for loop in theVaporAppsEventLoopGroup.makeIterator() {
   loop.scheduleRepeatedTask(initialDelay: .seconds(1), delay: .seconds(1)) { _ in
       logger.info("EventLoop \(loop) is alive")
   }
}

will log a message for every EventLoop once per second. If you stop seeing anything logged, then the EventLoops are likely blocked.

In my gists, I also found some old colde which implements a basic but somewhat configurable EventLoop blockage checker. Vapor should probably have something like this.

@finestructure
Copy link
Owner Author

Thanks a lot for taking a look, @weissi !

I've now managed add CAP_SYS_PTRACE to the service and regenerated the logs. Here they are, the last couple should have captured the hang:

diag-2022-01-08-14:07:25.md
diag-2022-01-08-14:07:32.md
diag-2022-01-08-14:07:36.md
diag-2022-01-08-14:07:41.md

Hope that helps and please let me know if there's anything else I can do/try, like a screen share or something. It's a test environment, so there's no problem poking it while running :)

@weissi
Copy link

weissi commented Jan 10, 2022

@finestructure thank you. Okay, so you seem to have 68 threads out of which 65 are stuck in futex (when the kernel is used to wait/notify locks/condition variables/...) and 3 are in epoll_wait.

You seem to have 2 NIO EventLoop threads (seems good) which are on the threads 7 and 8, crucially both of them are in epoll_wait which is good. Ie, they're parked in the kernel waiting for work to be requested (either by data coming in or by more work requested from your app). So the EventLoop threads don't appear to be stuck but are completely ready to accept and do work.

This means: From a SwiftNIO point of view, there is absolutely nothing that looks wrong that I can see. All the usual suspects look totally fine. Could you run the following commands for me from within the container whilst the problem reproduces? Just paste the full output of

  1. time curl -v http://127.0.0.1:80/this_is_a_route_that_does_not_exist
  2. time curl -X WRONG -v http://127.0.0.1:80
  3. grep -H ^ /proc/1/task/*/comm
  4. lldb --batch -o 'bt all' -o cont -p 1 (don't worry about if you don't have it installed)

The idea here is to "prove" that at least something's responsive. The first command hits a route that doesn't exist (this should respond quickly but is handled by Vapor) and the second one should trigger an error really early on (because WRONG is not a HTTP method) which should lead Vapor to instruct NIO to close the connection. They should also log errors in Vapor.

Out of interest: Do we know what all these other 65 blocked threads are doing? How many NIOThreadPool threads do you have configured? I reckon you probably don't have 65 of them?

Because what could be happening is that some other system that you use all the time (logging or something?) could be blocked on one of those threads? The third command might give us more info (will dump all the thread names, should be part of nio-diagnose really...). The fourth command just attempts to use lldb to give us all back traces.


$ grep -e task/7/syscall: -e task/8/syscall: /Users/johannes/Downloads/diag-2022-01-08-*.md 
/Users/johannes/Downloads/diag-2022-01-08-14.07.25.md:/proc/1/task/7/syscall:232 0x3 0x7f5964000d00 0x40 0xffffffff 0x0 0x7f596b338938 0x7f596b338920 0x7f596e3c45ce
/Users/johannes/Downloads/diag-2022-01-08-14.07.25.md:/proc/1/task/8/syscall:232 0x6 0x7f595c000c80 0x40 0xffffffff 0x0 0x7f596ab16938 0x7f596ab16920 0x7f596e3c45ce
/Users/johannes/Downloads/diag-2022-01-08-14.07.32.md:/proc/1/task/7/syscall:232 0x3 0x7f5964000d00 0x40 0xffffffff 0x0 0x7f596b338938 0x7f596b338920 0x7f596e3c45ce
/Users/johannes/Downloads/diag-2022-01-08-14.07.32.md:/proc/1/task/8/syscall:232 0x6 0x7f595c000c80 0x40 0xffffffff 0x0 0x7f596ab16938 0x7f596ab16920 0x7f596e3c45ce
/Users/johannes/Downloads/diag-2022-01-08-14.07.36.md:/proc/1/task/7/syscall:232 0x3 0x7f5964000d00 0x40 0xffffffff 0x0 0x7f596b338938 0x7f596b338920 0x7f596e3c45ce
/Users/johannes/Downloads/diag-2022-01-08-14.07.36.md:/proc/1/task/8/syscall:232 0x6 0x7f595c000c80 0x40 0xffffffff 0x0 0x7f596ab16938 0x7f596ab16920 0x7f596e3c45ce
/Users/johannes/Downloads/diag-2022-01-08-14.07.41.md:/proc/1/task/7/syscall:232 0x3 0x7f5964000d00 0x40 0xffffffff 0x0 0x7f596b338938 0x7f596b338920 0x7f596e3c45ce
/Users/johannes/Downloads/diag-2022-01-08-14.07.41.md:/proc/1/task/8/syscall:232 0x6 0x7f595c000c80 0x40 0xffffffff 0x0 0x7f596ab16938 0x7f596ab16920 0x7f596e3c45ce

the first column (always 232 for threads 7 and 8) is the syscall number and 232 is epoll_wait which is why I know:

# grep -e NR_epoll_wait -e NR_futex /usr/include/x86_64-linux-gnu/asm/unistd_64.h
#define __NR_futex 202
#define __NR_epoll_wait_old 215
#define __NR_epoll_wait 232

the other threads are in 202 which is futex, meaning a lock/condition variable or so.

You also seem to have a bunch of NIOThreadPool threads and those are expected to be stuck in futex because they use a pthread_cond condition variable to wait for work which would show up as futex unless they're currently actively running work.

@finestructure
Copy link
Owner Author

Following up with some additional detail raised during yesterday's debugging session:

  • (why) do error routes require a db connection?
  • (why) does /route_that_does_not_exist require a db connection?
    • it does not
    • loads 500 and displays error page as expected

I tested this by running locally while shutting down the db. Everything behaved as expected, i.e. pages that require the db raised 500s and the error page displayed as expected while Markdown pages loaded from the filesystem as expected as well.

So these routes should not be affected by any db related timeouts.

It seems that congestion with db related activity is impacting non-db routes when we pg_dump.

We were speculating that the write queries might be pushed above a 10s timeout limit due to the pg_dump activity. I'm pretty sure something like this is happening. I remembered that we've recently added query duration logging for our writer jobs and when pulling up the data for yesterday afternoon, I can see distinct spikes in duration which surely correspond to the times I forced the error by running pg_dump. (We should be able to match it with the times in the strace log. Grafana timezone is CET, logs would be UTC.)

CleanShot 2022-01-11 at 07 30 11@2x

Zoomed in view, showing the query run times in more detail:

CleanShot 2022-01-11 at 07 41 04@2x

"ingest" is a process that queries the Github API for package metadata every 5mins in batches of 100 packages.

"analyse" is a process that runs analysis on git repository checkouts for packages just refreshed by "ingest" in batches of 25 with a check frequency of 20s. That's why after every step change in the "ingest" curve, the "analysis" curve spikes up with 4 requests timing around 6-7s.

(Since we're running batch processes, we're using a metrics pusher and it'll keep reporting the last duration until a new duration is in. That's why the lines don't drop down to 0 in between requests.)

@weissi
Copy link

weissi commented Jan 11, 2022

It seems that congestion with db related activity is impacting non-db routes when we pg_dump.

This seems odd. How could the non-DB routes be affected if they truly don't use the DB?

Could this scenario be possible: The "non-DB" routes attempt to contact the DB but they don't require it. So if the DB is shut down they work really fast because the DB immediately fails. If the DB is working normally, they obviously work too. But in the case where the DB is very backed up, they still wait for the timeout because they attempt to speak to the DB.

For example if you have something like an access log stored in the DB, something like

app.get("no-db") { in
    self.database.storeAccessLog()
        .recover { _ in /* this is best-effort, let's ignore the error }
        .flatMap {
            return self.renderMarkdownPage()
        }
}

obviously your real code wouldn't look like that. But maybe one of your middlewares does a best-effort DB connection?

That could explain that everything is working when the DB is either working totally fine or not working at all. But in the case where the DB is just super slow, we'd see the non-db route taking a long time because it still attempts to contact the DB.

This should be easily visible with some logging. Paging @fabianfett, is there a way to configure PostgresNIO to log every attempt, so something that'd log before you become a waiter in the connection pool?

@weissi
Copy link

weissi commented Jan 11, 2022

@finestructure actually, it might be more subtle. Didn't you see only very few requests to /route_that_does_not_exist actually hang? Is it possible that the non-DB routes usually don't contact the DB but every now and then, they do? Maybe some reporting, or when some error happens, or smth else?

@finestructure
Copy link
Owner Author

finestructure commented Jan 11, 2022

Yes, I did see some requests go through in the initial tests. During the last test (when I’d stopped the external checker) that didn’t seem to be the case, although that's not something I’ve rigorously tested. I’ll def review all routes with respect to db access, add some logging, and will re-test. Thanks for taking a look!

@finestructure
Copy link
Owner Author

I'm not done yet investigating but wanted to share some interim results.

I set the project up to run locally, pointing the db at our staging db with my network configured to have a 5 second delay via Link Conditioner. Here's what I found:

So it definitely looks like a 404 route is contacting the db - or at least something over the network. It's not clear to me where and why that would be, because as far as I can see, without any of our routes nor the ErrorMiddleware handling it this should be entirely stock Vapor behaviour, if I'm not mistaken. I'll dig into this further. In particular I'll try to differentiate if it's db or general network access that causes the delay.

(Is there a way to set up an SSH tunnel with a delay? If so I could configure the db over a slow tunnel and separate that out from network traffic. Another thought I had was to set a db request breakpoint somewhere that definitely gets executed for any Postgres request. I'll need to see if I can find a good place.)

What's also clear is that our Markdown page loading does not access the db. This might be an interesting data point, because when the timeouts occur during pg_dump, these pages are affected and time out as well.

Is there any chance that there's a Vapor/NIO internal resource that gets congested if db tasks are getting so slow that requests would run over the 10s timeout? I'm pretty sure you said that everything looks ok with respect to event loops etc but I just wanted to double check if I understand that correctly.

@weissi
Copy link

weissi commented Jan 12, 2022

Is there any chance that there's a Vapor/NIO internal resource that gets congested if db tasks are getting so slow that requests would run over the 10s timeout? I'm pretty sure you said that everything looks ok with respect to event loops etc but I just wanted to double check if I understand that correctly.

I can only speak for NIO and there the answer is no. NIO has no idea if a network connection is HTTP or a database or if some connection doesn't make progress or anything. To stall NIO, you'd need to stall the EventLoop but (by doing curl -X WRONG & through the nio-diagnose data) we already know that's not the case.

What you could do is send me another strace -ttTf -p 1 whilst you do for f in $(seq 250); do curl http://127.0.0.1:8080/privacy; done

@finestructure
Copy link
Owner Author

Ok, I think I can at least clear up one of the contradictions: the privacy page does not lock up when the db is blocking.

Those 250 curl requests complete throughout the whole of pg_dump just fine and I now realised why I thought they were blocked. My test script goes round robin through the urls when I test externally: home (/), search (/search?query=alamo), and privacy (/privacy). What I didn't notice until now is that when we're blocking, we're just blocking on the home page and not cycling through the other pages:

CleanShot 2022-01-13 at 14 57 44@2x

I suspect this is what's been happening all along and I just misinterpreted the errors. My choice of blue on black for the request name certainly didn't help 😟 Sorry for the confusion!

@weissi
Copy link

weissi commented Jan 13, 2022

Ah, that makes a lot of sense! Thanks for reporting back.

@finestructure
Copy link
Owner Author

It's like a Sudoku: once you've figured out the first piece, things are starting to fall into place.

The reason that http://localhost:8080/this_is_a_route_that_does_not_exist blocked is actually quite simple: it's not a route that does not exist! It's actually trying to resolve to an author page for a non-existing author, i.e. like https://staging.swiftpackageindex.com/apple

Naturally, that's running a search query. I completely missed that 🤦

Next, I've started running queries via an SQL client while the db is "blocked" and found some interesting results. Most queries run fine while the db is "blocked" (and by blocked I mean some pages throwing 500s). It turns out that out of the three trivial queries we run for the home page, one is consistently blocking also in the SQL client when run against the db while a pg_dump has been underway for a while.

When this one query is blocked, other queries run fine. It's unclear why a read would block like this and why it's only that particular query. To be investigated, but that certainly doesn't look like a Vapor issue.

Another odd result is the following: the page https://staging.swiftpackageindex.com/apple is driven by a single SQL query. Accessing this page during a pg_dump will, when the dump's been underway for a while, start throwing up 500s. However, the underlying SQL query can still be executed successfully via db client the whole time.

So there is certainly still something weird going on where a page we're building off an SQL query times out even though there's no evidence of the actual SQL query timing out or taking too long.

I'll need to follow-up on this as well. Maybe we're exhausting a connection limit or something.

I think at this point it's best if I close this issue and open another one in the SPI repo to track the details. It's now getting quite SPI specific.

Thanks again for the help, Johannes, and feel free to follow along in the new issue :)

@weissi
Copy link

weissi commented Jan 14, 2022

However, the underlying SQL query can still be executed successfully via db client the whole time.

That's quite interesting. @fabianfett does PostgresNIO by default create a transaction and if yes, what isolation level?

@finestructure could you try (whilst the Vapor app has the sql queries hanging) in the commandline psql

BEGIN
SET TRANSACTION ISOLATION_LEVEL SERIALIZABLE;
SELECT ... # your normal query here
COMMIT

and see if that still executes fast?

pg_dump shouldn't write to the DB but certainly it wants a consistent view of the DB, so it'll probably do ISOLATION_LEVEL SERIALIZABLE or so. Now if some writing request comes in, this may cause other serializable transactions to wait.

@finestructure
Copy link
Owner Author

It's almost certainly something like that. I forgot to mention that this whole issue disappears if we stop our db writer processes while dumping. That smells like something transaction or table locking related. I'll follow up as you suggest asap!

@finestructure
Copy link
Owner Author

I've run that check with the transaction level set to SERIALIZABLE now. It makes no difference – the query runs fine with or without it, while the page that uses it times out.

SwiftPackageIndex/SwiftPackageIndex-Server#1489 has some more details. It's baffling!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants