New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New details #1
Comments
Sorry @adam-fowler, I tagged you into this issue without a lot of context but I suspect you might be interested in the findings. I've had a chat last week with Fabian about the original issue I outline at the very end above. It's probably easiest if I give you more details via Discord/Slack somewhere if you're interested - just let me know! |
Hi @finestructure , just looking into it. Not sure what the original Vapor issue is but a crash in Hummingbird isn't ideal. Just looking into it |
Ah great, let me write up some repro details for the crash in the other repo then. |
At a guess, the issue is the URL you are sending is not proper UTF8. But my parser should not be so fussy though and instead throw an error. |
I've updated the Readme over in the other repro project: https://github.com/finestructure/hummingbird-lockup-repro/blob/main/README.md Thanks a lot for taking a look! |
Fix on its way. Stupid bug where I was checking to see if a byte was a valid starting byte for a UTF8 character, except the byte happened to be at the end index of the string, so would cause a crash when accessed. |
Latest version of Hummingbird 0.14.4 fixes this |
You are far too trusting |
Very interesting: with the Adam's fix in Hummingbird 0.14.4 I can now reproduce the lockup issue both in Vapor and Hummingbird, if I hit the server with It seems |
One other piece of information: Adam mentioned yesterday that he suspects
instead of
(same for But even if I'm doing something wrong in the way I'm looping in Another observation worth noting: The tests for Vapor and Hummingbird don't lock up when running the |
I think the behaviour we are seeing is an issue with Swift NIO where it does not recover from being flooded with open connections, even after all of them are closed. I'm going to try and construct a simple example using their sample server. |
@finestructure I wrote a little app that flooded a server with connections and ran it multiple times and was able to replicate what we were talking about above. I had a chat with Fabian about this. And he spoke with @weissi who says one of the reasons we are seeing issues here is the mac by default only allows a maximum of 256 connections so it is very easy to flood the server and cause it to hang. Linux normally allows for a great deal more connections, so it isn't as much of a problem there. This may be completely unrelated to your database issue though. |
@adam-fowler you can easily up that limit using You could put a file called <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>limit.maxfiles</string>
<key>ProgramArguments</key>
<array>
<string>launchctl</string>
<string>limit</string>
<string>maxfiles</string>
<string>262144</string>
<string>524288</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>ServiceIPC</key>
<false/>
</dict>
</plist> to never have this problem again (after a reboot). And please note, that it's the choice of the web server to never close those connections. A webserver may choose to say close connections that are idle for some time. But honestly, if you're running a server, the maximum file descriptor limit should be set rather high... |
Furthermore, I think that SwiftNIO tells the web server in the SERVER channel pipeline (configured via |
@weissi we just trying to diagnose why we were seeing this issue. I'm not looking to run a server on macOS so this shouldn't really be an issue. I guess we should do testing on Linux. I'll look into seeing if I can catch errors on the server channel. Although not totally sure what I'd do if I caught one. |
@adam-fowler you could log and maybe close some open connections that are idle for a while? Also, you'll want to drop connections that are idle for long, otherwise it's quite easy for somebody to consume resources on your server indefinitely. Additionally, you should have a very high fd limit in prod. To see if you have that problem, you can check |
So I need to keep a record of how long all my open connections have been idle? Do I need an additional scheduled task that closes a connection if it is idle for too long? |
@adam-fowler you can use NIO's Apart from that you can also use NIO's |
There isn't one "right" answer here that works in all cases. As almost always, you're dealing with limited resources that you need to use in the ideal way. Our resources here are (non exhaustive):
The right answer depends on the use case, I think the best thing you can do right now is to set the limits "high" enough but more importantly use logging and metrics to make sure the user has the tools to evaluate what's going on. If you wanted to with SwiftNIO you can also build more adaptive strategies. Like you can very easily count the number of currently active connections (by having a handler in the server pipeline that counts the number of |
Sorry for the very late reply and thanks a lot for the follow up and additional detail, Adam and Johannes. I think that while the repro on macOS may have been useful to start the discussion it seems like there’s a good chance it’s actually reproducing a similar looking issue without shedding enough light an what we actually see in our test and prod envs on Linux. Next week I’ll try to create a trimmed down repro with our setup in a docker-compose file that can be deployed somewhere for testing, or just provide an env to look at. Hopefully that’ll help track this down! Thanks again for your help and tips everyone 🙂 |
Sounds good. In case you have a repro in prod, the information from the commands below would be extremely useful. All of the commands are non-destructive and can safely be run in production. They should (all?) work in unprivileged containers too but if one doesn't then don't worry about it, we probably won't need all the information. If I put a All commands but the All commands will assume a shell variable The easiest way to set that is pid=$(pgrep NameOfMyBinary) (in container land it's actually probably Open file descriptors(
(if you haven't got Active TCP connections and their state( sudo netstat --tcp -peona What are the
|
I proposed a script called |
Happy new year @weissi and @adam-fowler ! Apologies for another delay in getting back to you. I tried unsuccessfully to replicate our setup in a local docker env, and that took a while. It's just not erroring out, at least not when running on my M1 MBP. I'm either not fully capturing what we're doing in the SPI or it's specific to our hosting env running on Linux. Rather than try and fiddle with this any longer, I've now simply triggered the issue in our staging env and logged some diagnostics as advised. The next to last file ( I can't glean much from the content I'm afraid, other than that perhaps some permission tweaking might be necessary to get the syscalls:
As you can see, I was I hope this is useful! Thanks in advance for taking a look and please let me know if there's any additional info I can provide. Just as a reminder as to our setup:
This last one is particularly baffling to me. Somehow our combination of readers, writers + I'm sure we've misconfigured something but as far as I can recall everything should be pretty much the default setup. So if that's the case maybe there are some defaults that require tweaking. diag-2022-01-03-09:12:31.md |
Happy New Year! Alright, so there are 230 file descriptors open, that looks good. Nothing in TIME_WAIT or CLOSE_WAIT which is also good and expected. All the sockets that I can see are in epoll state 2019 which is
which is what you want to see: NIO is accepting bytes from the network (waiting for So this all looks 100% normal so far. Apart from some pretty severe bug in the kernel/NIO/... this pretty much leaves us with 2 options that I can think of right now:
If you can't add Something like for loop in theVaporAppsEventLoopGroup.makeIterator() {
loop.scheduleRepeatedTask(initialDelay: .seconds(1), delay: .seconds(1)) { _ in
logger.info("EventLoop \(loop) is alive")
}
} will log a message for every EventLoop once per second. If you stop seeing anything logged, then the EventLoops are likely blocked. In my gists, I also found some old colde which implements a basic but somewhat configurable EventLoop blockage checker. Vapor should probably have something like this. |
Thanks a lot for taking a look, @weissi ! I've now managed add diag-2022-01-08-14:07:25.md Hope that helps and please let me know if there's anything else I can do/try, like a screen share or something. It's a test environment, so there's no problem poking it while running :) |
@finestructure thank you. Okay, so you seem to have 68 threads out of which 65 are stuck in You seem to have 2 NIO EventLoop threads (seems good) which are on the threads This means: From a SwiftNIO point of view, there is absolutely nothing that looks wrong that I can see. All the usual suspects look totally fine. Could you run the following commands for me from within the container whilst the problem reproduces? Just paste the full output of
The idea here is to "prove" that at least something's responsive. The first command hits a route that doesn't exist (this should respond quickly but is handled by Vapor) and the second one should trigger an error really early on (because Out of interest: Do we know what all these other 65 blocked threads are doing? How many Because what could be happening is that some other system that you use all the time (logging or something?) could be blocked on one of those threads? The third command might give us more info (will dump all the thread names, should be part of
the first column (always
the other threads are in You also seem to have a bunch of |
Following up with some additional detail raised during yesterday's debugging session:
I tested this by running locally while shutting down the db. Everything behaved as expected, i.e. pages that require the db raised 500s and the error page displayed as expected while Markdown pages loaded from the filesystem as expected as well. So these routes should not be affected by any db related timeouts. It seems that congestion with db related activity is impacting non-db routes when we We were speculating that the write queries might be pushed above a 10s timeout limit due to the Zoomed in view, showing the query run times in more detail: "ingest" is a process that queries the Github API for package metadata every 5mins in batches of 100 packages. "analyse" is a process that runs analysis on git repository checkouts for packages just refreshed by "ingest" in batches of 25 with a check frequency of 20s. That's why after every step change in the "ingest" curve, the "analysis" curve spikes up with 4 requests timing around 6-7s. (Since we're running batch processes, we're using a metrics pusher and it'll keep reporting the last duration until a new duration is in. That's why the lines don't drop down to 0 in between requests.) |
This seems odd. How could the non-DB routes be affected if they truly don't use the DB? Could this scenario be possible: The "non-DB" routes attempt to contact the DB but they don't require it. So if the DB is shut down they work really fast because the DB immediately fails. If the DB is working normally, they obviously work too. But in the case where the DB is very backed up, they still wait for the timeout because they attempt to speak to the DB. For example if you have something like an access log stored in the DB, something like
obviously your real code wouldn't look like that. But maybe one of your middlewares does a best-effort DB connection? That could explain that everything is working when the DB is either working totally fine or not working at all. But in the case where the DB is just super slow, we'd see the This should be easily visible with some logging. Paging @fabianfett, is there a way to configure PostgresNIO to log every attempt, so something that'd log before you become a waiter in the connection pool? |
@finestructure actually, it might be more subtle. Didn't you see only very few requests to |
Yes, I did see some requests go through in the initial tests. During the last test (when I’d stopped the external checker) that didn’t seem to be the case, although that's not something I’ve rigorously tested. I’ll def review all routes with respect to db access, add some logging, and will re-test. Thanks for taking a look! |
I'm not done yet investigating but wanted to share some interim results. I set the project up to run locally, pointing the db at our staging db with my network configured to have a 5 second delay via Link Conditioner. Here's what I found:
So it definitely looks like a 404 route is contacting the db - or at least something over the network. It's not clear to me where and why that would be, because as far as I can see, without any of our routes nor the ErrorMiddleware handling it this should be entirely stock Vapor behaviour, if I'm not mistaken. I'll dig into this further. In particular I'll try to differentiate if it's db or general network access that causes the delay. (Is there a way to set up an SSH tunnel with a delay? If so I could configure the db over a slow tunnel and separate that out from network traffic. Another thought I had was to set a db request breakpoint somewhere that definitely gets executed for any Postgres request. I'll need to see if I can find a good place.) What's also clear is that our Markdown page loading does not access the db. This might be an interesting data point, because when the timeouts occur during Is there any chance that there's a Vapor/NIO internal resource that gets congested if db tasks are getting so slow that requests would run over the 10s timeout? I'm pretty sure you said that everything looks ok with respect to event loops etc but I just wanted to double check if I understand that correctly. |
I can only speak for NIO and there the answer is no. NIO has no idea if a network connection is HTTP or a database or if some connection doesn't make progress or anything. To stall NIO, you'd need to stall the EventLoop but (by doing What you could do is send me another |
Ok, I think I can at least clear up one of the contradictions: the privacy page does not lock up when the db is blocking. Those 250 curl requests complete throughout the whole of I suspect this is what's been happening all along and I just misinterpreted the errors. My choice of blue on black for the request name certainly didn't help 😟 Sorry for the confusion! |
Ah, that makes a lot of sense! Thanks for reporting back. |
It's like a Sudoku: once you've figured out the first piece, things are starting to fall into place. The reason that http://localhost:8080/this_is_a_route_that_does_not_exist blocked is actually quite simple: it's not a route that does not exist! It's actually trying to resolve to an author page for a non-existing author, i.e. like https://staging.swiftpackageindex.com/apple Naturally, that's running a search query. I completely missed that 🤦 Next, I've started running queries via an SQL client while the db is "blocked" and found some interesting results. Most queries run fine while the db is "blocked" (and by blocked I mean some pages throwing 500s). It turns out that out of the three trivial queries we run for the home page, one is consistently blocking also in the SQL client when run against the db while a When this one query is blocked, other queries run fine. It's unclear why a read would block like this and why it's only that particular query. To be investigated, but that certainly doesn't look like a Vapor issue. Another odd result is the following: the page https://staging.swiftpackageindex.com/apple is driven by a single SQL query. Accessing this page during a So there is certainly still something weird going on where a page we're building off an SQL query times out even though there's no evidence of the actual SQL query timing out or taking too long. I'll need to follow-up on this as well. Maybe we're exhausting a connection limit or something. I think at this point it's best if I close this issue and open another one in the SPI repo to track the details. It's now getting quite SPI specific. Thanks again for the help, Johannes, and feel free to follow along in the new issue :) |
That's quite interesting. @fabianfett does PostgresNIO by default create a transaction and if yes, what isolation level? @finestructure could you try (whilst the Vapor app has the sql queries hanging) in the commandline BEGIN
SET TRANSACTION ISOLATION_LEVEL SERIALIZABLE;
SELECT ... # your normal query here
COMMIT and see if that still executes fast?
|
It's almost certainly something like that. I forgot to mention that this whole issue disappears if we stop our db writer processes while dumping. That smells like something transaction or table locking related. I'll follow up as you suggest asap! |
I've run that check with the transaction level set to SwiftPackageIndex/SwiftPackageIndex-Server#1489 has some more details. It's baffling! |
Interesting new details regarding the lockup error.
I wanted to understand if the problem is in Vapor, Fluent, or perhaps NIO so I re-implemented the little example with Hummingbird and its Fluent module: https://github.com/finestructure/hummingbird-lockup-repro
I ported the example with the same API and ran
make post
, which sends a post request with Rester. I got a crash in the server:The index out of range is happening in the second
precondition
on line 77 in HBParser.swift:, which is
/todos?
. Not entirely sure where the?
is coming from, it's not in the initial request, but let's leave that aside for now.Because I couldn't believe Hummingbird would crash on such a simple example I tried to run a post request with the shipping example
todos-fluent
via Rester and ... it crashed in the same place!However, the
todos-fluent
example comes with a Paw file and running the post request from there ... works!I inspected the difference in the requests between Paw and Rester and started eliminating some of the minor differences but that made no change.
Instead of trying to debug this I swapped Rester out of the repro steps in favour of curl. Now the Hummingbird example worked, and the lock-up was not reproducible. Apart from the weird crash when using Rester, Hummingbird performed just fine, processed all posts, gets, and even dumps in a loop.
On a hunch, I went back to the Vapor repro example and swapped out Rester for curl and ... the lockup does not reproduce anymore. Switching back to Rester does lock it up.
So it appears that the way Rester is making requests
Now Rester shouldn't be doing anything esoteric: it's a Swift tool triggering web requests using URLSession.dataTask. Did I (having written Rester) somehow manage to create a Swift server nuke?
But even if the problem lies with my implementation in Rester, something similar must be happening in a Vapor server, because remember: we can bring the Swift Package Index server down when running web requests via Safari and dumping the Postgres database via
pg_dump
.This example using Rester is purely a local way to reproduce the error.
It's unclear to me what exactly is going on but it certainly looks like there's a common root problem that affects both Vapor and Hummingbird - but in different ways.
Or maybe I'm just holding it very wrong, which is entirely possible 🤣
FYI @fabianfett @adam-fowler
The text was updated successfully, but these errors were encountered: