-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vaultwarden taking 200 seconds to time out icon requests #2118
Comments
Additional information: |
And one more additional log chunk (cleaned up to remove websocket debug entries inline with these) after deliberately deleting just a couple icon_cache miss files and enabling debug and extended logging:
|
I have discovered that this issue manifests when DNS is not working/available from the docker instance running vaultwarden. In my case, I have pihole set up as a DNS resolver running as another docker container on the same host, and due to how the docker network bridging/masquerading works, DNS queries were successfully reaching pihole but were getting returned from an unexpected interface (the docker host bridge interface rather than the external local network interface the query was sent to) which caused the DNS querier on the vaultwarden instance to reject the responses silently. Altering my pihole deployment as per the below link to ensure it only binds to port 53 on the external host interface resolved the nonfunctional DNS in my vaultwarden container, and by extension "worked around" the above bug because DNS queries now succeed and the icons come in correctly or time out quickly. I would still like to know why this pathologically slow timeout occurs in the nonfunctional DNS case; ideally vaultwarden's icon requests should time out in the expected time regardless of cause (10 seconds by default) in this scenario, preventing broken DNS from resulting in a hung web interface for many minutes as all threads are blocked with queries that will never succeed. Of course, in this scenario you can avert the issue by disabling the icon lookups entirely, but this requires additional steps after container deployment (resulting in poor initial user experience that might discourage a new user from continuing to use vaultwarden) and will also result in poor behaviour any time a transient failure in DNS resolution occurs with no obvious indication to the user as to why their instance is hanging. |
First, glad that you have found the solution. I briefly thought about setting a flag that if one resolving took way longer to skip the rest afterwards, but that also could cause false positive if there is one broken authoritative resolver which then breaks all the rest of the following queries. I actually think that this is ok, since you had DNS resolving issues which is why Vaultwarden took a long time resolving. I do not think that it is the task of Vaultwarden to mitigate and prepare for broken systems of configurations, that could be endless. Since after you have resolved your dns resolving issue Vaultwarden seems to work as it should, i do not think we should change anything on the Vaultwarden side. |
It seems like the way the timeout is handled currently is by directly passing the configured parameter into the reqwest client builder, and it appears that reqwest is only using that timeout for the actual server query and not for delays caused during any implicit lookups that are performed beforehand. I'm not a Rust expert but it might be possible to control the DNS timeout by using the trust-dns resolver instead of the default behaviour of getaddrinfo (this is an option in reqwest from what I can see) or alternatively by running the reqwest query in an async context wrapped by tokio::time's Timeout functionality or similar so that it can be aborted after a specified delay regardless of the step it's on. Maybe it's also possible to run these queries in their own thread or threads separate from the actual Rocket ones so they don't block other behaviour while they're happening, which would also permit killing the threads if they run too long (assuming operations like that aren't too expensive in Rust). I don't have enough knowledge in this domain to say how much work would be involved with something like this, so I'll leave it up to the experts if this is something worth investigating. At least from the user's perspective, it's somewhat unexpected that setting a specific icon download timeout doesn't actually limit the query to the specified value but instead some potentially much larger value based on the particulars of their host environment. It would be a better experience if the icon request were wrapped in such a way that it timed out after the specified number of seconds no matter the cause, as it's not really important to the user whether the query fails due to DNS, missing route, server unresponsive, or any number of other causes. For my own instance since I now have DNS working I'm no longer directly affected by this behaviour, so this is mostly just input/feedback on potential improvements at this point. Feel free to leave this open if there's interest in investigating further or closing if it will be left with the current behaviour. |
I do agree on user experience. Though misconfiguration of the host and dns which is taking a long time could be a nice and good indicator to start. And, users also want to see those icons, if we cut off to quickly and not show icons at all the world be bad use experience also. I did looked at trust-dns, though i also saw issues regarding the use of that in different platforms. Maybe they are solved right now that i haven't checked. There is an async branch of Vaultwarden which is using tokio and a general timeout could be added there maybe. Regarding threads, that is currently already there case, Rocket starts with an x amount of default threads and you can configure that. Though creating 160+ threads in your case could have been an issue also when suggesting to just create threads for all requests. Maybe the default is a bit low for your system, but on low powered devices it is probably better to be on the safe side. Just change |
My point about threads was to have a small number (1-4 maybe) in a separate thread pool dedicated specifically to performing the icon request lookups (which are potentially way longer-running than generating a typical http response for a bitwarden client), leaving the "standard" pool of Rocket threads for serving web requests from the client as designed. The behaviour noted here is only really an issue because it permits long-running icon queries to block and cause normal web client and API requests from the bitwarden clients to hang; if you had a thread or two slowly churning away on (and failing) icon queries in the background, it would have effectively no impact on the user, especially since they're not going to receive their icons anyway in this configuration so it really doesn't matter how long it takes. As long as the threads intended to provide the "web server" portions of vaultwarden as running within Rocket can also get hung up issuing their own long-running web client requests (for icons in this case), it won't really matter how many threads you throw into the main Rocket thread pool for large enough vaultwarden databases with many entries/URLs requiring icon lookups, it will just delay the problem a bit. Either having some wrapped timeout mechanism as you mention in the async branch or having a separate thread pool for icon requests would do the trick I think, though. |
I do agree on the blocking part regarding fetching icons and the rest of the api/server. |
A quick test. It looks like the |
Small update: Now, i am willing to enable this feature which seems to help a lot. But i am a bit hesitant into adding a separate thread-pool for the icon endpoint right now. The main reason is, i do not know when @dani-garcia wants to release the async version as the main version. Because that would solve the rest. I probably need to add trust-dns-resolver because we do some resolving our selfs, which then also improves the performance and less blocking in the end, but that needs a bit more testing. |
My current plan is to make a new release 1.24 tomorrow or during this week, which will be the last release of the old bitwardenrs images. Then potentially merging the async branch in 1.25 or 1.26. Knowing that I don't think it makes sense to develop a specific solution for the sync version when it's not a widespread issue and it's gonna be obsolete soon, we can add the trust-dns feature if you see an improvement with it enabled though. |
That's what i thought indeed. I'll see if i have some time tomorrow or the day after to add it and charge the custom dns lookups to also use the trust-dns-resolver. |
- Enabled trust-dns feature which seems to help a bit when DNS is causing long timeouts. Though in the blocking version it is less visible then on the async branch. - Updated crates - Removed some redundant code - Updated javascript/css libraries Resolves dani-garcia#2118 Resolves dani-garcia#2119
I have just got a new vaultwarden instance set up to test out, and things are mostly working correctly. I'm encountering an issue with icon downloads however, that substantially impacts application performance.
As far as I understand, the default icon download timeout is supposed to be 10 seconds, but my instance appears to be taking exactly 200 seconds to time out icon requests (a representative log snippet below). Since the timeout is so long, all Rocket threads get blocked and the application hangs until the process completes (which takes a long time on startup or whenever the icon cache misses expire).
I have tried manually setting the
ICON_CACHE_TIMEOUT
environment variable to something explicit (5 seconds) and this has no effect; requests still take 200 seconds to time out.The database is imported from a Keepass XML export and has about 160 entries. My current docker run command is below:
docker run -d --name vaultwarden -v /storage/store/docker-configs/vw-data:/data/ -p 8124:80 -p 3012:3012 -e WEBSOCKET_ENABLED=true -e ICON_DOWNLOAD_TIMEOUT=5 vaultwarden/server:latest
And I'm running this on a Fedora server using nginx as a reverse proxy to front the vaultwarden instance with Letsencrypt certificates. This part otherwise appears to be working fine, and I'm able to connect to the instance via web interface or mobile app and websockets are even working correctly.
Any idea what could be causing this suspiciously specific and overly long 200-second timeout for icon downloads?
Log snippet (note the chunk of requests at 21:48:08 followed by no activity while all 10 threads block, then another chunk at 21:50:58
The text was updated successfully, but these errors were encountered: