Description
So I've been noticing that every now and then some requests against ntfy.sh had been taking 11-15s (as opposed to <1s). At first I thought it was a problem with the Linux kernel tuning variables (somaxconn, nofile, ...). Then I thought it was nginx. After randomly poking around I found that the updateStatsAndPrune() code is likely to blame, because it locks the server mutex for a very long time (or so it appears).
Here's what I saw:
curl -sd hi ntfy.sh/mytopic123 > /dev/null 0.00s user 0.00s system 12% cpu 0.055 total
Tue Jun 21 15:04:57 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null 0.01s user 0.01s system 23% cpu 0.062 total
Tue Jun 21 15:05:02 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null 0.01s user 0.01s system 0% cpu 11.509 total
^^^^^^
This happened even when doing it against localhost:11080 (= not through nginx), meaning DNS and nginx could be ruled out.
I briefly turned on trace logging in ntfy and saw this:
Jun 21 18:51:05 ntfy.sh ntfy[4169942]: DEBUG Manager: Pruning messages older than 2022-06-21 06:51:05
Jun 21 18:51:20 ntfy.sh ntfy[4169942]: INFO Stats: 525254 messages published, 60415 in cache, 1478 topic(s) active, 1267 subscriber(s), 6045 visitor(s), 1733 mails received (4 successful, 1729 failed), 12 mails sent (12 successful, 0 failed)
This corresponds to this block of code:
Lines 1114 to 1150 in 4e29216
Line 1083 in 4e29216
Note the timestamps, 18:51:05 + 18:51:20 -- That's 15 seconds to run this code, meaning that all POST/PUT requests have to wait on the lock this entire time.
This is likely relatively easy to fix, and looking at the code it is obviously pretty inefficient.