Skip to content

Some publish requests on ntfy.sh take up to 15 seconds #338

Closed
@binwiederhier

Description

@binwiederhier

So I've been noticing that every now and then some requests against ntfy.sh had been taking 11-15s (as opposed to <1s). At first I thought it was a problem with the Linux kernel tuning variables (somaxconn, nofile, ...). Then I thought it was nginx. After randomly poking around I found that the updateStatsAndPrune() code is likely to blame, because it locks the server mutex for a very long time (or so it appears).

Here's what I saw:

curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.00s user 0.00s system 12% cpu 0.055 total
Tue Jun 21 15:04:57 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.01s user 0.01s system 23% cpu 0.062 total
Tue Jun 21 15:05:02 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.01s user 0.01s system 0% cpu 11.509 total
                                                                           ^^^^^^

This happened even when doing it against localhost:11080 (= not through nginx), meaning DNS and nginx could be ruled out.

I briefly turned on trace logging in ntfy and saw this:

Jun 21 18:51:05 ntfy.sh ntfy[4169942]: DEBUG Manager: Pruning messages older than 2022-06-21 06:51:05
Jun 21 18:51:20 ntfy.sh ntfy[4169942]: INFO Stats: 525254 messages published, 60415 in cache, 1478 topic(s) active, 1267 subscriber(s), 6045 visitor(s), 1733 mails received (4 successful, 1729 failed), 12 mails sent (12 successful, 0 failed)

This corresponds to this block of code:

ntfy/server/server.go

Lines 1114 to 1150 in 4e29216

log.Debug("Manager: Pruning messages older than %s", olderThan.Format("2006-01-02 15:04:05"))
if err := s.messageCache.Prune(olderThan); err != nil {
log.Warn("Manager: Error pruning cache: %s", err.Error())
}
// Prune old topics, remove subscriptions without subscribers
var subscribers, messages int
for _, t := range s.topics {
subs := t.Subscribers()
msgs, err := s.messageCache.MessageCount(t.ID)
if err != nil {
log.Warn("Manager: Cannot get stats for topic %s: %s", t.ID, err.Error())
continue
}
if msgs == 0 && subs == 0 {
delete(s.topics, t.ID)
continue
}
subscribers += subs
messages += msgs
}
// Mail stats
var receivedMailTotal, receivedMailSuccess, receivedMailFailure int64
if s.smtpServerBackend != nil {
receivedMailTotal, receivedMailSuccess, receivedMailFailure = s.smtpServerBackend.Counts()
}
var sentMailTotal, sentMailSuccess, sentMailFailure int64
if s.smtpSender != nil {
sentMailTotal, sentMailSuccess, sentMailFailure = s.smtpSender.Counts()
}
// Print stats
log.Info("Stats: %d messages published, %d in cache, %d topic(s) active, %d subscriber(s), %d visitor(s), %d mails received (%d successful, %d failed), %d mails sent (%d successful, %d failed)",
s.messages, messages, len(s.topics), subscribers, len(s.visitors),
receivedMailTotal, receivedMailSuccess, receivedMailFailure,
sentMailTotal, sentMailSuccess, sentMailFailure)
, which locks the global server mutex here:
s.mu.Lock()

Note the timestamps, 18:51:05 + 18:51:20 -- That's 15 seconds to run this code, meaning that all POST/PUT requests have to wait on the lock this entire time.

This is likely relatively easy to fix, and looking at the code it is obviously pretty inefficient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    serverRelates to the main binary (server or client)🪲 bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions