Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some publish requests on ntfy.sh take up to 15 seconds #338

Closed
binwiederhier opened this issue Jun 21, 2022 · 3 comments
Closed

Some publish requests on ntfy.sh take up to 15 seconds #338

binwiederhier opened this issue Jun 21, 2022 · 3 comments
Labels
🪲 bug Something isn't working server Relates to the main binary (server or client)

Comments

@binwiederhier
Copy link
Owner

binwiederhier commented Jun 21, 2022

So I've been noticing that every now and then some requests against ntfy.sh had been taking 11-15s (as opposed to <1s). At first I thought it was a problem with the Linux kernel tuning variables (somaxconn, nofile, ...). Then I thought it was nginx. After randomly poking around I found that the updateStatsAndPrune() code is likely to blame, because it locks the server mutex for a very long time (or so it appears).

Here's what I saw:

curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.00s user 0.00s system 12% cpu 0.055 total
Tue Jun 21 15:04:57 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.01s user 0.01s system 23% cpu 0.062 total
Tue Jun 21 15:05:02 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.01s user 0.01s system 0% cpu 11.509 total
                                                                           ^^^^^^

This happened even when doing it against localhost:11080 (= not through nginx), meaning DNS and nginx could be ruled out.

I briefly turned on trace logging in ntfy and saw this:

Jun 21 18:51:05 ntfy.sh ntfy[4169942]: DEBUG Manager: Pruning messages older than 2022-06-21 06:51:05
Jun 21 18:51:20 ntfy.sh ntfy[4169942]: INFO Stats: 525254 messages published, 60415 in cache, 1478 topic(s) active, 1267 subscriber(s), 6045 visitor(s), 1733 mails received (4 successful, 1729 failed), 12 mails sent (12 successful, 0 failed)

This corresponds to this block of code:

ntfy/server/server.go

Lines 1114 to 1150 in 4e29216

log.Debug("Manager: Pruning messages older than %s", olderThan.Format("2006-01-02 15:04:05"))
if err := s.messageCache.Prune(olderThan); err != nil {
log.Warn("Manager: Error pruning cache: %s", err.Error())
}
// Prune old topics, remove subscriptions without subscribers
var subscribers, messages int
for _, t := range s.topics {
subs := t.Subscribers()
msgs, err := s.messageCache.MessageCount(t.ID)
if err != nil {
log.Warn("Manager: Cannot get stats for topic %s: %s", t.ID, err.Error())
continue
}
if msgs == 0 && subs == 0 {
delete(s.topics, t.ID)
continue
}
subscribers += subs
messages += msgs
}
// Mail stats
var receivedMailTotal, receivedMailSuccess, receivedMailFailure int64
if s.smtpServerBackend != nil {
receivedMailTotal, receivedMailSuccess, receivedMailFailure = s.smtpServerBackend.Counts()
}
var sentMailTotal, sentMailSuccess, sentMailFailure int64
if s.smtpSender != nil {
sentMailTotal, sentMailSuccess, sentMailFailure = s.smtpSender.Counts()
}
// Print stats
log.Info("Stats: %d messages published, %d in cache, %d topic(s) active, %d subscriber(s), %d visitor(s), %d mails received (%d successful, %d failed), %d mails sent (%d successful, %d failed)",
s.messages, messages, len(s.topics), subscribers, len(s.visitors),
receivedMailTotal, receivedMailSuccess, receivedMailFailure,
sentMailTotal, sentMailSuccess, sentMailFailure)
, which locks the global server mutex here:
s.mu.Lock()

Note the timestamps, 18:51:05 + 18:51:20 -- That's 15 seconds to run this code, meaning that all POST/PUT requests have to wait on the lock this entire time.

This is likely relatively easy to fix, and looking at the code it is obviously pretty inefficient.

@binwiederhier binwiederhier added 🪲 bug Something isn't working server Relates to the main binary (server or client) labels Jun 21, 2022
binwiederhier pushed a commit that referenced this issue Jun 21, 2022
@binwiederhier
Copy link
Owner Author

In trying to fix this, I have encountered a horrible data race that I have not been able to figure out in quite some time. It appears to be happening when the Go HTTP code reads from the socket when closing the request, and causes a data race with the ResponseWriter.

Here's what the data race stack looks like (see https://github.com/binwiederhier/ntfy/runs/6994933562?check_suite_focus=true):

go test -v -race -run 'TestClient_Publish_Subscribe' ./client
=== RUN   TestClient_Publish_Subscribe
2022/06/22 13:33:35 INFO Listening on :21493[http], ntfy , log level is INFO
2022/06/22 13:33:36 INFO 127.0.0.1:21493/mytopic Connection exited
==================
WARNING: DATA RACE
Read at 0x00c0002ea0a8 by goroutine 31:
  bufio.(*Writer).Available()
      /usr/local/go/src/bufio/bufio.go:645 +0xcd
  bufio.(*Writer).WriteString()
      /usr/local/go/src/bufio/bufio.go:735 +0x18
  net/http.(*chunkWriter).close()
      /usr/local/go/src/net/http/server.go:407 +0xbd
  net/http.(*response).finishRequest()
      /usr/local/go/src/net/http/server.go:1630 +0xb2
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1972 +0xbfa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58

Previous write at 0x00c0002ea0a8 by goroutine 44:
  bufio.(*Writer).Flush()
      /usr/local/go/src/bufio/bufio.go:640 +0x319
  net/http.(*chunkWriter).flush()
      /usr/local/go/src/net/http/server.go:397 +0x92
  net/http.(*response).Flush()
      /usr/local/go/src/net/http/server.go:1682 +0x78
  heckel.io/ntfy/server.(*Server).handleSubscribeHTTP.func1()
      /home/pheckel/Code/ntfy/server/server.go:815 +0x174
  heckel.io/ntfy/server.(*topic).Publish.func1()
      /home/pheckel/Code/ntfy/server/topic.go:53 +0x3fb

Goroutine 31 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:3071 +0x80c
  net/http.(*Server).ListenAndServe()
      /usr/local/go/src/net/http/server.go:2968 +0xbc
  heckel.io/ntfy/server.(*Server).Run.func1()
      /home/pheckel/Code/ntfy/server/server.go:191 +0x52

Goroutine 44 (finished) created at:
  heckel.io/ntfy/server.(*topic).Publish()
      /home/pheckel/Code/ntfy/server/topic.go:46 +0x116
  heckel.io/ntfy/server.(*Server).handlePublishWithoutResponse()
      /home/pheckel/Code/ntfy/server/server.go:468 +0xa64
  heckel.io/ntfy/server.(*Server).handlePublish()
      /home/pheckel/Code/ntfy/server/server.go:495 +0x77
  heckel.io/ntfy/server.(*Server).handlePublish-fm()
      <autogenerated>:1 +0x6d
  heckel.io/ntfy/server.(*Server).withAuth.func1()
      /home/pheckel/Code/ntfy/server/server.go:1361 +0x273
  heckel.io/ntfy/server.(*Server).limitRequests.func1()
      /home/pheckel/Code/ntfy/server/server.go:1268 +0x18a
  heckel.io/ntfy/server.(*Server).handleInternal()
      /home/pheckel/Code/ntfy/server/server.go:308 +0x25d3
  heckel.io/ntfy/server.(*Server).handle()
      /home/pheckel/Code/ntfy/server/server.go:253 +0x1c4
  heckel.io/ntfy/server.(*Server).handle-fm()
      <autogenerated>:1 +0x57
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2084 +0x4d
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/net/http/server.go:2462 +0xc5
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2916 +0x896
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1966 +0xbaa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58
==================
    testing.go:1312: race detected during execution of test
--- FAIL: TestClient_Publish_Subscribe (1.48s)
=== CONT  
    testing.go:1312: race detected during execution of test
FAIL
FAIL	heckel.io/ntfy/client	1.532s
FAIL

This stack in particular indicates that something inside the Go stdlib is reading from the underlying connection/socket

Read at 0x00c0002ea0a8 by goroutine 31:
  bufio.(*Writer).Available()
      /usr/local/go/src/bufio/bufio.go:645 +0xcd
  bufio.(*Writer).WriteString()
      /usr/local/go/src/bufio/bufio.go:735 +0x18
  net/http.(*chunkWriter).close()
      /usr/local/go/src/net/http/server.go:407 +0xbd
  net/http.(*response).finishRequest()
      /usr/local/go/src/net/http/server.go:1630 +0xb2
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1972 +0xbfa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58

The w ResponseWriter is protected with wlock sync.Mutex, but that is obviously not used by the standard library, which is why the race occurs. I have not been able to figure out a proper fix, other than to call wlock.TryLock() (or Lock()) before the HTTP handler function exits. That is a hack, but it's harmless, because the lock is freed and irrelevant anyway after the function exits.

binwiederhier added a commit that referenced this issue Jun 22, 2022
@binwiederhier
Copy link
Owner Author

Hopefully fixed, will be released on the next server release

@binwiederhier
Copy link
Owner Author

binwiederhier commented Jun 23, 2022

Wrote this test to see if it fixed it and how much performance has improved.

Before (note: Publishing message; took 948ms, ❗):

2022/06/22 20:29:06 Adding 50000 messages
2022/06/22 20:29:07 Done: Adding 50000 messages; took 1.068s
2022/06/22 20:29:07 Updating stats
2022/06/22 20:29:07 Publishing message
2022/06/22 20:29:08 INFO Stats: 0 messages published, 50000 in cache, 50000 topic(s) active, 0 subscriber(s), 0 visitor(s), 0 mails received (0 successful, 0 failed), 0 mails sent (0 successful, 0 failed)
2022/06/22 20:29:08 Done: Updating stats; took 984ms
    server_test.go:1393: 
        	Error Trace:	server_test.go:1393
        	Error:      	Should be true
        	Test:       	TestServer_PublishWhileUpdatingStatsWithLotsOfMessages
2022/06/22 20:29:08 Done: Publishing message; took 948ms
2022/06/22 20:29:08 Done: Waiting for all locks
--- FAIL: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages (2.09s)

After (note: Publishing message; took 3ms ❗):

go test -v  -run 'TestServer_PublishWhileUpdatingStats' ./server
=== RUN   TestServer_PublishWhileUpdatingStatsWithLotsOfMessages
2022/06/22 20:30:34 Adding 50000 messages
2022/06/22 20:30:35 Done: Adding 50000 messages; took 843ms
2022/06/22 20:30:35 Updating stats
2022/06/22 20:30:35 Publishing message
2022/06/22 20:30:35 Done: Publishing message; took 3ms
2022/06/22 20:30:35 INFO Stats: 1 messages published, 50000 in cache, 50000 topic(s) active, 0 subscriber(s), 1 visitor(s), 0 mails received (0 successful, 0 failed), 0 mails sent (0 successful, 0 failed)
2022/06/22 20:30:35 Done: Updating stats; took 70ms
2022/06/22 20:30:35 Done: Waiting for all locks
--- PASS: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages (0.93s)
PASS
ok  	heckel.io/ntfy/server	0.945s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 bug Something isn't working server Relates to the main binary (server or client)
Projects
None yet
Development

No branches or pull requests

1 participant