Some publish requests on ntfy.sh take up to 15 seconds #338

binwiederhier · 2022-06-21T19:10:06Z

So I've been noticing that every now and then some requests against ntfy.sh had been taking 11-15s (as opposed to <1s). At first I thought it was a problem with the Linux kernel tuning variables (somaxconn, nofile, ...). Then I thought it was nginx. After randomly poking around I found that the updateStatsAndPrune() code is likely to blame, because it locks the server mutex for a very long time (or so it appears).

Here's what I saw:

curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.00s user 0.00s system 12% cpu 0.055 total
Tue Jun 21 15:04:57 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.01s user 0.01s system 23% cpu 0.062 total
Tue Jun 21 15:05:02 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.01s user 0.01s system 0% cpu 11.509 total
                                                                           ^^^^^^

This happened even when doing it against localhost:11080 (= not through nginx), meaning DNS and nginx could be ruled out.

I briefly turned on trace logging in ntfy and saw this:

Jun 21 18:51:05 ntfy.sh ntfy[4169942]: DEBUG Manager: Pruning messages older than 2022-06-21 06:51:05
Jun 21 18:51:20 ntfy.sh ntfy[4169942]: INFO Stats: 525254 messages published, 60415 in cache, 1478 topic(s) active, 1267 subscriber(s), 6045 visitor(s), 1733 mails received (4 successful, 1729 failed), 12 mails sent (12 successful, 0 failed)

This corresponds to this block of code:

ntfy/server/server.go

Lines 1114 to 1150 in 4e29216

    
           log.Debug("Manager: Pruning messages older than %s", olderThan.Format("2006-01-02 15:04:05")) 
        
           if err := s.messageCache.Prune(olderThan); err != nil { 
        
           	log.Warn("Manager: Error pruning cache: %s", err.Error()) 
        
           } 
        
           // Prune old topics, remove subscriptions without subscribers 
        
           var subscribers, messages int 
        
           for _, t := range s.topics { 
        
           	subs := t.Subscribers() 
        
           	msgs, err := s.messageCache.MessageCount(t.ID) 
        
           	if err != nil { 
        
           		log.Warn("Manager: Cannot get stats for topic %s: %s", t.ID, err.Error()) 
        
           		continue 
        
           	} 
        
           	if msgs == 0 && subs == 0 { 
        
           		delete(s.topics, t.ID) 
        
           		continue 
        
           	} 
        
           	subscribers += subs 
        
           	messages += msgs 
        
           } 
        
           // Mail stats 
        
           var receivedMailTotal, receivedMailSuccess, receivedMailFailure int64 
        
           if s.smtpServerBackend != nil { 
        
           	receivedMailTotal, receivedMailSuccess, receivedMailFailure = s.smtpServerBackend.Counts() 
        
           } 
        
           var sentMailTotal, sentMailSuccess, sentMailFailure int64 
        
           if s.smtpSender != nil { 
        
           	sentMailTotal, sentMailSuccess, sentMailFailure = s.smtpSender.Counts() 
        
           } 
        
           // Print stats 
        
           log.Info("Stats: %d messages published, %d in cache, %d topic(s) active, %d subscriber(s), %d visitor(s), %d mails received (%d successful, %d failed), %d mails sent (%d successful, %d failed)", 
        
           	s.messages, messages, len(s.topics), subscribers, len(s.visitors), 
        
           	receivedMailTotal, receivedMailSuccess, receivedMailFailure, 
        
           	sentMailTotal, sentMailSuccess, sentMailFailure)

, which locks the global server mutex here:

ntfy/server/server.go

Line 1083 in 4e29216

s.mu.Lock()

Note the timestamps, 18:51:05 + 18:51:20 -- That's 15 seconds to run this code, meaning that all POST/PUT requests have to wait on the lock this entire time.

This is likely relatively easy to fix, and looking at the code it is obviously pretty inefficient.

The text was updated successfully, but these errors were encountered:

binwiederhier · 2022-06-22T17:42:09Z

In trying to fix this, I have encountered a horrible data race that I have not been able to figure out in quite some time. It appears to be happening when the Go HTTP code reads from the socket when closing the request, and causes a data race with the ResponseWriter.

Here's what the data race stack looks like (see https://github.com/binwiederhier/ntfy/runs/6994933562?check_suite_focus=true):

go test -v -race -run 'TestClient_Publish_Subscribe' ./client
=== RUN   TestClient_Publish_Subscribe
2022/06/22 13:33:35 INFO Listening on :21493[http], ntfy , log level is INFO
2022/06/22 13:33:36 INFO 127.0.0.1:21493/mytopic Connection exited
==================
WARNING: DATA RACE
Read at 0x00c0002ea0a8 by goroutine 31:
  bufio.(*Writer).Available()
      /usr/local/go/src/bufio/bufio.go:645 +0xcd
  bufio.(*Writer).WriteString()
      /usr/local/go/src/bufio/bufio.go:735 +0x18
  net/http.(*chunkWriter).close()
      /usr/local/go/src/net/http/server.go:407 +0xbd
  net/http.(*response).finishRequest()
      /usr/local/go/src/net/http/server.go:1630 +0xb2
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1972 +0xbfa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58

Previous write at 0x00c0002ea0a8 by goroutine 44:
  bufio.(*Writer).Flush()
      /usr/local/go/src/bufio/bufio.go:640 +0x319
  net/http.(*chunkWriter).flush()
      /usr/local/go/src/net/http/server.go:397 +0x92
  net/http.(*response).Flush()
      /usr/local/go/src/net/http/server.go:1682 +0x78
  heckel.io/ntfy/server.(*Server).handleSubscribeHTTP.func1()
      /home/pheckel/Code/ntfy/server/server.go:815 +0x174
  heckel.io/ntfy/server.(*topic).Publish.func1()
      /home/pheckel/Code/ntfy/server/topic.go:53 +0x3fb

Goroutine 31 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:3071 +0x80c
  net/http.(*Server).ListenAndServe()
      /usr/local/go/src/net/http/server.go:2968 +0xbc
  heckel.io/ntfy/server.(*Server).Run.func1()
      /home/pheckel/Code/ntfy/server/server.go:191 +0x52

Goroutine 44 (finished) created at:
  heckel.io/ntfy/server.(*topic).Publish()
      /home/pheckel/Code/ntfy/server/topic.go:46 +0x116
  heckel.io/ntfy/server.(*Server).handlePublishWithoutResponse()
      /home/pheckel/Code/ntfy/server/server.go:468 +0xa64
  heckel.io/ntfy/server.(*Server).handlePublish()
      /home/pheckel/Code/ntfy/server/server.go:495 +0x77
  heckel.io/ntfy/server.(*Server).handlePublish-fm()
      <autogenerated>:1 +0x6d
  heckel.io/ntfy/server.(*Server).withAuth.func1()
      /home/pheckel/Code/ntfy/server/server.go:1361 +0x273
  heckel.io/ntfy/server.(*Server).limitRequests.func1()
      /home/pheckel/Code/ntfy/server/server.go:1268 +0x18a
  heckel.io/ntfy/server.(*Server).handleInternal()
      /home/pheckel/Code/ntfy/server/server.go:308 +0x25d3
  heckel.io/ntfy/server.(*Server).handle()
      /home/pheckel/Code/ntfy/server/server.go:253 +0x1c4
  heckel.io/ntfy/server.(*Server).handle-fm()
      <autogenerated>:1 +0x57
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2084 +0x4d
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/net/http/server.go:2462 +0xc5
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2916 +0x896
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1966 +0xbaa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58
==================
    testing.go:1312: race detected during execution of test
--- FAIL: TestClient_Publish_Subscribe (1.48s)
=== CONT  
    testing.go:1312: race detected during execution of test
FAIL
FAIL	heckel.io/ntfy/client	1.532s
FAIL

This stack in particular indicates that something inside the Go stdlib is reading from the underlying connection/socket

Read at 0x00c0002ea0a8 by goroutine 31:
  bufio.(*Writer).Available()
      /usr/local/go/src/bufio/bufio.go:645 +0xcd
  bufio.(*Writer).WriteString()
      /usr/local/go/src/bufio/bufio.go:735 +0x18
  net/http.(*chunkWriter).close()
      /usr/local/go/src/net/http/server.go:407 +0xbd
  net/http.(*response).finishRequest()
      /usr/local/go/src/net/http/server.go:1630 +0xb2
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1972 +0xbfa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58

The w ResponseWriter is protected with wlock sync.Mutex, but that is obviously not used by the standard library, which is why the race occurs. I have not been able to figure out a proper fix, other than to call wlock.TryLock() (or Lock()) before the HTTP handler function exits. That is a hack, but it's harmless, because the lock is freed and irrelevant anyway after the function exits.

WIP: Shorter lock, for #338

binwiederhier · 2022-06-23T00:18:28Z

Hopefully fixed, will be released on the next server release

binwiederhier · 2022-06-23T00:31:08Z

Wrote this test to see if it fixed it and how much performance has improved.

Before (note: Publishing message; took 948ms, ❗):

2022/06/22 20:29:06 Adding 50000 messages
2022/06/22 20:29:07 Done: Adding 50000 messages; took 1.068s
2022/06/22 20:29:07 Updating stats
2022/06/22 20:29:07 Publishing message
2022/06/22 20:29:08 INFO Stats: 0 messages published, 50000 in cache, 50000 topic(s) active, 0 subscriber(s), 0 visitor(s), 0 mails received (0 successful, 0 failed), 0 mails sent (0 successful, 0 failed)
2022/06/22 20:29:08 Done: Updating stats; took 984ms
    server_test.go:1393: 
        	Error Trace:	server_test.go:1393
        	Error:      	Should be true
        	Test:       	TestServer_PublishWhileUpdatingStatsWithLotsOfMessages
2022/06/22 20:29:08 Done: Publishing message; took 948ms
2022/06/22 20:29:08 Done: Waiting for all locks
--- FAIL: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages (2.09s)

After (note: Publishing message; took 3ms ❗):

go test -v  -run 'TestServer_PublishWhileUpdatingStats' ./server
=== RUN   TestServer_PublishWhileUpdatingStatsWithLotsOfMessages
2022/06/22 20:30:34 Adding 50000 messages
2022/06/22 20:30:35 Done: Adding 50000 messages; took 843ms
2022/06/22 20:30:35 Updating stats
2022/06/22 20:30:35 Publishing message
2022/06/22 20:30:35 Done: Publishing message; took 3ms
2022/06/22 20:30:35 INFO Stats: 1 messages published, 50000 in cache, 50000 topic(s) active, 0 subscriber(s), 1 visitor(s), 0 mails received (0 successful, 0 failed), 0 mails sent (0 successful, 0 failed)
2022/06/22 20:30:35 Done: Updating stats; took 70ms
2022/06/22 20:30:35 Done: Waiting for all locks
--- PASS: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages (0.93s)
PASS
ok  	heckel.io/ntfy/server	0.945s

binwiederhier added 🪲 bug Something isn't working server Relates to the main binary (server or client) labels Jun 21, 2022

binwiederhier pushed a commit that referenced this issue Jun 21, 2022

WIP: Shorter lock, for #338

85f2252

binwiederhier added a commit that referenced this issue Jun 22, 2022

Merge pull request #340 from binwiederhier/shorter-lock

07b5d9a

WIP: Shorter lock, for #338

binwiederhier closed this as completed Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some publish requests on ntfy.sh take up to 15 seconds #338

Some publish requests on ntfy.sh take up to 15 seconds #338

binwiederhier commented Jun 21, 2022 •

edited

Loading

binwiederhier commented Jun 22, 2022

binwiederhier commented Jun 23, 2022

binwiederhier commented Jun 23, 2022 •

edited

Loading

Some publish requests on ntfy.sh take up to 15 seconds #338

Some publish requests on ntfy.sh take up to 15 seconds #338

Comments

binwiederhier commented Jun 21, 2022 • edited Loading

binwiederhier commented Jun 22, 2022

binwiederhier commented Jun 23, 2022

binwiederhier commented Jun 23, 2022 • edited Loading

binwiederhier commented Jun 21, 2022 •

edited

Loading

binwiederhier commented Jun 23, 2022 •

edited

Loading