Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP 499 for polling iOS clients #677

Closed
binwiederhier opened this issue Mar 21, 2023 · 9 comments
Closed

HTTP 499 for polling iOS clients #677

binwiederhier opened this issue Mar 21, 2023 · 9 comments
Labels
🪲 bug Something isn't working server Relates to the main binary (server or client)

Comments

@binwiederhier
Copy link
Owner

root@ntfy:/var/log# cat nginx/access.log.1 nginx/access_error.log.1 | grep 20/Mar/2023:14:46: | grep iOS | grep " 200 "|wc -l
44

root@ntfy:/var/log# cat nginx/access.log.1 nginx/access_error.log.1 | grep 20/Mar/2023:14:46: | grep iOS | grep " 499 "|wc -l
423
@binwiederhier binwiederhier added 🪲 bug Something isn't working server Relates to the main binary (server or client) labels Mar 21, 2023
@binwiederhier
Copy link
Owner Author

image

@binwiederhier
Copy link
Owner Author

{
  "time": "2023-03-21T02:26:50.137Z",
  "level": "INFO",
  "message": "Connection closed with HTTP 500 (ntfy error 50001)",
  "error": "write unix /tmp/ntfy.sock->@: write: broken pipe",
  "http_method": "GET",
  "http_path": "/xxx/json?poll=1&since=HtLxH3Qm76CO",
  "tag": "http",
  "visitor_auth_limiter_limit": 0.016666666666666666,
  "visitor_auth_limiter_tokens": 30,
  "visitor_emails": 0,
  "visitor_emails_limit": 10,
  "visitor_emails_remaining": 10,
  "visitor_messages": 14,
  "visitor_messages_limit": 750,
  "visitor_messages_remaining": 736,
  "visitor_request_limiter_limit": 0.2,
  "visitor_request_limiter_tokens": 59.5017925624,
  "visitor_seen": "2023-03-21T02:26:47.628Z"
}

@binwiederhier
Copy link
Owner Author

binwiederhier commented Mar 23, 2023

This can't possibly be a timeout issue. The GET start end end are only 2 seconds apart. That's still long, but it shouldn't lead to a timeout client side. Sadly it looks like I cannot wireshark/tshark on unix sockets, so I'm not sure if I can see what's happening on the wire. 🤔

nginx:

root@ntfy:/var/log# tail -n 1000 nginx/access_error.log|grep topicname
x.x.x.x - - [23/Mar/2023:01:27:41 +0000] "GET /topicname/json?poll=1&since=GOhOM5fKP8Df HTTP/2.0" 499 0 "-" "ntfy/1.2 (build 3; iOS 16.3.1)" "TLSv1.3" "TLS_AES_128_GCM_SHA256"

ntfy:

root@ntfy:/var/log# tail -n 1000 ntfy.log|jq 'select(.visitor_ip == "x.x.x.x")'
{
  "time": "2023-03-23T01:27:40.392Z",
  "level": "DEBUG",
  "message": "HTTP stream connection opened",
  "http_method": "GET",
  "http_path": "/topicname/json?poll=1&since=GOhOM5fKP8Df",
  "tag": "subscribe",
  "visitor_auth_limiter_limit": 0.016666666666666666,
  "visitor_auth_limiter_tokens": 30,
  "visitor_emails": 0,
  "visitor_emails_limit": 10,
  "visitor_emails_remaining": 10,
  "visitor_id": "ip:x.x.x.x",
  "visitor_ip": "x.x.x.x",
  "visitor_messages": 281,
  "visitor_messages_limit": 750,
  "visitor_messages_remaining": 469,
  "visitor_request_limiter_limit": 0.2,
  "visitor_request_limiter_tokens": 58.000108232,
  "visitor_seen": "2023-03-23T01:27:40.392Z"
}
{
  "time": "2023-03-23T01:27:40.595Z",
  "level": "DEBUG",
  "message": "HTTP stream connection closed",
  "http_method": "GET",
  "http_path": "/topicname/json?poll=1&since=GOhOM5fKP8Df",
  "tag": "subscribe",
  "visitor_auth_limiter_limit": 0.016666666666666666,
  "visitor_auth_limiter_tokens": 30,
  "visitor_emails": 0,
  "visitor_emails_limit": 10,
  "visitor_emails_remaining": 10,
  "visitor_id": "ip:x.x.x.x",
  "visitor_ip": "x.x.x.x",
  "visitor_messages": 281,
  "visitor_messages_limit": 750,
  "visitor_messages_remaining": 469,
  "visitor_request_limiter_limit": 0.2,
  "visitor_request_limiter_tokens": 58.4954717692,
  "visitor_seen": "2023-03-23T01:27:40.392Z"
}
{
  "time": "2023-03-23T01:27:42.869Z",
  "level": "INFO",
  "message": "Connection closed with HTTP 500 (ntfy error 50001)",
  "error": "write unix /tmp/ntfy.sock->@: write: broken pipe",
  "http_method": "GET",
  "http_path": "/topicname/json?poll=1&since=GOhOM5fKP8Df",
  "tag": "http",
  "visitor_auth_limiter_limit": 0.016666666666666666,
  "visitor_auth_limiter_tokens": 30,
  "visitor_emails": 0,
  "visitor_emails_limit": 10,
  "visitor_emails_remaining": 10,
  "visitor_id": "ip:x.x.x.x",
  "visitor_ip": "x.x.x.x",
  "visitor_messages": 281,
  "visitor_messages_limit": 750,
  "visitor_messages_remaining": 469,
  "visitor_request_limiter_limit": 0.2,
  "visitor_request_limiter_tokens": 58.495496099,
  "visitor_seen": "2023-03-23T01:27:40.392Z"
}
{
  "time": "2023-03-23T01:27:40.392Z",
  "level": "DEBUG",
  "message": "HTTP request finished",
  "http_method": "GET",
  "http_path": "/topicname/json?poll=1&since=GOhOM5fKP8Df",
  "tag": "http",
  "time_taken_ms": 2477,
  "visitor_auth_limiter_limit": 0.016666666666666666,
  "visitor_auth_limiter_tokens": 30,
  "visitor_emails": 0,
  "visitor_emails_limit": 10,
  "visitor_emails_remaining": 10,
  "visitor_id": "ip:x.x.x.x",
  "visitor_ip": "x.x.x.x",
  "visitor_messages": 281,
  "visitor_messages_limit": 750,
  "visitor_messages_remaining": 469,
  "visitor_request_limiter_limit": 0.2,
  "visitor_request_limiter_tokens": 58.4955154748,
  "visitor_seen": "2023-03-23T01:27:40.392Z"
}

@binwiederhier
Copy link
Owner Author

I couldn't figure out how to intercept the traffic without restarting or messing with nginx. I am seeing evidence that this is not related to iOS at all though, and likely more on the thundering herd. It's happening to the web app and Android clients as well on occasion. I am too tired now, but I will look more tomorrow.

@binwiederhier
Copy link
Owner Author

Hmm. Ok, I can reproduce it on a topic that has a lot of messages, by just doing:

curl "https://ntfy.sh/topicname/json?since=all"
^C (quickly, before the output is downloaded/written)

Not surprising, but I am wondering if the non-iOS ones we're seeing is just these cases...

@binwiederhier
Copy link
Owner Author

binwiederhier commented Mar 23, 2023

More data:

Happens exactly every 20min as we know already, almost exactly to the second:

root@ntfy:/var/log/nginx# cat access_error.log.1 access_error.log | grep " 499 " | awk '{ print $4 }' | awk -F: '{ print $1 ":" $2 ":" $3 }' | sort | cut -c 1-18 | uniq -c| tail -n 20
    442 [23/Mar/2023:12:07
     11 [23/Mar/2023:12:08
      1 [23/Mar/2023:12:09
      2 [23/Mar/2023:12:17
    452 [23/Mar/2023:12:27
     10 [23/Mar/2023:12:28
      1 [23/Mar/2023:12:30
      1 [23/Mar/2023:12:35
      1 [23/Mar/2023:12:36
      1 [23/Mar/2023:12:42
    422 [23/Mar/2023:12:47
     21 [23/Mar/2023:12:48
      1 [23/Mar/2023:12:49
      3 [23/Mar/2023:12:53
      2 [23/Mar/2023:12:56
      2 [23/Mar/2023:13:00
    426 [23/Mar/2023:13:07
     14 [23/Mar/2023:13:08
      2 [23/Mar/2023:13:09
      7 [23/Mar/2023:13:13

root@ntfy:/var/log/nginx# cat access_error.log.1 access_error.log | grep " 499 " | awk '{ print $4 }' | awk -F: '{ print $1 ":" $2 ":" $3 ":" $4 }' | sort | cut -c 1-21 | uniq -c| tail -n 20
      2 [23/Mar/2023:13:00:14
      6 [23/Mar/2023:13:07:54
    140 [23/Mar/2023:13:07:55
    202 [23/Mar/2023:13:07:56
     65 [23/Mar/2023:13:07:57
      4 [23/Mar/2023:13:07:58

Happens pretty much exclusively on iOS (ntfy/1.2), but that could be false hint, since they all connect at the same time:

root@ntfy:/var/log/nginx# cat access_error.log|grep " 499 "| awk '{ print $12 }' | sort | uniq -c |sort -nr
  16503 "ntfy/1.2
    119 "Mozilla/5.0
     50 "ntfy/1.16.0
      6 "Jetty/9.4.46.v20220331"
      4 "Dalvik/2.1.0
      3 "python-requests/2.28.2"
      2 "ntfy/1.14.0

@binwiederhier
Copy link
Owner Author

binwiederhier commented Mar 24, 2023

I am not really closer. I added logging times to nginx, and tuned the listen block to add backlog=5000; and I looked at the time_taken_ms in ntfy.log. See https://www.nginx.com/blog/tuning-nginx/#listen

nginx.conf:

listen 443 ssl http2 backlog=5000;

log_format combined_ssl '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" '
               'request_time=$request_time '
               'upstream_status=$upstream_status '
               'upstream_response_time=$upstream_response_time '
               'upstream_connect_time=$upstream_connect_time '
               'upstream_header_time=$upstream_header_time';

What I see is that during the thundering herd, ntfy requests in the backend take very long to complete (6-8 seconds), but nginx timings do not at all match the ntfy backend. It is entirely conceivable that if requests actually take this long, that iOS will hang up.

x.x.x.x - - [24/Mar/2023:00:55:59 +0000] "GET /topicname/json?poll=1&since=xCYUwMFKwjhu HTTP/2.0" 499 0 "-" "ntfy/1.2 (build 3; iOS 15.6.0)" 
  request_time=0.744 
  upstream_status=- 
  upstream_response_time=0.744 
  upstream_connect_time=0.000 
  upstream_header_time=-

root@ntfy:/var/log# tail -n 100000 ntfy.log|grep topicname|grep finished|jq .time_taken_ms
8994

In the example about (which is representative), you can see that the backend request took almost 9 seconds, but nginx reports a response time of 0.7s and a status code of 499. So it could be that the iOS client really did hang up after 0.7s (unlikely), and the backend request just kept going.

@binwiederhier
Copy link
Owner Author

So. My profiling endeavors show me that the 499-spikes are likely simply caused by CPU exhaustion. The bottleneck for this many queries seems to be simply related to SQLite queries, and not anything in the code in particular. I think with a bigger box, we'd be fine, which is good to know, since the box that runs ntfy.sh is pretty small.

As for my concrete steps from this, I will likely simply disable the iOS polling entirely (for now). This should not have any impact on iOS (hopefully), since the polling is a secondary measure anyway, and almost all of them seem to fail anyway right now.

It may even be beneficial, because the duplicate delivery issues will disappear.

There is no good way to test this IMHO, so I'll just do it and see if iOS users will scream. 😬

@binwiederhier
Copy link
Owner Author

Disabled entirely in bdae48a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 bug Something isn't working server Relates to the main binary (server or client)
Projects
None yet
Development

No branches or pull requests

1 participant