near-infinite loop if no upstreams can be reached & using healthcheck #2

Open
dukehoops opened this Issue Apr 23, 2012 · 10 comments

Comments

Projects
None yet
4 participants

synopsis

if using this module + healthcheck and if all upstream nodes are down (cannot connect), subsequent request to proxying server will cause handling nginx worker thread to 'hang' virtually indefinitely in ngx_http_upstream_hash_next_peer's while-loop.

This happens because ngx_http_upstream_hash_next_peer appears to be called with *tries argument value set to '4294967295' (ie =max_int - 1)

env

I compiled nginx 1.0.15 (current GA) with this module as well as with cep21's healthcheck on AWS linux with this gcc:

[clabs@lb1 ~]$ gcc --version
gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6)

nginx config

    upstream clp925-cluster {
                hash $remote_addr;
                #hash_again 1;
                server localhost:18081; # no-one's listening on this port
                server localhost:18085; # no-one's listening on this port

                healthcheck_enabled;
                healthcheck_send "GET /ping HTTP/1.0" "Connection: close";
                healthcheck_delay 2000;
                healthcheck_failcount 2;
                healthcheck_timeout 2000;
        }

        server {
                server_name clp925.castanealabs.com;
                listen 80;

                # search app locations
                location / {
                        proxy_pass http://clp925-cluster;
                        add_header "X-ECR-upstream-addr" "$upstream_addr";
                        add_header "X-ECR-upstream-status" "$upstream_status";
                }
        }

failure

if both upstreams are actually down (i.e. actually not accepting any connections, as opposed to just returning 404 for /ping, which would fail healthcheck), then executing any request against this server yields the following error.log on nginx:

2012/04/23 17:44:33 [debug] 16665#0: 110 http proxy header:
"GET /ping HTTP/1.0^M
Host: clp925.castanealabs.com^M
X-Real-IP: 75.147.129.117^M
X-Forwarded-For: 75.147.129.117^M
X-Forwarded-Proto: http^M
Connection: close^M
User-Agent: curl/7.24.0 (i686-pc-cygwin) libcurl/7.24.0 OpenSSL/1.0.1 zlib/1.2.5 libidn/1.22 libssh2/1.4.0^M
Accept: */
^M
^M
"
2012/04/23 17:44:33 [debug] 16665#0: *110 http cleanup add: 08716D64
2012/04/23 17:44:33 [debug] 16665#0: *110 http script var: "75.147.129.117"
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: hashed "75.147.129.117" to 1
2012/04/23 17:44:33 [debug] 16665#0: *110 ECR: executing upstream_hash_next_peer (tries=1)
2012/04/23 17:44:33 [debug] 16665#0: *110 ECR: tries=0
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: hashed "175.147.129.117" to 0
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: Starting with 0
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: get upstream request hash peer try 4294967295
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: chose peer 0 w/ hash 33846 for tries 4294967295
2012/04/23 17:44:33 [debug] 16665#0: *110 socket 12
2012/04/23 17:44:33 [debug] 16665#0: *110 epoll add connection: fd:12 ev:80000005
2012/04/23 17:44:33 [debug] 16665#0: *110 connect to 127.0.0.1:18081, fd:12 #111
2012/04/23 17:44:33 [debug] 16665#0: *110 http upstream connect: -2
2012/04/23 17:44:33 [debug] 16665#0: *110 event timer add: 12: 3000:3788414729
2012/04/23 17:44:33 [debug] 16665#0: *110 http finalize request: -4, "/ping?" a:1, c:2
2012/04/23 17:44:33 [debug] 16665#0: *110 http request count:2 blk:0
2012/04/23 17:44:33 [debug] 16665#0: timer delta: 5
2012/04/23 17:44:33 [debug] 16665#0: posted events 00000000
2012/04/23 17:44:33 [debug] 16665#0: worker cycle
2012/04/23 17:44:33 [debug] 16665#0: epoll timer: 353
2012/04/23 17:44:33 [debug] 16665#0: epoll: fd:3 ev:0004 d:0878C89D
2012/04/23 17:44:33 [debug] 16665#0: *110 http run request: "/ping?"
2012/04/23 17:44:33 [debug] 16665#0: *110 http upstream check client, write event:1, "/ping"
2012/04/23 17:44:33 [debug] 16665#0: *110 http upstream recv(): -1 (11: Resource temporarily unavailable)
2012/04/23 17:44:33 [debug] 16665#0: epoll: fd:12 ev:001D d:0878C901
2012/04/23 17:44:33 [debug] 16665#0: epoll_wait() error on fd:12 ev:001D
2012/04/23 17:44:33 [debug] 16665#0: *110 http upstream request: "/ping?"
2012/04/23 17:44:33 [debug] 16665#0: *110 http upstream process header
2012/04/23 17:44:33 [error] 16665#0: *110 connect() failed (111: Connection refused) while connecting to upstream, client: 75.147.129.117, server: clp925.castanealabs.com, request: "GET /ping HTTP/1.1", upstream: "http://127.0.0.1:18081/ping", host: "clp925.castanealabs.com"
2012/04/23 17:44:33 [debug] 16665#0: *110 http next upstream, 2
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: free upstream hash peer try 4294967295
2012/04/23 17:44:33 [debug] 16665#0: *110 ECR: executing upstream_hash_next_peer (tries=4294967295)
2012/04/23 17:44:33 [debug] 16665#0: *110 ECR: tries=4294967294
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: hashed "275.147.129.117" to 0
2012/04/23 17:44:33 [debug] 16665#0: *110 ECR: tries=4294967293
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: hashed "375.147.129.117" to 1
2012/04/23 17:44:33 [debug] 16665#0: *110 ECR: tries=4294967292
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: hashed "475.147.129.117" to 1
2012/04/23 17:44:33 [debug] 16665#0: *110 ECR: tries=4294967291
2012/04/23 17:44:33 [debug] 16665#0: *110 upstream_hash: hashed "575.147.129.117" to 0
2012/04/23 17:44:33 [debug] 16665#0: *110 ECR: tries=4294967290

Possible Root Cause

to produce the log output above, I added 2 debug statements to ngx_http_upstream_hash_next_peer:

static void ngx_http_upstream_hash_next_peer(ngx_http_upstream_hash_peer_data_t *uhpd,
        ngx_uint_t *tries, ngx_log_t *log) {

ngx_log_debug1(NGX_LOG_DEBUG_HTTP, log, 0, "ECR: executing upstream_hash_next_peer (tries=%ui)", *tries);

    ngx_uint_t current;
    current = uhpd->hash % uhpd->peers->number;
    //  Loop while there is a try left, we're on one we haven't tried, and
    // the current peer isn't marked down
    while ((*tries)-- && (
       (uhpd->tried[ngx_bitvector_index(current)] & ngx_bitvector_bit(current))
        || uhpd->peers->peer[current].down
#if (NGX_HTTP_HEALTHCHECK)
        || ngx_http_healthcheck_is_down(uhpd->peers->peer[current].health_index, log)
#endif
        )) {

ngx_log_debug1(NGX_LOG_DEBUG_HTTP, log, 0, "ECR: tries=%ui", *tries);

       uhpd->current_key.len = ngx_sprintf(uhpd->current_key.data, "%d%V",
           ++uhpd->try_i, &uhpd->original_key) - uhpd->current_key.data;
       uhpd->hash += ngx_http_upstream_hash_crc32(uhpd->current_key.data,
           uhpd->current_key.len);
       current = uhpd->hash % uhpd->peers->number;
       ngx_log_debug2(NGX_LOG_DEBUG_HTTP, log, 0,
           "upstream_hash: hashed \"%V\" to %ui", &uhpd->current_key, current);
   }
}

so it looks like something calls ngx_http_upstream_hash_next_peer passing "4294967295" for *tries arg. Given that that's 2^32 - 1, perhaps something is doing the equivalent of:
*tries--; //when *tries == 0 (and it's an unsigned int)

oops. there is an infinite loop but not because of the unsigned int comparison. will update issue with more info today

description update with test output. also confirmed that if healthcheck is not used, server correctly returns 5xx when all upstreams aren't accepting connections.

Owner

evanmiller commented Apr 24, 2012

The while condition is && not || so when tries hits zero it always fails
(match the parentheses to see what I mean). Does this only happen with the
health check stuff? Because the code looks fine to me.

Evan

On Monday, April 23, 2012, Nikita Tovstoles wrote:

in the code block below, as far as I understand, tries is an unsigned int.
Consequently, *tries-- will always be non-negative and leads to infinite
loop if other conditions of the while loop are true (ie when all upstream
hosts are down):

ngx_http_upstream_hash_next_peer(ngx_http_upstream_hash_peer_data_t *uhpd,
       ngx_uint_t *tries, ngx_log_t *log) {

   ngx_uint_t current;
   current = uhpd->hash % uhpd->peers->number;
   //  Loop while there is a try left, we're on one we haven't tried, and
   // the current peer isn't marked down
   while ((*tries)-- && (
      (uhpd->tried[ngx_bitvector_index(current)] &
ngx_bitvector_bit(current))
       || uhpd->peers->peer[current].down
#if (NGX_HTTP_HEALTHCHECK)
       ||
ngx_http_healthcheck_is_down(uhpd->peers->peer[current].health_index, log)
#endif
       )) {
      uhpd->current_key.len = ngx_sprintf(uhpd->current_key.data, "%d%V",
          ++uhpd->try_i, &uhpd->original_key) - uhpd->current_key.data;
      uhpd->hash += ngx_http_upstream_hash_crc32(uhpd->current_key.data,
          uhpd->current_key.len);
      current = uhpd->hash % uhpd->peers->number;
      ngx_log_debug2(NGX_LOG_DEBUG_HTTP, log, 0,
          "upstream_hash: hashed \"%V\" to %ui", &uhpd->current_key,
current);
  }
}


Reply to this email directly or view it on GitHub:
#2

Evan Miller
http://www.evanmiller.org/

That makes sense. Yes, this only happens with 'healthcheck_enabled;". With
that directive, though, ngx_http_upstream_hash_next_peer does get called
with erroneous *tries arg value of 4294967295 (see log snippet in issue).
So far, doesn't look like healthcheck module alters 'tries' - but I may be
missing that...

Also, there is no race condition possibility, is there? No multiple threads
calling ngx_http_upstream_hash_next_peer or modifying 'tries' elsewhere in
the module (still coming up to speed on nginx modules architecture)?

thanks,
-nikita

On Tue, Apr 24, 2012 at 1:01 PM, Evan Miller <
reply@reply.github.com

wrote:

The while condition is && not || so when tries hits zero it always fails
(match the parentheses to see what I mean). Does this only happen with the
health check stuff? Because the code looks fine to me.

Evan

On Monday, April 23, 2012, Nikita Tovstoles wrote:

in the code block below, as far as I understand, tries is an unsigned
int.
Consequently, *tries-- will always be non-negative and leads to infinite
loop if other conditions of the while loop are true (ie when all upstream
hosts are down):

ngx_http_upstream_hash_next_peer(ngx_http_upstream_hash_peer_data_t
*uhpd,
       ngx_uint_t *tries, ngx_log_t *log) {

   ngx_uint_t current;
   current = uhpd->hash % uhpd->peers->number;
   //  Loop while there is a try left, we're on one we haven't tried, and
   // the current peer isn't marked down
   while ((*tries)-- && (
      (uhpd->tried[ngx_bitvector_index(current)] &
ngx_bitvector_bit(current))
       || uhpd->peers->peer[current].down
#if (NGX_HTTP_HEALTHCHECK)
       ||
ngx_http_healthcheck_is_down(uhpd->peers->peer[current].health_index,
log)
#endif
       )) {
      uhpd->current_key.len = ngx_sprintf(uhpd->current_key.data, "%d%V",
          ++uhpd->try_i, &uhpd->original_key) - uhpd->current_key.data;
      uhpd->hash += ngx_http_upstream_hash_crc32(uhpd->current_key.data,
          uhpd->current_key.len);
      current = uhpd->hash % uhpd->peers->number;
      ngx_log_debug2(NGX_LOG_DEBUG_HTTP, log, 0,
          "upstream_hash: hashed \"%V\" to %ui", &uhpd->current_key,
current);
  }
}


Reply to this email directly or view it on GitHub:
#2

Evan Miller
http://www.evanmiller.org/


Reply to this email directly or view it on GitHub:

#2 (comment)

actually, spoke too soon: I was able to repro this problem after re-compiling 1.2.0 with just this module + ssl (without healthcheck). The problem occurs when:

  • both upstreams are down
  • 'hash_again 1;' is used.

entire config is below:

       upstream foobar-cluster {
                hash $remote_addr;
                #hash_again 1;
                server lb1.staging.castanealabs.com:8081;
                server lb1.staging.castanealabs.com:8085;
        }

        server {
                server_name foobar;
                listen 80;
                location / {
                        proxy_pass http://foobar-cluster;
                        add_header "ECR-foobar-upstream" "$upstream_addr";
                }

with 'hash_again' enabled, the very first request request against this server causes worker thread to spin. Here's what the debug log looks like:

2012/04/24 14:46:36 [debug] 12141#0: *3 http upstream request: "/ping?"
2012/04/24 14:46:36 [debug] 12141#0: *3 http upstream process header
2012/04/24 14:46:36 [error] 12141#0: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: foobar, request: "GET /ping
HTTP/1.1", upstream: "http://10.64.94.87:8085/ping", host: "foobar"
2012/04/24 14:46:36 [debug] 12141#0: *3 http next upstream, 2
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: free upstream hash peer try 1
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: hashed "1127.0.0.1" to 0
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: Using 0 because 1 failed
2012/04/24 14:46:36 [debug] 12141#0: *3 close http upstream connection: 9
2012/04/24 14:46:36 [debug] 12141#0: *3 free: 09A4EFE0, unused: 88
2012/04/24 14:46:36 [debug] 12141#0: *3 event timer del: 9: 3864137272
2012/04/24 14:46:36 [debug] 12141#0: *3 reusable connection: 0
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: get upstream request hash peer try 4294967295
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: chose peer 0 w/ hash 44220 for tries 4294967295
2012/04/24 14:46:36 [debug] 12141#0: *3 socket 9
2012/04/24 14:46:36 [debug] 12141#0: *3 epoll add connection: fd:9 ev:80000005
2012/04/24 14:46:36 [debug] 12141#0: *3 connect to 10.64.94.87:8081, fd:9 #5
2012/04/24 14:46:36 [debug] 12141#0: *3 http upstream connect: -2
2012/04/24 14:46:36 [debug] 12141#0: *3 posix_memalign: 09A44BA0:128 @16
2012/04/24 14:46:36 [debug] 12141#0: *3 event timer add: 9: 3000:3864137273
2012/04/24 14:46:36 [debug] 12141#0: epoll: stale event B78B6134
2012/04/24 14:46:36 [debug] 12141#0: timer delta: 0
2012/04/24 14:46:36 [debug] 12141#0: posted events 00000000
2012/04/24 14:46:36 [debug] 12141#0: worker cycle
2012/04/24 14:46:36 [debug] 12141#0: epoll timer: 3000
2012/04/24 14:46:36 [debug] 12141#0: epoll: fd:9 ev:001D d:B78B6134
2012/04/24 14:46:36 [debug] 12141#0: epoll_wait() error on fd:9 ev:001D
2012/04/24 14:46:36 [debug] 12141#0: *3 http upstream request: "/ping?"
2012/04/24 14:46:36 [debug] 12141#0: *3 http upstream process header
2012/04/24 14:46:36 [error] 12141#0: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: foobar, request: "GET /ping HTTP/1.1", upstream: "http://10.64.94.87:8081/ping", host: "foobar"
2012/04/24 14:46:36 [debug] 12141#0: *3 http next upstream, 2
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: free upstream hash peer try 4294967295
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: hashed "2127.0.0.1" to 0
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: hashed "3127.0.0.1" to 0
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: hashed "4127.0.0.1" to 0
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: hashed "5127.0.0.1" to 0
2012/04/24 14:46:36 [debug] 12141#0: *3 upstream_hash: hashed "6127.0.0.1" to 1

Owner

evanmiller commented Apr 25, 2012

Try this

f0ae839

tried - same problem. I am clearly out of my depth here, but seems like since ngx_http_upstream_hash_next_peer is at some point invoked with tries==(int_max - 1), the thread is spinning in that method's while-loop and so the following if statement will never be reached (or at least not until 4 billion while-loop iterations)

    /* In case this one is marked down */
    ngx_http_upstream_hash_next_peer(uhpd, &r->upstream->peer.tries, r->connection->log);
    if ((ngx_int_t)r->upstream->peer.tries == -1) {
        return NGX_ERROR;
    }

2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990606127.0.0.1" to 0
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990607127.0.0.1" to 1
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990608127.0.0.1" to 1
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990609127.0.0.1" to 1
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990610127.0.0.1" to 0
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990611127.0.0.1" to 1
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990612127.0.0.1" to 1
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990613127.0.0.1" to 1
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990614127.0.0.1" to 1
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990615127.0.0.1" to 1
2012/04/25 10:51:14 [debug] 6239#0: *1 upstream_hash: hashed "35990616127.0.0.1" to 0

realpdm commented Apr 25, 2012

This seems to fix our problem in preliminary testing. I was reproducing it by adding a null route for every backend server. Before this fix it would immediately lock up and take 100% CPU fo reach worker. Now it responds fine and continues working with the backend is reachable again.

@realpdm

if you add a config below (where none of the upstreams exist), restart nginx and try to hit foobar (and are sure nginx is selecting 'foobar'), do you get a 502 or spinning worker thread?

       upstream foobar-cluster {
                hash $remote_addr;
                hash_again 1;
                server localhost:8081;
                server localhost:8082;
        }

        server {
                server_name foobar;
                listen 80;
                location / {
                        proxy_pass http://foobar-cluster;
                        add_header "ECR-foobar-upstream" "$upstream_addr";
                }

gurre commented Jun 24, 2013

Is this solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment