Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion `!*"refcnt sub-zero"' failed. #591

Closed
trthulhu opened this issue Mar 28, 2016 · 5 comments
Closed

Assertion `!*"refcnt sub-zero"' failed. #591

trthulhu opened this issue Mar 28, 2016 · 5 comments

Comments

@trthulhu
Copy link

Using librdkafka and we came across an issue that took down our application during stress tests. The stack trace from the core file looks like so:

(gdb) bt 
#0 0x0000003239232625 in raise () from /lib64/libc.so.6 
#1 0x0000003239233e05 in abort () from /lib64/libc.so.6 
#2 0x000000323922b74e in __assert_fail_base () from /lib64/libc.so.6 
#3 0x000000323922b810 in __assert_fail () from /lib64/libc.so.6 
#4 0x00007ff0d408bbb8 in rd_refcnt_sub0 (R=<optimized out>) at rd.h:273 
#5 0x00007ff0d411169a in rd_refcnt_sub0 (R=<optimized out>) at rd.h:97 
#6 rd_kafka_op_destroy (rko=0x7fe85a34a5d0) at rdkafka_op.c:100 
#7 0x00007ff0d40fea28 in rd_kafka_broker_op_serve (rko=<optimized out>, rkb=0x7fe85a333e20) at rdkafka_broker.c:2052 
#8 rd_kafka_broker_serve (rkb=rkb@entry=0x7fe85a333e20, timeout_ms=timeout_ms@entry=10) at rdkafka_broker.c:2072 
#9 0x00007ff0d41002b1 in rd_kafka_broker_ua_idle (rkb=rkb@entry=0x7fe85a333e20, timeout_ms=timeout_ms@entry=0) at rdkafka_broker.c:2136 
#10 0x00007ff0d4100888 in rd_kafka_broker_thread_main (arg=0x7fe85a333e20) at rdkafka_broker.c:3303 
#11 0x00007ff0d41302b9 in _thrd_wrapper_function (aArg=<optimized out>) at tinycthread.c:596 
#12 0x00000032396079d1 in start_thread () from /lib64/libpthread.so.0 
#13 0x00000032392e88fd in clone () from /lib64/libc.so.6

An excerpt of the log:

%3|1458850597.751|FAIL|vertica#consumer-3893| localhost:16959/bootstrap: Connect to ipv4#127.0.0.1:16959 failed: Connection refused 
%3|1458850597.751|ERROR|vertica#consumer-3893| localhost:16959/bootstrap: Connect to ipv4#127.0.0.1:169vertica: rd.h:273: rd_refcnt_sub0: Assertion `!*"refcnt sub-zero"' failed.
59 failed: Connection refused 
%5|1458850720.536|PARTCNT|vertica#consumer-3894| Topic fortest partition count changed from 6 to 7 

Assertion failure is due to an extra decref on a RD_KAFKA_OP_PARITION_LEAVE :

$5 = (rd_kafka_op_t *) 0x7fe85a34a5d0 
(gdb) p *rko 
$6 = { 
  rko_link = { 
    tqe_next = 0x7fe85a11f460, 
    tqe_prev = 0x7fe85a333ec0 
  }, 
  rko_type = RD_KAFKA_OP_PARTITION_LEAVE, 
  rko_flags = 0, 
  rko_msgq = { 
    rkmq_msgs = { 
      tqh_first = 0x0, 
      tqh_last = 0x0 
    }, 
    rkmq_msg_cnt = { 
      val = 0 
    }, 
    rkmq_msg_bytes = { 
      val = 0 
    } 
  }, 
  rko_replyq = 0x0, 
  rko_intarg = 0, 
  rko_op_cb = 0x0, 
  rko_free_cb = 0x0, 
  rko_rkm = 0x0, 
  rko_rkmessage = { 
    err = RD_KAFKA_RESP_ERR_NO_ERROR, 
    rkt = 0x0, 
    partition = 0, 
    payload = 0x0, 
    len = 0, 
    key = 0x0, 
    key_len = 0, 
    offset = 0, 
    _private = 0x0 
  }, 
  rko_rkbuf = 0x0, 
  rko_metadata = 0x0, 
  rko_rktp = 0x7fe85a2471b0, 
  rko_cgrp = 0x0 
} 

A little background: simple consumer; we load in 10 second increments (recreating rdkafka confs each time without destroying rdkafka entirely (so maybe some cacheing issue?)

Let me know what other information would be valuable.

@edenhill
Copy link
Contributor

Hey,

can you elaborate some on this, what do you mean by not destroying entirely?

A little background: simple consumer; we load in 10 second increments (recreating rdkafka confs each time without destroying rdkafka entirely (so maybe some cacheing issue?)

Are you using latest master?

Is it reproducible?

Was the broker that is now Connection refused, previously up?

@trthulhu
Copy link
Author

We destroy everything we use, but we do not call: rd_kafka_wait_destroyed(); at the end of each cycle, as we are time-bound and cannot wait for global state to be destroyed every time.

We are using 0.9.0 I'm pretty sure.

It is reproducible, though I'll have to get back to you tomorrow or the following day on the test itself. As far as I can recall, the brokers were never down. Once the engineer that created the test comes back I can get you the specifics.

@edenhill
Copy link
Contributor

There's been a lot of fixes since 0.9.0 and I strongly suggest that you try latest master instead.
I'm looking to roll a new 0.9.1 release very soon.

Also, with >=0.9 wait_destroyed() is no longer needed, rd_kafka_destroy() is now blocking until all internals have been cleaned up (this was easier for people to reason about)

@trthulhu
Copy link
Author

We will try out master and let you know if it solves the issue, thank you. However, we will probably wait until 0.9.1 to officially move over.

@trthulhu
Copy link
Author

trthulhu commented Apr 4, 2016

Just an update: the issue is hard to reproduce, and we haven't had resources the past week to look deeper into it. We will be trying this week hopefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants