heap invariant compromised in connsched #119

kr · 2012-05-18T01:51:31Z

Function connsched was modifying field tickat while a Conn
was still inside the heap. Wacky hijinks ensued.

Original report follows.

Reported in version 1.5:

We are running beanstalkd on Ubuntu. Yesterday we received alerts
saying the beanstalkd process stopped and our queues began backing up.
This is the only error message I could find. Any idea what it means?
May 13 11:44:39 bns6 kernel: [4658098.511165] beanstalkd[16793] general protection ip:40978b sp:7fff307194c0 error:0 in beanstalkd[400000+12000]

https://groups.google.com/d/topic/beanstalk-talk/3uRiRBonVuE

The text was updated successfully, but these errors were encountered:

kr · 2012-05-19T00:34:20Z

@gbarr wrote

We have been seeing this same error. A google search lead me to

http://lists.opensuse.org/opensuse-amd64/2004-01/msg00195.html

which seems to indicate that this is a bad pointer issue, but I have
been unable to track it.

and

we are running a version that was compiled from 4b45d37 which is a few commits after 1.5 was tagged

chadkouse · 2012-05-23T04:14:06Z

We reverted to a previous version of beanstalkd and we haven't seen this bug yet since (over 24 hours). We were seeing it multiple times per day before, so it appears to only affect 1.5

If I can convince ops we will try 1.6 and see if it exists there.

kr · 2012-05-23T09:02:45Z

@chadkouse if you can reproduce it easily, it would be extremely helpful
if you could send me a core dump or at least a stack trace of the process
when it happens.

I can help if you need instructions on how to do that.

chadkouse · 2012-05-23T16:06:17Z

Yeah tell me how to do that and I'll see of I can get it done.

On Wednesday, May 23, 2012 at 5:02 AM, Keith Rarick wrote:

@chadkouse if you can reproduce it easily, it would be extremely helpful
if you could send me a core dump or at least a stack trace of the process
when it happens.

I can help if you need instructions on how to do that.

Reply to this email directly or view it on GitHub:
#119 (comment)

kr · 2012-05-23T19:11:54Z

Before starting the beanstalkd process, run ulimit -c unlimited
in the shell where you'll start beanstalkd. That will cause the
system to generate a core dump when the process crashes.

If you run it on Mac OS X, the core file will be in /cores; if you're
on Linux, the core file should be in the directory where you ran
beanstalkd.

chadkouse · 2012-05-31T15:22:02Z

Here's a core dump from 1.5 - http://dl.dropbox.com/u/32251821/core

We have put 1.6 on one of our nodes now and will report back if we get a crash there as well (with core dump)

Let me know when you've got that file so I can delete it from my dropbox.

chadkouse · 2012-06-04T14:41:49Z

Hi Keith,

Attached is the core dump file from the 1.6 crash. Hope this helps.

On Thu, May 31, 2012 at 11:34 AM, Chad Kouse chad.kouse@gmail.com wrote:

#119 (comment)

Chad Kouse

On Thursday, May 31, 2012 at 11:34 AM, Chad Kouse wrote:

I'm just using github's issue tracker.
#119 (comment)
Chad Kouse

On Thursday, May 31, 2012 at 11:29 AM, Andrew Fessler wrote:

Can you forward me his info, we'll take the lead on it from here so you
dont need to stay involved.

Andrew Fessler
Director of Operations
andrew@tunewiki.com

On May 31, 2012, at 11:27 AM, Chad Kouse wrote:

Sent thanks

Chad Kouse

On Thursday, May 31, 2012 at 10:59 AM, Tyler Yosick wrote:

Attached is the core dump (16MB) from the 1.5 crash. Chad, could you
forward this to Keith? I don't have his email address.

We now have 1.6 running on bns6. I've used the same start up script for
beanstalk, so should 1.6 crash, I assume it would generate another dump.

Thanks.

On Wed, May 23, 2012 at 4:56 PM, Chad Kouse chad.kouse@gmail.com wrote:

Right. Sounds good

Chad Kouse

On Wednesday, May 23, 2012 at 4:26 PM, Tyler Yosick wrote:

Chad,

Jared and I are planning on throwing 1.5 on bns6 tomorrow, early afternoon
to get that core dump for Keith. As soon as 1.5 crashes, well get the core
dump files to him and then throw 1.6 on bns6 to see what it does.

If 1.6 crashes, I assume we can perform the same procedure to get a core
dump from it?

On Wed, May 23, 2012 at 3:14 PM, Chad Kouse chad.kouse@gmail.com wrote:

Let's see if we can get Keith a core dump.

Chad Kouse

Forwarded message:

From: Keith Rarick <
reply@reply.github.com

To: chadkouse chad.kouse@gmail.com
Date: Wednesday, May 23, 2012 3:11:55 PM
Subject: Re: [beanstalkd] invalid pointer dereference (#119)

Before starting the beanstalkd process, run ulimit -c unlimited
in the shell where you'll start beanstalkd. That will cause the
system to generate a core dump when the process crashes.

If you run it on Mac OS X, the core file will be in /cores; if you're
on Linux, the core file should be in the directory where you ran
beanstalkd.

Reply to this email directly or view it on GitHub:
#119 (comment)

Tyler Yosick
TuneWiki Operations

Tyler Yosick
TuneWiki Operations

Attachments:

core

Tyler Yosick
TuneWiki Operations

ghost · 2012-06-04T14:46:09Z

Whoops. Looks like replying to one of the emails made me comment as Chad somehow...

Keith: How can I get this 1.6 core dump file to you?

kr · 2012-06-11T02:23:18Z

@chadkouse I have the core file now. Can you also send
the beanstalkd binary that generated this file? Sorry I didn't
ask for that earlier.

@tyosick You can email it to kr@xph.us. Please also send
the beanstalkd binary. Thanks!

chadkouse · 2012-06-11T02:29:18Z

@tyosick works with us so I'll let him send both binaries.

ghost · 2012-06-11T02:35:59Z

@kr , just to confirm, you need both the 1.5 and 1.6 binaries? (Located in /usr/local/bin/beanstalkd)?

kr · 2012-06-11T04:30:44Z

@tyosick both would be best, though either one would probably be sufficient.
I need to match up the binary with the core file so I can load both of them into
gdb together. Yes, looking at the core file from 1.5, it says it is from
/usr/local/bin/beanstalkd.

$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/local/bin/beanstalkd -b /var/spool/beanstalkd -f 5000'

kr · 2012-07-01T06:10:29Z

Thanks for the core file and binary. Loading them into gdb
gives the following backtrace:

#0  0x00000000004054db in remove_waiting_conn ()
#1  0x000000000040982b in conn_timeout ()
#2  0x000000000040a11a in prottick ()
#3  0x000000000040b873 in srvtick ()
#4  0x000000000040b5e3 in sockmain ()
#5  0x000000000040b7a2 in srvserve ()
#6  0x000000000040d77b in main ()

chadkouse · 2012-07-10T01:32:53Z

I'm not sure what this means - does it mean there's a bug in beanstalkd or a flaw in our setup?

nitinahuja · 2012-07-11T22:50:15Z

Just had similar crash in my production environment - running ver 1.5

Jul 11 15:16:01 server04 kernel: [32980446.960264] beanstalkd[18214] general protection ip:4095ab sp:7fffcff4b2c0 error:0 in beanstalkd[400000+12000]

Any recommendations or time to fix would be great.

chadkouse · 2012-08-21T04:28:36Z

kr any recommendations here? We're stuck on 1.4.6 until we can solve this unfortunately.

kr · 2012-08-25T02:04:24Z

I have been able to reproduce a crash. Now it's just a matter of time
until it gets fixed. I'm spending time here and there when I get a chance.

You're right in sticking to 1.4.6 until this is fixed.

Can I get you to run a test build of the bug fix when it's ready?

chadkouse · 2012-08-25T02:18:37Z

Yeah we're willing to help test.

Chad Kouse

On Friday, August 24, 2012 at 10:04 PM, Keith Rarick wrote:

I have been able to reproduce a crash. Now it's just a matter of time
until it gets fixed. I'm spending time here and there when I get a chance.
You're right in sticking to 1.4.6 until this is fixed.
Can I get you to run a test build of the bug fix when it's ready?

—
Reply to this email directly or view it on GitHub (#119 (comment)).

nitinahuja · 2012-08-29T22:56:49Z

We've now been seeing this happen about once a day - still using 1.5, not really straightforward to roll back.

My question is, if we have persistence turned on - are we losing any messages when this crash occurs. It would be really helpful to know this and help prioritize the need to roll back to 1.4.6

kr · 2012-08-31T04:13:09Z

@chadkouse @nitinahuja could you please run this build
of beanstalkd and let me know if you see any problems?

https://s3.amazonaws.com/krheroku/beanstalkd

$ ./beanstalkd -v
beanstalkd 1.6+4+g236c669

If it works well I'll make a release.

kr · 2012-08-31T09:21:16Z

@nitinahuja The potential for losing jobs depends on how often
you have beanstalkd call fsync. The safest thing you can do is to
use -f 0, which means to call fsync before acknowledging any
change. (This isn't the default behavior because beanstalkd is
primarily designed for workloads that can tolerate losing jobs and
need more speed.)

kr · 2012-09-03T19:30:04Z

@chadkouse @nitinahuja did either of you get a chance to test out the recent fixes?

chadkouse · 2012-09-03T23:18:44Z

We are loading it onto our machines tomorrow. Will report back in a couple of days (or sooner if you didn't fix the bug :) )

Chad Kouse

On Monday, September 3, 2012 at 3:30 PM, Keith Rarick wrote:

@chadkouse (https://github.com/chadkouse) @nitinahuja (https://github.com/nitinahuja) did either of you get a chance to test out the recent fixes?

—
Reply to this email directly or view it on GitHub (#119 (comment)).

kr · 2012-09-10T08:07:15Z

@chadkouse can I assume you've had no problems so far?

chadkouse · 2012-09-10T12:55:01Z

Yea, no problems so far

Chad Kouse

On Monday, September 10, 2012 at 4:07 AM, Keith Rarick wrote:

@chadkouse (https://github.com/chadkouse) can I assume you've had no problems so far?

—
Reply to this email directly or view it on GitHub (#119 (comment)).

kr · 2012-09-10T18:33:16Z

Excellent. I consider this fixed.

chadkouse · 2012-09-11T01:43:46Z

Awesome thanks. We loaded it on a few mores machines so we will let you know of any hiccups we come across.

Chad Kouse

On Monday, September 10, 2012 at 2:33 PM, Keith Rarick wrote:

Excellent. I consider this fixed.

—
Reply to this email directly or view it on GitHub (#119 (comment)).

aqibsm · 2012-12-13T10:29:40Z

I am facing the same issue on CentOS 6.3 Beanstalkd 1.8

Dec 13 13:02:35 vm7-d2 kernel: beanstalkd[8701] general protection ip:40200a sp:7fff6c65fa10 error:0 in beanstalkd[400000+11000]
Dec 13 14:59:17 vm7-d2 kernel: beanstalkd[9993] general protection ip:40200a sp:7fff6c6f2fc0 error:0 in beanstalkd[400000+11000]

Beanstalkd 1.4 was running fine without any issue so reverting back 1.4

kr · 2012-12-14T08:28:40Z

@aqibsm this is a different bug, though the symptom is the same.
Can you help me track it down? Let's discuss in #160.

This was referenced May 25, 2012

Crashes when receiving empty job #122

Closed

beanstalkd: prot.c:1762 in update_conns: sockwant: Bad file descriptor #123

Closed

kr mentioned this issue Aug 31, 2012

Better attempt to fix issue #134 #136

Merged

This was referenced Sep 4, 2012

segfault in connclose (v1.6) #126

Closed

beanstalkd often breakdown #128

Closed

Segfault with v1.5 #132

Closed

kr closed this as completed Sep 10, 2012

kr mentioned this issue Dec 7, 2012

Problem with Debian 6 stable #158

Closed

kr mentioned this issue Dec 14, 2012

segfault in connsoonestjob at conn.c:166 #160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

heap invariant compromised in connsched #119

heap invariant compromised in connsched #119

kr commented May 18, 2012

kr commented May 19, 2012

chadkouse commented May 23, 2012

kr commented May 23, 2012

chadkouse commented May 23, 2012

kr commented May 23, 2012

chadkouse commented May 31, 2012

chadkouse commented Jun 4, 2012

ghost commented Jun 4, 2012

kr commented Jun 11, 2012

chadkouse commented Jun 11, 2012

ghost commented Jun 11, 2012

kr commented Jun 11, 2012

kr commented Jul 1, 2012

chadkouse commented Jul 10, 2012

nitinahuja commented Jul 11, 2012

chadkouse commented Aug 21, 2012

kr commented Aug 25, 2012

chadkouse commented Aug 25, 2012

nitinahuja commented Aug 29, 2012

kr commented Aug 31, 2012

kr commented Aug 31, 2012

kr commented Sep 3, 2012

chadkouse commented Sep 3, 2012

kr commented Sep 10, 2012

chadkouse commented Sep 10, 2012

kr commented Sep 10, 2012

chadkouse commented Sep 11, 2012

aqibsm commented Dec 13, 2012

kr commented Dec 14, 2012

heap invariant compromised in connsched #119

heap invariant compromised in connsched #119

Comments

kr commented May 18, 2012

kr commented May 19, 2012

chadkouse commented May 23, 2012

kr commented May 23, 2012

chadkouse commented May 23, 2012

kr commented May 23, 2012

chadkouse commented May 31, 2012

chadkouse commented Jun 4, 2012

ghost commented Jun 4, 2012

kr commented Jun 11, 2012

chadkouse commented Jun 11, 2012

ghost commented Jun 11, 2012

kr commented Jun 11, 2012

kr commented Jul 1, 2012

chadkouse commented Jul 10, 2012

nitinahuja commented Jul 11, 2012

chadkouse commented Aug 21, 2012

kr commented Aug 25, 2012

chadkouse commented Aug 25, 2012

nitinahuja commented Aug 29, 2012

kr commented Aug 31, 2012

kr commented Aug 31, 2012

kr commented Sep 3, 2012

chadkouse commented Sep 3, 2012

kr commented Sep 10, 2012

chadkouse commented Sep 10, 2012

kr commented Sep 10, 2012

chadkouse commented Sep 11, 2012

aqibsm commented Dec 13, 2012

kr commented Dec 14, 2012