Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heap invariant compromised in connsched #119

Closed
kr opened this issue May 18, 2012 · 29 comments
Closed

heap invariant compromised in connsched #119

kr opened this issue May 18, 2012 · 29 comments

Comments

@kr
Copy link
Member

kr commented May 18, 2012

Function connsched was modifying field tickat while a Conn
was still inside the heap. Wacky hijinks ensued.

Original report follows.


Reported in version 1.5:

We are running beanstalkd on Ubuntu. Yesterday we received alerts
saying the beanstalkd process stopped and our queues began backing up.
This is the only error message I could find. Any idea what it means?

May 13 11:44:39 bns6 kernel: [4658098.511165] beanstalkd[16793] general protection ip:40978b sp:7fff307194c0 error:0 in beanstalkd[400000+12000]

https://groups.google.com/d/topic/beanstalk-talk/3uRiRBonVuE

@kr
Copy link
Member Author

kr commented May 19, 2012

@gbarr wrote

We have been seeing this same error. A google search lead me to

http://lists.opensuse.org/opensuse-amd64/2004-01/msg00195.html

which seems to indicate that this is a bad pointer issue, but I have
been unable to track it.

and

we are running a version that was compiled from 4b45d37 which is a few commits after 1.5 was tagged

@chadkouse
Copy link

We reverted to a previous version of beanstalkd and we haven't seen this bug yet since (over 24 hours). We were seeing it multiple times per day before, so it appears to only affect 1.5

If I can convince ops we will try 1.6 and see if it exists there.

@kr
Copy link
Member Author

kr commented May 23, 2012

@chadkouse if you can reproduce it easily, it would be extremely helpful
if you could send me a core dump or at least a stack trace of the process
when it happens.

I can help if you need instructions on how to do that.

@chadkouse
Copy link

Yeah tell me how to do that and I'll see of I can get it done.

On Wednesday, May 23, 2012 at 5:02 AM, Keith Rarick wrote:

@chadkouse if you can reproduce it easily, it would be extremely helpful
if you could send me a core dump or at least a stack trace of the process
when it happens.

I can help if you need instructions on how to do that.


Reply to this email directly or view it on GitHub:
#119 (comment)

@kr
Copy link
Member Author

kr commented May 23, 2012

Before starting the beanstalkd process, run ulimit -c unlimited
in the shell where you'll start beanstalkd. That will cause the
system to generate a core dump when the process crashes.

If you run it on Mac OS X, the core file will be in /cores; if you're
on Linux, the core file should be in the directory where you ran
beanstalkd.

@chadkouse
Copy link

Here's a core dump from 1.5 - http://dl.dropbox.com/u/32251821/core

We have put 1.6 on one of our nodes now and will report back if we get a crash there as well (with core dump)

Let me know when you've got that file so I can delete it from my dropbox.

@chadkouse
Copy link

Hi Keith,

Attached is the core dump file from the 1.6 crash. Hope this helps.

On Thu, May 31, 2012 at 11:34 AM, Chad Kouse chad.kouse@gmail.com wrote:

#119 (comment)

Chad Kouse

On Thursday, May 31, 2012 at 11:34 AM, Chad Kouse wrote:

I'm just using github's issue tracker.
#119 (comment)
Chad Kouse

On Thursday, May 31, 2012 at 11:29 AM, Andrew Fessler wrote:

Can you forward me his info, we'll take the lead on it from here so you
dont need to stay involved.

Andrew Fessler
Director of Operations
andrew@tunewiki.com

On May 31, 2012, at 11:27 AM, Chad Kouse wrote:

Sent thanks

Chad Kouse

On Thursday, May 31, 2012 at 10:59 AM, Tyler Yosick wrote:

Attached is the core dump (16MB) from the 1.5 crash. Chad, could you
forward this to Keith? I don't have his email address.

We now have 1.6 running on bns6. I've used the same start up script for
beanstalk, so should 1.6 crash, I assume it would generate another dump.

Thanks.

On Wed, May 23, 2012 at 4:56 PM, Chad Kouse chad.kouse@gmail.com wrote:

Right. Sounds good

Chad Kouse

On Wednesday, May 23, 2012 at 4:26 PM, Tyler Yosick wrote:

Chad,

Jared and I are planning on throwing 1.5 on bns6 tomorrow, early afternoon
to get that core dump for Keith. As soon as 1.5 crashes, well get the core
dump files to him and then throw 1.6 on bns6 to see what it does.

If 1.6 crashes, I assume we can perform the same procedure to get a core
dump from it?

On Wed, May 23, 2012 at 3:14 PM, Chad Kouse chad.kouse@gmail.com wrote:

Let's see if we can get Keith a core dump.

Chad Kouse

Forwarded message:

From: Keith Rarick <
reply@reply.github.com

To: chadkouse chad.kouse@gmail.com
Date: Wednesday, May 23, 2012 3:11:55 PM
Subject: Re: [beanstalkd] invalid pointer dereference (#119)

Before starting the beanstalkd process, run ulimit -c unlimited
in the shell where you'll start beanstalkd. That will cause the
system to generate a core dump when the process crashes.

If you run it on Mac OS X, the core file will be in /cores; if you're
on Linux, the core file should be in the directory where you ran
beanstalkd.


Reply to this email directly or view it on GitHub:
#119 (comment)

Tyler Yosick
TuneWiki Operations

Tyler Yosick
TuneWiki Operations

Attachments:

  • core

Tyler Yosick
TuneWiki Operations

@ghost
Copy link

ghost commented Jun 4, 2012

Whoops. Looks like replying to one of the emails made me comment as Chad somehow...

Keith: How can I get this 1.6 core dump file to you?

@kr
Copy link
Member Author

kr commented Jun 11, 2012

@chadkouse I have the core file now. Can you also send
the beanstalkd binary that generated this file? Sorry I didn't
ask for that earlier.

@tyosick You can email it to kr@xph.us. Please also send
the beanstalkd binary. Thanks!

@chadkouse
Copy link

@tyosick works with us so I'll let him send both binaries.

@ghost
Copy link

ghost commented Jun 11, 2012

@kr , just to confirm, you need both the 1.5 and 1.6 binaries? (Located in /usr/local/bin/beanstalkd)?

@kr
Copy link
Member Author

kr commented Jun 11, 2012

@tyosick both would be best, though either one would probably be sufficient.
I need to match up the binary with the core file so I can load both of them into
gdb together. Yes, looking at the core file from 1.5, it says it is from
/usr/local/bin/beanstalkd.

$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/local/bin/beanstalkd -b /var/spool/beanstalkd -f 5000'

@kr
Copy link
Member Author

kr commented Jul 1, 2012

Thanks for the core file and binary. Loading them into gdb
gives the following backtrace:

#0  0x00000000004054db in remove_waiting_conn ()
#1  0x000000000040982b in conn_timeout ()
#2  0x000000000040a11a in prottick ()
#3  0x000000000040b873 in srvtick ()
#4  0x000000000040b5e3 in sockmain ()
#5  0x000000000040b7a2 in srvserve ()
#6  0x000000000040d77b in main ()

@chadkouse
Copy link

I'm not sure what this means - does it mean there's a bug in beanstalkd or a flaw in our setup?

@nitinahuja
Copy link

Just had similar crash in my production environment - running ver 1.5

Jul 11 15:16:01 server04 kernel: [32980446.960264] beanstalkd[18214] general protection ip:4095ab sp:7fffcff4b2c0 error:0 in beanstalkd[400000+12000]

Any recommendations or time to fix would be great.

@chadkouse
Copy link

kr any recommendations here? We're stuck on 1.4.6 until we can solve this unfortunately.

@kr
Copy link
Member Author

kr commented Aug 25, 2012

I have been able to reproduce a crash. Now it's just a matter of time
until it gets fixed. I'm spending time here and there when I get a chance.

You're right in sticking to 1.4.6 until this is fixed.

Can I get you to run a test build of the bug fix when it's ready?

@chadkouse
Copy link

Yeah we're willing to help test.

Chad Kouse

On Friday, August 24, 2012 at 10:04 PM, Keith Rarick wrote:

I have been able to reproduce a crash. Now it's just a matter of time
until it gets fixed. I'm spending time here and there when I get a chance.
You're right in sticking to 1.4.6 until this is fixed.
Can I get you to run a test build of the bug fix when it's ready?


Reply to this email directly or view it on GitHub (#119 (comment)).

@nitinahuja
Copy link

We've now been seeing this happen about once a day - still using 1.5, not really straightforward to roll back.

My question is, if we have persistence turned on - are we losing any messages when this crash occurs. It would be really helpful to know this and help prioritize the need to roll back to 1.4.6

@kr
Copy link
Member Author

kr commented Aug 31, 2012

@chadkouse @nitinahuja could you please run this build
of beanstalkd and let me know if you see any problems?

https://s3.amazonaws.com/krheroku/beanstalkd

$ ./beanstalkd -v
beanstalkd 1.6+4+g236c669

If it works well I'll make a release.

@kr
Copy link
Member Author

kr commented Aug 31, 2012

@nitinahuja The potential for losing jobs depends on how often
you have beanstalkd call fsync. The safest thing you can do is to
use -f 0, which means to call fsync before acknowledging any
change. (This isn't the default behavior because beanstalkd is
primarily designed for workloads that can tolerate losing jobs and
need more speed.)

@kr
Copy link
Member Author

kr commented Sep 3, 2012

@chadkouse @nitinahuja did either of you get a chance to test out the recent fixes?

@chadkouse
Copy link

We are loading it onto our machines tomorrow. Will report back in a couple of days (or sooner if you didn't fix the bug :) )

Chad Kouse

On Monday, September 3, 2012 at 3:30 PM, Keith Rarick wrote:

@chadkouse (https://github.com/chadkouse) @nitinahuja (https://github.com/nitinahuja) did either of you get a chance to test out the recent fixes?


Reply to this email directly or view it on GitHub (#119 (comment)).

@kr
Copy link
Member Author

kr commented Sep 10, 2012

@chadkouse can I assume you've had no problems so far?

@chadkouse
Copy link

Yea, no problems so far

Chad Kouse

On Monday, September 10, 2012 at 4:07 AM, Keith Rarick wrote:

@chadkouse (https://github.com/chadkouse) can I assume you've had no problems so far?


Reply to this email directly or view it on GitHub (#119 (comment)).

@kr
Copy link
Member Author

kr commented Sep 10, 2012

Excellent. I consider this fixed.

@kr kr closed this as completed Sep 10, 2012
@chadkouse
Copy link

Awesome thanks. We loaded it on a few mores machines so we will let you know of any hiccups we come across.

Chad Kouse

On Monday, September 10, 2012 at 2:33 PM, Keith Rarick wrote:

Excellent. I consider this fixed.


Reply to this email directly or view it on GitHub (#119 (comment)).

@aqibsm
Copy link

aqibsm commented Dec 13, 2012

I am facing the same issue on CentOS 6.3 Beanstalkd 1.8

Dec 13 13:02:35 vm7-d2 kernel: beanstalkd[8701] general protection ip:40200a sp:7fff6c65fa10 error:0 in beanstalkd[400000+11000]
Dec 13 14:59:17 vm7-d2 kernel: beanstalkd[9993] general protection ip:40200a sp:7fff6c6f2fc0 error:0 in beanstalkd[400000+11000]

Beanstalkd 1.4 was running fine without any issue so reverting back 1.4

@kr
Copy link
Member Author

kr commented Dec 14, 2012

@aqibsm this is a different bug, though the symptom is the same.
Can you help me track it down? Let's discuss in #160.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants