New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fdset memory leak #586

Merged
merged 14 commits into from Feb 10, 2015

Conversation

Projects
None yet
5 participants
@tmm1
Copy link
Contributor

tmm1 commented Feb 10, 2015

fixes leak reported in #585 (comment)

originally introduced in #502

/cc @piki @sodabrew

tmm1 added some commits Feb 10, 2015

epoll/kqueue on older rubies (without rb_wait_for_single_fd) should u…
…se rb_thread_select with regular fdset

epoll/kqueue fds are created early during ruby boot, so it is highly
unlikely that they will ever overflow FD_SETSIZE

if ((ret = rb_thread_fd_select(epfd + 1, &fdreads, NULL, NULL, &tv)) < 1) {
if ((ret = rb_thread_select(epfd + 1, &fdreads, NULL, NULL, &tv)) < 1) {

This comment has been minimized.

@tmm1

tmm1 Feb 10, 2015

Contributor

This epoll (and the below kqueue) path is only for legacy rubies that only shipped with rb_thread_select. Reactors are generally started very early during a ruby process, so it is highly unlikely that an epoll/kqueue fd will ever exceed FD_SETSIZE. It's safe to use a regular stack allocated fd_set here.

This comment has been minimized.

@piki

piki Feb 10, 2015

Contributor

Thanks for all the comments inline. This one is especially helpful, because it explains something that looks at first glance like a reversion.

👍 to all of it.

ext/em.cpp Outdated
@@ -877,7 +883,7 @@ _SelectDataSelect
static VALUE _SelectDataSelect (void *v)
{
SelectData_t *sd = (SelectData_t*)v;
sd->nSockets = rb_fd_select (sd->maxsocket+1, &(sd->fdreads), &(sd->fdwrites), &(sd->fderrors), &(sd->tv));
sd->nSockets = select (sd->maxsocket+1, &(sd->fdreads), &(sd->fdwrites), &(sd->fderrors), &(sd->tv));

This comment has been minimized.

@tmm1

tmm1 Feb 10, 2015

Contributor

This has to use a raw select because it is invoked inside a "thread blocking region". This was a legacy api in ruby 1.9.x which allowed callbacks to be invoked outside ruby's GVL, and thus is not allowed to call back into any ruby apis.

This comment has been minimized.

@tmm1

tmm1 Feb 10, 2015

Contributor

This is failing to compile since we're no longer using raw fd_sets. Switching to select here also reintroduces the original issue where the select reactor breaks on large fds.

Unfortunately it doesn't look like rb_fd_select is safe to call from outside the gvl, because it can invoke ruby_xmalloc which can kick off a ruby GC.

The correct solution here is probably to prefer rb_fd_select over the raw rb_thread_blocking_region and rb_thread_call_without_gvl apis when available, to let the VM do all the blocking GVL stuff for us.

@@ -1045,6 +1051,7 @@ void EventMachine_t::_CleanBadDescriptors()
rb_fd_set(sd, &fds);

int ret = rb_fd_select(sd + 1, &fds, NULL, NULL, &tv);
rb_fd_term(&fds);

This comment has been minimized.

@tmm1

tmm1 Feb 10, 2015

Contributor

This codepath is only exercised when a BADF is encountered via the select() reactor. It is quite rare, so I left the heap allocation in place and am explicitly cleaning up here.

@tmm1 tmm1 force-pushed the fix-fdset-leak branch from e0b095a to 2ef7674 Feb 10, 2015

#define EmSelect rb_thread_fd_select
#else
// ruby 1.9.1 and below
#define EmSelect rb_thread_select

This comment has been minimized.

@tmm1

tmm1 Feb 10, 2015

Contributor

1.9.1 is a special case because it defines rb_fdset_t and rb_thread_select, but not rb_fd_init or rb_thread_fd_select

{
rb_fd_term (&fdreads);
rb_fd_term (&fdwrites);
rb_fd_term (&fderrors);

This comment has been minimized.

@tmm1

tmm1 Feb 10, 2015

Contributor

These rb_fd_term will cleanup heap allocated memory associated with the rb_fdset_t, but I wanted to avoid allocation/deallocation on every reactor tick so I made the entire SelectData_t get allocated once when the reactor is created.

@tmm1

This comment has been minimized.

Copy link
Contributor

tmm1 commented Feb 10, 2015

cc @neongrau this should fix the memory leak you are seeing

@tmm1

This comment has been minimized.

Copy link
Contributor

tmm1 commented Feb 10, 2015

This leak was present on any ruby >1.9.1 when using the select reactor, and only on ruby <2.0.0 when using the epoll or kqueue reactor. Our production epoll+ruby 2.1 instances were not affected which is why I didn't notice the memory leak.

tmm1 added some commits Feb 10, 2015

switch back to rb_thread_call_without_gvl on modern rubies
avoid the extra malloc+memcpy inside rb_thread_fd_select (for
rb_fd_init_copy), since we already have an rb_fdset_t ready to pass into
select().

tmm1 added a commit that referenced this pull request Feb 10, 2015

@tmm1 tmm1 merged commit bc18211 into master Feb 10, 2015

1 check was pending

continuous-integration/travis-ci The Travis CI build is in progress
Details

@tmm1 tmm1 deleted the fix-fdset-leak branch Feb 10, 2015

@SamSaffron

This comment has been minimized.

Copy link
Contributor

SamSaffron commented Feb 16, 2015

Looks like we also saw this leak pretty bad at Discourse.

@tmm1

This comment has been minimized.

Copy link
Contributor

tmm1 commented Feb 16, 2015

We should probably yank the bad releases from rubygems /cc @sodabrew

@SamSaffron

This comment has been minimized.

Copy link
Contributor

SamSaffron commented Feb 16, 2015

@tmm1 on one hand some people say, never ever yank gems, on the other hand I think it is called for here because it is very hard to debug, breaking deployments is actually desirable here.

cc @indirect

@indirect

This comment has been minimized.

Copy link

indirect commented Feb 16, 2015

The current tradeoffs mean I only recommend it for security issues. Yanking will break new deploys or checkouts for every project that depends on the yanked gem.

@SamSaffron

This comment has been minimized.

Copy link
Contributor

SamSaffron commented Feb 16, 2015

@indirect a memory leak that will take out your server at 3AM in the morning while you are sleeping is arguably just as important as a security flaw, perhaps even more.

Its a tough call to make, but I think "security only" should be a policy that should have a bit of movement for cases such as this, especially since it is so hard to diagnose.

@indirect

This comment has been minimized.

Copy link

indirect commented Feb 16, 2015

@SamSaffron well, that's why it's my recommendation rather than my demand. :)

More seriously, it's a tradeoff. Some people are apparently using this version without any problems, and those people will have their ability to develop or deploy arbitrarily broken by a yank. Other people have a potential problem of a memory leak taking down their app servers. I have no idea which group of people is bigger or more important, but my general advice is (and has been) to not yank and ask people to upgrade instead, in order to reduce breakage.

@SamSaffron

This comment has been minimized.

Copy link
Contributor

SamSaffron commented Feb 16, 2015

@indirect I totally get where you are getting at.

The thing that is really hurting our ecosystem is that there is no way of communicating this kind of stuff outside of yanking.

Sure, Aman can tweet about it, but at best it will reach a fraction of the users. The homepage here could be updated but really nobody is going to the readme for news on a regular basis, when you are lugging around 200 dependencies its just not practical to subscribe to every stream of information.

What we really need is a way to be able to simply flag a gem as "problematic", then next "bundle install" can output your gem blaba has a serious memory leak, please upgrade to version 2.3 urgently , it would be super duper awesome and cover this issue just fine. It is also way better than yanking cause it would teach you why?

@indirect

This comment has been minimized.

Copy link

indirect commented Feb 16, 2015

Notifications about gems seem good. Someone should build that. :)

@SamSaffron

This comment has been minimized.

Copy link
Contributor

SamSaffron commented Feb 16, 2015

@indirect Google Summer Of Code !!! 👍 does bundler have a project there, only 4 days left

@sodabrew

This comment has been minimized.

Copy link
Contributor

sodabrew commented Feb 16, 2015

At the moment, there have been 45,324 downloads of 1.0.6 and 27,480 downloads of 1.0.5. If those numbers are still moving significantly in a few days, I would consider a yank.

+1 for adding notices to the Rubygems API, that'd be really neat. 🌱

Arie added a commit to Arie/serveme that referenced this pull request Feb 16, 2015

@SamSaffron

This comment has been minimized.

Copy link
Contributor

SamSaffron commented Mar 31, 2015

update on @sodabrew's number

72183 downloads for 1.0.5
57421 downloads for 1.0.6

I recommend yanking @tmm1 this bleeding should stop. It is better to break builds than have them deal with debugging this.

@sodabrew

This comment has been minimized.

Copy link
Contributor

sodabrew commented Mar 31, 2015

Compared to 350,000 downloads for 1.0.7 after ~45 days. I don't feel too strongly either way. @SamSaffron it sounds like you feel strongly for the yank?

@SamSaffron

This comment has been minimized.

Copy link
Contributor

SamSaffron commented Mar 31, 2015

I think its appropriate for this particular case. your call

@tmm1

This comment has been minimized.

Copy link
Contributor

tmm1 commented Apr 1, 2015

Yea let's yank them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment