Out Of Memory errors #58

Splaktar · 2015-10-07T04:20:36Z

There are a number of reasons that the Hub stops responding to requests (OOM, exceptions, hangs, etc). The goal isn't to solve all of these problems because there will likely be more introduced in the future via open source contributions and limited automated testing.

We need to setup a system to improve the Hub's HA, ideally via pm2 and possibly other packages. This is a common problem with Node.js projects and there are many examples and guides for handling this. We just need someone to set it up, test it, and finally work with me on deployment.

On Oct 6th at 6am the Hub started responding to all requests with a 502 error and the console just logged the request and timed out processing it at 2 seconds. This appears to be different than the previous issue with a resource leak which left a clear exception.

I've restarted the Hub and it's back online.

We need to spin up another Hub node and connect it to the load balancer so that if one goes down, we don't loose service. Then we probably also need to enable Stackdriver Monitoring and alerts so that we get emailed when the Health Checks fail for a node under the load balancer. We currently get no such notification.

tasomaniac · 2015-10-07T06:14:40Z

What you mentioned is a workaround right. Do we really need more than 1 server if we didn't have these kind of problems. I guess we don't have that many users.

Splaktar · 2015-10-16T03:21:49Z

Happened again tonight:

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)

@tasomaniac it's not so much of a work around as it is a proper HA configuration.

Splaktar · 2015-10-17T22:44:52Z

This seems related http://stackoverflow.com/questions/31856829/memory-error-in-node-js-nodesmallocalloc

Splaktar · 2015-11-01T23:22:47Z

Happened again as few days ago, but I didn't have time to investigate or collect stack trace.

tasomaniac · 2015-11-01T23:58:14Z

:(

On Mon, Nov 2, 2015, 02:22 Michael Prentice notifications@github.com
wrote:

Happened again as few days ago, but I didn't have time to investigate or
collect stack trace.

—
Reply to this email directly or view it on GitHub
#58 (comment).

Splaktar · 2015-11-14T17:44:42Z

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
npm ERR! Linux 3.19.0-28-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "runScript" "startProd"
npm ERR! node v0.12.7
npm ERR! npm  v2.11.3
npm ERR! code ELIFECYCLE
npm ERR! gdgx-hub@0.0.2 startProd: `grunt serve:dist`
npm ERR! Exit status 134
npm ERR! 
npm ERR! Failed at the gdgx-hub@0.0.2 startProd script 'grunt serve:dist'.
npm ERR! This is most likely a problem with the gdgx-hub package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     grunt serve:dist
npm ERR! You can get their info via:
npm ERR!     npm owner ls gdgx-hub
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR!     /opt/hub/npm-debug.log

Splaktar · 2015-11-30T10:44:53Z

Here's the latest status when the Hub stopped and started giving 502 status errors:

  System information as of Mon Nov 30 10:41:11 UTC 2015
  System load:  0.0               Processes:           4398
  Usage of /:   24.7% of 9.69GB   Users logged in:     0
  Memory usage: 49%               IP address for eth0: 10.111.216.151
  Swap usage:   0%
  => There are 4318 zombie processes.

4318 zombies does not look good... but the resources don't seem to be otherwise bottlenecked (RAM and disk are fine).

Splaktar · 2015-12-01T02:29:16Z

OK, I've spun up a second Hub node (small instance, tried micro but ran into ENOMEM errors with grunt).

Now clustering and load balancing seems to be working:

hub:

[1510] worker-2317 just said hi. Replying.
[1510] was master: true, now master: true

hub-backup:

2317] Risky is up. I'm worker-2317
[2317] Cancel masterResponder
[2317] was master: false, now master: false

Then kill hub:

2317] worker-1510 has gone down...
[2317] was master: false, now master: true

And the handoff is seamless with no interruption to traffic. I tried a few iterations of this in both directions and it seemed to work great.

This does not solve the fact that the hub instances sometimes run out of memory or otherwise stop responding, but it should reduce the impact. I've started to setup Stackdriver monitoring to alert us when one of them stops responding, but I haven't completed that process yet.

Splaktar · 2015-12-28T00:27:20Z

Still seeing OOM errors bringing the server down:

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
npm ERR! Linux 3.19.0-39-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "runScript" "startProd"
npm ERR! node v0.12.9
npm ERR! npm  v2.14.9
npm ERR! code ELIFECYCLE
npm ERR! gdgx-hub@0.1.0 startProd: `grunt serve:dist`
npm ERR! Exit status 134

The hub-backup also stopped responding to requests. But it did not have any kind of stack trace, crash, or logs. I really want to move this to a managed service as this is far too much trouble atm.

Splaktar · 2016-01-19T13:28:34Z

Both VMs locked up last night, so even pm2 wouldn't have helped. We may need to go farther and implement Kubernetes to orchestrate the containers and restart them when they fail health checks.

Splaktar · 2017-03-13T01:47:48Z

If we implement #100, then this should be much less of an issue. It's also been many months since these were an issue, though I think that this is due to the resolution of auto restarting in #88.

Splaktar added the bug label Oct 7, 2015

Splaktar added this to the v0.1.0 milestone Oct 7, 2015

Splaktar modified the milestones: v0.2.0, v0.1.0 Nov 1, 2015

Splaktar changed the title ~~Hub stops responding to requests~~ Implement pm2 recovery/restart Dec 30, 2015

Splaktar changed the title ~~Implement pm2 recovery/restart~~ Out Of Memory errors Jun 10, 2016

Splaktar modified the milestones: v0.2.0, v0.2.1 Aug 29, 2016

Splaktar modified the milestones: v0.3.0, v0.2.1 Mar 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out Of Memory errors #58

Out Of Memory errors #58

Splaktar commented Oct 7, 2015

tasomaniac commented Oct 7, 2015

Splaktar commented Oct 16, 2015

Splaktar commented Oct 17, 2015

Splaktar commented Nov 1, 2015

tasomaniac commented Nov 1, 2015

Splaktar commented Nov 14, 2015

Splaktar commented Nov 30, 2015

Splaktar commented Dec 1, 2015

Splaktar commented Dec 28, 2015

Splaktar commented Jan 19, 2016

Splaktar commented Mar 13, 2017 •

edited

Loading

Out Of Memory errors #58

Out Of Memory errors #58

Comments

Splaktar commented Oct 7, 2015

tasomaniac commented Oct 7, 2015

Splaktar commented Oct 16, 2015

Splaktar commented Oct 17, 2015

Splaktar commented Nov 1, 2015

tasomaniac commented Nov 1, 2015

Splaktar commented Nov 14, 2015

Splaktar commented Nov 30, 2015

Splaktar commented Dec 1, 2015

Splaktar commented Dec 28, 2015

Splaktar commented Jan 19, 2016

Splaktar commented Mar 13, 2017 • edited Loading

Splaktar commented Mar 13, 2017 •

edited

Loading