-
Notifications
You must be signed in to change notification settings - Fork 19
Out Of Memory errors #58
Comments
What you mentioned is a workaround right. Do we really need more than 1 server if we didn't have these kind of problems. I guess we don't have that many users. |
Happened again tonight:
@tasomaniac it's not so much of a work around as it is a proper HA configuration. |
Happened again as few days ago, but I didn't have time to investigate or collect stack trace. |
:( On Mon, Nov 2, 2015, 02:22 Michael Prentice notifications@github.com
|
|
Here's the latest status when the Hub stopped and started giving 502 status errors:
4318 zombies does not look good... but the resources don't seem to be otherwise bottlenecked (RAM and disk are fine). |
OK, I've spun up a second Hub node (small instance, tried micro but ran into ENOMEM errors with grunt). Now clustering and load balancing seems to be working: hub:
hub-backup:
Then kill hub:
And the handoff is seamless with no interruption to traffic. I tried a few iterations of this in both directions and it seemed to work great. This does not solve the fact that the hub instances sometimes run out of memory or otherwise stop responding, but it should reduce the impact. I've started to setup Stackdriver monitoring to alert us when one of them stops responding, but I haven't completed that process yet. |
Still seeing OOM errors bringing the server down:
The hub-backup also stopped responding to requests. But it did not have any kind of stack trace, crash, or logs. I really want to move this to a managed service as this is far too much trouble atm. |
Both VMs locked up last night, so even pm2 wouldn't have helped. We may need to go farther and implement Kubernetes to orchestrate the containers and restart them when they fail health checks. |
There are a number of reasons that the Hub stops responding to requests (OOM, exceptions, hangs, etc). The goal isn't to solve all of these problems because there will likely be more introduced in the future via open source contributions and limited automated testing.
We need to setup a system to improve the Hub's HA, ideally via pm2 and possibly other packages. This is a common problem with Node.js projects and there are many examples and guides for handling this. We just need someone to set it up, test it, and finally work with me on deployment.
On Oct 6th at 6am the Hub started responding to all requests with a 502 error and the console just logged the request and timed out processing it at 2 seconds. This appears to be different than the previous issue with a resource leak which left a clear exception.
I've restarted the Hub and it's back online.
We need to spin up another Hub node and connect it to the load balancer so that if one goes down, we don't loose service. Then we probably also need to enable Stackdriver Monitoring and alerts so that we get emailed when the Health Checks fail for a node under the load balancer. We currently get no such notification.
The text was updated successfully, but these errors were encountered: