Major Site Outage #66

dkfellows · 2016-03-29T13:32:34Z

The site is very down and has been so for a substantial amount of time. This would have been a non-reportable SNAFU under Discourse, but with NodeBB it is indicative of a serious problem.

No idea what is wrong.

LB-- · 2016-03-29T19:37:45Z

BenLubar · 2016-03-29T20:34:44Z

I disabled the notifications list and the cooties seem to have stopped.

And by "disabled" I mean I added this after this line:

                // TDWTF DEBUG 2016-03-29
                return callback(null, []);

/cc @julianlam

julianlam · 2016-03-29T20:39:05Z

Git hash plus applicable error stack traces please and thank you 😄

BenLubar · 2016-03-29T20:43:17Z

@julianlam No stack traces, but I added this:

BenLubar/NodeBB@117b8d2

I updated from NodeBB 1783a07 to e99d952 during the cooties but they didn't stop until I disabled that function.

Here are some snippets from IRC:

13:54 < BenLubar> 29/3 18:53 [39] - warn: [socket.io] slow callback - 1044732ms 
                  - uid: [redacted] - ip: [redacted] - event: notifications.get 
                  - params: null - err: null
...
14:57 < BenLubar> 1174710ms for notifications.get

BenLubar · 2016-03-29T20:46:48Z

BenLubar · 2016-03-29T22:41:56Z

https://community.nodebb.org/topic/8396/issues-with-nodebb-perfomance

julianlam · 2016-03-29T22:52:12Z

Odd, if it was a crash from that bug, it would've been fixed in more recent commits.

BenLubar · 2016-03-30T00:02:58Z

Ah, I hadn't read the stack traces in that topic very closely. It is indeed a separate bug.

dkfellows · 2016-03-30T13:08:58Z

I do not know if the notifications thing was the problem or the symptom, but the continued performance difficulties would tend to indicate that it was merely the symptom. For example, I'm currently seeing extremely long times to load small topics. Eventually I get a 504 Gateway Timeout sometimes, or the topic loads. It's apparently arbitrary which happens.

Hunting performance problems can be hard. The only way to do it is to keep on improving the instrumentation you're applying to service calls in the hope of catching the trouble red-handed.

LB-- · 2016-03-30T13:24:44Z

DoctaJonez · 2016-03-30T15:23:28Z

Do we know if the storm affected all 4 NodeBB instances, or just specific ones? If it affected all 4 instances simultaneously, that would tend to suggest a single point of failure, like Mongo db.

If the slowdown was in the Node backend, I'd expect individual instances to suffer while other instances are still OK.

Do we have any way of profiling the database and the node instances to see what's going on?

It'd be useful to profile the hosts to see if CPU, Memory, Disk access or Network bandwidth is being saturated.

boomzillawtf · 2016-03-30T15:24:34Z

Search may also be a factor. I remember it being very slow yesterday and at least one user reported cootie storms when he started searching.

julianlam · 2016-03-30T15:56:45Z

Is everything one one t2.medium? May be a better idea to put the database elsewhere if it isn't already.

llouviere · 2016-03-30T16:05:01Z

Timetable on enabling notification dropdown?

Is there an alternative?

BenLubar · 2016-03-30T16:26:38Z

It looks like we need more disk IO operations per second allocated.

AccaliaDeElementia · 2016-03-30T16:28:55Z

ouch, yeah that disc queue length is.... off the charts. ideally you want that sucker sollidly under 1.0, we seem to be north of 10 regularly.... no wonder we have issues!

AccaliaDeElementia · 2016-03-30T16:32:53Z

@llouviere alternative for notifications?

i think i can come up with something...

LB-- · 2016-03-30T17:09:09Z

Forums seem to be completely offline. SSL error over HTTPS and 404 over HTTP. Maintenance, I assume?

BenLubar · 2016-03-30T17:11:13Z

I stopped the AWS instance so I could snapshot the disk. Currently about 50% done. I hope the IP won't change when it comes back up.

llouviere · 2016-03-30T17:33:51Z

We could just use GitHub for our forums.

Pretty good reliability here.

pauljherring · 2016-03-30T17:47:40Z

=-o

LB-- · 2016-03-30T22:16:53Z

julianlam · 2016-03-30T22:28:38Z

@BenLubar Any chance you can just use nginx to send Googlebot a 503 temporarily?

http://stackoverflow.com/questions/2786595/what-is-the-correct-http-status-code-to-send-when-a-site-is-down-for-maintenance

Edit: Maybe they'll respond to an HTTP 429

BenLubar · 2016-03-30T23:25:01Z

I've rate limited anyone with a user-agent matching /(bot|spider|slurp|crawler)/i to 1 request every 10 seconds (per IP) in nginx.

BenLubar · 2016-03-31T18:38:55Z

Looks like the rate limiting handled the bots and the better disk handled the large number of humans.

dkfellows · 2016-04-04T08:52:42Z

And another one; this one has been about an hour long so far (and started about 5 minutes before I attempted to use the site).

DoctaJonez · 2016-04-04T09:07:55Z

Does this need opening as a new issue so it gets noticed? I think it was possibly a bit optimistic closing it in the first place.

The site seems to have been down for quite some time now.

(cc @BenLubar)

BenLubar · 2016-04-04T13:13:26Z

I just woke up and the site seems fine right now. Here's a graph:

AccaliaDeElementia · 2016-04-04T15:19:56Z

so.... what happened between 0230 and 0440 then?

DoctaJonez · 2016-04-04T15:34:17Z

so.... what happened between 0230 and 0440 then?

Could it be backups? Maintenence plans (log file purging, etc)? Software updates?

AccaliaDeElementia · 2016-04-04T15:45:47Z

could be any of them. i asked because i don't know what it was and want to.

:-P

dkfellows · 2016-04-11T08:56:33Z

It hit again this morning (and is still ongoing as I write this) commencing at about 07:50 GMT. The nginx front-end seems to be sometimes fast to respond and sometimes not, like there's a resource exhaustion problem, but the back-end is solidly not responsive.

dkfellows · 2016-04-11T09:37:04Z

Also, I assume you have read http://redis.io/topics/lru-cache about maxmemory tuning? If not, you really need to as redis defaults to being a memory hog.

BenLubar · 2016-04-11T14:11:32Z

maxmemory has been set to 100mb and maxmemory-policy to allkeys-lru since before this last set of cooties started.

julianlam · 2016-04-11T14:28:19Z

@BenLubar output of ss -s? Run mongostat and see if there are items in the mongo write/read queue...

LB-- mentioned this issue Mar 29, 2016

After 502, The notification icon shows positive count, clicking shows no notifications. #67

Closed

BenLubar closed this as completed Mar 30, 2016

BenLubar reopened this Mar 30, 2016

BenLubar closed this as completed Mar 31, 2016

Major Site Outage #66

Major Site Outage #66

Comments

dkfellows commented Mar 29, 2016

LB-- commented Mar 29, 2016

BenLubar commented Mar 29, 2016

julianlam commented Mar 29, 2016

BenLubar commented Mar 29, 2016

BenLubar commented Mar 29, 2016

BenLubar commented Mar 29, 2016

julianlam commented Mar 29, 2016

BenLubar commented Mar 30, 2016

dkfellows commented Mar 30, 2016

LB-- commented Mar 30, 2016

DoctaJonez commented Mar 30, 2016

boomzillawtf commented Mar 30, 2016

julianlam commented Mar 30, 2016

llouviere commented Mar 30, 2016

BenLubar commented Mar 30, 2016

AccaliaDeElementia commented Mar 30, 2016

AccaliaDeElementia commented Mar 30, 2016

LB-- commented Mar 30, 2016

BenLubar commented Mar 30, 2016

llouviere commented Mar 30, 2016

pauljherring commented Mar 30, 2016

LB-- commented Mar 30, 2016

julianlam commented Mar 30, 2016

BenLubar commented Mar 30, 2016

BenLubar commented Mar 31, 2016

dkfellows commented Apr 4, 2016

DoctaJonez commented Apr 4, 2016

BenLubar commented Apr 4, 2016

AccaliaDeElementia commented Apr 4, 2016

DoctaJonez commented Apr 4, 2016

AccaliaDeElementia commented Apr 4, 2016

dkfellows commented Apr 11, 2016

dkfellows commented Apr 11, 2016

BenLubar commented Apr 11, 2016

julianlam commented Apr 11, 2016