Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Site Outage #66

Closed
dkfellows opened this issue Mar 29, 2016 · 35 comments
Closed

Major Site Outage #66

dkfellows opened this issue Mar 29, 2016 · 35 comments

Comments

@dkfellows
Copy link

The site is very down and has been so for a substantial amount of time. This would have been a non-reportable SNAFU under Discourse, but with NodeBB it is indicative of a serious problem.

image

No idea what is wrong.

@LB--
Copy link

LB-- commented Mar 29, 2016

image

@BenLubar
Copy link
Collaborator

I disabled the notifications list and the cooties seem to have stopped.

And by "disabled" I mean I added this after this line:

                // TDWTF DEBUG 2016-03-29
                return callback(null, []);

/cc @julianlam

@julianlam
Copy link

Git hash plus applicable error stack traces please and thank you 😄

@BenLubar
Copy link
Collaborator

@julianlam No stack traces, but I added this:

BenLubar/NodeBB@117b8d2

I updated from NodeBB 1783a07 to e99d952 during the cooties but they didn't stop until I disabled that function.

Here are some snippets from IRC:

13:54 < BenLubar> 29/3 18:53 [39] - warn: [socket.io] slow callback - 1044732ms 
                  - uid: [redacted] - ip: [redacted] - event: notifications.get 
                  - params: null - err: null
...
14:57 < BenLubar> 1174710ms for notifications.get

@BenLubar
Copy link
Collaborator

image

@BenLubar
Copy link
Collaborator

@julianlam
Copy link

Odd, if it was a crash from that bug, it would've been fixed in more recent commits.

@BenLubar
Copy link
Collaborator

Ah, I hadn't read the stack traces in that topic very closely. It is indeed a separate bug.

@dkfellows
Copy link
Author

I do not know if the notifications thing was the problem or the symptom, but the continued performance difficulties would tend to indicate that it was merely the symptom. For example, I'm currently seeing extremely long times to load small topics. Eventually I get a 504 Gateway Timeout sometimes, or the topic loads. It's apparently arbitrary which happens.

Hunting performance problems can be hard. The only way to do it is to keep on improving the instrumentation you're applying to service calls in the hope of catching the trouble red-handed.

@LB--
Copy link

LB-- commented Mar 30, 2016

image

@DoctaJonez
Copy link

Do we know if the storm affected all 4 NodeBB instances, or just specific ones? If it affected all 4 instances simultaneously, that would tend to suggest a single point of failure, like Mongo db.

If the slowdown was in the Node backend, I'd expect individual instances to suffer while other instances are still OK.

Do we have any way of profiling the database and the node instances to see what's going on?

It'd be useful to profile the hosts to see if CPU, Memory, Disk access or Network bandwidth is being saturated.

@boomzillawtf
Copy link

Search may also be a factor. I remember it being very slow yesterday and at least one user reported cootie storms when he started searching.

@julianlam
Copy link

Is everything one one t2.medium? May be a better idea to put the database elsewhere if it isn't already.

@llouviere
Copy link

Timetable on enabling notification dropdown?

Is there an alternative?

@BenLubar
Copy link
Collaborator

screenshot 2016-03-30 at 11 23 53
screenshot 2016-03-30 at 11 22 38
screenshot 2016-03-30 at 11 20 43

It looks like we need more disk IO operations per second allocated.

@AccaliaDeElementia
Copy link

ouch, yeah that disc queue length is.... off the charts. ideally you want that sucker sollidly under 1.0, we seem to be north of 10 regularly.... no wonder we have issues!

@AccaliaDeElementia
Copy link

@llouviere alternative for notifications?

i think i can come up with something...

@LB--
Copy link

LB-- commented Mar 30, 2016

Forums seem to be completely offline. SSL error over HTTPS and 404 over HTTP. Maintenance, I assume?

@BenLubar
Copy link
Collaborator

I stopped the AWS instance so I could snapshot the disk. Currently about 50% done. I hope the IP won't change when it comes back up.

@llouviere
Copy link

We could just use GitHub for our forums.

Pretty good reliability here.

@pauljherring
Copy link

=-o

@LB--
Copy link

LB-- commented Mar 30, 2016

image

@BenLubar BenLubar reopened this Mar 30, 2016
@julianlam
Copy link

@BenLubar Any chance you can just use nginx to send Googlebot a 503 temporarily?

http://stackoverflow.com/questions/2786595/what-is-the-correct-http-status-code-to-send-when-a-site-is-down-for-maintenance

Edit: Maybe they'll respond to an HTTP 429

@BenLubar
Copy link
Collaborator

I've rate limited anyone with a user-agent matching /(bot|spider|slurp|crawler)/i to 1 request every 10 seconds (per IP) in nginx.

@BenLubar
Copy link
Collaborator

Looks like the rate limiting handled the bots and the better disk handled the large number of humans.

@dkfellows
Copy link
Author

And another one; this one has been about an hour long so far (and started about 5 minutes before I attempted to use the site).

image

@DoctaJonez
Copy link

Does this need opening as a new issue so it gets noticed? I think it was possibly a bit optimistic closing it in the first place.

The site seems to have been down for quite some time now.

(cc @BenLubar)

@BenLubar
Copy link
Collaborator

BenLubar commented Apr 4, 2016

I just woke up and the site seems fine right now. Here's a graph:

screenshot 2016-04-04 at 08 09 43

@AccaliaDeElementia
Copy link

so.... what happened between 0230 and 0440 then?

@DoctaJonez
Copy link

so.... what happened between 0230 and 0440 then?

Could it be backups? Maintenence plans (log file purging, etc)? Software updates?

@AccaliaDeElementia
Copy link

could be any of them. i asked because i don't know what it was and want to.

:-P

@dkfellows
Copy link
Author

It hit again this morning (and is still ongoing as I write this) commencing at about 07:50 GMT. The nginx front-end seems to be sometimes fast to respond and sometimes not, like there's a resource exhaustion problem, but the back-end is solidly not responsive.

@dkfellows
Copy link
Author

Also, I assume you have read http://redis.io/topics/lru-cache about maxmemory tuning? If not, you really need to as redis defaults to being a memory hog.

@BenLubar
Copy link
Collaborator

maxmemory has been set to 100mb and maxmemory-policy to allkeys-lru since before this last set of cooties started.

@julianlam
Copy link

@BenLubar output of ss -s? Run mongostat and see if there are items in the mongo write/read queue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants