-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major Site Outage #66
Comments
I disabled the notifications list and the cooties seem to have stopped. And by "disabled" I mean I added this after this line:
/cc @julianlam |
Git hash plus applicable error stack traces please and thank you 😄 |
@julianlam No stack traces, but I added this: I updated from NodeBB 1783a07 to e99d952 during the cooties but they didn't stop until I disabled that function. Here are some snippets from IRC:
|
Odd, if it was a crash from that bug, it would've been fixed in more recent commits. |
Ah, I hadn't read the stack traces in that topic very closely. It is indeed a separate bug. |
I do not know if the notifications thing was the problem or the symptom, but the continued performance difficulties would tend to indicate that it was merely the symptom. For example, I'm currently seeing extremely long times to load small topics. Eventually I get a Hunting performance problems can be hard. The only way to do it is to keep on improving the instrumentation you're applying to service calls in the hope of catching the trouble red-handed. |
Do we know if the storm affected all 4 NodeBB instances, or just specific ones? If it affected all 4 instances simultaneously, that would tend to suggest a single point of failure, like Mongo db. If the slowdown was in the Node backend, I'd expect individual instances to suffer while other instances are still OK. Do we have any way of profiling the database and the node instances to see what's going on? It'd be useful to profile the hosts to see if CPU, Memory, Disk access or Network bandwidth is being saturated. |
Search may also be a factor. I remember it being very slow yesterday and at least one user reported cootie storms when he started searching. |
Is everything one one |
Timetable on enabling notification dropdown? Is there an alternative? |
ouch, yeah that disc queue length is.... off the charts. ideally you want that sucker sollidly under 1.0, we seem to be north of 10 regularly.... no wonder we have issues! |
@llouviere alternative for notifications? i think i can come up with something... |
Forums seem to be completely offline. SSL error over HTTPS and 404 over HTTP. Maintenance, I assume? |
I stopped the AWS instance so I could snapshot the disk. Currently about 50% done. I hope the IP won't change when it comes back up. |
We could just use GitHub for our forums. Pretty good reliability here. |
=-o |
I've rate limited anyone with a user-agent matching /(bot|spider|slurp|crawler)/i to 1 request every 10 seconds (per IP) in nginx. |
Looks like the rate limiting handled the bots and the better disk handled the large number of humans. |
Does this need opening as a new issue so it gets noticed? I think it was possibly a bit optimistic closing it in the first place. The site seems to have been down for quite some time now. (cc @BenLubar) |
so.... what happened between 0230 and 0440 then? |
Could it be backups? Maintenence plans (log file purging, etc)? Software updates? |
could be any of them. i asked because i don't know what it was and want to. :-P |
It hit again this morning (and is still ongoing as I write this) commencing at about 07:50 GMT. The nginx front-end seems to be sometimes fast to respond and sometimes not, like there's a resource exhaustion problem, but the back-end is solidly not responsive. |
Also, I assume you have read http://redis.io/topics/lru-cache about |
|
@BenLubar output of |
The site is very down and has been so for a substantial amount of time. This would have been a non-reportable SNAFU under Discourse, but with NodeBB it is indicative of a serious problem.
No idea what is wrong.
The text was updated successfully, but these errors were encountered: