Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issues #553

Open
stianr opened this issue Oct 9, 2017 · 40 comments
Open

Scalability issues #553

stianr opened this issue Oct 9, 2017 · 40 comments
Assignees

Comments

@stianr
Copy link

stianr commented Oct 9, 2017

We've just had our first in-lecture test of Quodl, with around 150 students taking a quiz simultaneously. We've encountered two major issues (or possibly two aspects of one), which affect the viability of using it in large lectures.

  1. Very long transitions between questions - I saw across many different phone types transitions from on question to the next taking 30 seconds or longer - over the time, the stuff on the screen was half whited-out, and seemed stuck. It meant for some students the next question didn't load in time for them to answer it.

  2. Getting stuck on the 'Loading...' screen. I don't think this was the problem I had, but actually it was unable to load the required information. Maybe a third of students had that problem at the end of the quiz where they should have received their scores and badges, but were stuck on the screen at least at the point where we had to move on. When we went to review the questions and answers on the lecture desktop computer, it got stuck on the 'Loading' page for at least 30 seconds, at which point we had to move on.

I'm hoping that it's just something we can fix by buying more capacity from Heroku or something similar that doesn't require working on the codebase (given it works really well for small groups). Would you be able to have a look at this? I'm meant to be using it again tomorrow - I think I'll do that anyway, just to see how consistent a problem this is, but if there's anything I can do that stands a chance of fixing it between now and then, please let me know.

@stianr
Copy link
Author

stianr commented Oct 9, 2017

Just looking on Heroku, it seems that there were 668 timeout errors, so I'm guessing paying for more dynos might solve this? Please could you advise on that?

@sohilpandya
Copy link
Contributor

Hi Stian, I've had a look and not able to debug the heroku in too much detail as we are on a hobby server and they don't provide feedback for more than the last 24 hours, so I am unable to review the exact issue that was encountered.

Can we please upgrade from Hobby to Professional dynos?

We can then run it again and track exactly what caused the issues.

If you could also let me know when you are running a quiz in a big lecture, then I can keep my eyes on the Heroku dashboard during the lecture as well.

Thank you for forwarding me the email about node security issue, it's been upgraded and will be deployed shortly.

@sohilpandya
Copy link
Contributor

@stianr it seems like the heroku was overloaded when you ran it with that number of students so upgrading looks like a sensible step.

@stianr
Copy link
Author

stianr commented Oct 12, 2017

If it helps, these are screenshots of the dashboard covering the time it happened. Probably doesn't tell you much but may help...

heroku2
heroku1

@stianr
Copy link
Author

stianr commented Oct 12, 2017

I've upgraded to professional, so you should have access to the logs going back to the incident on Monday 12.00 UTC.

@sohilpandya
Copy link
Contributor

Thanks for that @stianr, just having a quick look and here is the screenshot.

screen shot 2017-10-12 at 15 57 09

there were 9000 successful requests (WOW, thats a lot! :)) and ~800 4xx or 5xx errors.

Looking into this as we speak.

@stianr
Copy link
Author

stianr commented Oct 13, 2017

Hi @sohilpandya, we are using Quodl this morning on a large group - sometime between 10 and 11.15 depending on how the lecturer chooses to do it.

@sohilpandya
Copy link
Contributor

Thanks Stian. I’ll keep an eye out today between 10 & 11.15.

Thanks for the update.

@sohilpandya
Copy link
Contributor

Hi @stianr,

please let me know the result of the ongoing quiz and if the issues were present.

@stianr
Copy link
Author

stianr commented Oct 13, 2017

Hi @sohilpandya, just run, almost identical problems to before - if anything they were worse this time, as at least two students only got blank screens (though as it was relatively few, that could be an device/connection-related issue). Main issues were:

  • Very slow crossfade between questions - question taking 15-30 seconds to appear fully
  • Loading screen permanently presented - this time, even before the quiz started
  • Screen with the Quodl background but neither loading page nor questions appearing
  • Some students got stuck on the 'Waiting to join quiz' page, didn't progress to the first question, but when the second question appeared, they jumped straight there

My impression was that laptop users didn't have this problem anywhere near as much as mobile users (I didn't see it at all on a laptop). But it's not specific device related: (a) Students with decent spec Samsung S7s encountered the problem; (b) at least one student had the situation where it worked fine when they did it earlier in the week, but they couldn't access it today. Some of the students who had the problem said they had logged in quite late, so not sure if it's a queuing issue.

On the upside, it looks like 83 students managed to answer all five questions. I can't see how many students attempted it but couldn't answer any questions though.

Just to reiterate (though I'm sure you're at least as aware of it as me!) it's critical that we find a solution quickly, because we're starting to lose student and lecturer goodwill. We don't have an SLA in place yet, but we will of course pay for any time you spend on this. Keep me posted!

@stianr
Copy link
Author

stianr commented Oct 16, 2017

Looks like you're tackling some of the issues, which is great. Thanks @sohilpandya.

Just a quick timeline from our side, so that you know when we can run - real - load tests in lectures. Essentially we have three lectures with >100 students in which we can run it:

  1. Monday 12-2
  2. Tuesday 3-5 (my lecture)
  3. Friday 10-12

(We also have a couple of other lecturers who run quizzes at various times, but they're generally <60 students, and haven't reported any problems so far.)

I'll keep on doing the Tuesday quiz unless there's a good reason not to - it's my lecture, so it's easier for me to handle any fallout. I cancelled the one today, and we can play the Friday one by ear - if we have a potential solution by then, we can test it then, if not we can wait until next week. If it's very high risk (i.e. pretty likely to fail), it may make more sense to wait until next Tuesday.

If you come up with any potential solution before tomorrow's lecture, you may as well deploy it, and we can see how it goes. (But of course it may be a bigger job than that.)

@sohilpandya
Copy link
Contributor

Thanks for those timings, we'll be looking into load testing both locally on staging and on the live version of the site over the coming week but I'd like to make the following suggestion for tomorrow lecture.

As the issues that have been reported are to do with:
a) transitions between questions
b) loading screen

I suggest that we take these out of the production version and run a more basic version without the loading screen and transitions for tomorrows lecture. This will allow us to investigate if these are the issues causing the unresponsiveness of the app.

Please let me know if you are ok with this for your lecture tomorrow.

thanks

@stianr
Copy link
Author

stianr commented Oct 16, 2017

Yes, that sounds sensible - would it be possible to deploy that version no earlier than, say, 1pm tomorrow, in case other people are using it in the morning with smaller groups? If there are any major unforeseen issues around removing these components we can then revert later in the afternoon.

@stianr
Copy link
Author

stianr commented Oct 16, 2017

Just another small observation. I've looked at the data for how many questions each student answered on all the quizzes we've run that still have data on the server. In this screenshot you can see the distribution of the number of answers:

image

The numbers of people not answering all the questions doesn't seem much different between cases where we know for sure we had problems, and smaller classes where I wasn't there that haven't reported any problems. So we shouldn't discount that there might be issues in the smaller groups too.

@stianr
Copy link
Author

stianr commented Oct 16, 2017

Second observation is that looking at the number of people who answered each question, it doesn't give the impression that the system is getting overloaded and as the questions keep coming, people increasingly struggle to see the questions or respond. If that happened, we'd expect a gradual dropoff in the number of responses as the questions progress. It's actually pretty flat:

image

There was definitely a loading issue in 1007 - on the projector computer we couldn't review the quiz answers because we got stuck on the 'loading' screen. But that's the only situation where we're certain that it was overloaded. It's also possible that some students encountered issues that were caused by something other than load - either device or connection or some combination of the two.

Not sure whether this helps, but worth keeping all potential explanations open when investigating the issues tomorrow - I'm sure you'll be able to get a much clearer overview.

@sohilpandya
Copy link
Contributor

Hey @stianr,

@Danwhy and I have just had another look at the potential issues, we've looked into heroku and also found that the postgres database can only handle 20 concurrent connections at a given time, we'd like us to upgrade to the next level up which is $50/month and can handle 120 concurrent connections.

Please let us know as soon as possible if you'd like us to upgrade this as it's not a straightforward process of clicking a button to upgrade.

We understand that $50/month is quite an overhead vs the $9/month, but we are also looking into a potentially cheaper alternative.

But for now, we'll have to upgrade this to the next tier on Heroku.

@stianr
Copy link
Author

stianr commented Oct 17, 2017

Yes please do.

@sohilpandya
Copy link
Contributor

@stianr, can you please transfer the ownership of the Heroku app to me so that we can run the upgrade for you.

Please let me know as soon as this is done, so that we can upgrade the database before 1pm.

@stianr
Copy link
Author

stianr commented Oct 17, 2017

@sohilpandya that should be done now - let me know if I need to do anything else.

@sohilpandya
Copy link
Contributor

@stianr, we've upgraded the database, should be ok to test for your 3-5pm lecture, we are still investigating the issues, but please keep us updated once the quiz has been run in your lecture.

Thanks

@stianr
Copy link
Author

stianr commented Oct 17, 2017

Great - thanks @sohilpandya. Will keep you posted.

@stianr
Copy link
Author

stianr commented Oct 17, 2017

A success! There were only 76 entries, so there's a chance we weren't loading the system enough to see the problems we've observed before. But only one student reported an issue, which seemed to be around loading time - she was on mobile network rather than WiFi, so I suspect it was having a weak or slow connection. Looks like 63/76 students answered all 8 questions (most who didn't missed the first question - so probably joined late), and the number of responses was high across all the questions.

I guess the next test will be to try it with >100 students on Friday, if you think that's a good plan.

@stianr
Copy link
Author

stianr commented Oct 22, 2017

Hi @sohilpandya, I asked the lecturer on Friday how it went, and she replied with:

"When I asked students to raise their hands I would estimate around a third had problems but very hard to say!"

So still issues to fix. Will cancel quiz for Monday and hope we can get a fix by next Friday.

@sohilpandya
Copy link
Contributor

@stianr, thanks for getting back.

@Danwhy did a little more digging and turns out not all of the information on Heroku was updated after the migration. I'll be making those changes later tonight and it should fix the issues. 🤞

@sohilpandya
Copy link
Contributor

@stianr, we have now completed the migration to the new database, I'll delete the existing database as it's not worth paying an extra $9 for another database which is not being used.

Our Max connections have jumped from 20 to 120 and as suggested by our stress testing done earlier last week, we should no longer have problems with a class of up to 300 users on this database.

Please let us know how the testing goes on Friday.

@stianr
Copy link
Author

stianr commented Oct 24, 2017

Brilliant - thanks @sohilpandya and @Danwhy. Running it on ~70 today, but it may end up getting a couple of trials on Friday with >100 students, as there are two lecturers who are interested in using it.

@stianr
Copy link
Author

stianr commented Oct 24, 2017

It worked well today with 72 students, but with a bit of a lag at times. For example, when I went to go through the answers with the students, the loading screen was up for 3-5 seconds or so before the answers to the first question were presented. That in itself is manageable, but if it's likely to slow down further with more students, then that may be more of an issue. If there's anything we can do to mitigate any of those potential effect between now and then, it's probably worth doing.

@stianr
Copy link
Author

stianr commented Nov 1, 2017

Two more data points from recent tests. Quodl was used twice on Friday, and I used it yesterday. I've only had feedback from one of Friday's lecturers, a new user, who said "I think it went reasonably well. Some people could not see their results though and the app indicated a score of 0 with someone who had answered all questions correct." From Heroku it looked like it slowed down a bit right at the end of the quiz where everyone got their results, which tallies with what the lecturer saw.

Yesterday it worked well - only 60 people, so a relatively light load, but no-one reported any problems. The wait when I went to review the questions was shorter than last time, though still probably around a second.

sohilpandya added a commit that referenced this issue Nov 10, 2017
We found a couple of queries that were taking a significant amount of time, we have optimised these
to make this work we've had to alter the order of constraints on the responses table
and add an index on scores table
related to #553
@sohilpandya
Copy link
Contributor

Hi @stianr,

The slowdown was being caused by some of the queries taking too long to return the results. We've optimised two queries:

  • Getting quiz review details was taking a significant amount of time, we've managed to reduce it by an order of magnitude, so lectures shouldn't be waiting too long when they click the review button.
  • Getting percentage scores for students when receiving feedback has also been optimised.
    • We were making an unnecessary API call when the students landed on the module dashboard, this has been removed and should also quicken up the app after a quiz has been run.

@Danwhy
Copy link
Collaborator

Danwhy commented Sep 6, 2018

In the load testing for this sprint, I found that the app could handle up to around 500 concurrent students taking a quiz at once before any timeouts started to occur, which is consistent with the findings of our last set of load testing.

I would suggest that if there are going to be a significant number of users on the app, and you want to be sure there will be no problems, that you temporarily scale the number of dynos up. This can easily be done on the heroku dashboard, and should double the capacity:

screen shot 2018-09-06 at 16 55 35

Heroku costs are based on usage, so if you scale down again when you don't require it, you won't be charged the full amount for the month

@stianr
Copy link
Author

stianr commented Sep 6, 2018

Thanks @Danwhy - sounds sensible to upgrade for the start of term. Will do that just before term starts.

@iteles iteles removed their assignment Sep 6, 2018
@iteles iteles added enhancement and removed bug labels Sep 6, 2018
@stianr
Copy link
Author

stianr commented Sep 25, 2018

Just had a fairly modest first use of Quodl in a lecture for this term. Looking at the heroku logs there seem to be a fair few long (>2s) response times, and a high (>1) dyno load for a short period of time, although the number of requests was relatively small (there were around 50 students taking part, it seems from looking at the responses). Wanted to check if this was what was predicted from the load testing. Don't seem to have been any timeouts, so it's probably fine, but just wanted to make sure it didn't change estimates of 500 concurrent students before timeouts are likely to occur. This was using the single dyno - I haven't upgraded to the second one yet.

image

@stianr
Copy link
Author

stianr commented Sep 28, 2018

And today we had 215 concurrent users - seemed to work fine for them. Quite a long delay at the end of the quiz before they saw their score - it went very quiet for a while. That may be where a lot of the >10s response times came from.

quodl_200_students

@stianr
Copy link
Author

stianr commented Sep 28, 2018

But should say I'm very happy to see that there weren't any major issues - it's the biggest test it's had, and at peak moments it must have had hundreds of requests per second. (I did upgrade to the second dyno for this, so that may have helped, looking at the dyno loads...)

@Danwhy
Copy link
Collaborator

Danwhy commented Oct 11, 2018

I enabled a logging system for heroku yesterday, so we could get an idea of what's happening during large tests of the app.

From what I've seen so far, it seems the current bottleneck is the websockets part if the app, where the users are pushed the live updates during quizzes.

I believe the websockets section of the app hasn't been touched since pretty much the very beginning of this project: https://github.com/cul-2016/quiz/blob/staging/server/start.js, so it makes sense that it was not optimised for a high level of usage back then.

It would probably be a good idea to look into re-architecturing/upgrading this part of the app as the next part of scalability improvements.

During a cursory bit of research I came across this article which suggests that the websocket library we're currently using is not the best for a high load application.

@stianr
Copy link
Author

stianr commented Oct 11, 2018

Thanks for looking into this Daniel - that makes sense, and I agree it makes sense to try to look at this for scaling further. At the moment, with up to 250 students, the only point where it lags is when a quiz finishes, which is not a problem from a UX perspective, relative to questions taking a long time to load. So it's okay for the sort of scale we're looking at for this term I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants