Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple tests seems to add a lot of overhead (or wait_for_idle not working). #56

Closed
soulgalore opened this issue Sep 18, 2017 · 24 comments

Comments

@soulgalore
Copy link
Contributor

Hi Pat,
been trying out WPT agent and seems like tests take a long time to run if we push them after each other.

I have case where we test 9 URLs, we push them to the server, running 1 run per URL, then when we got the result back we run 9 more URLs and then after them 9 more. They all run on the same (AWS) Linux agent, testing both Chrome & Firefox (and emulated mobile). The first batch of 9 URLs takes something like 6 minutes, but when we starts with the next it usually starts to take time, so the 27 URLs can take 1 - 1 1/2 hour to run. But they should be finished in 18 min or so (with a couple of minutes overhead at most right?).

I suspect it could be the wait_for_idle function (do the timeout really work?). When I was demoing WPTAgent for my teammates I got into problems that the agent never could starts since it wasn't idle but that was ok since I'm running on my local machine. Right now we use the AWS images I haven't checked the logs, but I guess I can check if I can login to the machine.

Maybe this is something you already seen on WebPageTest.org Linux agents?

Best
Peter

@pmeenan
Copy link
Contributor

pmeenan commented Sep 18, 2017

What instance sizes are you using? If it is anything in the t* line then I could see the CPU bursting being a problem but if it is a c4.large I'd be really surprised and haven't seen that happen. I can adjust the idle threshold to make it a bit more generous but logs would really help. Maybe something where I measure the CPU utilization before launching the browser to get an "idle" floor to wait for the system to reach after having launched it.

With wptagent if you run a test with debug=1 (if the server has been updated in the last few months) the agent will upload the full debug log (downloadable to the left of the waterfall). If you visit the home page with ?debug=1 it should also set a hidden form field to pass the option through to the test if you are running them manually.

There have been a few other reports of the wptagent agents running significantly slower than expected so tracking it down would be awesome.

@soulgalore
Copy link
Contributor Author

Yes we do c4.large, I did some testing in the beginning for Chrome for Linux too, and it seems that we need to have that size.

Ah cool, we use the WebPageTest API wrapper so need to add support for the debug flag there, I can fix that tomorrow (if it isn't there already there), I'll turn on the debug log then and run the same tests again.

@soulgalore
Copy link
Contributor Author

I got the logging to work, this tests takes a really long time to run http://wpt.wmftest.org/result/170919_48_8S/ but in the log it looks ok. We don't log anything before the actual browser start, so we need to change that.

@pmeenan
Copy link
Contributor

pmeenan commented Sep 19, 2017

Since the log is turned on and uploaded with the test run it can only include the part of the time from when it got the job until it goes to upload the result. Outside of that should just be the polling for work and uploading of the actual result.

There should be a test timing log file in the server directory for the test that should have the timestamps for when the job was fetched from the server and when the result was uploaded which should help narrow it down a bit more.

You can also grep your access log for the i-0a9209caf6a484ac6 instance ID string to see how often it is polling for work and the times between getting work and calling workdone.

The agent should be running inside of a script that is basically identical to the GCE one here. At startup and once every 24 hours it will install OS, browser and node updates and hourly it will pull the agent code from git. One of those might explain a long time between polling for work but not if the time comes between when the test started and when it was marked as done.

@soulgalore
Copy link
Contributor Author

Hi Pat, wanted to verify that you haven't seen this issue since you did the switch to Linux agents? We wanna switch and I was thinking it should be safe now since the move of most(?) wpt.org to the Linux version?

Best
Peter

@pmeenan
Copy link
Contributor

pmeenan commented Jan 22, 2018

I haven't seen it but I also haven't measured it carefully to be positive (certainly not to the scale of the initial post though). Beyond the public instances, the HTTP Archive has been running Linux agents for quite a while and the test thruput has been consistent.

@soulgalore
Copy link
Contributor Author

Ok thanks, let us try it again then :)

@soulgalore
Copy link
Contributor Author

Hi Pat, yesterday we did push WPTAgent by the side of the old Windows version. We run the exact same tests, excepts that Windows also runs IE tests (11 test URLs more). We still have the same problems as before. We use the AWS image for WPTAgent ami-68dd6812 on a c4.large.

The Window runs take 70 min to finish, the WPTAgent times out after 120 min (we abort a job after 120 min). We push X jobs first, wait on them to finish and then push Y jobs. When I look in the job log I can see this:

02:22:08 Batch test URLs/scripts from ./scripts/batch/mobile.txt
02:22:08 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for https://en.m.wikipedia.org/wiki/Facebook
02:22:08 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for https://en.m.wikipedia.org/wiki/Barack_Obama
02:22:08 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for https://en.m.wikipedia.org/wiki/Sweden
02:22:08 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for https://en.m.wikipedia.beta.wmflabs.org/wiki/Barack_Obama
03:21:44 Tested URL https://en.m.wikipedia.org/wiki/Facebook for us-east-1-linux:Chrome with connectivity 3GFast result: http://wpt.wmftest.org/results.php?test=180201_7V_35&medianRun=fastest&medianMetric=SpeedIndex

That first batch is the first test we send to the server (the server is up and running we never takes it down). There's a gap there of about one hour before the first job gets back. Then the other jobs continues (until the timeout). The full log looks like this https://gist.github.com/soulgalore/d21b7735a99085a8b2c2e4f24f30ae33

But there are also other variants where the delay happens between runs. Here we get a 500 error after 50 minutes (where we had successful runs first):

01:11:57 Tested URL https://en.wikipedia.org/wiki/Sweden for us-east-1-linux:Firefox with connectivity Cable result: http://wpt.wmftest.org/results.php?test=180201_PG_10&medianRun=fastest&medianMetric=SpeedIndex
01:11:57 Tested using Firefox 58.0.1 on Linux 4.4.0-1039-aws
01:11:57 We got 16 working tests and 0 failing tests
01:11:57 Succesfully sent metrics.
01:11:57 Succesfully sent metrics.
01:11:58 Batch test URLs/scripts from ./scripts/batch/login-mobile.txt
01:11:58 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for ./scripts/wptscripts/login-mobile-enwiki-facebook.txt
01:11:58 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for ./scripts/wptscripts/login-mobile-enwiki-sweden.txt
01:25:33 Tested URL https://en.m.wikipedia.org/wiki/Sweden for us-east-1-linux:Chrome with connectivity custom result: http://wpt.wmftest.org/results.php?test=180201_TH_1X&medianRun=fastest&medianMetric=SpeedIndex
01:25:33 Un-mapped entry in devtools timeline metrics:ScriptWrappableVisitor::performLazyCleanup
01:25:33 Un-mapped entry in devtools timeline metrics:ResourceChangePriority
01:25:33 Un-mapped entry in devtools timeline metrics:PlatformResourceSendRequest
01:25:33 Un-mapped entry in devtools timeline metrics:FireIdleCallback
01:25:33 Un-mapped entry in devtools timeline metrics:ScriptWrappableVisitor::performLazyCleanup
01:25:33 Un-mapped entry in devtools timeline metrics:ResourceChangePriority
01:25:33 Un-mapped entry in devtools timeline metrics:PlatformResourceSendRequest
01:25:33 Un-mapped entry in devtools timeline metrics:FireIdleCallback
01:25:33 Tested using Chrome 64.0.3282.119 on unknown
01:25:34 Succesfully sent metrics.
01:25:34 Succesfully sent metrics.
01:25:34 Succesfully sent metrics.
02:15:57 Failing test: { Error
02:15:57     at Object.<anonymous> (/srv/jenkins/workspace/workspace/performance-webpagetest-linux-wmf/node_modules/webpagetest/lib/helper.js:213:25)
02:15:57     at Module._compile (module.js:570:32)
02:15:57     at Object.Module._extensions..js (module.js:579:10)
02:15:57     at Module.load (module.js:487:32)
02:15:57     at tryModuleLoad (module.js:446:12)
02:15:57     at Function.Module._load (module.js:438:3)
02:15:57     at Module.require (module.js:497:17)
02:15:57     at require (internal/module.js:20:19)
02:15:57     at Object.<anonymous> (/srv/jenkins/workspace/workspace/performance-webpagetest-linux-wmf/node_modules/webpagetest/lib/webpagetest.js:14:15)
02:15:57     at Module._compile (module.js:570:32)
02:15:57     at Object.Module._extensions..js (module.js:579:10)
02:15:57     at Module.load (module.js:487:32)
02:15:57     at tryModuleLoad (module.js:446:12)
02:15:57     at Function.Module._load (module.js:438:3)
02:15:57     at Module.require (module.js:497:17)
02:15:57     at require (internal/module.js:20:19)
02:15:57   name: 'WPTAPIError',
02:15:57   code: 500,
02:15:57   message: 'Internal Server Error' }

so I don't think the wait time is always tied to the first run of the batch of the server.

Best
Peter

@pmeenan
Copy link
Contributor

pmeenan commented Feb 1, 2018

The 500 is a server-level error. Any chance the server is getting heavily loaded while processing the linux tests and that is why it starts taking longer? There have been a lot of optimizations (maybe even since the last official release) to have the server not process the raw results and to do all the processing at the agent.

@soulgalore
Copy link
Contributor Author

Any chance the server is getting heavily loaded while processing

No I don't think so? Any load we put on is from the WPT API NodeJS version that checks if a test is finished. However that 500 error happens maybe 8/10 runs when I went through the log. When I check the CPU on Cloudwatch it's been up to 50% and then down. I'll have a closer look tomorrow but I got a feeling that the wait time isn't because of high CPU (at least if it is reported right in the monitoring details), but I need more data.

@pmeenan
Copy link
Contributor

pmeenan commented Feb 1, 2018

There could be a fair amount of load in actually running the tests (as results get posted back to the server). If the server is post-processing the devtools data (and possibly traces) instead of using the pre-processed versions from the agents.

If there aren't a lot of agents though it may not be the source of issues. Just taking a stab in the dark since I haven't been able to reproduce it.

@soulgalore
Copy link
Contributor Author

Ok, I could see now that the agent has been closed down in the AWS console, but the our job still waits on getting jobs back. I haven't changed anything in the config, could it be that something changed how to keep AWS instances alive? We use the server to auto spin up one instance. For the old Windows agent that kept the same instance alive for months (since keep adding tests to it) but maybe that's not the best way to do it.

@pmeenan
Copy link
Contributor

pmeenan commented Feb 1, 2018

:-/ I know there have been relatively recent changes (3 months or so) to the auto-scaling logic but I don't use it myself. I'd expect the Windows and Linux instances to behave similarly though.

FWIW, if you just want to keep an instance running, I've found that using a persistent spot request works really well.

@soulgalore
Copy link
Contributor Author

Yes, finally "it just works" :) I've deployed a standalone AWS (using your AMI) and everything just runs clean. I wanna keep this open during the weekend though so I can see that it works with our Jenkins instance.

I've found that using a persistent spot request works really well.

So you make one request with a long valid until? And then you push one test at a time so you are sure the tests will run or what's the best practice to do it?

@pmeenan
Copy link
Contributor

pmeenan commented Feb 2, 2018

Do you usually keep one server running (min and max) or do you let them burst up as needed to handle the testing? If you just keep one running then a long-lived persistent spot request basically does the same thing. I usually just submit all of the tests at a time and let the queueing handle it but that also assumes you have enough capacity running to handle the testing volume you need (at the intervals you need it).

One thing I forgot is that the Linux AMI's can spin up and down more frequently than hourly (since they are billed per minute) and maybe that's why they were getting shut down differently.

@soulgalore
Copy link
Contributor Author

Ok, thanks!

Hmm actually the agent was closed down again. Could it be that even though that the instance wasn't started by the WebPageTest server (I created it manually), it still get the kill depending on EC2.IdleTerminateMinutes?

At the moment we run both Windows and the new Linux agent for a while so I wanted to change the config as little as possible and then when we only run Linux, just disable the auto EC2 handling.

@soulgalore
Copy link
Contributor Author

One thing I forgot is that the Linux AMI's can spin up and down more frequently than hourly (since they are billed per minute) and maybe that's why they were getting shut down differently.

Yes but I'm not sure because it shut down when it had work in the queue, and that doesn't add up for me?

@pmeenan
Copy link
Contributor

pmeenan commented Feb 2, 2018

Yes, if the ami ID matches one of the locations it knows about then it will kill other instances even if you launched them manually. If you comment out the ami in locations.ini for the linux location then it should stop trying to manage it. You can also launch an instance, stop it and make a new AMI (basically copy the AMI) to a new ID.

@pmeenan
Copy link
Contributor

pmeenan commented Feb 2, 2018

@josephscott knows the atu-scaling behavior better than I do and may be able to help.

@josephscott
Copy link
Contributor

If the problem is the WPT server killing off EC2 instances, is there anything in the daily ec2/log/ directory? For example - ec2/log/ec2.log.20180203

When I was tracking down oddities with EC2 agents going up/down, I found the log info in there helpful.

@soulgalore
Copy link
Contributor Author

soulgalore commented Feb 4, 2018

Hi @josephscott ah thanks, I forgot to look there when I actually understood that it was the EC2 up/down that was causing the problem.

I can see a pattern, without having any specific configuration changes (meaning EC2.IdleTerminateMinutes was not configured) it seems to always look like this:

17:55:13 - Instance i-0609ab2530ee3357a started: c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
18:50:12 - Terminated instance i-0609ab2530ee3357a in us-east-1
18:50:12 - Launching c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
18:50:14 - Instance i-086dde645760c7cda started: c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
19:45:12 - Terminated instance i-086dde645760c7cda in us-east-1
19:45:12 - Launching c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
19:45:13 - Instance i-0df8d4d8bd48f285e started: c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
20:40:12 - Terminated instance i-0df8d4d8bd48f285e in us-east-1
20:53:43 - Launching c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=wpt.wmftest.org wpt_loc=us-east-1-linux wpt_key=KEY
20:53:45 - Instance i-06945243e08735cb6 started: c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY

It seems it is almost always teardown after 55 minutes, and mostly cerating a new instance but that happens when there are jobs in the queue for that instance. And the new instance doesn't come up so it gets any of the jobs. I've added EC2.IdleTerminateMinutes and set it to 5 hours in plain desperation and it worked ... for five hours :) Right now I disabled the ec2 keys to make sure the instance(s) lives forever.

I can have a look at it late next week, now when it finally works for us (we wanna move to Linux ASAP) I wanna focus on getting everything ready on our side (we don't use the scaling daily, only when we test configuration changes on our side).

@pmeenan
Copy link
Contributor

pmeenan commented Mar 7, 2018

Just checking in to see if this has been resolved with the recent changes to the EC2 autoscaling logic.

@soulgalore
Copy link
Contributor Author

We switched it off and I haven't turned it on again, sorry. It works fine for us to run without the scaling logic.

@pmeenan
Copy link
Contributor

pmeenan commented Mar 7, 2018

Thanks. Closing it out as there have been a few other threads on the auto-scaling and I think most of the rough edges are fixed (though it's still hard to get just right).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants