Reduce latency in submitting async tasks #1894

mmcfarland · 2017-05-18T13:33:08Z

Original work was done in #1529 to reduce the amount of time the initial request to kick off a mapshed job. However, the latency on submitting async tasks, including MapShed, are still quite long and an exploration of ELB logs and p95 latency charts show that we have consistent 10-20 second execution of these submission views. The main culprit seems to pinging the workers to get a list via chose_worker, investigate alternatives. Things to keep in mind:

We're not trying to reduce the actual latency of the submission call (such as invoking celery on a new thread and responding to the request sooner), but rather the adding of the new job. If it takes 15 seconds to queue the job and 5 to execute it, it still takes 20s to get results no matter how quickly the initial http request is resolved
Caching the list of workers for a period of time may not be that effective in practice. If we cache for a short enough period of time to make it unlikely to have workers cycle in and out, it may not be long enough for relatively infrequent requests to job submissions. Although, for a lot of our jobs, the submissions comes in very close together because the requests are fired off all at once. Caching in a static variable in python would also be additionally unhelpful because the requests are likely routed to a number of app server instances which execute the choose_worker.
The latency is highly variable (30s to 2s) and it's unclear what the cause is for it is

Here are the highest 20 latency request for 3 days in March. Note they're all to /modeling/start/*. I've anonomyzied the IP address, but kept the label consistent (AAA is the same IP in all requests).

rank	time	resource	bytes	user	requested_at
1	28.99762	https://app.wikiwatershed.org:443/api/modeling/start/gwlfe/	294211	AAA	2017-03-15T00:53:50.285289Z
2	20.72409	https://app.wikiwatershed.org:443/api/modeling/start/gwlfe/	294211	AAA	2017-03-15T00:53:51.858341Z
3	20.46592	https://app.wikiwatershed.org:443/api/modeling/start/analyze/animals/	464349	BBB	2017-03-15T23:24:02.588588Z
4	20.45691	https://app.wikiwatershed.org:443/api/modeling/start/analyze/catchment-water-quality/	464349	BBB	2017-03-15T23:24:02.971498Z
5	20.41035	https://app.wikiwatershed.org:443/api/modeling/start/analyze/catchment-water-quality/	399480	CCC	2017-03-15T15:31:41.272118Z
6	20.37707	https://app.wikiwatershed.org:443/api/modeling/start/analyze/pointsource/	399480	CCC	2017-03-15T15:31:41.248567Z
7	20.31558	https://app.wikiwatershed.org:443/api/modeling/start/analyze/pointsource/	464349	BBB	2017-03-15T23:24:02.804521Z
8	19.49325	https://app.wikiwatershed.org:443/api/modeling/start/analyze/animals/	399480	CCC	2017-03-15T15:31:40.628332Z
9	19.469975	https://app.wikiwatershed.org:443/api/modeling/start/analyze/	399480	CCC	2017-03-15T15:31:40.572087Z
10	14.797632	https://app.wikiwatershed.org:443/api/modeling/start/analyze/	464349	BBB	2017-03-15T23:24:02.413951Z
11	9.65040	https://app.wikiwatershed.org:443/api/modeling/start/tr55/	45736	DDD	2017-03-15T23:20:06.759284Z
12	9.63470	https://app.wikiwatershed.org:443/api/modeling/start/tr55/	45736	DDD	2017-03-15T23:20:06.773238Z
13	8.826231	https://app.wikiwatershed.org:443/api/modeling/start/analyze/animals/	331685	EEE	2017-03-15T16:09:50.451538Z
14	8.365412	https://app.wikiwatershed.org:443/api/modeling/start/analyze/	331685	EEE	2017
15	7.944202	https://app.wikiwatershed.org:443/api/modeling/start/analyze/catchment-water-quality/	185953	BBB	2017-03-15T23:23:40.792558Z
16	7.815164	https://app.wikiwatershed.org:443/api/modeling/start/analyze/pointsource/	185953	BBB	2017-03-15T23:23:40.772179Z
17	7.764834	https://app.wikiwatershed.org:443/api/modeling/start/analyze/pointsource/	331685	EEE	2017-03-15T16:09:50.702655Z
18	7.669558	https://app.wikiwatershed.org:443/api/modeling/start/analyze/animals/	147484	BBB	2017-03-15T23:22:03.702829Z
19	7.409083	https://app.wikiwatershed.org:443/api/modeling/start/analyze/	185953	BBB	2017-03-15T23:23:40.212512Z
20	7.363016	https://app.wikiwatershed.org:443/api/modeling/start/analyze/catchment-water-quality/	331685	EEE	2017-03-15T16:09:51.335015Z

The text was updated successfully, but these errors were encountered:

mmcfarland · 2017-05-18T13:48:10Z

There may be some transferable improvements to TR55 & Analyze as recorded in #1535

mmcfarland · 2017-06-21T00:27:25Z

Additional anecdotal evidence: when running the multi-year model on a small AoI (65km ²), the job submission endpoint took considerably longer than actually executing the model:

A large analyze job:

mmcfarland added the 1 label May 18, 2017

ajrobbins removed the 1 label May 18, 2017

rajadain mentioned this issue May 18, 2017

Investigate Reducing Choose Worker Calls #1903

Closed

rajadain added this to the WPF 3-1 milestone May 18, 2017

mmcfarland added the 1 label Jun 21, 2017

ajrobbins removed the 1 label Jul 10, 2017

mmcfarland mentioned this issue Aug 3, 2017

Collections API: Rework Celery task submission #2117

Closed

mmcfarland mentioned this issue Aug 25, 2017

Remove custom celery routing and exchange #2193

Merged

mmcfarland self-assigned this Aug 25, 2017

mmcfarland added in progress in review tested/verified and removed in progress in review labels Aug 25, 2017

ajrobbins closed this as completed Sep 7, 2017

ajrobbins added WPF 3-1 and removed tested/verified labels Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce latency in submitting async tasks #1894

Reduce latency in submitting async tasks #1894

mmcfarland commented May 18, 2017

mmcfarland commented May 18, 2017

mmcfarland commented Jun 21, 2017 •

edited

Loading

Reduce latency in submitting async tasks #1894

Reduce latency in submitting async tasks #1894

Comments

mmcfarland commented May 18, 2017

mmcfarland commented May 18, 2017

mmcfarland commented Jun 21, 2017 • edited Loading

mmcfarland commented Jun 21, 2017 •

edited

Loading