Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce latency in submitting async tasks #1894

Closed
mmcfarland opened this issue May 18, 2017 · 2 comments
Closed

Reduce latency in submitting async tasks #1894

mmcfarland opened this issue May 18, 2017 · 2 comments
Assignees
Labels

Comments

@mmcfarland
Copy link
Contributor

Original work was done in #1529 to reduce the amount of time the initial request to kick off a mapshed job. However, the latency on submitting async tasks, including MapShed, are still quite long and an exploration of ELB logs and p95 latency charts show that we have consistent 10-20 second execution of these submission views. The main culprit seems to pinging the workers to get a list via chose_worker, investigate alternatives. Things to keep in mind:

  • We're not trying to reduce the actual latency of the submission call (such as invoking celery on a new thread and responding to the request sooner), but rather the adding of the new job. If it takes 15 seconds to queue the job and 5 to execute it, it still takes 20s to get results no matter how quickly the initial http request is resolved
  • Caching the list of workers for a period of time may not be that effective in practice. If we cache for a short enough period of time to make it unlikely to have workers cycle in and out, it may not be long enough for relatively infrequent requests to job submissions. Although, for a lot of our jobs, the submissions comes in very close together because the requests are fired off all at once. Caching in a static variable in python would also be additionally unhelpful because the requests are likely routed to a number of app server instances which execute the choose_worker.
  • The latency is highly variable (30s to 2s) and it's unclear what the cause is for it is

Here are the highest 20 latency request for 3 days in March. Note they're all to /modeling/start/*. I've anonomyzied the IP address, but kept the label consistent (AAA is the same IP in all requests).

rank time resource bytes user requested_at
1 28.99762 https://app.wikiwatershed.org:443/api/modeling/start/gwlfe/ 294211 AAA 2017-03-15T00:53:50.285289Z
2 20.72409 https://app.wikiwatershed.org:443/api/modeling/start/gwlfe/ 294211 AAA 2017-03-15T00:53:51.858341Z
3 20.46592 https://app.wikiwatershed.org:443/api/modeling/start/analyze/animals/ 464349 BBB 2017-03-15T23:24:02.588588Z
4 20.45691 https://app.wikiwatershed.org:443/api/modeling/start/analyze/catchment-water-quality/ 464349 BBB 2017-03-15T23:24:02.971498Z
5 20.41035 https://app.wikiwatershed.org:443/api/modeling/start/analyze/catchment-water-quality/ 399480 CCC 2017-03-15T15:31:41.272118Z
6 20.37707 https://app.wikiwatershed.org:443/api/modeling/start/analyze/pointsource/ 399480 CCC 2017-03-15T15:31:41.248567Z
7 20.31558 https://app.wikiwatershed.org:443/api/modeling/start/analyze/pointsource/ 464349 BBB 2017-03-15T23:24:02.804521Z
8 19.49325 https://app.wikiwatershed.org:443/api/modeling/start/analyze/animals/ 399480 CCC 2017-03-15T15:31:40.628332Z
9 19.469975 https://app.wikiwatershed.org:443/api/modeling/start/analyze/ 399480 CCC 2017-03-15T15:31:40.572087Z
10 14.797632 https://app.wikiwatershed.org:443/api/modeling/start/analyze/ 464349 BBB 2017-03-15T23:24:02.413951Z
11 9.65040 https://app.wikiwatershed.org:443/api/modeling/start/tr55/ 45736 DDD 2017-03-15T23:20:06.759284Z
12 9.63470 https://app.wikiwatershed.org:443/api/modeling/start/tr55/ 45736 DDD 2017-03-15T23:20:06.773238Z
13 8.826231 https://app.wikiwatershed.org:443/api/modeling/start/analyze/animals/ 331685 EEE 2017-03-15T16:09:50.451538Z
14 8.365412 https://app.wikiwatershed.org:443/api/modeling/start/analyze/ 331685 EEE 2017
15 7.944202 https://app.wikiwatershed.org:443/api/modeling/start/analyze/catchment-water-quality/ 185953 BBB 2017-03-15T23:23:40.792558Z
16 7.815164 https://app.wikiwatershed.org:443/api/modeling/start/analyze/pointsource/ 185953 BBB 2017-03-15T23:23:40.772179Z
17 7.764834 https://app.wikiwatershed.org:443/api/modeling/start/analyze/pointsource/ 331685 EEE 2017-03-15T16:09:50.702655Z
18 7.669558 https://app.wikiwatershed.org:443/api/modeling/start/analyze/animals/ 147484 BBB 2017-03-15T23:22:03.702829Z
19 7.409083 https://app.wikiwatershed.org:443/api/modeling/start/analyze/ 185953 BBB 2017-03-15T23:23:40.212512Z
20 7.363016 https://app.wikiwatershed.org:443/api/modeling/start/analyze/catchment-water-quality/ 331685 EEE 2017-03-15T16:09:51.335015Z
@mmcfarland mmcfarland added the 1 label May 18, 2017
@mmcfarland
Copy link
Contributor Author

There may be some transferable improvements to TR55 & Analyze as recorded in #1535

@ajrobbins ajrobbins removed the 1 label May 18, 2017
@rajadain rajadain added this to the WPF 3-1 milestone May 18, 2017
@mmcfarland
Copy link
Contributor Author

mmcfarland commented Jun 21, 2017

Additional anecdotal evidence: when running the multi-year model on a small AoI (65km 2), the job submission endpoint took considerably longer than actually executing the model:

gwlfe.png

A large analyze job:
more-polling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants