Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collections API #2212

Merged
merged 15 commits into from
Aug 31, 2017
Merged

Collections API #2212

merged 15 commits into from
Aug 31, 2017

Conversation

rajadain
Copy link
Member

Overview

Formally includes the Collections API work by merging the feature branch. Includes #2100, #2101, #2102, #2103, #2104, WikiWatershed/mmw-geoprocessing#48, WikiWatershed/mmw-geoprocessing#49, WikiWatershed/mmw-geoprocessing#50, WikiWatershed/mmw-geoprocessing#51, WikiWatershed/mmw-geoprocessing#52, WikiWatershed/mmw-geoprocessing#53.

All the code in this PR has already been reviewed. This only requires a quick run through, before merging.

Connects #2105

Notes

Some areas of interest will not work for MapShed, because of an alignment mismatch issue described in #2153. To test the values here outside of that, try and choose a horizontal area of interest, and not a diagonal one. These HUC-12's should work:

image

Testing Instructions

  • Check out this branch. Destroy your worker, and reprovision it.
  • Run through the app. Ensure you can still analyze and model.
  • Ensure TR-55 works. Compare results against staging and ensure they are identical.
  • Ensure MapShed works. Compare results against staging and ensure they are identical.

rajadain and others added 15 commits August 2, 2017 11:42
This is required to run Akka HTTP services natively. Previously
we were running Spark JobServer within a Docker container, so
did not need to install this. Now, for performance reasons, we
will run the service natively, thus necessitating this install.
- adjust geoprocessing jar version and name
- remove Spark Job Server from Ansible config
- rename SJS port & host -> geop_port & geop_host
- configure geoprocessing role
- add upstart geoprocessing job
- declare an explicit dependency on the model-my-watershed.base role in
the geoprocessing role to ensure mmw user's created before the service
starts
Replace SJS with Akka HTTP server

Connects #2101 
Connects #2103
Since the new geoprocessing service is run as the `mmw` user in
the Worker VM, that user must have access to AWS credentials.
Instead of mounting the developer's credentials into `/aws`, they
are now mounted into the `mmw` user's home folder.

Both `~/.aws/credentials` and `~/.aws/config` must be 644.
We add a task `run` and a helper method `geoprocess`. The `run`
task converts the input into the desired format, and `geoprocess`
communicates with the geoprocessing service and returns results.

`run` is a combination of `start` and `finish`: it checks whether
a result is cacheable and cached or not, and if so returns that.
Otherwise it runs `geoprocess`.

`geoprocess` is similar to `sjs_submit` in the sense that it is
POSTing to an endpoint. Unlike `sjs_submit`, which gets back a job
id, `geoprocess` receives the actual results and returns them.

`run` is designed to replace `start` and `finish` tasks in Celery
chains. So if a previous celery chain was:

    chain(geoprocessing.start.s(data),
          geoprocessing.finish.s(),
          mytasks.process_results.s())

It will now be:

    chain(geoprocessing.run.s(data),
          mytasks.process_results.s())
We likely do not need to use `choose_worker` anymore, since each
request is independent and can be run on any worker (in the right
colored stack). However, this probably needs some more thought,
and thus will be addressed in the separate issue #2117.
These old async operations are no longer used.
This version includes RasterGroupedCount and RasterGroupedAverage
operations, which make it sufficient for Analyze and TR-55 tasks.
We need azavea/ansible-java#27 to solve
AWS access issues with OpenJDK.
…elery

Collections API: Update Celery to be Synchronous

Connects #2102
Due to the implementation of async SJS requests, we established a
mechninism for directly establishing routes that would stay on a single
worker.  This led to complex code and also high latency when we pinged
each worker to determine if they were available.  With recent changes
removing SJS in favor of a synchronous geoprocessing call, we can now
cull these custom routes and exchanges, both simplifying the code and
increasing the speed of submitting a new job.
Remove custom celery routing and exchange
Upgrade to 3.0.0-beta-1 which suports RasterLinesJoin, thus
unlocking MapShed tasks, and becoming feature complete.
Copy link
Contributor

@mmcfarland mmcfarland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran through everything and compared to production, identical results. The celery improvements make a big impact, but at larger watersheds it's performing at slightly more polls, which is probably to be expected due to network latency with s3. Looking forward to getting this up on staging and having a better apples to apples comparison.

@mmcfarland mmcfarland assigned rajadain and unassigned mmcfarland Aug 31, 2017
@rajadain rajadain merged commit 3d7edbd into develop Aug 31, 2017
@rajadain rajadain deleted the feature/collections-api branch August 31, 2017 18:06
@rajadain rajadain mentioned this pull request Oct 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants