-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collections API #2212
Collections API #2212
Conversation
This is required to run Akka HTTP services natively. Previously we were running Spark JobServer within a Docker container, so did not need to install this. Now, for performance reasons, we will run the service natively, thus necessitating this install.
…l-dependencies Upgrade to Java 8 Connects #2100
- adjust geoprocessing jar version and name - remove Spark Job Server from Ansible config - rename SJS port & host -> geop_port & geop_host - configure geoprocessing role - add upstart geoprocessing job - declare an explicit dependency on the model-my-watershed.base role in the geoprocessing role to ensure mmw user's created before the service starts
Since the new geoprocessing service is run as the `mmw` user in the Worker VM, that user must have access to AWS credentials. Instead of mounting the developer's credentials into `/aws`, they are now mounted into the `mmw` user's home folder. Both `~/.aws/credentials` and `~/.aws/config` must be 644.
We add a task `run` and a helper method `geoprocess`. The `run` task converts the input into the desired format, and `geoprocess` communicates with the geoprocessing service and returns results. `run` is a combination of `start` and `finish`: it checks whether a result is cacheable and cached or not, and if so returns that. Otherwise it runs `geoprocess`. `geoprocess` is similar to `sjs_submit` in the sense that it is POSTing to an endpoint. Unlike `sjs_submit`, which gets back a job id, `geoprocess` receives the actual results and returns them. `run` is designed to replace `start` and `finish` tasks in Celery chains. So if a previous celery chain was: chain(geoprocessing.start.s(data), geoprocessing.finish.s(), mytasks.process_results.s()) It will now be: chain(geoprocessing.run.s(data), mytasks.process_results.s())
We likely do not need to use `choose_worker` anymore, since each request is independent and can be run on any worker (in the right colored stack). However, this probably needs some more thought, and thus will be addressed in the separate issue #2117.
These old async operations are no longer used.
This version includes RasterGroupedCount and RasterGroupedAverage operations, which make it sufficient for Analyze and TR-55 tasks.
We need azavea/ansible-java#27 to solve AWS access issues with OpenJDK.
…elery Collections API: Update Celery to be Synchronous Connects #2102
Due to the implementation of async SJS requests, we established a mechninism for directly establishing routes that would stay on a single worker. This led to complex code and also high latency when we pinged each worker to determine if they were available. With recent changes removing SJS in favor of a synchronous geoprocessing call, we can now cull these custom routes and exchanges, both simplifying the code and increasing the speed of submitting a new job.
Remove custom celery routing and exchange
Upgrade to 3.0.0-beta-1 which suports RasterLinesJoin, thus unlocking MapShed tasks, and becoming feature complete.
…feature/collections-api Connects #2015 Connects WikiWatershed/mmw-geoprocessing#53
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran through everything and compared to production, identical results. The celery improvements make a big impact, but at larger watersheds it's performing at slightly more polls, which is probably to be expected due to network latency with s3. Looking forward to getting this up on staging and having a better apples to apples comparison.
Overview
Formally includes the Collections API work by merging the feature branch. Includes #2100, #2101, #2102, #2103, #2104, WikiWatershed/mmw-geoprocessing#48, WikiWatershed/mmw-geoprocessing#49, WikiWatershed/mmw-geoprocessing#50, WikiWatershed/mmw-geoprocessing#51, WikiWatershed/mmw-geoprocessing#52, WikiWatershed/mmw-geoprocessing#53.
All the code in this PR has already been reviewed. This only requires a quick run through, before merging.
Connects #2105
Notes
Some areas of interest will not work for MapShed, because of an alignment mismatch issue described in #2153. To test the values here outside of that, try and choose a horizontal area of interest, and not a diagonal one. These HUC-12's should work:
Testing Instructions