Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Use of CPU and GPU Queues #668

Merged
merged 4 commits into from Feb 1, 2019

Conversation

Projects
None yet
2 participants
@jamesmcclain
Copy link
Member

jamesmcclain commented Jan 25, 2019

Overview

Allows jobs to be run on both CPU and GPU instances on AWS.

Checklist

  • Updated docs/changelog.rst
  • Added needs-backport label if PR is bug fix that applies to previous minor release
  • Ran scripts/format_code and committed any changes
  • Documentation updated if needed
  • PR has a name that won't get you publicly shamed for vagueness

Closes #634
Closes #649

See also https://github.com/azavea/pfb-network-connectivity/blob/0.8.1/src/django/pfb_analysis/models.py#L712-L716 and https://github.com/azavea/pfb-network-connectivity/blob/0.8.1/src/django/pfb_analysis/models.py#L745-L756

Testing

Tested with Vegas SpaceNet, using this command line:

rastervision run aws_batch -e spacenet.vegas -a test True -a use_remote_data True -a root_uri s3://bucket/prefix -a target buildings -a task_type semantic_segmentation

and this patch on top of this branch

diff --git a/rastervision/runner/aws_batch_experiment_runner.py b/rastervision/runner/aws_batch_experiment_runner.py
index b365b04..20d998b 100644
--- a/rastervision/runner/aws_batch_experiment_runner.py
+++ b/rastervision/runner/aws_batch_experiment_runner.py
@@ -51,6 +51,9 @@ class AwsBatchExperimentRunner(OutOfProcessExperimentRunner):
                 cpu_job_definition = job_definition
         self.cpu_job_definition = cpu_job_definition
 
+        self.job_definition = 'jamesmcclain-dockerhub-gpu'
+        self.cpu_job_definition = 'jamesmcclain-dockerhub-cpu'
+
         self.submit = self.batch_submit
         self.execution_environment = 'Batch'
 

jamesmcclain added some commits Jan 25, 2019

@jamesmcclain jamesmcclain added the review label Jan 25, 2019

@jamesmcclain jamesmcclain requested a review from lewfish Jan 28, 2019

@lewfish
Copy link
Contributor

lewfish left a comment

Tested using azavea/raster-vision-aws#8 and updating ~/.rastervision/default to contain:

[AWS_BATCH]
job_queue=lewfishRasterVisionGpuJobQueue
job_definition=lewfishRasterVisionCustomGpuJobDefinition
cpu_job_queue=lewfishRasterVisionCpuJobQueue
cpu_job_definition=lewfishRasterVisionCustomCpuJobDefinition

The only requested change is to update the docs at: https://github.com/azavea/raster-vision/blob/develop/docs/setup.rst#L203-L219 with the new fields.

@lewfish

This comment has been minimized.

Copy link
Contributor

lewfish commented Jan 29, 2019

I thought this worked but when I looked at the Batch console I noticed that the first job is stuck in Runnable. This could be because there's something messed up with the new Batch resources I just created using the new CloudFormation setup. But it also looks like what happened in the past when we had jobs with cross-queue dependencies. When you tested whether this was possible, did you notice if the jobs were actually completed?

screen shot 2019-01-28 at 7 09 13 pm

@jamesmcclain

This comment has been minimized.

Copy link
Member Author

jamesmcclain commented Jan 29, 2019

did you notice if the jobs were actually completed?

All completed.

@jamesmcclain

This comment has been minimized.

Copy link
Member Author

jamesmcclain commented Jan 30, 2019

rastervision run aws_batch -e spacenet.vegas -a test True -a use_remote_data True -a root_uri s3://bucket/prefix -a target buildings -a task_type semantic_segmentation

The first screenshot was taken before the job was submitted.

screenshot_2019-01-30_06-34-21

screenshot_2019-01-30_06-34-33

screenshot_2019-01-30_06-38-30

screenshot_2019-01-30_06-40-45

screenshot_2019-01-30_06-42-36

screenshot_2019-01-30_06-49-45

screenshot_2019-01-30_06-52-38

screenshot_2019-01-30_06-54-36

screenshot_2019-01-30_07-00-45

screenshot_2019-01-30_07-03-28

@jamesmcclain

This comment has been minimized.

Copy link
Member Author

jamesmcclain commented Jan 30, 2019

Tested using azavea/raster-vision-cloudformation#8 and updating ~/.rastervision/default to contain:

[AWS_BATCH]
job_queue=lewfishRasterVisionGpuJobQueue
job_definition=lewfishRasterVisionCustomGpuJobDefinition
cpu_job_queue=lewfishRasterVisionCpuJobQueue
cpu_job_definition=lewfishRasterVisionCustomCpuJobDefinition

The only requested change is to update the docs at: https://github.com/azavea/raster-vision/blob/develop/docs/setup.rst#L203-L219 with the new fields.

Updated, but still out of date because the instructions should probably reference raster-vision-cloudformation (see #672).

@lewfish

This comment has been minimized.

Copy link
Contributor

lewfish commented Jan 30, 2019

After making some changes (for one, lowering the requested RAM) I've got the jobs to move past Runnable in the CPU queue but they still crash. I think there's something wrong with the Cloudformation setup. I have one more idea to try before I contact Ops.

@jamesmcclain

This comment has been minimized.

Copy link
Member Author

jamesmcclain commented Jan 30, 2019

After making some changes (for one, lowering the requested RAM) I've got the jobs to move past Runnable in the CPU queue but they still crash. I think there's something wrong with the Cloudformation setup. I have one more idea to try before I contact Ops.

Okay

@lewfish lewfish referenced this pull request Jan 30, 2019

Merged

Clean up and fix CPU issues #8

@jamesmcclain jamesmcclain merged commit 5ce55ee into azavea:develop Feb 1, 2019

@jamesmcclain jamesmcclain deleted the jamesmcclain:cpu-gpu branch Feb 1, 2019

@jamesmcclain jamesmcclain removed the review label Feb 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.