Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization of commands (CHIP and PREDICT) #671

Merged
merged 41 commits into from Mar 25, 2019

Conversation

Projects
None yet
2 participants
@lossyrob
Copy link
Member

commented Jan 30, 2019

This PR adds the ability for users to specify a split option for running experiments, which will split certain commands into smaller commands that can be run in parallel.

This introduces several refactors:

  • Utilizing directories instead of specific file paths in the CommandIODefinitions
  • Breaking up the update_for_command from the addition of files to io_defs, which is now done in report_io.

This was tested against Rio chip classification and Vegas semantic segmentation in the examples.

TODO

  • Increase unit test coverage
  • Add documentation
  • Modify examples to use splits Examples need to be updated further as per azavea/raster-vision-examples#42 - Adding splitting can be done as part of the upgrade after this gets merged.
  • Add release notes
  • Document report_io and update_for_command, in general and specifically for plugins.

Closes #654
Closes #268

@lossyrob lossyrob added the review label Jan 30, 2019

for command_config in base_command_config.split(splits):
io_def = command_config.report_io()
command_def = cls(e.id, command_config, io_def)
command_definitions.append(command_def)

This comment has been minimized.

Copy link
@lossyrob

lossyrob Jan 30, 2019

Author Member

This is the crux of the changes - by separating the io_def updates and the update_for_command, we can generate commands from the fully resolved and unsplit experiment, and then split the commands, then ask the commands to provide the io_def updates.

@lewfish

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2019

I skimmed this and the approach seems sound.

@lossyrob lossyrob changed the title [WIP] Parallelization of commands (CHIP and PREDICT) Parallelization of commands (CHIP and PREDICT) Feb 17, 2019

@lossyrob lossyrob changed the title Parallelization of commands (CHIP and PREDICT) [WIP] Parallelization of commands (CHIP and PREDICT) Feb 17, 2019

job_name = '{}-{}_{}_{}'.format(command_type, command_split_id, exp,
uuid_part)

group_level = 1

This comment has been minimized.

Copy link
@lossyrob

lossyrob Feb 18, 2019

Author Member

@lewfish commented in slack:

You can use Batch "array jobs" for this. https://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html There's some code for this here

This comment has been minimized.

Copy link
@lossyrob

lossyrob Feb 18, 2019

Author Member

We'd have to figure out how to utilize AWS_BATCH_JOB_ARRAY_INDEX inside run_command, as this is the only mechanism for splitting up the command parameters AFAIK.

def run_command(command_config_uri, tempdir):

We could feed an option into run_command that specifies the environment variable that defines the split ID, and then modify the command config appropriately. Thoughts?

This comment has been minimized.

Copy link
@lossyrob

lossyrob Feb 18, 2019

Author Member

This will also affect the mechanism for submission - the submission happens a command at a time, and does not have the visibility of e.g. all CHIP commands at once. This may take a substantial refactor to work out - and may take some reconsideration of the OutOfProcessRunner heirarchy.

Perhaps we should do this in a separate PR?

This comment has been minimized.

Copy link
@lewfish

lewfish Feb 18, 2019

Contributor

Yeah, that would be a big change. I think the way you have it is ok.

@lossyrob lossyrob requested a review from lewfish Mar 3, 2019

@lossyrob lossyrob changed the title Parallelization of commands (CHIP and PREDICT) WIP: Parallelization of commands (CHIP and PREDICT) Mar 4, 2019

@codecov

This comment has been minimized.

Copy link

commented Mar 7, 2019

Codecov Report

Merging #671 into develop will increase coverage by 0.23%.
The diff coverage is 51.47%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #671      +/-   ##
===========================================
+ Coverage    71.66%   71.89%   +0.23%     
===========================================
  Files          171      171              
  Lines         8283     8401     +118     
===========================================
+ Hits          5936     6040     +104     
- Misses        2347     2361      +14
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (+4.16%) ⬆️
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (-0.79%) ⬇️
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø) ⬆️
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (-0.36%) ⬇️
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (-1.43%) ⬇️
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (-1.59%) ⬇️
rastervision/filesystem/filesystem.py 100% <100%> (ø) ⬆️
rastervision/command/predict_command_config.py 96.93% <100%> (+15.23%) ⬆️
rastervision/utils/files.py 93.47% <100%> (ø) ⬆️
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø) ⬆️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f38cc9...cf540c4. Read the comment docs.

@lossyrob lossyrob changed the title WIP: Parallelization of commands (CHIP and PREDICT) Parallelization of commands (CHIP and PREDICT) Mar 7, 2019

@codecov

This comment has been minimized.

Copy link

commented Mar 9, 2019

Codecov Report

Merging #671 into develop will increase coverage by 0.23%.
The diff coverage is 51.47%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #671      +/-   ##
===========================================
+ Coverage    71.66%   71.89%   +0.23%     
===========================================
  Files          171      171              
  Lines         8283     8401     +118     
===========================================
+ Hits          5936     6040     +104     
- Misses        2347     2361      +14
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (+4.16%) ⬆️
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (-0.79%) ⬇️
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø) ⬆️
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (-0.36%) ⬇️
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (-1.43%) ⬇️
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (-1.59%) ⬇️
rastervision/filesystem/filesystem.py 100% <100%> (ø) ⬆️
rastervision/command/predict_command_config.py 96.93% <100%> (+15.23%) ⬆️
rastervision/utils/files.py 93.47% <100%> (ø) ⬆️
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø) ⬆️
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90f4c5d...3d98352. Read the comment docs.

@lewfish

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

When testing this on a remote object detection experiment, it fails here due to self.temp_dir never getting set

return get_local_path(uri, self.temp_dir)

@lewfish
Copy link
Contributor

left a comment

Looks good aside from some small issues and the bug noted in #671 (comment)

make_dir(directory, check_empty=False)

with open(path, 'w+') as file:
file.write(self.lorem)

This comment has been minimized.

Copy link
@lewfish

lewfish Mar 14, 2019

Contributor

Could use str_to_file instead.

make_dir(dir1, check_empty=False)

with open(path1, 'w+') as file:
file.write(self.lorem)

This comment has been minimized.

Copy link
@lewfish

lewfish Mar 14, 2019

Contributor

Could use str_to_file instead.

if command_type == rv.ANALYZE:
if not self.stats_uri:
self.stats_uri = os.path.join(experiment_config.analyze_uri,
'stats.json')

def report_io(self, command_type, io_def):

This comment has been minimized.

Copy link
@lewfish

lewfish Mar 14, 2019

Contributor

At first I thought this should be named get_io but since it has to recursively call this on all it's children, report_io makes sense.

outputs_to_defs[(output_uri, command_type)] = []
outputs_to_defs[(output_uri, command_type)].append(command_def)
if (output_uri, command_type, split_id) not in outputs_to_defs:
outputs_to_defs[(output_uri, command_type, split_id)] = []

This comment has been minimized.

Copy link
@lewfish
@@ -24,6 +25,13 @@ def report_io(self):
"""
pass

def split(self, num_parts):
"""Split this command config into num_parts parts if possible.
This will return a list containing 1 - num_parts command configurations.

This comment has been minimized.

Copy link
@lewfish

lewfish Mar 14, 2019

Contributor

This doesn't sound correct: 1 - num_parts. Shouldn't it be num_parts?

This comment has been minimized.

Copy link
@lossyrob

lossyrob Mar 15, 2019

Author Member

Ah, should say "1 to num_parts", I'll reword it.

rerunning happening even though output already exists and the ``--rerun`` flag is not used. This
can be a common pitfall for plugin development, and care should be taken to ensure that IO is
properly being reported. The ``--dry-run`` flag with the ``-v`` verbosity flag can be useful here
for ensuring the IO that is reported is what is expected.

This comment has been minimized.

Copy link
@lewfish

lewfish Mar 14, 2019

Contributor

extra space before is

Reporting IO
^^^^^^^^^^^^

Raster Vision requires that configuration reports on it's input and output files, which allows it to tie

This comment has been minimized.

Copy link
@lewfish

lewfish Mar 14, 2019

Contributor

it's -> its

@@ -68,7 +69,8 @@ def split(self, num_parts):
commands = []
t_scenes = list(map(lambda x: (0, x), self.train_scenes))
v_scenes = list(map(lambda x: (1, x), self.val_scenes))
group_size = max(int((len(t_scenes) + len(v_scenes)) / num_parts), 1)
group_size = max(

This comment has been minimized.

Copy link
@lewfish

lewfish Mar 14, 2019

Contributor

It seems like the logic for getting group size should be moved into a utility function since it is duplicated in predict_command_config. You may also want to consider combining that and the grouped function into a single split_into_groups(list, num_groups) since they usually co-occur. In addition, I think this function should be unit tested so we can make sure it is correct in edge cases, like more groups than elements in list, number of elements isn't divisible by number groups, etc.

@@ -11,6 +11,8 @@ env:
- CLEAN_TRAVIS_TAG=${TRAVIS_TAG/[[:space:]]/}
- COMMIT=${CLEAN_TRAVIS_TAG:-${TRAVIS_COMMIT:0:7}}

if: (type = pull_request) OR (tag IS present) OR (branch = develop) OR (branch =~ /^feature.*/)

This comment has been minimized.

Copy link
@lewfish

lewfish Mar 14, 2019

Contributor

Does this fix #268?

This comment has been minimized.

Copy link
@lossyrob

lossyrob Mar 15, 2019

Author Member

Yep! I'll mark it in the PR description.

@codecov

This comment has been minimized.

Copy link

commented Mar 25, 2019

Codecov Report

❗️ No coverage uploaded for pull request base (develop@1c869ce). Click here to learn what that means.
The diff coverage is 51.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #671   +/-   ##
==========================================
  Coverage           ?   71.89%           
==========================================
  Files              ?      171           
  Lines              ?     8401           
  Branches           ?        0           
==========================================
  Hits               ?     6040           
  Misses             ?     2361           
  Partials           ?        0
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (ø)
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (ø)
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø)
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (ø)
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (ø)
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (ø)
rastervision/filesystem/filesystem.py 100% <100%> (ø)
rastervision/command/predict_command_config.py 96.93% <100%> (ø)
rastervision/utils/files.py 93.47% <100%> (ø)
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c869ce...cd1c7e7. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented Mar 25, 2019

Codecov Report

❗️ No coverage uploaded for pull request base (develop@1c869ce). Click here to learn what that means.
The diff coverage is 51.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #671   +/-   ##
==========================================
  Coverage           ?   71.89%           
==========================================
  Files              ?      171           
  Lines              ?     8401           
  Branches           ?        0           
==========================================
  Hits               ?     6040           
  Misses             ?     2361           
  Partials           ?        0
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (ø)
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (ø)
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø)
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (ø)
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (ø)
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (ø)
rastervision/filesystem/filesystem.py 100% <100%> (ø)
rastervision/command/predict_command_config.py 96.93% <100%> (ø)
rastervision/utils/files.py 93.47% <100%> (ø)
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c869ce...3175f80. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented Mar 25, 2019

Codecov Report

❗️ No coverage uploaded for pull request base (develop@1c869ce). Click here to learn what that means.
The diff coverage is 51.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #671   +/-   ##
==========================================
  Coverage           ?   71.89%           
==========================================
  Files              ?      171           
  Lines              ?     8401           
  Branches           ?        0           
==========================================
  Hits               ?     6040           
  Misses             ?     2361           
  Partials           ?        0
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (ø)
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (ø)
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø)
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (ø)
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (ø)
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (ø)
rastervision/filesystem/filesystem.py 100% <100%> (ø)
rastervision/command/predict_command_config.py 96.93% <100%> (ø)
rastervision/utils/files.py 93.47% <100%> (ø)
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c869ce...2c6e560. Read the comment docs.

1 similar comment
@codecov

This comment has been minimized.

Copy link

commented Mar 25, 2019

Codecov Report

❗️ No coverage uploaded for pull request base (develop@1c869ce). Click here to learn what that means.
The diff coverage is 51.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #671   +/-   ##
==========================================
  Coverage           ?   71.89%           
==========================================
  Files              ?      171           
  Lines              ?     8401           
  Branches           ?        0           
==========================================
  Hits               ?     6040           
  Misses             ?     2361           
  Partials           ?        0
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (ø)
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (ø)
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø)
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (ø)
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (ø)
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (ø)
rastervision/filesystem/filesystem.py 100% <100%> (ø)
rastervision/command/predict_command_config.py 96.93% <100%> (ø)
rastervision/utils/files.py 93.47% <100%> (ø)
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c869ce...2c6e560. Read the comment docs.

@lossyrob lossyrob requested a review from lewfish Mar 25, 2019

@lewfish lewfish merged commit 091feb9 into develop Mar 25, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@lewfish lewfish deleted the rde/feature/parallelize branch Mar 25, 2019

@lewfish lewfish removed the review label Mar 25, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.