Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization of commands (CHIP and PREDICT) #671

Merged
merged 41 commits into from Mar 25, 2019
Merged

Conversation

@lossyrob
Copy link
Contributor

@lossyrob lossyrob commented Jan 30, 2019

This PR adds the ability for users to specify a split option for running experiments, which will split certain commands into smaller commands that can be run in parallel.

This introduces several refactors:

  • Utilizing directories instead of specific file paths in the CommandIODefinitions
  • Breaking up the update_for_command from the addition of files to io_defs, which is now done in report_io.

This was tested against Rio chip classification and Vegas semantic segmentation in the examples.

TODO

  • Increase unit test coverage
  • Add documentation
  • Modify examples to use splits Examples need to be updated further as per azavea/raster-vision-examples#42 - Adding splitting can be done as part of the upgrade after this gets merged.
  • Add release notes
  • Document report_io and update_for_command, in general and specifically for plugins.

Closes #654
Closes #268

for command_config in base_command_config.split(splits):
io_def = command_config.report_io()
command_def = cls(e.id, command_config, io_def)
command_definitions.append(command_def)

This comment has been minimized.

@lossyrob

lossyrob Jan 30, 2019
Author Contributor

This is the crux of the changes - by separating the io_def updates and the update_for_command, we can generate commands from the fully resolved and unsplit experiment, and then split the commands, then ask the commands to provide the io_def updates.

@lewfish
Copy link
Contributor

@lewfish lewfish commented Jan 31, 2019

I skimmed this and the approach seems sound.

@lossyrob lossyrob changed the title [WIP] Parallelization of commands (CHIP and PREDICT) Parallelization of commands (CHIP and PREDICT) Feb 17, 2019
@lossyrob lossyrob changed the title Parallelization of commands (CHIP and PREDICT) [WIP] Parallelization of commands (CHIP and PREDICT) Feb 17, 2019
job_name = '{}-{}_{}_{}'.format(command_type, command_split_id, exp,
uuid_part)

group_level = 1

This comment has been minimized.

@lossyrob

lossyrob Feb 18, 2019
Author Contributor

@lewfish commented in slack:

You can use Batch "array jobs" for this. https://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html There's some code for this here

This comment has been minimized.

@lossyrob

lossyrob Feb 18, 2019
Author Contributor

We'd have to figure out how to utilize AWS_BATCH_JOB_ARRAY_INDEX inside run_command, as this is the only mechanism for splitting up the command parameters AFAIK.

def run_command(command_config_uri, tempdir):

We could feed an option into run_command that specifies the environment variable that defines the split ID, and then modify the command config appropriately. Thoughts?

This comment has been minimized.

@lossyrob

lossyrob Feb 18, 2019
Author Contributor

This will also affect the mechanism for submission - the submission happens a command at a time, and does not have the visibility of e.g. all CHIP commands at once. This may take a substantial refactor to work out - and may take some reconsideration of the OutOfProcessRunner heirarchy.

Perhaps we should do this in a separate PR?

This comment has been minimized.

@lewfish

lewfish Feb 18, 2019
Contributor

Yeah, that would be a big change. I think the way you have it is ok.

@lossyrob lossyrob requested a review from lewfish Mar 3, 2019
@lossyrob lossyrob changed the title Parallelization of commands (CHIP and PREDICT) WIP: Parallelization of commands (CHIP and PREDICT) Mar 4, 2019
@codecov
Copy link

@codecov codecov bot commented Mar 7, 2019

Codecov Report

Merging #671 into develop will increase coverage by 0.23%.
The diff coverage is 51.47%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #671      +/-   ##
===========================================
+ Coverage    71.66%   71.89%   +0.23%     
===========================================
  Files          171      171              
  Lines         8283     8401     +118     
===========================================
+ Hits          5936     6040     +104     
- Misses        2347     2361      +14
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (+4.16%) ⬆️
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (-0.79%) ⬇️
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø) ⬆️
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (-0.36%) ⬇️
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (-1.43%) ⬇️
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (-1.59%) ⬇️
rastervision/filesystem/filesystem.py 100% <100%> (ø) ⬆️
rastervision/command/predict_command_config.py 96.93% <100%> (+15.23%) ⬆️
rastervision/utils/files.py 93.47% <100%> (ø) ⬆️
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø) ⬆️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f38cc9...cf540c4. Read the comment docs.

@lossyrob lossyrob changed the title WIP: Parallelization of commands (CHIP and PREDICT) Parallelization of commands (CHIP and PREDICT) Mar 7, 2019
@codecov
Copy link

@codecov codecov bot commented Mar 9, 2019

Codecov Report

Merging #671 into develop will increase coverage by 0.23%.
The diff coverage is 51.47%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #671      +/-   ##
===========================================
+ Coverage    71.66%   71.89%   +0.23%     
===========================================
  Files          171      171              
  Lines         8283     8401     +118     
===========================================
+ Hits          5936     6040     +104     
- Misses        2347     2361      +14
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (+4.16%) ⬆️
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (-0.79%) ⬇️
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø) ⬆️
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (-0.36%) ⬇️
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (-1.43%) ⬇️
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (-1.59%) ⬇️
rastervision/filesystem/filesystem.py 100% <100%> (ø) ⬆️
rastervision/command/predict_command_config.py 96.93% <100%> (+15.23%) ⬆️
rastervision/utils/files.py 93.47% <100%> (ø) ⬆️
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø) ⬆️
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90f4c5d...3d98352. Read the comment docs.

@lewfish
Copy link
Contributor

@lewfish lewfish commented Mar 14, 2019

When testing this on a remote object detection experiment, it fails here due to self.temp_dir never getting set

return get_local_path(uri, self.temp_dir)

Copy link
Contributor

@lewfish lewfish left a comment

Looks good aside from some small issues and the bug noted in #671 (comment)

make_dir(directory, check_empty=False)

with open(path, 'w+') as file:
file.write(self.lorem)

This comment has been minimized.

@lewfish

lewfish Mar 14, 2019
Contributor

Could use str_to_file instead.

make_dir(dir1, check_empty=False)

with open(path1, 'w+') as file:
file.write(self.lorem)

This comment has been minimized.

@lewfish

lewfish Mar 14, 2019
Contributor

Could use str_to_file instead.

if command_type == rv.ANALYZE:
if not self.stats_uri:
self.stats_uri = os.path.join(experiment_config.analyze_uri,
'stats.json')

def report_io(self, command_type, io_def):

This comment has been minimized.

@lewfish

lewfish Mar 14, 2019
Contributor

At first I thought this should be named get_io but since it has to recursively call this on all it's children, report_io makes sense.

outputs_to_defs[(output_uri, command_type)] = []
outputs_to_defs[(output_uri, command_type)].append(command_def)
if (output_uri, command_type, split_id) not in outputs_to_defs:
outputs_to_defs[(output_uri, command_type, split_id)] = []

This comment has been minimized.

@@ -24,6 +25,13 @@ def report_io(self):
"""
pass

def split(self, num_parts):
"""Split this command config into num_parts parts if possible.
This will return a list containing 1 - num_parts command configurations.

This comment has been minimized.

@lewfish

lewfish Mar 14, 2019
Contributor

This doesn't sound correct: 1 - num_parts. Shouldn't it be num_parts?

This comment has been minimized.

@lossyrob

lossyrob Mar 15, 2019
Author Contributor

Ah, should say "1 to num_parts", I'll reword it.

rerunning happening even though output already exists and the ``--rerun`` flag is not used. This
can be a common pitfall for plugin development, and care should be taken to ensure that IO is
properly being reported. The ``--dry-run`` flag with the ``-v`` verbosity flag can be useful here
for ensuring the IO that is reported is what is expected.

This comment has been minimized.

@lewfish

lewfish Mar 14, 2019
Contributor

extra space before is

Reporting IO
^^^^^^^^^^^^

Raster Vision requires that configuration reports on it's input and output files, which allows it to tie

This comment has been minimized.

@lewfish

lewfish Mar 14, 2019
Contributor

it's -> its

@@ -68,7 +69,8 @@ def split(self, num_parts):
commands = []
t_scenes = list(map(lambda x: (0, x), self.train_scenes))
v_scenes = list(map(lambda x: (1, x), self.val_scenes))
group_size = max(int((len(t_scenes) + len(v_scenes)) / num_parts), 1)
group_size = max(

This comment has been minimized.

@lewfish

lewfish Mar 14, 2019
Contributor

It seems like the logic for getting group size should be moved into a utility function since it is duplicated in predict_command_config. You may also want to consider combining that and the grouped function into a single split_into_groups(list, num_groups) since they usually co-occur. In addition, I think this function should be unit tested so we can make sure it is correct in edge cases, like more groups than elements in list, number of elements isn't divisible by number groups, etc.

@@ -11,6 +11,8 @@ env:
- CLEAN_TRAVIS_TAG=${TRAVIS_TAG/[[:space:]]/}
- COMMIT=${CLEAN_TRAVIS_TAG:-${TRAVIS_COMMIT:0:7}}

if: (type = pull_request) OR (tag IS present) OR (branch = develop) OR (branch =~ /^feature.*/)

This comment has been minimized.

@lewfish

lewfish Mar 14, 2019
Contributor

Does this fix #268?

This comment has been minimized.

@lossyrob

lossyrob Mar 15, 2019
Author Contributor

Yep! I'll mark it in the PR description.

@codecov
Copy link

@codecov codecov bot commented Mar 25, 2019

Codecov Report

No coverage uploaded for pull request base (develop@1c869ce). Click here to learn what that means.
The diff coverage is 51.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #671   +/-   ##
==========================================
  Coverage           ?   71.89%           
==========================================
  Files              ?      171           
  Lines              ?     8401           
  Branches           ?        0           
==========================================
  Hits               ?     6040           
  Misses             ?     2361           
  Partials           ?        0
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (ø)
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (ø)
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø)
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (ø)
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (ø)
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (ø)
rastervision/filesystem/filesystem.py 100% <100%> (ø)
rastervision/command/predict_command_config.py 96.93% <100%> (ø)
rastervision/utils/files.py 93.47% <100%> (ø)
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c869ce...cd1c7e7. Read the comment docs.

@codecov
Copy link

@codecov codecov bot commented Mar 25, 2019

Codecov Report

No coverage uploaded for pull request base (develop@1c869ce). Click here to learn what that means.
The diff coverage is 51.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #671   +/-   ##
==========================================
  Coverage           ?   71.89%           
==========================================
  Files              ?      171           
  Lines              ?     8401           
  Branches           ?        0           
==========================================
  Hits               ?     6040           
  Misses             ?     2361           
  Partials           ?        0
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (ø)
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (ø)
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø)
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (ø)
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (ø)
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (ø)
rastervision/filesystem/filesystem.py 100% <100%> (ø)
rastervision/command/predict_command_config.py 96.93% <100%> (ø)
rastervision/utils/files.py 93.47% <100%> (ø)
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c869ce...3175f80. Read the comment docs.

@codecov
Copy link

@codecov codecov bot commented Mar 25, 2019

Codecov Report

No coverage uploaded for pull request base (develop@1c869ce). Click here to learn what that means.
The diff coverage is 51.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #671   +/-   ##
==========================================
  Coverage           ?   71.89%           
==========================================
  Files              ?      171           
  Lines              ?     8401           
  Branches           ?        0           
==========================================
  Hits               ?     6040           
  Misses             ?     2361           
  Partials           ?        0
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (ø)
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (ø)
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø)
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (ø)
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (ø)
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (ø)
rastervision/filesystem/filesystem.py 100% <100%> (ø)
rastervision/command/predict_command_config.py 96.93% <100%> (ø)
rastervision/utils/files.py 93.47% <100%> (ø)
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c869ce...2c6e560. Read the comment docs.

1 similar comment
@codecov
Copy link

@codecov codecov bot commented Mar 25, 2019

Codecov Report

No coverage uploaded for pull request base (develop@1c869ce). Click here to learn what that means.
The diff coverage is 51.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #671   +/-   ##
==========================================
  Coverage           ?   71.89%           
==========================================
  Files              ?      171           
  Lines              ?     8401           
  Branches           ?        0           
==========================================
  Hits               ?     6040           
  Misses             ?     2361           
  Partials           ?        0
Impacted Files Coverage Δ
rastervision/augmentor/augmentor_config.py 87.5% <ø> (ø)
...rvision/runner/out_of_process_experiment_runner.py 36.17% <0%> (ø)
...l_source/object_detection_geojson_source_config.py 0% <0%> (ø)
.../label_store/semantic_segmentation_raster_store.py 15.11% <0%> (ø)
rastervision/runner/inprocess_experiment_runner.py 37.03% <0%> (ø)
rastervision/filesystem/s3_filesystem.py 95.86% <100%> (ø)
rastervision/filesystem/filesystem.py 100% <100%> (ø)
rastervision/command/predict_command_config.py 96.93% <100%> (ø)
rastervision/utils/files.py 93.47% <100%> (ø)
.../vector_source/vector_tile_vector_source_config.py 90.56% <100%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c869ce...2c6e560. Read the comment docs.

@lossyrob lossyrob requested a review from lewfish Mar 25, 2019
@lewfish lewfish merged commit 091feb9 into develop Mar 25, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@lewfish lewfish deleted the rde/feature/parallelize branch Mar 25, 2019
@lewfish lewfish removed the review label Mar 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants