Merge pull request #671 from azavea/rde/feature/parallelize

Parallelization of commands (CHIP and PREDICT)
azavea · Mar 25, 2019 · 091feb9 · 091feb9
2 parents 1c869ce + 2c6e560
commit 091feb9
Show file tree

Hide file tree

Showing 105 changed files with 2,088 additions and 942 deletions.
diff --git a/.codecov.yml b/.codecov.yml
@@ -2,3 +2,5 @@ coverage:
   status:
     project: off
     patch: off
+comment:
+  behavior: new
diff --git a/.travis.yml b/.travis.yml
@@ -11,6 +11,8 @@ env:
       - CLEAN_TRAVIS_TAG=${TRAVIS_TAG/[[:space:]]/}
       - COMMIT=${CLEAN_TRAVIS_TAG:-${TRAVIS_COMMIT:0:7}}
 
+if: (type = pull_request) OR (tag IS present) OR (branch = develop) OR (branch =~ /^feature.*/)
+
 script:
    - .travis/build
    - .travis/test

diff --git a/docs/CONTRIBUTING.rst b/docs/CONTRIBUTING.rst
@@ -5,6 +5,9 @@ We are happy to take contributions! It is best to get in touch with the maintain
 about larger features or design changes *before* starting the work,
 as it will make the process of accepting changes smoother.
 
+Contributor License Agreement (CLA)
+-----------------------------------
+
 Everyone who contributes code to Raster Vision will be asked to sign the
 Azavea CLA, which is based off of the Apache CLA.
 

diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -16,11 +16,13 @@ Raster Vision 0.9.0
 - Decrease semseg memory usage `#630 <https://github.com/azavea/raster-vision/pull/630>`_
 - Add support for vector tiles in .mbtiles files `#601 <https://github.com/azavea/raster-vision/pull/601>`_
 - Add support for getting labels from zxy vector tiles `#532 <https://github.com/azavea/raster-vision/pull/532>`_
-- Remove custom ``__deepcopy__`` implementation from ``ConfigBuilder``s. `#567 <https://github.com/azavea/raster-vision/pull/567>`_
+- Remove custom ``__deepcopy__`` implementation from ``ConfigBuilder``\s. `#567 <https://github.com/azavea/raster-vision/pull/567>`_
 - Add ability to shift raster images by given numbers of meters.  `#573 <https://github.com/azavea/raster-vision/pull/573>`_
 - Add ability to generate GeoJSON segmentation predictions.  `#575 <https://github.com/azavea/raster-vision/pull/575>`_
 - Add ability to run the DeepLab eval script.  `#653 <https://github.com/azavea/raster-vision/pull/653>`_
 - Submit CPU-only stages to a CPU queue on Aws.  `#668 <https://github.com/azavea/raster-vision/pull/668>`_
+- Parallelize CHIP and PREDICT commands  `#671 <https://github.com/azavea/raster-vision/pull/671>`_
+- Refactor ``update_for_command`` to split out the IO reporting into ``report_io``. `#671 <https://github.com/azavea/raster-vision/pull/671>`_
 
 Raster Vision 0.8
 -----------------

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -25,8 +25,14 @@ Run is the main interface into running ``ExperimentSet`` workflows.
 
 Some specific parameters to call out:
 
+-\\-arg
+~~~~~~~~~~~
+
 Use ``-a`` to pass arguments into the experiment methods; many of which take a root_uri, which is where Raster Vision will store all the output of the experiment. If you forget to supply this, Raster Vision will remind you.
 
+-\\-dry-run
+~~~~~~~~~~~
+
 Using the ``-n`` or ``--dry-run`` flag is useful to see what you're about to run before you run it. Combine this with the verbose flag for different levels of output:
 
 .. code:: shell
@@ -35,8 +41,17 @@ Using the ``-n`` or ``--dry-run`` flag is useful to see what you're about to run
    > rastervision -v run spacenet.chip_classification -a root_uri s3://example/ --dry_run
    > rastervision -vv run spacenet.chip_classification -a root_uri s3://example/ --dry_run
 
+-\\-skip-file-check
+~~~~~~~~~~~~~~~~~~~
+
+Use ``--skip-file-check`` or ``-x`` to avoid checking if files exist, which can take a long time for large experiments. This is useful to do the first run, but if you haven't changed anything about the experiment and are sure the files are there, it's often nice to skip that step.
+
+.. _run split option:
+
+-\\-splits
+~~~~~~~~~~
 
-Use ``-x`` to avoid checking if files exist, which can take a long time for large experiments. This is useful to do the first run, but if you haven't changed anything about the experiment and are sure the files are there, it's often nice to skip that step.
+Use ``-s N`` or ``--splits N``, where ``N`` is the number of splits to create, to parallelize commands that can be split into parallelizable chunks. See :ref:`parallelizing commands` for more information.
 
 .. _predict cli command:
 

diff --git a/docs/codebase.rst b/docs/codebase.rst
@@ -54,3 +54,50 @@ Global Registry
 ---------------
 
 Another major design pattern of Raster Vision is the use of a global registry. This is what gives the ability for the single interface to construct all subclass builders through the static ``builder()`` method on the ``Config`` via a key, e.g. ``rv.RasterSourceConfig.builder(rv.GEOTIFF_SOURCE)``. The key is used to look up what ConfigBuilders are registered inside the global registery, and the registry determines what builder to return from the ``build()`` call. More importantly, this enables Raster Vision to have a flexible system to create :ref:`plugins` out of anything that has a keyed ConfigBuilder. The registry pattern goes beyond Configs and ConfigBuilders, though: this is also how internal classes and plugins are chosen for :ref:`default provider`, :ref:`experiment runner`, and :ref:`filesystem`.
+
+.. _configuration topics:
+
+Configuration Topics
+--------------------
+
+Configuration objects have a couple of methods that require some understanding if you'd like deeper
+knowledge of how Raster Vision works - for example if you are creating plugins.
+
+Implicit Configuration
+^^^^^^^^^^^^^^^^^^^^^^
+
+Configuration values can be set implicitly from other configuration. For example, if my backend
+requires a ``model_uri`` to save a model to, and it is not set, the configuration may set
+it to ``/opt/data/rv_root/train/experiment-name/model.hdf``. This was implicitly set by knowing the
+root URI for the train command is ``/opt/data/rv_root/train/experiment-name``, which is set on the
+experiment (by default constructed from the ``root_uri`` and ``experiment_id``).
+The mechanism that allows this is that configurations
+implement a method called ``update_for_command``, with the following signature:
+
+.. autoclass:: rastervision.core.Config
+   :members: update_for_command
+
+This method is called before running commands on an experiment, and gives the configuration a
+chance to update any values it needs to based on the experiment and any other context it needs.
+The context argument is, for example, the ``SceneConfig`` that the configuration is attached
+to (e.g. a ``RasterSourceConfig``). Context should be set whenever a parent configuration
+calls ``update_for_command`` on child configuration, when that parent configuration is part
+of a collection of configurations (e.g., the collection of ``SceneConfig``s in a ``DataSetConfig``).
+
+Reporting IO
+^^^^^^^^^^^^
+
+Raster Vision requires that configuration reports on its input and output files, which allows it to tie
+together commands into a Directed Acyclic Graph of operations that the ``ExperimentRunner``\s can execute.
+The way this reporting happens is through the ``report_io`` method on configuration:
+
+.. autoclass:: rastervision.core.Config
+   :members: report_io
+
+For each specific command, configuration should set any input files or directories onto the ``io_def`` through the add ``add_input`` method, and set any output  files or directories using the ``add_output`` method.
+
+If a configuration does not correctly report on its IO, it could result in commands not running or
+rerunning happening even though output already exists and the ``--rerun`` flag is not used. This
+can be a common pitfall for plugin development, and care should be taken to ensure that IO is
+properly being reported. The ``--dry-run`` flag with the  ``-v`` verbosity flag can be useful here
+for ensuring the IO that is reported is what is expected.
diff --git a/docs/commands.rst b/docs/commands.rst
@@ -16,6 +16,8 @@ ANALYZE
 
 The ANALYZE command is used to analyze scenes that are part of an experiment and produce some output that can be consumed by later commands. Geospatial raster sources such as GeoTIFFs often contain 16- and 32-bit pixel color values, but many deep learning libraries expect 8-bit values. In order to perform this transformation, we need to know the distribution of pixel values. So one usage of the ANALYZE command is to compute statistics of the raster sources and save them to a JSON file which is later used by the StatsTransformer (one of the available :ref:`raster transformer`) to do the conversion.
 
+.. _chip command:
+
 CHIP
 ^^^^
 
@@ -26,6 +28,8 @@ TRAIN
 
 The TRAIN command is used to train a model using the dataset generated by the CHIP command. The command is a thin wrapper around the train method in the backend that synchronizes files with the cloud, configures and calls the training routine provided by the associated third-party machine learning library, and sets up a log visualization server in some cases (e.g. Tensorboard). The output is a trained model that can be used to make predictions and fine-tune on another dataset.
 
+.. _predict command:
+
 PREDICT
 ^^^^^^^
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -182,7 +182,7 @@ usage patterns.
    codebase
    plugins
    qgis
-   contributing
+   CONTRIBUTING
    release
 
 API Reference

diff --git a/docs/plugins.rst b/docs/plugins.rst
@@ -26,6 +26,17 @@ you later refer to in your experiment configurations. For instance, if you devel
 
 You'll need to implement the ``to_proto`` method and the ``Config`` and the ``from_proto`` method on ``ConfigBuilder`` - in the ``.proto`` files for the entity you are creating a plugin for, you'll see a ``google.protobuf.Struct custom_config`` section. This is the field in the protobuf that can handle arbitrary JSON, and should be used in plugins for configuration.
 
+.. note::
+
+   Be sure to review the :ref:`configuration topics` and ensure you're implementing ``report_io`` and ``update_for_command`` properly in your configuration.
+
+.. note::
+
+   A common pitfall is implementing the ``ConfigBuilder.from_proto`` and ``Config.to_proto`` methods
+   correctly. Look to other ``Config`` and ``ConfigBuilder`` implementations in the Raster Vision
+   codebase for examples on how to do this correctly - and utilize the ``custom_config`` in the protobufs
+   to be able to set arbitrary configuration that is specific to your plugin implementation.
+
 Registering the Plugin
 ----------------------
 

diff --git a/docs/runners.rst b/docs/runners.rst
@@ -76,3 +76,27 @@ includes plugin files.
 
 .. note::
    To run on AWS Batch, you'll need the proper setup. See :ref:`aws batch setup` for instructions.
+
+.. _parallelizing commands:
+
+Running commands in Parallel
+----------------------------
+
+Raster Vision can run certain commands in parallel, such as the :ref:`chip command` and :ref:`predict command` commands. To do so, use the :ref:`run split option` option in the ``run`` command of the CLI.
+
+Commands implement a ``split`` method on them, that either returns the original command if they
+can not be  split, e.g. with training, or a sequence of commands that are split up into
+a given number of groups. For instance, using ``--splits 5`` on a ``CHIP`` command over
+50 training scenes and 25 validation scenes will result in 5 CHIP commands, that can be run
+in parallel, that will each create chips for 15 scenes.
+
+The command DAG that is given to the experiment runner is constructed such that each split command
+can be run in parallel if the runner supports parallelization, and that any command that is dependent on
+the output of the split command will be dependent on each of the splits. So that means, in the above example,
+a ``TRAIN`` command, which was dependent on a single ``CHIP`` command pre-split, will be dependent each of the
+5 individual ``CHIP`` commands after the split.
+
+Each runner will handle parallelization differently. For instance, the local runner will run each
+of the splits simultaneously, so be sure the split number is in relation to the number of CPUs available.
+The AWS Batch runner will submit jobs for each of the command splits, and the Batch Compute Environment will
+dictate how  many resources are available to run Batch jobs simultaneously.
diff --git a/integration_tests/chip_classification_tests/experiment.py b/integration_tests/chip_classification_tests/experiment.py
@@ -10,12 +10,15 @@ def get_path(part):
 
         img_path = get_path('scene/image.tif')
         label_path = get_path('scene/labels.json')
+
+        img2_path = get_path('scene/image2.tif')
+        label2_path = get_path('scene/labels2.json')
+
         backend_conf_path = get_path('configs/backend.config')
 
         pretrained_model = (
-            'https://github.com/fchollet/'
-            'deep-learning-models/releases/download/v0.2/'
-            'resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5')
+            'https://github.com/azavea/raster-vision-data/'
+            'releases/download/v0.0.7/chip-classification-test-weights.hdf5')
 
         task = rv.TaskConfig.builder(rv.CHIP_CLASSIFICATION) \
                             .with_chip_size(200) \
@@ -38,31 +41,35 @@ def get_path(part):
                                                       replace_model=True) \
                                   .build()
 
-        label_source = rv.LabelSourceConfig.builder(rv.CHIP_CLASSIFICATION_GEOJSON) \
-                                           .with_uri(label_path) \
-                                           .with_ioa_thresh(0.5) \
-                                           .with_use_intersection_over_cell(False) \
-                                           .with_pick_min_class_id(True) \
-                                           .with_background_class_id(3) \
-                                           .with_infer_cells(True) \
-                                           .build()
-
-        raster_source = rv.RasterSourceConfig.builder(rv.GEOTIFF_SOURCE) \
-                                             .with_uri(img_path) \
-                                             .with_channel_order([0, 1, 2]) \
-                                             .with_stats_transformer() \
-                                             .build()
-
-        scene = rv.SceneConfig.builder() \
-                              .with_task(task) \
-                              .with_id('cc_test') \
-                              .with_raster_source(raster_source) \
-                              .with_label_source(label_source) \
-                              .build()
+        def make_scene(i_path, l_path):
+            label_source = rv.LabelSourceConfig.builder(rv.CHIP_CLASSIFICATION) \
+                                               .with_uri(l_path) \
+                                               .with_ioa_thresh(0.5) \
+                                               .with_use_intersection_over_cell(False) \
+                                               .with_pick_min_class_id(True) \
+                                               .with_background_class_id(3) \
+                                               .with_infer_cells(True) \
+                                               .build()
+
+            raster_source = rv.RasterSourceConfig.builder(rv.GEOTIFF_SOURCE) \
+                                                 .with_uri(i_path) \
+                                                 .with_channel_order([0, 1, 2]) \
+                                                 .with_stats_transformer() \
+                                                 .build()
+
+            return rv.SceneConfig.builder() \
+                                 .with_task(task) \
+                                 .with_id(os.path.basename(i_path)) \
+                                 .with_raster_source(raster_source) \
+                                 .with_label_source(label_source) \
+                                 .build()
+
+        scene_1 = make_scene(img_path, label_path)
+        scene_2 = make_scene(img2_path, label2_path)
 
         dataset = rv.DatasetConfig.builder() \
-                                  .with_train_scene(scene) \
-                                  .with_validation_scene(scene) \
+                                  .with_train_scenes([scene_1, scene_2]) \
+                                  .with_validation_scenes([scene_1, scene_2]) \
                                   .build()
 
         experiment = rv.ExperimentConfig.builder() \