[DET-2252, DET-2858, DET-2861] Support Data Layer #6

aaron276h · 2020-04-07T18:33:56Z

~~This PR is blocked by: determined-ai/yogadl#14~~

rb-determined-ai

What if a user defined their own DataRef? We seem to be missing the abstractions we would need to support that case.
We have local caches with GCS and S3 storages that seem super opaque to me, and no mechanism for garbage collecting them. I'm ok with not GC'ing the S3 bucket, or even the LFS directory, since those are very clearly passed in by the user, but I think this intermediate cache thing might be a real problem.
This is a lot of new code, and it looks pretty good, and comes with tests and examples and everything. Good work.

examples/experimental/data_layer/mnist_estimator/const.yaml

examples/experimental/data_layer/mnist_tf_keras/model_def.py

harness/determined/_experiment_config.py

examples/experimental/data_layer_mnist_tf_keras/model_def.py

harness/determined/data_layer.py

harness/determined/_train_context.py

rb-determined-ai · 2020-04-10T19:11:00Z

harness/determined/_train_context.py

@@ -121,6 +134,12 @@ def get_hparam(self, name: str) -> Any:
            )
        return self.env.hparams[name]

+    def get_train_cacheable(self) -> data_layer.CacheableDecorator:


I think we should get a second opinion on these names. I know why they are named the way they are, and even to me I think this name is not particularly clear.

But I'm not exactly overwhelmed by great ideas for alternatives...

How about self.context.data_layer.train_dataset_decorator?

Adding this to the agenda for monday's meeting.

harness/determined/estimator/_estimator_trial.py

rb-determined-ai · 2020-04-10T19:14:46Z

harness/determined/estimator/_estimator_trial.py

+            if isinstance(ds, tf.data.Dataset):
+                ds = ds.repeat()


Does estimator support non-tf.data datasets?

It support tuples. We do currently support it and one our unit tests actually used to do exactly this.

Note: In our current setting we only support as creating an iterator from a tf.data.Dataset object since we require wrap_dataset().

I don't understand, it seems like these two statements are in conflict:

We do currently support it

Note: In our current setting we only support as creating an iterator from a tf.data.Dataset object since we require wrap_dataset().

Our current wrap_dataset expects to receive a tf.data.Dataset otherwise it craps out. Otherwise we support any input. This means that users can pass a tf.data.Dataset into wrap_dataset and then pass a tf.data.dataset or tf.data.iterator as their output. Theoretically they could also pass a tf.data.dataset into the wrapper and pass any estimator supported input in to us.

ok, I guess this is fine. It seems like we are setting ourselves up for a very confused user by silently allowing the passing of iterators here, since we can't repeat() on an iterator.

Agreed but will punt on this issue for now.

shiyuann · 2020-04-10T20:18:54Z

harness/determined/data_layer.py

+    if configured_storage_path:
+        storage_path = pathlib.Path(cast(str, configured_storage_path))
+    else:
+        storage_path = pathlib.Path("/data/determined/")


Better use tempfile here because sometimes we don't have access to /data.

Yep switched over to using the home directory

shiyuann · 2020-04-10T20:48:53Z

harness/determined/_train_context.py

@@ -121,6 +134,12 @@ def get_hparam(self, name: str) -> Any:
            )
        return self.env.hparams[name]

+    def get_train_cacheable(self) -> data_layer.CacheableDecorator:


How about self.context.data_layer.train_dataset_decorator?

shiyuann · 2020-04-10T21:09:36Z

harness/determined/data_layer.py

+
+        def wrap(make_dataset_fn: Callable) -> Callable:
+            def _decorated_fn(*args: Any, **kwargs: Any) -> Any:
+                @self._storage.cacheable(  # type: ignore


It seems it's unnecessary to create a function make_dataset just to use this decorator. Can we add one function that accepts a dataset as an argument rather than only supporting the decorator?

You mean for the user facing call of @cacheable?

The problem with accepting a dataset as a parameter is that the dataset will already have been built, which means you are potentially throwing away a lot of time savings. If you accept a function that creates a dataset, then you don't have to do anything inside of the function when you know you have already cached the output once.

@rb-determined-ai I see. Probably we should document it in cacheable to explain why we only support decorators and what users should put in make_dataset() to make the most use of the data layer.

Documentation is coming!

shiyuann · 2020-04-10T21:18:06Z

harness/determined/keras/_tf_keras_inputs.py

+from determined_common import check
+
+
+class _InputManager(metaclass=abc.ABCMeta):


Can we add more context to the docstring to explain when this class is used and why do we need it?

Yep good call.

examples/experimental/data_layer_mnist_tf_keras/model_def.py

harness/determined/_experiment_config.py

harness/determined/_train_context.py

aaron276h · 2020-04-10T21:34:14Z

harness/determined/_train_context.py

@@ -121,6 +134,12 @@ def get_hparam(self, name: str) -> Any:
            )
        return self.env.hparams[name]

+    def get_train_cacheable(self) -> data_layer.CacheableDecorator:


Adding this to the agenda for monday's meeting.

harness/determined/data_layer.py

harness/determined/estimator/_estimator_trial.py

aaron276h · 2020-04-13T15:55:45Z

harness/determined/estimator/_estimator_trial.py

+            if isinstance(ds, tf.data.Dataset):
+                ds = ds.repeat()


It support tuples. We do currently support it and one our unit tests actually used to do exactly this.

Note: In our current setting we only support as creating an iterator from a tf.data.Dataset object since we require wrap_dataset().

harness/determined/keras/_tf_keras_inputs.py

aaron276h · 2020-04-13T15:58:37Z

harness/determined/keras/_tf_keras_inputs.py

+from determined_common import check
+
+
+class _InputManager(metaclass=abc.ABCMeta):


Yep good call.

harness/determined/data_layer.py

rb-determined-ai

Probably the first thing we should do is reconsider our @Cacheable API based on the feedback we got in the sync yesterday.

harness/determined/_experiment_config.py

harness/determined/data_layer.py

rb-determined-ai · 2020-04-14T15:24:05Z

harness/determined/data_layer.py

+
+        def wrap(make_dataset_fn: Callable) -> Callable:
+            def _decorated_fn(*args: Any, **kwargs: Any) -> Any:
+                @self._storage.cacheable(  # type: ignore


The problem with accepting a dataset as a parameter is that the dataset will already have been built, which means you are potentially throwing away a lot of time savings. If you accept a function that creates a dataset, then you don't have to do anything inside of the function when you know you have already cached the output once.

harness/determined/data_layer.py

rb-determined-ai · 2020-04-14T15:40:29Z

harness/determined/estimator/_estimator_trial.py

+            if isinstance(ds, tf.data.Dataset):
+                ds = ds.repeat()


I don't understand, it seems like these two statements are in conflict:

We do currently support it

Note: In our current setting we only support as creating an iterator from a tf.data.Dataset object since we require wrap_dataset().

harness/determined/keras/_tf_keras_context.py

harness/determined/keras/_tf_keras_inputs.py

common/determined_common/api/experiment.py

master/pkg/model/experiment_config.go

rb-determined-ai · 2020-04-16T17:55:12Z

tests/integrations/experiment/test_tf_estimator.py

+    )
+
+
+@pytest.mark.integ3  # type: ignore


should this be integ4? It seems like therest of the parameterize("tf2",...) tests are integ4, so maybe there's some efficiency in putting them on the same machine?

didn't want to overload a branch but sure

lol it was not a rhetorical question, I really don't know. So sure.

tests/integrations/experiment/test_tf_estimator.py

harness/determined/keras/_tf_keras_inputs.py

harness/determined/keras/_tf_keras_context.py

harness/determined/estimator/_estimator_context.py

shiyuann · 2020-04-16T16:47:17Z

master/pkg/model/data_layer_config.go

+	"github.com/determined-ai/determined/master/pkg/union"
+)
+
+// DataLayerConfig configures data layer storage.


I don't remember if we have discussed this. Can we merge this configuration with the checkpoint storage configuration?

IMO it's better to leave them separate for now, since data_layer is still experimental and we are not sure if the final version of this will be compatible for the checkpoint_stroage config.

shiyuann · 2020-04-16T18:33:51Z

examples/experimental/data_layer_mnist_tf_keras/model_def.py

+        return [TFKerasTensorBoard(update_freq="batch", profile_batch=0, histogram_freq=1)]
+
+    def build_training_data_loader(self) -> tf.data.Dataset:
+        @self.context.experimental.cache_train_dataset("mnist-tf-keras", "v1", shuffle=True)


Is it recommended that this decorated function should be put specifically inside a trial class?

shiyuann · 2020-04-16T18:34:44Z

harness/determined/_experiment_config.py

@@ -39,3 +39,6 @@ def slots_per_trial(self) -> int:

    def experiment_seed(self) -> int:
        return int(self.get("reproducibility", {}).get("experiment_seed", 0))
+
+    def get_data_layer_type(self) -> str:
+        return cast(str, self["data_layer"]["type"])


Would be better to use str(self.get("data_layer", {}).get("type",""))?

This will be always set.

shiyuann · 2020-04-16T18:37:10Z

harness/determined/data_layer.py

+from determined import horovod, workload
+from determined_common import check
+
+tensorflow_dataset_type = "tf.data.Dataset"


Are these strings of type names used somewhere?

ah good catch, they were being used but not anymore

shiyuann · 2020-04-16T18:38:28Z

harness/determined/data_layer.py

+            storage_config = storage.LFSConfigurations(storage_dir_path=str(local_cache_path))
+            self._storage = storage.LFSStorage(storage_config, tensorflow_config=session_config)
+
+        elif data_layer_type == StorageTypes.SHARED_FS.value:


This should be S3

shiyuann · 2020-04-16T20:14:59Z

tests/integrations/experiment/test_tf_estimator.py

+    config = conf.set_max_steps(config, 2)
+    config = conf.set_slots_per_trial(config, 8)
+    config = conf.set_tf2_image(config) if tf2 else conf.set_tf1_image(config)
+    if storage_type == "lfs":


Why does this test cover lfs and s3 while the below test only cover lfs?

Just to avoid running too many CI tests.

shiyuann · 2020-04-16T20:17:17Z

tests/unit/frameworks/fixtures/estimator_xor_model.py

 ) -> Callable[[], Tuple[tf.Tensor, tf.Tensor]]:
    def _input_fn() -> Tuple[tf.Tensor, tf.Tensor]:
        data, labels = xor_data()
        dataset = tf.data.Dataset.from_tensor_slices((data, labels))
        dataset = context.wrap_dataset(dataset)
        if shuffle:
            dataset = dataset.shuffle(1000)
+
+        def map_dataset(x, y):


It seems this function can be replaced with lambda.

It makes our codebase 10 lines cleaner!

DET-2252 #Done. DET-2858 #Done. DET-2861 #Done.

rb-determined-ai

Ran some tests, found some bugs, they got fixed, and I think this is good to go! Nice work!

shiyuann · 2020-04-17T18:32:36Z

harness/determined/_train_context.py

+        self,
+        env: det.EnvContext,
+        hvd_config: horovod.HorovodContext,
+        train_context: Union[NativeContext, TrialContext],


This makes _DataLayerContext depends on a TrainContext. If it is a part of TFKerasContext, it shouldn't depend on a TrainContext. I think we should move calculating per_slot_batch_size to EnvContext and access that directly. I made the above change in #167. @rb-determined-ai what do you think?

@shiyuann To get the fixes in this PR into this release I want to land this as is. Happy to discuss further about this outside the scope of this PR. I am not deadset on this being the right way to do this. I think once we have a base Experimental class, it could clear some of this up.

I don't think it's outside the scope of this PR. Introducing _DataLayerContext along with some weird workaround rather than a simple clean refactor seems wrong to me and confusing to other developers. And given this refactor is very simple (just very few lines of change) I don't see any reasons that would postpone this PR to land. Basically, what you need to do is to move TrainContext._calculate_batch_sizes to EnvContext and use env.per_slot_batch_size in DataLayerContext. This won't even take more than 10 mins to write.

Not saying it is outside the scope of this PR, I am sayin I want to postpone fixing this to outside this PR because this needs to land this asap (and not have to re-run CI).

I do disagree that it is a simple refactor because I am not convinced that passing in the batch size is the best way forward. I need to think about the best way to do this and don't have time to do that today. I filed: https://determinedai.atlassian.net/browse/DET-2884 to track this.

Sorry for the confusion. It's not passing in the batch size. It's making per_slot_batch_size part of EnvContext and passing in EnvContext. See code here: https://github.com/determined-ai/determined/blob/69c79eb8468b9ba2b3c3c279d98e4337f5308237/harness/determined/_env_context.py

Yes, I think you are right... this is pretty messy.

Do you think that this is functionally broken? If it is only messy then we can fix it later.

Talked to @shiyuann offline, I will clean this up after his PR with the above changes lands.

shiyuann · 2020-04-17T18:36:32Z

harness/determined/estimator/_estimator_context.py

+        self,
+        env: det.EnvContext,
+        hvd_config: horovod.HorovodContext,
+        train_context: Union[det.NativeContext, det.TrialContext],


shiyuann · 2020-04-17T18:36:38Z

harness/determined/estimator/_estimator_context.py

+        self,
+        env: det.EnvContext,
+        hvd_config: horovod.HorovodContext,
+        train_context: Union[det.NativeContext, det.TrialContext],


shiyuann · 2020-04-17T18:38:16Z

harness/determined/keras/_tf_keras_context.py

+        self,
+        env: det.EnvContext,
+        hvd_config: horovod.HorovodContext,
+        train_context: Union[det.NativeContext, det.TrialContext],


shiyuann · 2020-04-17T18:38:25Z

harness/determined/keras/_tf_keras_context.py

+        self,
+        env: det.EnvContext,
+        hvd_config: horovod.HorovodContext,
+        train_context: Union[det.NativeContext, det.TrialContext],


shiyuann · 2020-04-17T18:51:44Z

harness/determined/keras/_tf_keras_inputs.py

+class _TrainingDataLayerTFDatasetManager(_TrainingInputManager):
+    def __init__(
+        self,
+        context: Union[keras.TFKerasTrialContext, keras.TFKerasNativeContext],


keras.TFKerasContext is fine.

Yes, good call

shiyuann · 2020-04-17T19:05:24Z

harness/determined/keras/_tf_keras_inputs.py

+        return None
+
+
+class _TrainingSequenceAdapterManager(_TrainingInputManager):


SequenceAdapter might be merged with this class. Nowhere else uses SequenceAdapter.

I am hesitant to do this since they serve quite different purposes, and SequenceAdapter is user facing class, while this is not.

shiyuann · 2020-04-17T19:11:32Z

tests/unit/frameworks/fixtures/estimator_xor_model.py


-        tf.compat.v1.summary.tensor_summary("features", features)
-        tf.compat.v1.summary.tensor_summary("labels", labels)
+        def map_dataset(x, y):


I mean replace it with lambda x, y: ({"input": x}, y). This won't even take more than 1 min.

shiyuann · 2020-04-17T19:11:40Z

tests/unit/frameworks/fixtures/estimator_xor_model_experiment.py


-        tf.compat.v1.summary.tensor_summary("features", features)
-        tf.compat.v1.summary.tensor_summary("labels", labels)
+        def map_dataset(x, y):


I mean replace it with lambda x, y: ({"input": x}, y).

* reduce duplication in test script * also run the check and test targets * add option to test against a given hash

aaron276h requested a review from rb-determined-ai April 7, 2020 18:33

aaron276h assigned rb-determined-ai Apr 7, 2020

rb-determined-ai reviewed Apr 10, 2020

View reviewed changes

shiyuann reviewed Apr 10, 2020

View reviewed changes

aaron276h commented Apr 13, 2020

View reviewed changes

aaron276h assigned rb-determined-ai and unassigned rb-determined-ai Apr 13, 2020

aaron276h mentioned this pull request Apr 13, 2020

[DET-2835] feat: support tf.data.Dataset for all TF version #98

Merged

rb-determined-ai reviewed Apr 14, 2020

View reviewed changes

rb-determined-ai assigned aaron276h and unassigned rb-determined-ai Apr 14, 2020

aaron276h changed the title ~~[DET-2252] Support Data Layer~~ [DET-2252, DET-2858] Support Data Layer Apr 15, 2020

aaron276h assigned shiyuann and rb-determined-ai and unassigned aaron276h Apr 15, 2020

rb-determined-ai reviewed Apr 16, 2020

View reviewed changes

rb-determined-ai assigned aaron276h and unassigned rb-determined-ai Apr 16, 2020

shiyuann reviewed Apr 16, 2020

View reviewed changes

shiyuann removed their assignment Apr 16, 2020

aaron276h added 2 commits April 16, 2020 17:50

feat: add data layer config

f98c6dc

feat: bind mount data layer paths

11dbf38

aaron276h changed the title ~~[DET-2252, DET-2858] Support Data Layer~~ [DET-2252, DET-2858, DET-2861] Support Data Layer Apr 16, 2020

aaron276h assigned shiyuann and rb-determined-ai and unassigned aaron276h Apr 16, 2020

aaron276h added 4 commits April 17, 2020 13:19

feat: support data layer in harness

c7adea1

DET-2252 #Done. DET-2858 #Done. DET-2861 #Done.

docs: add examples using data layer

9510731

test: add CI tests for data layer

895daea

test: add unit tests for data layer

2beb11e

docs: add documentation for experimental contexts

7e1edea

rb-determined-ai approved these changes Apr 17, 2020

View reviewed changes

shiyuann reviewed Apr 17, 2020

View reviewed changes

shiyuann assigned aaron276h and unassigned shiyuann Apr 17, 2020

aaron276h merged commit 775f24e into determined-ai:master Apr 17, 2020

trentwatt pushed a commit that referenced this pull request Jun 22, 2022

test: run downstream check and test targets [DET-7430] (#6)

8178b40

* reduce duplication in test script * also run the check and test targets * add option to test against a given hash

		from determined_common import check


		class _InputManager(metaclass=abc.ABCMeta):

		return None


		class _TrainingSequenceAdapterManager(_TrainingInputManager):

[DET-2252, DET-2858, DET-2861] Support Data Layer #6

[DET-2252, DET-2858, DET-2861] Support Data Layer #6

Conversation

aaron276h commented Apr 7, 2020 • edited

rb-determined-ai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaron276h Apr 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rb-determined-ai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiyuann Apr 16, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rb-determined-ai left a comment

Choose a reason for hiding this comment

shiyuann Apr 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiyuann Apr 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiyuann Apr 17, 2020 • edited

Choose a reason for hiding this comment

shiyuann Apr 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaron276h commented Apr 7, 2020 •

edited

aaron276h Apr 17, 2020 •

edited

shiyuann Apr 16, 2020 •

edited

shiyuann Apr 17, 2020 •

edited

shiyuann Apr 17, 2020 •

edited

shiyuann Apr 17, 2020 •

edited

shiyuann Apr 17, 2020 •

edited