Use Azure ML pipeline data passing #43

tomasvanpottelbergh · 2023-01-20T15:58:51Z

Resolves #7.

This is a reworked version of my implementation of using Azure ML native data passing between nodes instead of a temporary storage.

The idea is simply to save and load the data from and to Pickle files located at the path passed to the node via the --az-input and --az-output arguments.
The AzureMLPipelineDataset adds the option to save intermediate datasets in the format as specified in the catalog.

kedro_azureml/runner.py

kedro_azureml/cli.py

kedro_azureml/datasets.py

marrrcin

Thanks for the updated contribution @tomasvanpottelbergh , looks much better now, I really like the implementation and I'm eager to test it!

Could you briefly summarize that my understanding is correct for this PR?

If the flag native_data_passing is set to false (default), everything works as previously, full backward compatibility to previous plugin's versions.
If the flag native_data_passing is set to true - then the runner overwrites the paths in the entries defined explicitly in the catalog, to swap their output paths to Azure-native ones. That means, if I for example use pandas.CSVDataSet wrapped by your dataset, like this:

my_data_set:
  type: AzureMLFolderDataset
  dataset:
    type: pandas.CSVDataSet
    filepath: my/local/path/to/file.csv

at runtime, in Azure ML, the filepath will be replaced to the path provided by Azure's runtime.

question:
What happens when I have native_data_passing set to true, but the catalog will not have any AzureMLFolderDataset entires?

⚠️ We will need some docs for this too, that will explain this feature to the users.

kedro_azureml/runner.py

kedro_azureml/datasets.py

kedro_azureml/runner.py

tomasvanpottelbergh · 2023-03-23T15:36:59Z

Thanks for the updated contribution @tomasvanpottelbergh , looks much better now, I really like the implementation and I'm eager to test it!

Thanks for taking the time to review it!

Could you briefly summarize that my understanding is correct for this PR?

If the flag native_data_passing is set to false (default), everything works as previously, full backward compatibility to previous plugin's versions.

If the flag native_data_passing is set to true - then the runner overwrites the paths in the entries defined explicitly in the catalog, to swap their output paths to Azure-native ones. That means, if I for example use pandas.CSVDataSet wrapped by your dataset, like this:
my_data_set:
  type: AzureMLFolderDataset
  dataset:
    type: pandas.CSVDataSet
    filepath: my/local/path/to/file.csv
at runtime, in Azure ML, the filepath will be replaced to the path provided by Azure's runtime.

Correct!

question:
What happens when I have native_data_passing set to true, but the catalog will not have any AzureMLFolderDataset entires?

In that case the native data passing will use pickle files for the datasets not specified in the catalog. It will not affect anything else dataset in the catalog which is not an AzureMLFolderDataset, so you need to make sure those still work on Azure ML.

⚠️ We will need some docs for this too, that will explain this feature to the users.

Definitely, I will add tests and docs once this approach is approved.

…o-azureml into rebase/feat/native-data-passing

marrrcin

OK, I've added my final thoughts. I was able to test it out, it works well, but needs a slight polishing.

I was expecting it to work with a catalog entry like this:

my_data_set:
  type: AzureMLFolderDataset
  dataset:
    type: pandas.CSVDataSet
    filepath: my/local/path/to/file.csv

But it doesn't as it expects the path attribute in the AzureMLFolderDataset, which is needless in my opinion, since you've already did a good job of allowing to set filepath_arg there - you can easily extract the file name from the internal dataset, instead of having to specify it twice. This makes the use much more pleasant. I did some hacking on your code base to make it possible (by default there will be no need for the path arg, but it can still be set if someone want's full control).

Change parameters ordering and make path empty by default.

class AzureMLFolderDataset(AbstractDataSet):
    def __init__(
        self,
        dataset: Union[str, Type[AbstractDataSet], Dict[str, Any]],
        path: str = "",
        filepath_arg: str = "filepath",
    ):
# ...

Add a property to AzureMLFolderDataset:

    @property
    def original_dataset_path(self) -> Path:
        return Path(self._dataset_config[self._filepath_arg])

In the run of AzurePipelinesRunner:

    def run(
        self,
        pipeline: Pipeline,
        catalog: DataCatalog,
        hook_manager: PluginManager = None,
        session_id: str = None,
    ) -> Dict[str, Any]:
        catalog = catalog.shallow_copy()
        catalog_set = set(catalog.list())

        # Loop over datasets in arguments to set their paths
        for ds_name, azure_dataset_folder in self.data_paths.items():
            if ds_name in catalog_set:
                ds = catalog._get_dataset(ds_name)
                if isinstance(ds, AzureMLFolderDataset):
                    file_name = (
                        ds.original_dataset_path.name
                        if not ds.path
                        else Path(ds.path).name
                    ) # <--- this
                    ds.path = str(Path(azure_dataset_folder) / file_name) # <--- and this
                    catalog.add(ds_name, ds, replace=True)
            else:
                catalog.add(ds_name, self.create_default_data_set(ds_name))

        return super().run(pipeline, catalog, hook_manager, session_id)

In the create_default_data_set of AzurePipelinesRunner: params order needs to change (path / dataset) in the return dataset_cls(PickleDataSet, path)

Once you apply those changes, add docs and unit test, we're fine to merge. It will be released as 0.4.0 as this is a big feature 🎉

kedro_azureml/runner.py

kedro_azureml/config.py

kedro_azureml/datasets/folder_dataset.py

kedro_azureml/runner.py

AOMoovAI · 2023-03-27T19:57:30Z

If I understand correctly, this is to essentially mount/attach an AzureML's Data Asset to a compute/cluster in AzureML. Is that correct?

tomasvanpottelbergh · 2023-03-28T11:53:33Z

If I understand correctly, this is to essentially mount/attach an AzureML's Data Asset to a compute/cluster in AzureML. Is that correct?

This PR only handles mounting (unregistered) Data Assets in order to pass data between pipeline nodes. When this PR is merged, I will extend this to registered Data Assets.

tomasvanpottelbergh · 2023-03-28T15:48:57Z

I was expecting it to work with a catalog entry like this:
my_data_set:
  type: AzureMLFolderDataset
  dataset:
    type: pandas.CSVDataSet
    filepath: my/local/path/to/file.csv
But it doesn't as it expects the path attribute in the AzureMLFolderDataset, which is needless in my opinion, since you've already did a good job of allowing to set filepath_arg there - you can easily extract the file name from the internal dataset, instead of having to specify it twice.

Sorry, you're correct, I didn't spot the difference in your example that made it fail. I based the implementation on that of the PartitionedDataSet, but this might indeed not be the best for this case.

I have changed the implementation in a slightly different way from what you proposed, but following the same idea. I would actually propose to remove the path argument from the dataset constructor, as it is only used by the runner and can be specified directly using the dataset config as well. Do you see a good reason to keep it?

kedro_azureml/datasets/pipeline_dataset.py

kedro_azureml/runner.py

marrrcin · 2023-03-30T13:01:34Z

I have changed the implementation in a slightly different way from what you proposed, but following the same idea. I would actually propose to remove the path argument from the dataset constructor, as it is only used by the runner and can be specified directly using the dataset config as well. Do you see a good reason to keep it?

Your implementation looks fine and I see no reason for leaving the path right now, you can remove it. It works on Azure with both settings (enabled / disabled).

Please fix the 3 comments I've added above. Once unit tests pass, I will merge it 👍🏻

…ataSet

tomasvanpottelbergh · 2023-03-31T13:50:34Z

Thanks again for the review @marrrcin! I think I have addressed all issues and made an attempt to write some docs, although I leave it up to you to adapt them as you prefer.

A last few things that came up:

Using local versioning will currently not work when specifying the new datasets in the catalog, but I think we can address this issue in a later PR.
The kedro azureml init command requires the storage account and container names, which are not needed when using the new feature. Would it make sense to remove any arguments from that command and let it write a dummy config file that the user can adapt?
I think it would also be good to get rid of AzureMLPipelineDistributedDataSet, since it will make the catalog specification dependent on whether distributed training is used or not. Would it be OK to handle the is_distributed_environment() logic inside the dataset itself instead of the runner?

marrrcin · 2023-03-31T14:03:33Z

Thanks again for the review @marrrcin! I think I have addressed all issues and made an attempt to write some docs, although I leave it up to you to adapt them as you prefer.

A last few things that came up:

Using local versioning will currently not work when specifying the new datasets in the catalog, but I think we can address this issue in a later PR.

Agree, let's address this later.

The kedro azureml init command requires the storage account and container names, which are not needed when using the new feature. Would it make sense to remove any arguments from that command and let it write a dummy config file that the user can adapt?

I will take over this next week.

I think it would also be good to get rid of AzureMLPipelineDistributedDataSet, since it will make the catalog specification dependent on whether distributed training is used or not. Would it be OK to handle the is_distributed_environment() logic inside the dataset itself instead of the runner?

Makes sense.

tomasvanpottelbergh · 2023-03-31T14:36:45Z

Agree, let's address this later.

Created #52 to track this.

I will take over this next week.

Great, thanks!

Makes sense.

Done in latest commit.

marrrcin · 2023-04-03T13:11:00Z

Merged, awesome work, we will release it later this week. Thanks for the contribution 🎉

jpoullet2000 · 2023-04-24T16:24:34Z

when do you think you'll release v0.4.0 with this new feature ?

marrrcin · 2023-04-25T11:09:16Z

@jpoullet2000 it will be released by 2023-04-28.

marrrcin reviewed Feb 24, 2023

View reviewed changes

kedro_azureml/runner.py Show resolved Hide resolved

kedro_azureml/runner.py Outdated Show resolved Hide resolved

kedro_azureml/runner.py Show resolved Hide resolved

kedro_azureml/runner.py Outdated Show resolved Hide resolved

marrrcin reviewed Feb 24, 2023

View reviewed changes

kedro_azureml/cli.py Show resolved Hide resolved

tomasvanpottelbergh added 13 commits March 3, 2023 17:35

Add AzureMLFolderDataset

7f8cbcf

Add option to config

93f84aa

Add data paths CLI argument and use in runner

437beda

Add data paths to generator

f7c4f8f

Add distributed support

4532e42

Make temporary storage config optional

37c1033

Only parse runner config when no native data passing

77234f8

Pass native data passing option to runner

2b18796

Do not parse data inputs as parameters

4930f6f

Ignore pipeline inputs for data paths

62210e2

Handle AzureMLFolderDatasets from catalog in runner

d00c60a

Fix typo

a6b8865

Convert dataset path to string

26f3e0a

tomasvanpottelbergh force-pushed the rebase/feat/native-data-passing branch from 8ceebb0 to 26f3e0a Compare March 8, 2023 11:16

tomasvanpottelbergh requested a review from marrrcin March 8, 2023 11:26

tomasvanpottelbergh added 2 commits March 8, 2023 13:32

Remove TODO

30698f7

Change TODO

026186b

tomasvanpottelbergh commented Mar 8, 2023

View reviewed changes

kedro_azureml/datasets.py Outdated Show resolved Hide resolved

marrrcin suggested changes Mar 23, 2023

View reviewed changes

kedro_azureml/runner.py Show resolved Hide resolved

kedro_azureml/datasets.py Outdated Show resolved Hide resolved

kedro_azureml/runner.py Outdated Show resolved Hide resolved

kedro_azureml/runner.py Outdated Show resolved Hide resolved

tomasvanpottelbergh added 2 commits March 23, 2023 17:02

Merge branch 'develop' of https://github.com/tomasvanpottelbergh/kedr…

6a73176

…o-azureml into rebase/feat/native-data-passing

Merge branch 'develop' of https://github.com/tomasvanpottelbergh/kedr…

05c34b5

…o-azureml into rebase/feat/native-data-passing

marrrcin suggested changes Mar 24, 2023

View reviewed changes

tomasvanpottelbergh added 3 commits March 28, 2023 14:08

Rename rename_data_passing to pipeline_data_passing

4655442

Rename AzureMLFolderDataset to AzureMLPipelineDataSet

320b84c

Rename ds_path to ds_mount_path

49bc423

tomasvanpottelbergh added 2 commits March 28, 2023 15:54

Separate input and output datasets in CLI arguments

c1adf5e

Change path override implementation

c8e2403

tomasvanpottelbergh added 3 commits March 29, 2023 15:33

Fix PickleDataSet import

9a03d86

Fix bugs

58c2ee3

Adapt CLI execute command test

ade7fa8

marrrcin reviewed Mar 30, 2023

View reviewed changes

kedro_azureml/datasets/pipeline_dataset.py Outdated Show resolved Hide resolved

kedro_azureml/runner.py Show resolved Hide resolved

kedro_azureml/runner.py Outdated Show resolved Hide resolved

tomasvanpottelbergh added 10 commits March 31, 2023 12:06

Handle input datasets not in arguments

1680eb0

Rename AzureMLFolderDistributedDataset to AzureMLPipelineDistributedD…

e794db8

…ataSet

Fix data_paths attribute initialisation

bd6142b

Remove path argument from AzureMLPipelineDataSet

1145ca6

Remove use of path argument

2788e3f

Add test for pipeline datasets

5ca4844

Extend generator test to use pipeline data passing

f1ff4d9

Add runner test with pipeline data passing

47f354e

Improve docstring

09c114c

Add docs

6c7f533

tomasvanpottelbergh requested a review from marrrcin March 31, 2023 13:51

tomasvanpottelbergh mentioned this pull request Mar 31, 2023

Support versioning in AzureMLPipelineDataSet #52

Open

Remove AzureMLPipelineDistributedDataSet

c19b1d1

tomasvanpottelbergh changed the title ~~[WIP] Use Azure ML native data passing~~ Use Azure ML pipeline data passing Mar 31, 2023

marrrcin approved these changes Apr 3, 2023

View reviewed changes

marrrcin merged commit 482c943 into getindata:develop Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Azure ML pipeline data passing #43

Use Azure ML pipeline data passing #43

tomasvanpottelbergh commented Jan 20, 2023 •

edited

Loading

marrrcin left a comment

tomasvanpottelbergh commented Mar 23, 2023 •

edited

Loading

marrrcin left a comment

AOMoovAI commented Mar 27, 2023

tomasvanpottelbergh commented Mar 28, 2023

tomasvanpottelbergh commented Mar 28, 2023

marrrcin commented Mar 30, 2023

tomasvanpottelbergh commented Mar 31, 2023

marrrcin commented Mar 31, 2023

tomasvanpottelbergh commented Mar 31, 2023

marrrcin commented Apr 3, 2023

jpoullet2000 commented Apr 24, 2023

marrrcin commented Apr 25, 2023

Use Azure ML pipeline data passing #43

Use Azure ML pipeline data passing #43

Conversation

tomasvanpottelbergh commented Jan 20, 2023 • edited Loading

marrrcin left a comment

Choose a reason for hiding this comment

tomasvanpottelbergh commented Mar 23, 2023 • edited Loading

marrrcin left a comment

Choose a reason for hiding this comment

AOMoovAI commented Mar 27, 2023

tomasvanpottelbergh commented Mar 28, 2023

tomasvanpottelbergh commented Mar 28, 2023

marrrcin commented Mar 30, 2023

tomasvanpottelbergh commented Mar 31, 2023

marrrcin commented Mar 31, 2023

tomasvanpottelbergh commented Mar 31, 2023

marrrcin commented Apr 3, 2023

jpoullet2000 commented Apr 24, 2023

marrrcin commented Apr 25, 2023

tomasvanpottelbergh commented Jan 20, 2023 •

edited

Loading

tomasvanpottelbergh commented Mar 23, 2023 •

edited

Loading