Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Azure ML pipeline data passing #43

Merged

Conversation

tomasvanpottelbergh
Copy link
Contributor

@tomasvanpottelbergh tomasvanpottelbergh commented Jan 20, 2023

Resolves #7.

This is a reworked version of my implementation of using Azure ML native data passing between nodes instead of a temporary storage.

The idea is simply to save and load the data from and to Pickle files located at the path passed to the node via the --az-input and --az-output arguments.
The AzureMLPipelineDataset adds the option to save intermediate datasets in the format as specified in the catalog.

kedro_azureml/runner.py Show resolved Hide resolved
kedro_azureml/runner.py Outdated Show resolved Hide resolved
kedro_azureml/runner.py Show resolved Hide resolved
kedro_azureml/runner.py Outdated Show resolved Hide resolved
Copy link
Contributor

@marrrcin marrrcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated contribution @tomasvanpottelbergh , looks much better now, I really like the implementation and I'm eager to test it!

Could you briefly summarize that my understanding is correct for this PR?

  1. If the flag native_data_passing is set to false (default), everything works as previously, full backward compatibility to previous plugin's versions.
  2. If the flag native_data_passing is set to true - then the runner overwrites the paths in the entries defined explicitly in the catalog, to swap their output paths to Azure-native ones. That means, if I for example use pandas.CSVDataSet wrapped by your dataset, like this:
my_data_set:
  type: AzureMLFolderDataset
  dataset:
    type: pandas.CSVDataSet
    filepath: my/local/path/to/file.csv

at runtime, in Azure ML, the filepath will be replaced to the path provided by Azure's runtime.

  • question:
    What happens when I have native_data_passing set to true, but the catalog will not have any AzureMLFolderDataset entires?

⚠️ We will need some docs for this too, that will explain this feature to the users.

kedro_azureml/runner.py Show resolved Hide resolved
kedro_azureml/datasets.py Outdated Show resolved Hide resolved
kedro_azureml/runner.py Outdated Show resolved Hide resolved
kedro_azureml/runner.py Outdated Show resolved Hide resolved
@tomasvanpottelbergh
Copy link
Contributor Author

tomasvanpottelbergh commented Mar 23, 2023

Thanks for the updated contribution @tomasvanpottelbergh , looks much better now, I really like the implementation and I'm eager to test it!

Thanks for taking the time to review it!

Could you briefly summarize that my understanding is correct for this PR?

  1. If the flag native_data_passing is set to false (default), everything works as previously, full backward compatibility to previous plugin's versions.

  2. If the flag native_data_passing is set to true - then the runner overwrites the paths in the entries defined explicitly in the catalog, to swap their output paths to Azure-native ones. That means, if I for example use pandas.CSVDataSet wrapped by your dataset, like this:

my_data_set:
  type: AzureMLFolderDataset
  dataset:
    type: pandas.CSVDataSet
    filepath: my/local/path/to/file.csv

at runtime, in Azure ML, the filepath will be replaced to the path provided by Azure's runtime.

Correct!

question:
What happens when I have native_data_passing set to true, but the catalog will not have any AzureMLFolderDataset entires?

In that case the native data passing will use pickle files for the datasets not specified in the catalog. It will not affect anything else dataset in the catalog which is not an AzureMLFolderDataset, so you need to make sure those still work on Azure ML.

⚠️ We will need some docs for this too, that will explain this feature to the users.

Definitely, I will add tests and docs once this approach is approved.

Copy link
Contributor

@marrrcin marrrcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've added my final thoughts. I was able to test it out, it works well, but needs a slight polishing.

I was expecting it to work with a catalog entry like this:

my_data_set:
  type: AzureMLFolderDataset
  dataset:
    type: pandas.CSVDataSet
    filepath: my/local/path/to/file.csv

But it doesn't as it expects the path attribute in the AzureMLFolderDataset, which is needless in my opinion, since you've already did a good job of allowing to set filepath_arg there - you can easily extract the file name from the internal dataset, instead of having to specify it twice. This makes the use much more pleasant. I did some hacking on your code base to make it possible (by default there will be no need for the path arg, but it can still be set if someone want's full control).

  1. Change parameters ordering and make path empty by default.
class AzureMLFolderDataset(AbstractDataSet):
    def __init__(
        self,
        dataset: Union[str, Type[AbstractDataSet], Dict[str, Any]],
        path: str = "",
        filepath_arg: str = "filepath",
    ):
# ...
  1. Add a property to AzureMLFolderDataset:
    @property
    def original_dataset_path(self) -> Path:
        return Path(self._dataset_config[self._filepath_arg])
  1. In the run of AzurePipelinesRunner:
    def run(
        self,
        pipeline: Pipeline,
        catalog: DataCatalog,
        hook_manager: PluginManager = None,
        session_id: str = None,
    ) -> Dict[str, Any]:
        catalog = catalog.shallow_copy()
        catalog_set = set(catalog.list())

        # Loop over datasets in arguments to set their paths
        for ds_name, azure_dataset_folder in self.data_paths.items():
            if ds_name in catalog_set:
                ds = catalog._get_dataset(ds_name)
                if isinstance(ds, AzureMLFolderDataset):
                    file_name = (
                        ds.original_dataset_path.name
                        if not ds.path
                        else Path(ds.path).name
                    ) # <--- this
                    ds.path = str(Path(azure_dataset_folder) / file_name) # <--- and this
                    catalog.add(ds_name, ds, replace=True)
            else:
                catalog.add(ds_name, self.create_default_data_set(ds_name))

        return super().run(pipeline, catalog, hook_manager, session_id)
  1. In the create_default_data_set of AzurePipelinesRunner: params order needs to change (path / dataset) in the return dataset_cls(PickleDataSet, path)

Once you apply those changes, add docs and unit test, we're fine to merge. It will be released as 0.4.0 as this is a big feature 🎉

kedro_azureml/runner.py Outdated Show resolved Hide resolved
kedro_azureml/config.py Outdated Show resolved Hide resolved
kedro_azureml/datasets/folder_dataset.py Outdated Show resolved Hide resolved
kedro_azureml/datasets/folder_dataset.py Outdated Show resolved Hide resolved
kedro_azureml/runner.py Show resolved Hide resolved
kedro_azureml/runner.py Outdated Show resolved Hide resolved
@AOMoovAI
Copy link

If I understand correctly, this is to essentially mount/attach an AzureML's Data Asset to a compute/cluster in AzureML. Is that correct?

@tomasvanpottelbergh
Copy link
Contributor Author

If I understand correctly, this is to essentially mount/attach an AzureML's Data Asset to a compute/cluster in AzureML. Is that correct?

This PR only handles mounting (unregistered) Data Assets in order to pass data between pipeline nodes. When this PR is merged, I will extend this to registered Data Assets.

@tomasvanpottelbergh
Copy link
Contributor Author

I was expecting it to work with a catalog entry like this:

my_data_set:
  type: AzureMLFolderDataset
  dataset:
    type: pandas.CSVDataSet
    filepath: my/local/path/to/file.csv

But it doesn't as it expects the path attribute in the AzureMLFolderDataset, which is needless in my opinion, since you've already did a good job of allowing to set filepath_arg there - you can easily extract the file name from the internal dataset, instead of having to specify it twice.

Sorry, you're correct, I didn't spot the difference in your example that made it fail. I based the implementation on that of the PartitionedDataSet, but this might indeed not be the best for this case.

I have changed the implementation in a slightly different way from what you proposed, but following the same idea. I would actually propose to remove the path argument from the dataset constructor, as it is only used by the runner and can be specified directly using the dataset config as well. Do you see a good reason to keep it?

kedro_azureml/datasets/pipeline_dataset.py Outdated Show resolved Hide resolved
kedro_azureml/runner.py Show resolved Hide resolved
kedro_azureml/runner.py Outdated Show resolved Hide resolved
@marrrcin
Copy link
Contributor

I have changed the implementation in a slightly different way from what you proposed, but following the same idea. I would actually propose to remove the path argument from the dataset constructor, as it is only used by the runner and can be specified directly using the dataset config as well. Do you see a good reason to keep it?

Your implementation looks fine and I see no reason for leaving the path right now, you can remove it. It works on Azure with both settings (enabled / disabled).

Please fix the 3 comments I've added above. Once unit tests pass, I will merge it 👍🏻

@tomasvanpottelbergh
Copy link
Contributor Author

Thanks again for the review @marrrcin! I think I have addressed all issues and made an attempt to write some docs, although I leave it up to you to adapt them as you prefer.

A last few things that came up:

  • Using local versioning will currently not work when specifying the new datasets in the catalog, but I think we can address this issue in a later PR.
  • The kedro azureml init command requires the storage account and container names, which are not needed when using the new feature. Would it make sense to remove any arguments from that command and let it write a dummy config file that the user can adapt?
  • I think it would also be good to get rid of AzureMLPipelineDistributedDataSet, since it will make the catalog specification dependent on whether distributed training is used or not. Would it be OK to handle the is_distributed_environment() logic inside the dataset itself instead of the runner?

@marrrcin
Copy link
Contributor

Thanks again for the review @marrrcin! I think I have addressed all issues and made an attempt to write some docs, although I leave it up to you to adapt them as you prefer.

A last few things that came up:

  • Using local versioning will currently not work when specifying the new datasets in the catalog, but I think we can address this issue in a later PR.

Agree, let's address this later.

  • The kedro azureml init command requires the storage account and container names, which are not needed when using the new feature. Would it make sense to remove any arguments from that command and let it write a dummy config file that the user can adapt?

I will take over this next week.

  • I think it would also be good to get rid of AzureMLPipelineDistributedDataSet, since it will make the catalog specification dependent on whether distributed training is used or not. Would it be OK to handle the is_distributed_environment() logic inside the dataset itself instead of the runner?

Makes sense.

@tomasvanpottelbergh
Copy link
Contributor Author

Agree, let's address this later.

Created #52 to track this.

I will take over this next week.

Great, thanks!

Makes sense.

Done in latest commit.

@tomasvanpottelbergh tomasvanpottelbergh changed the title [WIP] Use Azure ML native data passing Use Azure ML pipeline data passing Mar 31, 2023
@marrrcin marrrcin merged commit 482c943 into getindata:develop Apr 3, 2023
@marrrcin
Copy link
Contributor

marrrcin commented Apr 3, 2023

Merged, awesome work, we will release it later this week. Thanks for the contribution 🎉

@jpoullet2000
Copy link

when do you think you'll release v0.4.0 with this new feature ?

@marrrcin
Copy link
Contributor

@jpoullet2000 it will be released by 2023-04-28.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle outputs better
4 participants