Skip to content

Commit

Permalink
Merge pull request #82 from getindata/dataset-and-mlflow-fixes
Browse files Browse the repository at this point in the history
DataSet to Dataset global renaming for Kedro 0.19 / MLflow min req. update
  • Loading branch information
marrrcin committed Nov 15, 2023
2 parents 8e5979f + d040abd commit 51cf99c
Show file tree
Hide file tree
Showing 22 changed files with 4,497 additions and 148 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Changelog

## [Unreleased]
- [💔 Breaking change] Renamed all `*DataSet` classes to `*Dataset` to follow Kedro's naming convention which will be introduced in 0.19.
- Upgraded minimal requirements for MLflow to `>=2.0.0,<3.0.0` to be compatible with `azureml-mlflow`

- Added `--on-job-scheduled` argument to `kedro azureml run` to plug-in custom behaviour after Azure ML job is scheduled [@Gabriel2409](https://github.com/Gabriel2409)

Expand Down
6 changes: 3 additions & 3 deletions docs/source/03_quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,17 +124,17 @@ Adjusting the Data Catalog
.. code:: yaml
companies:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: data/01_raw/companies.csv
layer: raw
reviews:
type: pandas.CSVDataSet
type: pandas.CSVDataset
filepath: data/01_raw/reviews.csv
layer: raw
shuttles:
type: pandas.ExcelDataSet
type: pandas.ExcelDataset
filepath: data/01_raw/shuttles.xlsx
layer: raw
Expand Down
18 changes: 9 additions & 9 deletions docs/source/05_data_assets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@ Azure Data Assets

``kedro-azureml`` adds support for two new datasets that can be used in the Kedro catalog. Right now we support both Azure ML v1 SDK (direct Python) and Azure ML v2 SDK (fsspec-based) APIs.

**For v2 API (fspec-based)** - use ``AzureMLAssetDataSet`` that enables to use Azure ML v2 SDK Folder/File datasets for remote and local runs.
**For v2 API (fspec-based)** - use ``AzureMLAssetDataset`` that enables to use Azure ML v2 SDK Folder/File datasets for remote and local runs.
Currently only the `uri_file` and `uri_folder` types are supported. Because of limitations of the Azure ML SDK, the `uri_file` type can only be used for pipeline inputs,
not for outputs. The `uri_folder` type can be used for both inputs and outputs.

**For v1 API** (deprecated ⚠️) use the ``AzureMLFileDataSet`` and the ``AzureMLPandasDataSet`` which translate to `File/Folder dataset`_ and `Tabular dataset`_ respectively in
**For v1 API** (deprecated ⚠️) use the ``AzureMLFileDataset`` and the ``AzureMLPandasDataset`` which translate to `File/Folder dataset`_ and `Tabular dataset`_ respectively in
Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any
other dataset in Kedro.


Apart from these, ``kedro-azureml`` also adds the ``AzureMLPipelineDataSet`` which is used to pass data between
Apart from these, ``kedro-azureml`` also adds the ``AzureMLPipelineDataset`` which is used to pass data between
pipeline nodes when the pipeline is run on Azure ML and the *pipeline data passing* feature is enabled.
By default, data is then saved and loaded using the ``PickleDataSet`` as underlying dataset.
Any other underlying dataset can be used instead by adding a ``AzureMLPipelineDataSet`` to the catalog.
By default, data is then saved and loaded using the ``PickleDataset`` as underlying dataset.
Any other underlying dataset can be used instead by adding a ``AzureMLPipelineDataset`` to the catalog.

All of these can be found under the `kedro_azureml.datasets`_ module.

Expand All @@ -35,7 +35,7 @@ Pipeline data passing

⚠️ Cannot be used when run locally.

.. autoclass:: kedro_azureml.datasets.AzureMLPipelineDataSet
.. autoclass:: kedro_azureml.datasets.AzureMLPipelineDataset
:members:

-----------------
Expand All @@ -47,7 +47,7 @@ Use the dataset below when you're using Azure ML SDK v2 (fsspec-based).

✅ Can be used for both remote and local runs.

.. autoclass:: kedro_azureml.datasets.asset_dataset.AzureMLAssetDataSet
.. autoclass:: kedro_azureml.datasets.asset_dataset.AzureMLAssetDataset
:members:

V1 SDK
Expand All @@ -56,11 +56,11 @@ Use the datasets below when you're using Azure ML SDK v1 (direct Python).

⚠️ Deprecated - will be removed in future version of `kedro-azureml`.

.. autoclass:: kedro_azureml.datasets.AzureMLPandasDataSet
.. autoclass:: kedro_azureml.datasets.AzureMLPandasDataset
:members:

-----------------

.. autoclass:: kedro_azureml.datasets.AzureMLFileDataSet
.. autoclass:: kedro_azureml.datasets.AzureMLFileDataset
:members:

16 changes: 8 additions & 8 deletions kedro_azureml/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
from kedro_azureml.datasets.asset_dataset import AzureMLAssetDataSet
from kedro_azureml.datasets.file_dataset import AzureMLFileDataSet
from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataSet
from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataSet
from kedro_azureml.datasets.asset_dataset import AzureMLAssetDataset
from kedro_azureml.datasets.file_dataset import AzureMLFileDataset
from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataset
from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataset
from kedro_azureml.datasets.runner_dataset import (
KedroAzureRunnerDataset,
KedroAzureRunnerDistributedDataset,
)

__all__ = [
"AzureMLFileDataSet",
"AzureMLAssetDataSet",
"AzureMLPipelineDataSet",
"AzureMLPandasDataSet",
"AzureMLFileDataset",
"AzureMLAssetDataset",
"AzureMLPipelineDataset",
"AzureMLPandasDataset",
"KedroAzureRunnerDataset",
"KedroAzureRunnerDistributedDataset",
]
32 changes: 16 additions & 16 deletions kedro_azureml/datasets/asset_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,25 +11,25 @@
from kedro.io.core import (
VERSION_KEY,
VERSIONED_FLAG_KEY,
AbstractDataSet,
AbstractVersionedDataSet,
DataSetError,
DataSetNotFoundError,
AbstractDataset,
AbstractVersionedDataset,
DatasetError,
DatasetNotFoundError,
Version,
VersionNotFoundError,
)

from kedro_azureml.client import _get_azureml_client
from kedro_azureml.config import AzureMLConfig
from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataSet
from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataset

AzureMLDataAssetType = Literal["uri_file", "uri_folder"]
logger = logging.getLogger(__name__)


class AzureMLAssetDataSet(AzureMLPipelineDataSet, AbstractVersionedDataSet):
class AzureMLAssetDataset(AzureMLPipelineDataset, AbstractVersionedDataset):
"""
AzureMLAssetDataSet enables kedro-azureml to use azureml
AzureMLAssetDataset enables kedro-azureml to use azureml
v2-sdk Folder/File datasets for remote and local runs.
Args
Expand All @@ -52,21 +52,21 @@ class AzureMLAssetDataSet(AzureMLPipelineDataSet, AbstractVersionedDataSet):
.. code-block:: yaml
my_folder_dataset:
type: kedro_azureml.datasets.AzureMLAssetDataSet
type: kedro_azureml.datasets.AzureMLAssetDataset
azureml_dataset: my_azureml_folder_dataset
root_dir: data/01_raw/some_folder/
versioned: True
dataset:
type: pandas.ParquetDataSet
type: pandas.ParquetDataset
filepath: "."
my_file_dataset:
type: kedro_azureml.datasets.AzureMLAssetDataSet
type: kedro_azureml.datasets.AzureMLAssetDataset
azureml_dataset: my_azureml_file_dataset
root_dir: data/01_raw/some_other_folder/
versioned: True
dataset:
type: pandas.ParquetDataSet
type: pandas.ParquetDataset
filepath: "companies.csv"
"""
Expand All @@ -76,7 +76,7 @@ class AzureMLAssetDataSet(AzureMLPipelineDataSet, AbstractVersionedDataSet):
def __init__(
self,
azureml_dataset: str,
dataset: Union[str, Type[AbstractDataSet], Dict[str, Any]],
dataset: Union[str, Type[AbstractDataset], Dict[str, Any]],
root_dir: str = "data",
filepath_arg: str = "filepath",
azureml_type: AzureMLDataAssetType = "uri_folder",
Expand All @@ -102,14 +102,14 @@ def __init__(
self._azureml_config = None
self._azureml_type = azureml_type
if self._azureml_type not in get_args(AzureMLDataAssetType):
raise DataSetError(
raise DatasetError(
f"Invalid azureml_type '{self._azureml_type}' in dataset definition. "
f"Valid values are: {get_args(AzureMLDataAssetType)}"
)

# TODO: remove and disable versioning in Azure ML runner?
if VERSION_KEY in self._dataset_config:
raise DataSetError(
raise DatasetError(
f"'{self.__class__.__name__}' does not support versioning of the "
f"underlying dataset. Please remove '{VERSIONED_FLAG_KEY}' flag from "
f"the dataset definition."
Expand Down Expand Up @@ -148,7 +148,7 @@ def download_path(self) -> str:
else:
return str(self.path)

def _construct_dataset(self) -> AbstractDataSet:
def _construct_dataset(self) -> AbstractDataset:
dataset_config = self._dataset_config.copy()
dataset_config[self._filepath_arg] = str(self.path)
return self._dataset_type(**dataset_config)
Expand All @@ -160,7 +160,7 @@ def _get_latest_version(self) -> str:
) as ml_client:
return ml_client.data.get(self._azureml_dataset, label="latest").version
except ResourceNotFoundError:
raise DataSetNotFoundError(f"Did not find Azure ML Data Asset for {self}")
raise DatasetNotFoundError(f"Did not find Azure ML Data Asset for {self}")

@cachedmethod(cache=attrgetter("_version_cache"), key=partial(hashkey, "load"))
def _fetch_latest_load_version(self) -> str:
Expand Down
4 changes: 2 additions & 2 deletions kedro_azureml/datasets/file_dataset.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataSet
from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataset


class AzureMLFileDataSet(AzureMLPandasDataSet):
class AzureMLFileDataset(AzureMLPandasDataset):
pass
4 changes: 2 additions & 2 deletions kedro_azureml/datasets/pandas_dataset.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
from kedro.io import AbstractDataSet
from kedro.io import AbstractDataset

from kedro_azureml.datasets.v1_datasets import REMOVED_DATASETS_WARNING


class AzureMLPandasDataSet(AbstractDataSet):
class AzureMLPandasDataset(AbstractDataset):
def _load(self):
raise REMOVED_DATASETS_WARNING

Expand Down
28 changes: 14 additions & 14 deletions kedro_azureml/datasets/pipeline_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
from kedro.io.core import (
VERSION_KEY,
VERSIONED_FLAG_KEY,
AbstractDataSet,
DataSetError,
AbstractDataset,
DatasetError,
parse_dataset_definition,
)

Expand All @@ -18,9 +18,9 @@
logger = logging.getLogger(__name__)


class AzureMLPipelineDataSet(AbstractDataSet):
class AzureMLPipelineDataset(AbstractDataset):
"""
Dataset to support pipeline data passing in Azure ML between nodes, using `kedro.io.AbstractDataSet` as base class.
Dataset to support pipeline data passing in Azure ML between nodes, using `kedro.io.AbstractDataset` as base class.
Wraps around an underlying dataset, which can be any dataset supported by Kedro, and adds the ability to modify the
file path of the underlying dataset, to point to the mount paths on the Azure ML compute where the node is run.
Expand All @@ -29,7 +29,7 @@ class AzureMLPipelineDataSet(AbstractDataSet):
| - ``dataset``: Underlying dataset definition.
Accepted formats are:
a) object of a class that inherits from ``AbstractDataSet``
a) object of a class that inherits from ``AbstractDataset``
b) a string representing a fully qualified class name to such class
c) a dictionary with ``type`` key pointing to a string from b),
other keys are passed to the Dataset initializer.
Expand All @@ -47,26 +47,26 @@ class AzureMLPipelineDataSet(AbstractDataSet):
.. code-block:: yaml
processed_images:
type: kedro_azureml.datasets.AzureMLPipelineDataSet
type: kedro_azureml.datasets.AzureMLPipelineDataset
root_dir: 'data/01_raw'
dataset:
type: pillow.ImageDataSet
type: pillow.ImageDataset
filepath: 'images.png'
"""

def __init__(
self,
dataset: Union[str, Type[AbstractDataSet], Dict[str, Any]],
dataset: Union[str, Type[AbstractDataset], Dict[str, Any]],
root_dir: str = "data",
filepath_arg: str = "filepath",
):
"""Creates a new instance of ``AzureMLPipelineDataSet``.
"""Creates a new instance of ``AzureMLPipelineDataset``.
Args:
dataset: Underlying dataset definition.
Accepted formats are:
a) object of a class that inherits from ``AbstractDataSet``
a) object of a class that inherits from ``AbstractDataset``
b) a string representing a fully qualified class name to such class
c) a dictionary with ``type`` key pointing to a string from b),
other keys are passed to the Dataset initializer.
Expand All @@ -75,7 +75,7 @@ def __init__(
If unspecified, defaults to "filepath".
Raises:
DataSetError: If versioning is enabled for the underlying dataset.
DatasetError: If versioning is enabled for the underlying dataset.
"""

dataset = dataset if isinstance(dataset, dict) else {"type": dataset}
Expand All @@ -94,7 +94,7 @@ def __init__(

# TODO: remove and disable versioning in Azure ML runner?
if VERSION_KEY in self._dataset_config:
raise DataSetError(
raise DatasetError(
f"'{self.__class__.__name__}' does not support versioning of the "
f"underlying dataset. Please remove '{VERSIONED_FLAG_KEY}' flag from "
f"the dataset definition."
Expand All @@ -112,7 +112,7 @@ def _filepath(self) -> str:
"""
return self.path

def _construct_dataset(self) -> AbstractDataSet:
def _construct_dataset(self) -> AbstractDataset:
dataset_config = self._dataset_config.copy()
dataset_config[self._filepath_arg] = str(self.path)
return self._dataset_type(**dataset_config)
Expand All @@ -122,7 +122,7 @@ def _load(self) -> Any:

def _save(self, data: Any) -> None:
if is_distributed_environment() and not is_distributed_master_node():
logger.warning(f"DataSet {self} will not be saved on a distributed node")
logger.warning(f"Dataset {self} will not be saved on a distributed node")
else:
self._construct_dataset().save(data)

Expand Down
6 changes: 3 additions & 3 deletions kedro_azureml/datasets/runner_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import backoff
import cloudpickle
import fsspec
from kedro.io import AbstractDataSet
from kedro.io import AbstractDataset

from kedro_azureml.constants import (
KEDRO_AZURE_BLOB_TEMP_DIR_NAME,
Expand All @@ -19,7 +19,7 @@
logger = logging.getLogger(__name__)


class KedroAzureRunnerDataset(AbstractDataSet):
class KedroAzureRunnerDataset(AbstractDataset):
def __init__(
self,
storage_account_name,
Expand Down Expand Up @@ -83,5 +83,5 @@ def _save(self, data: Any) -> None:
super()._save(data)
else:
logger.warning(
f"DataSet {self.dataset_name} will not be saved on a distributed node"
f"Dataset {self.dataset_name} will not be saved on a distributed node"
)
2 changes: 1 addition & 1 deletion kedro_azureml/datasets/v1_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
REMOVED_DATASETS_WARNING = DeprecationWarning(
(
REMOVED_DATASETS_TEXT := "This dataset was removed in kedro-azureml 0.6.0. "
"Use kedro_azureml.datasets.asset_dataset.AzureMLAssetDataSet instead."
"Use kedro_azureml.datasets.asset_dataset.AzureMLAssetDataset instead."
)
)

Expand Down

0 comments on commit 51cf99c

Please sign in to comment.