Merge pull request #82 from getindata/dataset-and-mlflow-fixes

DataSet to Dataset global renaming for Kedro 0.19 / MLflow min req. update
getindata · Nov 15, 2023 · 51cf99c · 51cf99c
2 parents 8e5979f + d040abd
commit 51cf99c
Show file tree

Hide file tree

Showing 22 changed files with 4,497 additions and 148 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,8 @@
 # Changelog
 
 ## [Unreleased]
+-   [💔 Breaking change] Renamed all `*DataSet` classes to `*Dataset` to follow Kedro's naming convention which will be introduced in 0.19.
+-   Upgraded minimal requirements for MLflow to `>=2.0.0,<3.0.0` to be compatible with `azureml-mlflow`
 
 -   Added `--on-job-scheduled` argument to `kedro azureml run` to plug-in custom behaviour after Azure ML job is scheduled [@Gabriel2409](https://github.com/Gabriel2409)
 

diff --git a/docs/source/03_quickstart.rst b/docs/source/03_quickstart.rst
@@ -124,17 +124,17 @@ Adjusting the Data Catalog
    .. code:: yaml
 
       companies:
-        type: pandas.CSVDataSet
+        type: pandas.CSVDataset
         filepath: data/01_raw/companies.csv
         layer: raw
 
       reviews:
-        type: pandas.CSVDataSet
+        type: pandas.CSVDataset
         filepath: data/01_raw/reviews.csv
         layer: raw
 
       shuttles:
-        type: pandas.ExcelDataSet
+        type: pandas.ExcelDataset
         filepath: data/01_raw/shuttles.xlsx
         layer: raw
 

diff --git a/docs/source/05_data_assets.rst b/docs/source/05_data_assets.rst
@@ -3,19 +3,19 @@ Azure Data Assets
 
 ``kedro-azureml`` adds support for two new datasets that can be used in the Kedro catalog. Right now we support both Azure ML v1 SDK (direct Python) and Azure ML v2 SDK (fsspec-based) APIs.
 
-**For v2 API (fspec-based)** - use ``AzureMLAssetDataSet`` that enables to use Azure ML v2 SDK Folder/File datasets for remote and local runs.
+**For v2 API (fspec-based)** - use ``AzureMLAssetDataset`` that enables to use Azure ML v2 SDK Folder/File datasets for remote and local runs.
 Currently only the `uri_file` and `uri_folder` types are supported. Because of limitations of the Azure ML SDK, the `uri_file` type can only be used for pipeline inputs,
 not for outputs. The `uri_folder` type can be used for both inputs and outputs.
 
-**For v1 API** (deprecated ⚠️) use the ``AzureMLFileDataSet`` and the ``AzureMLPandasDataSet`` which translate to `File/Folder dataset`_ and `Tabular dataset`_ respectively in
+**For v1 API** (deprecated ⚠️) use the ``AzureMLFileDataset`` and the ``AzureMLPandasDataset`` which translate to `File/Folder dataset`_ and `Tabular dataset`_ respectively in
 Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any
 other dataset in Kedro.
 
 
-Apart from these, ``kedro-azureml`` also adds the ``AzureMLPipelineDataSet`` which is used to pass data between
+Apart from these, ``kedro-azureml`` also adds the ``AzureMLPipelineDataset`` which is used to pass data between
 pipeline nodes when the pipeline is run on Azure ML and the *pipeline data passing* feature is enabled.
-By default, data is then saved and loaded using the ``PickleDataSet`` as underlying dataset.
-Any other underlying dataset can be used instead by adding a ``AzureMLPipelineDataSet`` to the catalog.
+By default, data is then saved and loaded using the ``PickleDataset`` as underlying dataset.
+Any other underlying dataset can be used instead by adding a ``AzureMLPipelineDataset`` to the catalog.
 
 All of these can be found under the `kedro_azureml.datasets`_ module.
 
@@ -35,7 +35,7 @@ Pipeline data passing
 
 ⚠️ Cannot be used when run locally.
 
-.. autoclass:: kedro_azureml.datasets.AzureMLPipelineDataSet
+.. autoclass:: kedro_azureml.datasets.AzureMLPipelineDataset
     :members:
 
 -----------------
@@ -47,7 +47,7 @@ Use the dataset below when you're using Azure ML SDK v2 (fsspec-based).
 
 ✅ Can be used for both remote and local runs.
 
-.. autoclass:: kedro_azureml.datasets.asset_dataset.AzureMLAssetDataSet
+.. autoclass:: kedro_azureml.datasets.asset_dataset.AzureMLAssetDataset
     :members:
 
 V1 SDK
@@ -56,11 +56,11 @@ Use the datasets below when you're using Azure ML SDK v1 (direct Python).
 
 ⚠️ Deprecated - will be removed in future version of `kedro-azureml`.
 
-.. autoclass:: kedro_azureml.datasets.AzureMLPandasDataSet
+.. autoclass:: kedro_azureml.datasets.AzureMLPandasDataset
     :members:
 
 -----------------
 
-.. autoclass:: kedro_azureml.datasets.AzureMLFileDataSet
+.. autoclass:: kedro_azureml.datasets.AzureMLFileDataset
     :members:
 
diff --git a/kedro_azureml/datasets/__init__.py b/kedro_azureml/datasets/__init__.py
@@ -1,17 +1,17 @@
-from kedro_azureml.datasets.asset_dataset import AzureMLAssetDataSet
-from kedro_azureml.datasets.file_dataset import AzureMLFileDataSet
-from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataSet
-from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataSet
+from kedro_azureml.datasets.asset_dataset import AzureMLAssetDataset
+from kedro_azureml.datasets.file_dataset import AzureMLFileDataset
+from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataset
+from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataset
 from kedro_azureml.datasets.runner_dataset import (
     KedroAzureRunnerDataset,
     KedroAzureRunnerDistributedDataset,
 )
 
 __all__ = [
-    "AzureMLFileDataSet",
-    "AzureMLAssetDataSet",
-    "AzureMLPipelineDataSet",
-    "AzureMLPandasDataSet",
+    "AzureMLFileDataset",
+    "AzureMLAssetDataset",
+    "AzureMLPipelineDataset",
+    "AzureMLPandasDataset",
     "KedroAzureRunnerDataset",
     "KedroAzureRunnerDistributedDataset",
 ]
diff --git a/kedro_azureml/datasets/asset_dataset.py b/kedro_azureml/datasets/asset_dataset.py
@@ -11,25 +11,25 @@
 from kedro.io.core import (
     VERSION_KEY,
     VERSIONED_FLAG_KEY,
-    AbstractDataSet,
-    AbstractVersionedDataSet,
-    DataSetError,
-    DataSetNotFoundError,
+    AbstractDataset,
+    AbstractVersionedDataset,
+    DatasetError,
+    DatasetNotFoundError,
     Version,
     VersionNotFoundError,
 )
 
 from kedro_azureml.client import _get_azureml_client
 from kedro_azureml.config import AzureMLConfig
-from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataSet
+from kedro_azureml.datasets.pipeline_dataset import AzureMLPipelineDataset
 
 AzureMLDataAssetType = Literal["uri_file", "uri_folder"]
 logger = logging.getLogger(__name__)
 
 
-class AzureMLAssetDataSet(AzureMLPipelineDataSet, AbstractVersionedDataSet):
+class AzureMLAssetDataset(AzureMLPipelineDataset, AbstractVersionedDataset):
     """
-    AzureMLAssetDataSet enables kedro-azureml to use azureml
+    AzureMLAssetDataset enables kedro-azureml to use azureml
     v2-sdk Folder/File datasets for remote and local runs.
 
     Args
@@ -52,21 +52,21 @@ class AzureMLAssetDataSet(AzureMLPipelineDataSet, AbstractVersionedDataSet):
     .. code-block:: yaml
 
         my_folder_dataset:
-          type: kedro_azureml.datasets.AzureMLAssetDataSet
+          type: kedro_azureml.datasets.AzureMLAssetDataset
           azureml_dataset: my_azureml_folder_dataset
           root_dir: data/01_raw/some_folder/
           versioned: True
           dataset:
-            type: pandas.ParquetDataSet
+            type: pandas.ParquetDataset
             filepath: "."
 
         my_file_dataset:
-            type: kedro_azureml.datasets.AzureMLAssetDataSet
+            type: kedro_azureml.datasets.AzureMLAssetDataset
             azureml_dataset: my_azureml_file_dataset
             root_dir: data/01_raw/some_other_folder/
             versioned: True
             dataset:
-                type: pandas.ParquetDataSet
+                type: pandas.ParquetDataset
                 filepath: "companies.csv"
 
     """
@@ -76,7 +76,7 @@ class AzureMLAssetDataSet(AzureMLPipelineDataSet, AbstractVersionedDataSet):
     def __init__(
         self,
         azureml_dataset: str,
-        dataset: Union[str, Type[AbstractDataSet], Dict[str, Any]],
+        dataset: Union[str, Type[AbstractDataset], Dict[str, Any]],
         root_dir: str = "data",
         filepath_arg: str = "filepath",
         azureml_type: AzureMLDataAssetType = "uri_folder",
@@ -102,14 +102,14 @@ def __init__(
         self._azureml_config = None
         self._azureml_type = azureml_type
         if self._azureml_type not in get_args(AzureMLDataAssetType):
-            raise DataSetError(
+            raise DatasetError(
                 f"Invalid azureml_type '{self._azureml_type}' in dataset definition. "
                 f"Valid values are: {get_args(AzureMLDataAssetType)}"
             )
 
         # TODO: remove and disable versioning in Azure ML runner?
         if VERSION_KEY in self._dataset_config:
-            raise DataSetError(
+            raise DatasetError(
                 f"'{self.__class__.__name__}' does not support versioning of the "
                 f"underlying dataset. Please remove '{VERSIONED_FLAG_KEY}' flag from "
                 f"the dataset definition."
@@ -148,7 +148,7 @@ def download_path(self) -> str:
         else:
             return str(self.path)
 
-    def _construct_dataset(self) -> AbstractDataSet:
+    def _construct_dataset(self) -> AbstractDataset:
         dataset_config = self._dataset_config.copy()
         dataset_config[self._filepath_arg] = str(self.path)
         return self._dataset_type(**dataset_config)
@@ -160,7 +160,7 @@ def _get_latest_version(self) -> str:
             ) as ml_client:
                 return ml_client.data.get(self._azureml_dataset, label="latest").version
         except ResourceNotFoundError:
-            raise DataSetNotFoundError(f"Did not find Azure ML Data Asset for {self}")
+            raise DatasetNotFoundError(f"Did not find Azure ML Data Asset for {self}")
 
     @cachedmethod(cache=attrgetter("_version_cache"), key=partial(hashkey, "load"))
     def _fetch_latest_load_version(self) -> str:

diff --git a/kedro_azureml/datasets/file_dataset.py b/kedro_azureml/datasets/file_dataset.py
@@ -1,5 +1,5 @@
-from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataSet
+from kedro_azureml.datasets.pandas_dataset import AzureMLPandasDataset
 
 
-class AzureMLFileDataSet(AzureMLPandasDataSet):
+class AzureMLFileDataset(AzureMLPandasDataset):
     pass
diff --git a/kedro_azureml/datasets/pandas_dataset.py b/kedro_azureml/datasets/pandas_dataset.py
@@ -1,9 +1,9 @@
-from kedro.io import AbstractDataSet
+from kedro.io import AbstractDataset
 
 from kedro_azureml.datasets.v1_datasets import REMOVED_DATASETS_WARNING
 
 
-class AzureMLPandasDataSet(AbstractDataSet):
+class AzureMLPandasDataset(AbstractDataset):
     def _load(self):
         raise REMOVED_DATASETS_WARNING
 

diff --git a/kedro_azureml/datasets/pipeline_dataset.py b/kedro_azureml/datasets/pipeline_dataset.py
@@ -5,8 +5,8 @@
 from kedro.io.core import (
     VERSION_KEY,
     VERSIONED_FLAG_KEY,
-    AbstractDataSet,
-    DataSetError,
+    AbstractDataset,
+    DatasetError,
     parse_dataset_definition,
 )
 
@@ -18,9 +18,9 @@
 logger = logging.getLogger(__name__)
 
 
-class AzureMLPipelineDataSet(AbstractDataSet):
+class AzureMLPipelineDataset(AbstractDataset):
     """
-    Dataset to support pipeline data passing in Azure ML between nodes, using `kedro.io.AbstractDataSet` as base class.
+    Dataset to support pipeline data passing in Azure ML between nodes, using `kedro.io.AbstractDataset` as base class.
     Wraps around an underlying dataset, which can be any dataset supported by Kedro, and adds the ability to modify the
     file path of the underlying dataset, to point to the mount paths on the Azure ML compute where the node is run.
 
@@ -29,7 +29,7 @@ class AzureMLPipelineDataSet(AbstractDataSet):
 
      | - ``dataset``: Underlying dataset definition.
             Accepted formats are:
-            a) object of a class that inherits from ``AbstractDataSet``
+            a) object of a class that inherits from ``AbstractDataset``
             b) a string representing a fully qualified class name to such class
             c) a dictionary with ``type`` key pointing to a string from b),
             other keys are passed to the Dataset initializer.
@@ -47,26 +47,26 @@ class AzureMLPipelineDataSet(AbstractDataSet):
     .. code-block:: yaml
 
         processed_images:
-          type: kedro_azureml.datasets.AzureMLPipelineDataSet
+          type: kedro_azureml.datasets.AzureMLPipelineDataset
           root_dir: 'data/01_raw'
           dataset:
-            type: pillow.ImageDataSet
+            type: pillow.ImageDataset
             filepath: 'images.png'
 
     """
 
     def __init__(
         self,
-        dataset: Union[str, Type[AbstractDataSet], Dict[str, Any]],
+        dataset: Union[str, Type[AbstractDataset], Dict[str, Any]],
         root_dir: str = "data",
         filepath_arg: str = "filepath",
     ):
-        """Creates a new instance of ``AzureMLPipelineDataSet``.
+        """Creates a new instance of ``AzureMLPipelineDataset``.
 
         Args:
             dataset: Underlying dataset definition.
                 Accepted formats are:
-                a) object of a class that inherits from ``AbstractDataSet``
+                a) object of a class that inherits from ``AbstractDataset``
                 b) a string representing a fully qualified class name to such class
                 c) a dictionary with ``type`` key pointing to a string from b),
                 other keys are passed to the Dataset initializer.
@@ -75,7 +75,7 @@ def __init__(
                 If unspecified, defaults to "filepath".
 
         Raises:
-            DataSetError: If versioning is enabled for the underlying dataset.
+            DatasetError: If versioning is enabled for the underlying dataset.
         """
 
         dataset = dataset if isinstance(dataset, dict) else {"type": dataset}
@@ -94,7 +94,7 @@ def __init__(
 
         # TODO: remove and disable versioning in Azure ML runner?
         if VERSION_KEY in self._dataset_config:
-            raise DataSetError(
+            raise DatasetError(
                 f"'{self.__class__.__name__}' does not support versioning of the "
                 f"underlying dataset. Please remove '{VERSIONED_FLAG_KEY}' flag from "
                 f"the dataset definition."
@@ -112,7 +112,7 @@ def _filepath(self) -> str:
         """
         return self.path
 
-    def _construct_dataset(self) -> AbstractDataSet:
+    def _construct_dataset(self) -> AbstractDataset:
         dataset_config = self._dataset_config.copy()
         dataset_config[self._filepath_arg] = str(self.path)
         return self._dataset_type(**dataset_config)
@@ -122,7 +122,7 @@ def _load(self) -> Any:
 
     def _save(self, data: Any) -> None:
         if is_distributed_environment() and not is_distributed_master_node():
-            logger.warning(f"DataSet {self} will not be saved on a distributed node")
+            logger.warning(f"Dataset {self} will not be saved on a distributed node")
         else:
             self._construct_dataset().save(data)
 

diff --git a/kedro_azureml/datasets/runner_dataset.py b/kedro_azureml/datasets/runner_dataset.py
@@ -8,7 +8,7 @@
 import backoff
 import cloudpickle
 import fsspec
-from kedro.io import AbstractDataSet
+from kedro.io import AbstractDataset
 
 from kedro_azureml.constants import (
     KEDRO_AZURE_BLOB_TEMP_DIR_NAME,
@@ -19,7 +19,7 @@
 logger = logging.getLogger(__name__)
 
 
-class KedroAzureRunnerDataset(AbstractDataSet):
+class KedroAzureRunnerDataset(AbstractDataset):
     def __init__(
         self,
         storage_account_name,
@@ -83,5 +83,5 @@ def _save(self, data: Any) -> None:
             super()._save(data)
         else:
             logger.warning(
-                f"DataSet {self.dataset_name} will not be saved on a distributed node"
+                f"Dataset {self.dataset_name} will not be saved on a distributed node"
             )
diff --git a/kedro_azureml/datasets/v1_datasets.py b/kedro_azureml/datasets/v1_datasets.py
@@ -3,7 +3,7 @@
 REMOVED_DATASETS_WARNING = DeprecationWarning(
     (
         REMOVED_DATASETS_TEXT := "This dataset was removed in kedro-azureml 0.6.0. "
-        "Use kedro_azureml.datasets.asset_dataset.AzureMLAssetDataSet instead."
+        "Use kedro_azureml.datasets.asset_dataset.AzureMLAssetDataset instead."
     )
 )