Merge branch 'main', remote-tracking branch 'origin' into fix/shuffle…

…_collate
activeloopai · Mar 28, 2022 · a4ca7c3 · a4ca7c3
3 parents 55467ee + ff63116 + 4ddb009
commit a4ca7c3
Show file tree

Hide file tree

Showing 63 changed files with 3,294 additions and 267 deletions.
diff --git a/README.md b/README.md
@@ -26,18 +26,37 @@
 
 ## About Hub
 
-Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size. The hub data layout enables rapid transformations and streaming of data while training models at scale. Hub is used by Google, Waymo, Red Cross, Oxford University, and Omdena.
+Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size. The hub data layout enables rapid transformations and streaming of data while training models at scale. Hub is used by Google, Waymo, Red Cross, Oxford University, and Omdena. Hub includes the following features:
 
+<details>
+  <summary><b>Storage Agnostic API</b></summary>
+Use the same API to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, local storage, as well as in-memory.
+</details>
+<details>
+  <summary><b>Compressed Storage</b></summary>
+Store images, audios and videos in their native compression, decompressing them only when needed, for e.g., when training a model.
+</details>
+<details>
+  <summary><b>Lazy NumPy-like Indexing</b></summary>
+Treat your S3 or GCP datasets as if they are a collection of NumPy arrays in your system's memory. Slice them, index them, or iterate through them. Only the bytes you ask for will be downloaded!
+</details>
+<details>
+  <summary><b>Dataset Version Control</b></summary>
+Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well!
+</details>
+<details>
+  <summary><b>Integrations with Deep Learning Frameworks</b></summary>
+Hub comes with built-in integrations for Pytorch and Tensorflow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
+</details>
+<details>
+  <summary><b>Distributed Transformations</b></summary>
+Rapidly apply transformations on your datasets using multi-threading, multi-processing, or our built-in <a href="https://www.ray.io/">Ray</a> integration.</details>
+</details>
+<details>
+  <summary><b>Instant Visualization Support in <a href="https://app.activeloop.ai/?utm_source=github&utm_medium=github&utm_campaign=github_readme&utm_id=readme">Activeloop Platform</a></b></summary>
+Hub datasets are instantly visualized with bounding boxes, masks, annotations, etc. in <a href="https://app.activeloop.ai/?utm_source=github&utm_medium=github&utm_campaign=github_readme&utm_id=readme">Activeloop Platform</a> (see below).
+</details>
 
-Hub includes the following features:
-
-* **Storage agnostic API**: Use the same API to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, local storage, as well as in-memory.
-* **Compressed storage**: Store images, audios and videos in their native compression, decompressing them only when needed, for e.g., when training a model.
-* **Lazy NumPy-like slicing**: Treat your S3 or GCP datasets as if they are a collection of NumPy arrays in your system's memory. Slice them, index them, or iterate through them. Only the bytes you ask for will be downloaded!
-* **Dataset version control**: Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well.
-* **Third-party integrations**: Hub comes with built-in integrations for Pytorch and Tensorflow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
-* **Distributed transforms**: Rapidly apply transformations on your datasets using multi-threading, multi-processing, or our built-in [Ray](https://www.ray.io/) integration.
-* **Instant visualization support**: Hub datasets are instantly visualized with bounding boxes, masks, annotations, etc. in [Activeloop Platform](https://app.activeloop.ai/?utm_source=github&utm_medium=github&utm_campaign=github_readme&utm_id=readme) (see below).
 
 <div align="center">
 <a href="https://www.linkpicture.com/view.php?img=LPic61b13e5c1c539681810493"><img src="https://www.linkpicture.com/q/ReadMe.gif" type="image"></a>
@@ -323,5 +342,3 @@ activeloop reporting --off
 
 ## Acknowledgment
 This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome [cloud-volume](https://github.com/seung-lab/cloud-volume) tool.
-
-Hub uses FFmpeg for video processing. Many thanks to the [FFmpeg](https://www.ffmpeg.org/) team for developing this amazing solution.
diff --git a/hub/__init__.py b/hub/__init__.py
@@ -4,6 +4,7 @@
 import numpy as np
 import multiprocessing
 import sys
+from hub.util.check_latest_version import warn_if_update_required
 
 if sys.platform == "darwin":
     multiprocessing.set_start_method("fork", force=True)
@@ -39,10 +40,13 @@
 compressions = list(SUPPORTED_COMPRESSIONS)
 htypes = sorted(list(HTYPE_CONFIGURATIONS))
 list = dataset.list
+exists = dataset.exists
 load = dataset.load
 empty = dataset.empty
 like = dataset.like
 delete = dataset.delete
+rename = dataset.rename
+copy = dataset.copy
 dataset_cl = Dataset
 ingest = dataset.ingest
 ingest_kaggle = dataset.ingest_kaggle
@@ -57,6 +61,7 @@
     "__version__",
     "load",
     "empty",
+    "exists",
     "compute",
     "compose",
     "like",
@@ -69,9 +74,12 @@
     "htypes",
     "config",
     "delete",
+    "copy",
+    "rename",
 ]
 
-__version__ = "2.3.1"
+__version__ = "2.3.3"
+warn_if_update_required(__version__)
 __encoded_version__ = np.array(__version__)
 config = {"s3": Config(max_pool_connections=50, connect_timeout=300, read_timeout=300)}
 

diff --git a/hub/api/dataset.py b/hub/api/dataset.py
@@ -22,6 +22,10 @@
     AuthorizationException,
 )
 from hub.util.storage import get_storage_and_cache_chain, storage_provider_from_path
+from hub.util.compute import get_compute_provider
+from hub.util.remove_cache import get_base_storage
+from hub.util.cache_chain import generate_chain
+from hub.core.storage.hub_memory_object import HubMemoryObject
 
 
 class dataset:
@@ -258,6 +262,44 @@ def load(
         except AgreementError as e:
             raise e from None
 
+    @staticmethod
+    def rename(
+        old_path: str,
+        new_path: str,
+        creds: Optional[dict] = None,
+        token: Optional[str] = None,
+    ) -> Dataset:
+        """Renames dataset at `old_path` to `new_path`.
+
+        Args:
+            old_path (str): The path to the dataset to be renamed.
+            new_path (str): Path to the dataset after renaming.
+            creds (dict, optional): A dictionary containing credentials used to access the dataset at the path.
+                This takes precedence over credentials present in the environment. Currently only works with s3 paths.
+                It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url' and 'region' as keys.
+            token (str, optional): Activeloop token, used for fetching credentials to the dataset at path if it is a Hub dataset. This is optional, tokens are normally autogenerated.
+
+        Returns:
+            Dataset object after renaming.
+
+        Raises:
+            DatasetHandlerError: If a Dataset does not exist at the given path or if new path is to a different directory.
+
+        Example::
+
+            hub.rename("hub://username/image_ds", "hub://username/new_ds")
+            hub.rename("s3://mybucket/my_ds", "s3://mybucket/renamed_ds")
+        """
+        if creds is None:
+            creds = {}
+
+        feature_report_path(old_path, "rename", {})
+
+        ds = hub.load(old_path, verbose=False, token=token, creds=creds)
+        ds.rename(new_path)
+
+        return ds  # type: ignore
+
     @staticmethod
     def delete(
         path: str,
@@ -350,13 +392,117 @@ def like(
         if isinstance(source, str):
             source_ds = dataset.load(source)
 
-        for tensor_name in source_ds.version_state["meta"].tensors:  # type: ignore
+        for tensor_name in source_ds.tensors:  # type: ignore
             destination_ds.create_tensor_like(tensor_name, source_ds[tensor_name])
 
         destination_ds.info.update(source_ds.info.__getstate__())  # type: ignore
 
         return destination_ds
 
+    @staticmethod
+    def copy(
+        src: str,
+        dest: str,
+        overwrite: bool = False,
+        src_creds=None,
+        dest_creds=None,
+        src_token=None,
+        dest_token=None,
+        num_workers: int = 0,
+        scheduler="threaded",
+        progress_bar=True,
+        public: bool = True,
+    ):
+        """Copies dataset at `src` to `dest`.
+
+        Args:
+            src (str): Path to the dataset to be copied.
+            dest (str): Destination path to copy to.
+            overwrite (bool): If True and a dataset exists at `destination`, it will be overwritten. Defaults to False.
+            src_creds (dict, optional): A dictionary containing credentials used to access the dataset at `src`.
+                If aws_access_key_id, aws_secret_access_key, aws_session_token are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
+                It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'region', 'profile_name' as keys.
+            src_token (str, optional): Activeloop token, used for fetching credentials to the dataset at `src` if it is a Hub dataset. This is optional, tokens are normally autogenerated.
+            dest_creds (dict, optional): creds required to create / overwrite datasets at `dest`.
+            dest_token (str, optional): token used to for fetching credentials to `dest`.
+            num_workers (int): The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
+            scheduler (str): The scheduler to be used for copying. Supported values include: 'serial', 'threaded', 'processed' and 'ray'.
+                Defaults to 'threaded'.
+            progress_bar (bool): Displays a progress bar if True (default).
+            public (bool): Defines if the dataset will have public access. Applicable only if Hub cloud storage is used and a new Dataset is being created. Defaults to True.
+
+        Returns:
+            Dataset: New dataset object.
+
+        Raises:
+            DatasetHandlerError: If a dataset already exists at destination path and overwrite is False.
+        """
+        src_ds = hub.load(src, read_only=True, creds=src_creds, token=src_token)
+        src_storage = get_base_storage(src_ds.storage)
+
+        dest_storage, cache_chain = get_storage_and_cache_chain(
+            dest,
+            creds=dest_creds,
+            token=dest_token,
+            read_only=False,
+            memory_cache_size=DEFAULT_MEMORY_CACHE_SIZE,
+            local_cache_size=DEFAULT_LOCAL_CACHE_SIZE,
+        )
+
+        if dataset_exists(cache_chain):
+            if overwrite:
+                cache_chain.clear()
+            else:
+                raise DatasetHandlerError(
+                    f"A dataset already exists at the given path ({dest}). If you want to copy to a new dataset, either specify another path or use overwrite=True."
+                )
+
+        def copy_func(keys, progress_callback=None):
+            cache = generate_chain(
+                src_storage,
+                memory_cache_size=DEFAULT_MEMORY_CACHE_SIZE,
+                local_cache_size=DEFAULT_LOCAL_CACHE_SIZE,
+                path=src,
+            )
+            for key in keys:
+                if isinstance(cache[key], HubMemoryObject):
+                    dest_storage[key] = cache[key].tobytes()
+                else:
+                    dest_storage[key] = cache[key]
+                if progress_callback:
+                    progress_callback(1)
+
+        def copy_func_with_progress_bar(pg_callback, keys):
+            copy_func(keys, pg_callback)
+
+        keys = list(src_storage._all_keys())
+        len_keys = len(keys)
+        if num_workers == 0:
+            keys = [keys]
+        else:
+            keys = [keys[i::num_workers] for i in range(num_workers)]
+        compute_provider = get_compute_provider(scheduler, num_workers)
+        try:
+            if progress_bar:
+                compute_provider.map_with_progressbar(
+                    copy_func_with_progress_bar,
+                    keys,
+                    len_keys,
+                    "Copying dataset",
+                )
+            else:
+                compute_provider.map(copy_func, keys)
+        finally:
+            compute_provider.close()
+
+        return dataset_factory(
+            path=dest,
+            storage=cache_chain,
+            read_only=False,
+            public=public,
+            token=dest_token,
+        )
+
     @staticmethod
     def ingest(
         src,

diff --git a/hub/api/read.py b/hub/api/read.py
@@ -33,11 +33,18 @@ def read(
         >>> ds.videos.shape
         (1, 136, 720, 1080, 3)
 
+        >>> image = hub.read("https://picsum.photos/200/300")
+        >>> image.compression
+        'jpeg'
+        >>> ds.create_tensor("images", htype="image", sample_compression="jpeg")
+        >>> ds.images.append(image)
+        >>> ds.images[0].shape
+        (300, 200, 3)
+
     Supported file types:
-        Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
 
+        Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
         Audio: "flac", "mp3", "wav"
-
         Video: "mp4", "mkv", "avi"
 
     Args: