Skip to content

Commit

Permalink
Merge branch 'main', remote-tracking branch 'origin' into fix/shuffle…
Browse files Browse the repository at this point in the history
…_collate
  • Loading branch information
AbhinavTuli committed Mar 28, 2022
3 parents 55467ee + ff63116 + 4ddb009 commit a4ca7c3
Show file tree
Hide file tree
Showing 63 changed files with 3,294 additions and 267 deletions.
41 changes: 29 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,37 @@

## About Hub

Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size. The hub data layout enables rapid transformations and streaming of data while training models at scale. Hub is used by Google, Waymo, Red Cross, Oxford University, and Omdena.
Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size. The hub data layout enables rapid transformations and streaming of data while training models at scale. Hub is used by Google, Waymo, Red Cross, Oxford University, and Omdena. Hub includes the following features:

<details>
<summary><b>Storage Agnostic API</b></summary>
Use the same API to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, local storage, as well as in-memory.
</details>
<details>
<summary><b>Compressed Storage</b></summary>
Store images, audios and videos in their native compression, decompressing them only when needed, for e.g., when training a model.
</details>
<details>
<summary><b>Lazy NumPy-like Indexing</b></summary>
Treat your S3 or GCP datasets as if they are a collection of NumPy arrays in your system's memory. Slice them, index them, or iterate through them. Only the bytes you ask for will be downloaded!
</details>
<details>
<summary><b>Dataset Version Control</b></summary>
Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well!
</details>
<details>
<summary><b>Integrations with Deep Learning Frameworks</b></summary>
Hub comes with built-in integrations for Pytorch and Tensorflow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
</details>
<details>
<summary><b>Distributed Transformations</b></summary>
Rapidly apply transformations on your datasets using multi-threading, multi-processing, or our built-in <a href="https://www.ray.io/">Ray</a> integration.</details>
</details>
<details>
<summary><b>Instant Visualization Support in <a href="https://app.activeloop.ai/?utm_source=github&utm_medium=github&utm_campaign=github_readme&utm_id=readme">Activeloop Platform</a></b></summary>
Hub datasets are instantly visualized with bounding boxes, masks, annotations, etc. in <a href="https://app.activeloop.ai/?utm_source=github&utm_medium=github&utm_campaign=github_readme&utm_id=readme">Activeloop Platform</a> (see below).
</details>

Hub includes the following features:

* **Storage agnostic API**: Use the same API to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, local storage, as well as in-memory.
* **Compressed storage**: Store images, audios and videos in their native compression, decompressing them only when needed, for e.g., when training a model.
* **Lazy NumPy-like slicing**: Treat your S3 or GCP datasets as if they are a collection of NumPy arrays in your system's memory. Slice them, index them, or iterate through them. Only the bytes you ask for will be downloaded!
* **Dataset version control**: Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well.
* **Third-party integrations**: Hub comes with built-in integrations for Pytorch and Tensorflow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
* **Distributed transforms**: Rapidly apply transformations on your datasets using multi-threading, multi-processing, or our built-in [Ray](https://www.ray.io/) integration.
* **Instant visualization support**: Hub datasets are instantly visualized with bounding boxes, masks, annotations, etc. in [Activeloop Platform](https://app.activeloop.ai/?utm_source=github&utm_medium=github&utm_campaign=github_readme&utm_id=readme) (see below).

<div align="center">
<a href="https://www.linkpicture.com/view.php?img=LPic61b13e5c1c539681810493"><img src="https://www.linkpicture.com/q/ReadMe.gif" type="image"></a>
Expand Down Expand Up @@ -323,5 +342,3 @@ activeloop reporting --off

## Acknowledgment
This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome [cloud-volume](https://github.com/seung-lab/cloud-volume) tool.

Hub uses FFmpeg for video processing. Many thanks to the [FFmpeg](https://www.ffmpeg.org/) team for developing this amazing solution.
10 changes: 9 additions & 1 deletion hub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import numpy as np
import multiprocessing
import sys
from hub.util.check_latest_version import warn_if_update_required

if sys.platform == "darwin":
multiprocessing.set_start_method("fork", force=True)
Expand Down Expand Up @@ -39,10 +40,13 @@
compressions = list(SUPPORTED_COMPRESSIONS)
htypes = sorted(list(HTYPE_CONFIGURATIONS))
list = dataset.list
exists = dataset.exists
load = dataset.load
empty = dataset.empty
like = dataset.like
delete = dataset.delete
rename = dataset.rename
copy = dataset.copy
dataset_cl = Dataset
ingest = dataset.ingest
ingest_kaggle = dataset.ingest_kaggle
Expand All @@ -57,6 +61,7 @@
"__version__",
"load",
"empty",
"exists",
"compute",
"compose",
"like",
Expand All @@ -69,9 +74,12 @@
"htypes",
"config",
"delete",
"copy",
"rename",
]

__version__ = "2.3.1"
__version__ = "2.3.3"
warn_if_update_required(__version__)
__encoded_version__ = np.array(__version__)
config = {"s3": Config(max_pool_connections=50, connect_timeout=300, read_timeout=300)}

Expand Down
148 changes: 147 additions & 1 deletion hub/api/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@
AuthorizationException,
)
from hub.util.storage import get_storage_and_cache_chain, storage_provider_from_path
from hub.util.compute import get_compute_provider
from hub.util.remove_cache import get_base_storage
from hub.util.cache_chain import generate_chain
from hub.core.storage.hub_memory_object import HubMemoryObject


class dataset:
Expand Down Expand Up @@ -258,6 +262,44 @@ def load(
except AgreementError as e:
raise e from None

@staticmethod
def rename(
old_path: str,
new_path: str,
creds: Optional[dict] = None,
token: Optional[str] = None,
) -> Dataset:
"""Renames dataset at `old_path` to `new_path`.
Args:
old_path (str): The path to the dataset to be renamed.
new_path (str): Path to the dataset after renaming.
creds (dict, optional): A dictionary containing credentials used to access the dataset at the path.
This takes precedence over credentials present in the environment. Currently only works with s3 paths.
It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url' and 'region' as keys.
token (str, optional): Activeloop token, used for fetching credentials to the dataset at path if it is a Hub dataset. This is optional, tokens are normally autogenerated.
Returns:
Dataset object after renaming.
Raises:
DatasetHandlerError: If a Dataset does not exist at the given path or if new path is to a different directory.
Example::
hub.rename("hub://username/image_ds", "hub://username/new_ds")
hub.rename("s3://mybucket/my_ds", "s3://mybucket/renamed_ds")
"""
if creds is None:
creds = {}

feature_report_path(old_path, "rename", {})

ds = hub.load(old_path, verbose=False, token=token, creds=creds)
ds.rename(new_path)

return ds # type: ignore

@staticmethod
def delete(
path: str,
Expand Down Expand Up @@ -350,13 +392,117 @@ def like(
if isinstance(source, str):
source_ds = dataset.load(source)

for tensor_name in source_ds.version_state["meta"].tensors: # type: ignore
for tensor_name in source_ds.tensors: # type: ignore
destination_ds.create_tensor_like(tensor_name, source_ds[tensor_name])

destination_ds.info.update(source_ds.info.__getstate__()) # type: ignore

return destination_ds

@staticmethod
def copy(
src: str,
dest: str,
overwrite: bool = False,
src_creds=None,
dest_creds=None,
src_token=None,
dest_token=None,
num_workers: int = 0,
scheduler="threaded",
progress_bar=True,
public: bool = True,
):
"""Copies dataset at `src` to `dest`.
Args:
src (str): Path to the dataset to be copied.
dest (str): Destination path to copy to.
overwrite (bool): If True and a dataset exists at `destination`, it will be overwritten. Defaults to False.
src_creds (dict, optional): A dictionary containing credentials used to access the dataset at `src`.
If aws_access_key_id, aws_secret_access_key, aws_session_token are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'region', 'profile_name' as keys.
src_token (str, optional): Activeloop token, used for fetching credentials to the dataset at `src` if it is a Hub dataset. This is optional, tokens are normally autogenerated.
dest_creds (dict, optional): creds required to create / overwrite datasets at `dest`.
dest_token (str, optional): token used to for fetching credentials to `dest`.
num_workers (int): The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str): The scheduler to be used for copying. Supported values include: 'serial', 'threaded', 'processed' and 'ray'.
Defaults to 'threaded'.
progress_bar (bool): Displays a progress bar if True (default).
public (bool): Defines if the dataset will have public access. Applicable only if Hub cloud storage is used and a new Dataset is being created. Defaults to True.
Returns:
Dataset: New dataset object.
Raises:
DatasetHandlerError: If a dataset already exists at destination path and overwrite is False.
"""
src_ds = hub.load(src, read_only=True, creds=src_creds, token=src_token)
src_storage = get_base_storage(src_ds.storage)

dest_storage, cache_chain = get_storage_and_cache_chain(
dest,
creds=dest_creds,
token=dest_token,
read_only=False,
memory_cache_size=DEFAULT_MEMORY_CACHE_SIZE,
local_cache_size=DEFAULT_LOCAL_CACHE_SIZE,
)

if dataset_exists(cache_chain):
if overwrite:
cache_chain.clear()
else:
raise DatasetHandlerError(
f"A dataset already exists at the given path ({dest}). If you want to copy to a new dataset, either specify another path or use overwrite=True."
)

def copy_func(keys, progress_callback=None):
cache = generate_chain(
src_storage,
memory_cache_size=DEFAULT_MEMORY_CACHE_SIZE,
local_cache_size=DEFAULT_LOCAL_CACHE_SIZE,
path=src,
)
for key in keys:
if isinstance(cache[key], HubMemoryObject):
dest_storage[key] = cache[key].tobytes()
else:
dest_storage[key] = cache[key]
if progress_callback:
progress_callback(1)

def copy_func_with_progress_bar(pg_callback, keys):
copy_func(keys, pg_callback)

keys = list(src_storage._all_keys())
len_keys = len(keys)
if num_workers == 0:
keys = [keys]
else:
keys = [keys[i::num_workers] for i in range(num_workers)]
compute_provider = get_compute_provider(scheduler, num_workers)
try:
if progress_bar:
compute_provider.map_with_progressbar(
copy_func_with_progress_bar,
keys,
len_keys,
"Copying dataset",
)
else:
compute_provider.map(copy_func, keys)
finally:
compute_provider.close()

return dataset_factory(
path=dest,
storage=cache_chain,
read_only=False,
public=public,
token=dest_token,
)

@staticmethod
def ingest(
src,
Expand Down
11 changes: 9 additions & 2 deletions hub/api/read.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,18 @@ def read(
>>> ds.videos.shape
(1, 136, 720, 1080, 3)
>>> image = hub.read("https://picsum.photos/200/300")
>>> image.compression
'jpeg'
>>> ds.create_tensor("images", htype="image", sample_compression="jpeg")
>>> ds.images.append(image)
>>> ds.images[0].shape
(300, 200, 3)
Supported file types:
Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
Audio: "flac", "mp3", "wav"
Video: "mp4", "mkv", "avi"
Args:
Expand Down
Loading

0 comments on commit a4ca7c3

Please sign in to comment.