Skip to content

Commit

Permalink
Merge branch 'master' into feature/tensor_subset
Browse files Browse the repository at this point in the history
  • Loading branch information
AbhinavTuli committed Mar 12, 2021
2 parents 8b0eb8c + 58c65c4 commit 07ee267
Show file tree
Hide file tree
Showing 7 changed files with 141 additions and 106 deletions.
12 changes: 5 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,14 +129,7 @@ Also, if you need a publicly available dataset that you cannot find in the Hub,
### Upload your dataset and access it from <ins>anywhere</ins> in 3 simple steps

1. Register a free account at [Activeloop](https://app.activeloop.ai/register/?utm_source=github&utm_medium=repo&utm_campaign=readme) and authenticate locally:
```sh
hub register
hub login

# alternatively, add username and password as arguments (use on platforms like Kaggle)
hub login -u username -p password
```
Future release will introduce the `activeloop` command. Here is the syntax for using it:
```sh
activeloop register
activeloop login
Expand Down Expand Up @@ -248,6 +241,11 @@ Using Hub? Add a README badge to let everyone know:
```
[![hub](https://img.shields.io/badge/powered%20by-hub%20-ff5a1f.svg)](https://github.com/activeloopai/Hub)
```
## Usage Tracking
By default, we collect anonymous usage data using Bugout (here's the [code](https://github.com/activeloopai/Hub/blob/853456a314b4fb5623c936c825601097b0685119/hub/__init__.py#L24) that does it). It only logs Hub library's own actions and parameters, and no user/ model data is collected.

This helps the Activeloop team to understand how the tool is used and how to deliver maximum value to the community by building features that matter to you. You can easily opt-out of usage tracking during login.


## Disclaimers

Expand Down
173 changes: 113 additions & 60 deletions docs/source/simple.md
Original file line number Diff line number Diff line change
@@ -1,92 +1,101 @@
# Getting Started with Hub

### Intro
### Quickstart

Today we introduce our new API/format for hub package.

Here is some features of new hub:
1. Ability to modify datasets on fly. Datasets are no longer immutable and can be modified over time.
2. Larger datasets can now be uploaded as we removed some RAM limiting components from the hub.
3. Caching is introduced to improve IO performance.
4. Dynamic shaping enables very large images/data support. You can have large images/data stored in hub.
5. Dynamically sized datasets. You will be able to increase number of samples dynamically.
6. Tensors can be added to dataset on the fly.

Hub uses [Zarr](https://zarr.readthedocs.io/en/stable/) as a storage for chunked NumPy arrays.

### Getting Started

1. Install beta version
1. Install Hub
```
pip3 install hub
```

2. Register and authenticate to upload datasets
2. Register and authenticate to upload datasets to [Activeloop](https://app.activeloop.ai/) store
```
activeloop register
activeloop login
# alternatively, add username and password as arguments (use on platforms like Kaggle)
activeloop login -u username -p password
```
3. Load a dataset

3. Lets start by creating a dataset
```python
import numpy as np
```python
import hub

import hub
from hub.schema import ClassLabel, Image
ds = hub.Dataset("activeloop/cifar10_train")
print(ds["label", :10].compute())
print(ds["id", 1234].compute())
print(ds["image", 4321].compute())
ds.copy("./data/examples/cifar10_train")
```

my_schema = {
"image": Image((28, 28)),
"label": ClassLabel(num_classes=10),
}
4. Create a dataset
```python
import numpy as np

url = "./data/examples/new_api_intro" #instead write your {username}/{dataset} to make it public
import hub
from hub.schema import ClassLabel, Image

ds = hub.Dataset(url, shape=(1000,), schema=my_schema)
for i in range(len(ds)):
ds["image", i] = np.ones((28, 28), dtype="uint8")
ds["label", i] = 3
my_schema = {
"image": Image((28, 28)),
"label": ClassLabel(num_classes=10),
}

print(ds["image", 5].compute())
print(ds["label", 100:110].compute())
ds.close()
```
url = "./data/examples/quickstart" # write your {username}/{dataset_name} to make it remotely accessible

You can also transfer a dataset from TFDS.
```python
import hub
import tensorflow as tf
ds = hub.Dataset(url, shape=(1000,), schema=my_schema)
for i in range(len(ds)):
ds["image", i] = np.ones((28, 28), dtype="uint8")
ds["label", i] = 3

out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000)
res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
```
print(ds["image", 5].compute())
print(ds["label", 100:110].compute())
ds.flush()
```
This code creates dataset in *"./data/examples/new_api_intro"* folder with overwrite mode. Dataset has a thousand samples. In each sample there is an *image* and a *label*. Once the dataset is ready, you may read, write and loop over it.


You can also transfer a dataset from TFDS (as below) and convert it from/to [Tensorflow](./integrations/tensorflow.md) or [PyTorch](./integrations/pytorch.md).
```python
import hub
import tensorflow as tf

out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000)
res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
```

### Data Storage

Every [dataset](./concepts/dataset.md) needs to specify where it is located. Hub Datasets use its first positional argument to declare its `url`.

#### Hub

If `url` parameter has the form of `username/dataset`, the dataset will be stored in our cloud storage.

```python
url = 'username/dataset'
ds = hub.Dataset(url, shape=(1000,), schema=my_schema)
```

Besides, you can also create a dataset in *S3*, *MinIO*, *Google Cloud Storage* or *Azure*.
In that case you will need to have the corresponding credentials and provide them as a `token` argument during Dataset creation. It can be a filepath to your credentials or a `dict`.
This is the default way to work with Hub datasets. Besides, you can also create or load a dataset locally or in *S3*, *MinIO*, *Google Cloud Storage* and *Azure*.
In case you choose other remote storage platforms, you will need to provide the corresponding credentials as a `token` argument during Dataset creation or loading. It can be a filepath to your credentials or a `dict`.

#### Local storage

To store datasets locally, let the `url` parameter be a local path.
```python
url = './datasets/'
ds = hub.Dataset(url, shape=(1000,), schema=my_schema)
```
#### S3
```python
url = 's3://new_dataset' # s3
url = 's3://new_dataset' # your s3 path
ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token={"aws_access_key_id": "...",
"aws_secret_access_key": "...",
...})
```

#### MinIO
```python
url = 's3://new_dataset' # minio also uses `s3://` prefix
url = 's3://new_dataset' # minio also uses *s3://* prefix
ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token={"aws_access_key_id": "your_minio_access_key",
"aws_secret_access_key": "your_minio_secret_key",
"endpoint_url": "your_minio_url:port",
Expand All @@ -95,41 +104,85 @@ ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token={"aws_access_key_id

#### Google Cloud Storage
```python
url = 'gcs://new_dataset' # gcloud
url = 'gcs://new_dataset' # your google storage (gs://) path
ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token="/path/to/credentials")
```

#### Azure
```python
url = 'https://activeloop.blob.core.windows.net/activeloop-hub/dataset' # Azure
url = 'https://activeloop.blob.core.windows.net/activeloop-hub/dataset' # Azure link
ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token="/path/to/credentials")
```
### Deletion

You can delete your dataset in [app.activeloop.ai](https://app.activeloop.ai/) in a dataset overview tab.

### Notes
### Schema

[Schema](./concepts/features.md) is a dictionary that describes what a dataset consists of. Every dataset is required to have a schema. This is how you can create a simple schema:

```python
from hub.schema import ClassLabel, Image, BBox, Text

my_schema = {
'kind': ClassLabel(names=["cows", "horses"]),
'animal': Image(shape=(512, 256, 3)),
'eyes': BBox(),
'description': Text(max_shape=(100,))
}
```

### Shape

Shape is another required attribute of a dataset. It simply specifies how large a dataset is. The rules associated with shapes are derived from `numpy`. As you might have noticed, shape is a universal attribute that is also present in schemas, however it is no longer required. If a schema does not have a well-definied shape, `max_shape` might be required.

### Dataset Access, Modification and Deletion

New hub mimics TFDS data types. Before creating dataset you have to mention the details of what type of data does it contain. This enables us to compress, process and visualize data more efficiently.
In order to access the data from the dataset, you should use `.compute()` on a portion of the dataset: `ds['key', :5].compute()`.

This code creates dataset in *"./data/examples/new_api_intro"* folder with overwrite mode. Dataset has 1000 samples. In each sample there is an *image* and a *label*.
You can modify the data to the dataset with a regular assignment operator or by performing more sophisticated [transforms](./concepts/transform.md).

After this we can loop over dataset and read/write from it.
You can delete your dataset with `.delete()` or through Activeloop's app on [app.activeloop.ai](https://app.activeloop.ai/) in a dataset overview tab.


### Why flush?
### Flush, Commit and Close

Since caching is in place, you need to tell the program to push final changes to a permanent storage.
Since Hub implements caching, you need to tell the program to push the final changes to permanent storage. Hub Datasets have three methods that let you do that.

`.close()` saves changes from cache to dataset final storage and does not invalidate dataset object.
On the other hand, `.flush()` saves changes to dataset, but invalidates it.
The most fundamental method, `.flush()` saves changes from cache to the dataset final storage and does not invalidate dataset object. It means that you can continue working on your data and pushing it later on.

`.commit()` saves the changes into a new version of a dataset that you may go back to later on if you want to.

Alternatively you can use the following style.
In rare cases, you may also use `.close()` to invalidate the dataset object after saving the changes.

If you prefer flushing to be taken care for you, wrap your operations on the dataset with the `with` statement in this fashion:
```python
with hub.Dataset(...) as ds:
pass
```

This works as well.
### Windows FAQ

**Q: Running `activeloop` commands results in an error with a message stating that `'activeloop' is not recognized as an internal or external command, operable program or batch file.` What should I do to use such commands?**

A: If you are having troubles running `activeloop` commands on Windows, it usually means there are issues with your PATH environmental variable and `activeloop` commands are only affected by this underlying problem. Regardless, there are several ways in which you can still be able to use the CLI.

Option 1. You may try running hub as a module, i.e. `py -m hub` and add arguments as necessary.

Option 2. You may try adding Python scripts to your path. First, you need to find out where your Python installation is located. Start from running:
```py --list-paths```
If your Python interpreter is not on the list but you can run it (despite not knowing its path), you should paste the following excerpt to Python console to find out its location:
```python
import os
import sys
os.path.dirname(sys.executable)
```

Once you know the path to the directory with the Python version you are using, adapt it to match the pattern in the command below. If you are unsure whether it is correct, check if the path exists. Finally, run this command in the command prompt (CMD):
<pre>
setx /m PATH "%PATH%;C:\<i>path\to\Python</i>\Python3<i>X</i>\Scripts\"
</pre>

Then refresh your CMD with:
```
start & exit
```
Now, you should be able to run activeloop commands.
4 changes: 4 additions & 0 deletions hub/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from hub.cli.command import cli

if __name__ == "__main__":
cli(prog_name="activeloop")
18 changes: 0 additions & 18 deletions hub/cli/utils.py

This file was deleted.

4 changes: 2 additions & 2 deletions hub/client/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from hub import config
from hub.log import logger
from hub.client.token_manager import TokenManager
from hub.cli.utils import get_cli_version
from hub.version import __version__

from hub.exceptions import (
AuthenticationException,
Expand Down Expand Up @@ -58,7 +58,7 @@ def request(
endpoint = config.HUB_REST_ENDPOINT

request_url = urljoin(endpoint, relative_url)
headers["hub-cli-version"] = get_cli_version()
headers["hub-cli-version"] = __version__
if (
"Authorization" not in headers
or headers["Authorization"] != self.auth_header
Expand Down
1 change: 1 addition & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ flake8==3.8.4
black==20.8b1
ray==1.2.0
cloudpickle>=1.6.0,<2
cachey>=0.2.1
35 changes: 16 additions & 19 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
click>=6.7,<8
numpy>=1.13.0,<2
requests>=2,<3
cachey==0.2.1
fsspec==0.8.5
s3fs==0.4.2
gcsfs==0.6.2
outdated==0.2.0
lz4>=3,<4
zarr==2.6.1
boto3==1.17.20
tqdm==4.54.1
azure-storage-blob==12.6.0
pathos>=0.2.2
psutil>=5.7.3
Pillow>=8.0.1
cloudpickle==1.6.0
msgpack==1.0.2
humbug==0.1.4
click>=6.7, <8
numpy>=1.17, <2
requests>=2, <3
fsspec>=0.8, <1
gcsfs>=0.6.2, <0.7 # newer versions fail tests #97
s3fs==0.4.2, <0.5.2 # newer versions require Python 3.7+
boto3==1.17.22
lz4>=3, <4
zarr>=2.4, <2.7
tqdm>=4.1, <5
azure-storage-blob>=12, <13
pathos>=0.2, <0.3
humbug>=0.1.4, <0.2
Pillow>=6
msgpack>=0.6
psutil>=5.8 # needed only for deprecated code

0 comments on commit 07ee267

Please sign in to comment.