Merge branch 'master' into feature/tensor_subset

activeloopai · Mar 12, 2021 · 07ee267 · 07ee267
2 parents 8b0eb8c + 58c65c4
commit 07ee267
Show file tree

Hide file tree

Showing 7 changed files with 141 additions and 106 deletions.
diff --git a/README.md b/README.md
@@ -129,14 +129,7 @@ Also, if you need a publicly available dataset that you cannot find in the Hub,
 ### Upload your dataset and access it from <ins>anywhere</ins> in 3 simple steps
 
 1. Register a free account at [Activeloop](https://app.activeloop.ai/register/?utm_source=github&utm_medium=repo&utm_campaign=readme) and authenticate locally:
-    ```sh
-    hub register
-    hub login
 
-    # alternatively, add username and password as arguments (use on platforms like Kaggle)
-    hub login -u username -p password
-    ```
-    Future release will introduce the `activeloop` command. Here is the syntax for using it:
     ```sh
     activeloop register
     activeloop login
@@ -248,6 +241,11 @@ Using Hub? Add a README badge to let everyone know:
 ```
 [![hub](https://img.shields.io/badge/powered%20by-hub%20-ff5a1f.svg)](https://github.com/activeloopai/Hub)
 ```
+## Usage Tracking
+By default, we collect anonymous usage data using Bugout (here's the [code](https://github.com/activeloopai/Hub/blob/853456a314b4fb5623c936c825601097b0685119/hub/__init__.py#L24) that does it). It only logs Hub library's own actions and parameters, and no user/ model data is collected.
+
+This helps the Activeloop team to understand how the tool is used and how to deliver maximum value to the community by building features that matter to you. You can easily opt-out of usage tracking during login.
+
 
 ## Disclaimers
 

diff --git a/docs/source/simple.md b/docs/source/simple.md
@@ -1,92 +1,101 @@
 # Getting Started with Hub
 
-### Intro
+### Quickstart
 
-Today we introduce our new API/format for hub package.  
-
-Here is some features of new hub:
-1. Ability to modify datasets on fly. Datasets are no longer immutable and can be modified over time.
-2. Larger datasets can now be uploaded as we removed some RAM limiting components from the hub.
-3. Caching is introduced to improve IO performance.
-4. Dynamic shaping enables very large images/data support. You can have large images/data stored in hub. 
-5. Dynamically sized datasets. You will be able to increase number of samples dynamically.
-6. Tensors can be added to dataset on the fly.
-
-Hub uses [Zarr](https://zarr.readthedocs.io/en/stable/) as a storage for chunked NumPy arrays.
-
-### Getting Started
-
-1. Install beta version
+1. Install Hub
     ```
     pip3 install hub
     ```
 
-2. Register and authenticate to upload datasets
+2. Register and authenticate to upload datasets to [Activeloop](https://app.activeloop.ai/) store
     ```
     activeloop register
     activeloop login
     
     # alternatively, add username and password as arguments (use on platforms like Kaggle)
     activeloop login -u username -p password
     ```
+3. Load a dataset
 
-3. Lets start by creating a dataset
-```python
-import numpy as np
+    ```python
+    import hub
 
-import hub
-from hub.schema import ClassLabel, Image
+    ds = hub.Dataset("activeloop/cifar10_train")
+    print(ds["label", :10].compute())
+    print(ds["id", 1234].compute())
+    print(ds["image", 4321].compute())
+    ds.copy("./data/examples/cifar10_train")
+    ```
 
-my_schema = {
-    "image": Image((28, 28)),
-    "label": ClassLabel(num_classes=10),
-}
+4. Create a dataset
+    ```python
+    import numpy as np
 
-url = "./data/examples/new_api_intro" #instead write your {username}/{dataset} to make it public
+    import hub
+    from hub.schema import ClassLabel, Image
 
-ds = hub.Dataset(url, shape=(1000,), schema=my_schema)
-for i in range(len(ds)):
-    ds["image", i] = np.ones((28, 28), dtype="uint8")
-    ds["label", i] = 3
+    my_schema = {
+        "image": Image((28, 28)),
+        "label": ClassLabel(num_classes=10),
+    }
 
-print(ds["image", 5].compute())
-print(ds["label", 100:110].compute())
-ds.close()
-```
+    url = "./data/examples/quickstart" # write your {username}/{dataset_name} to make it remotely accessible
 
-You can also transfer a dataset from TFDS.
-```python
-import hub
-import tensorflow as tf
+    ds = hub.Dataset(url, shape=(1000,), schema=my_schema)
+    for i in range(len(ds)):
+        ds["image", i] = np.ones((28, 28), dtype="uint8")
+        ds["label", i] = 3
 
-out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000)
-res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
-```
+    print(ds["image", 5].compute())
+    print(ds["label", 100:110].compute())
+    ds.flush()
+    ```
+    This code creates dataset in *"./data/examples/new_api_intro"* folder with overwrite mode. Dataset has a thousand samples. In each sample there is an *image* and a *label*. Once the dataset is ready, you may read, write and loop over it.
+
+
+    You can also transfer a dataset from TFDS (as below) and convert it from/to [Tensorflow](./integrations/tensorflow.md) or [PyTorch](./integrations/pytorch.md).
+    ```python
+    import hub
+    import tensorflow as tf
+
+    out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000)
+    res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
+    ```
 
 ### Data Storage
 
+Every [dataset](./concepts/dataset.md) needs to specify where it is located. Hub Datasets use its first positional argument to declare its `url`.
+
 #### Hub
 
 If `url` parameter has the form of `username/dataset`, the dataset will be stored in our cloud storage.
 
 ```python
 url = 'username/dataset'
+ds = hub.Dataset(url, shape=(1000,), schema=my_schema)
 ```
 
-Besides, you can also create a dataset in *S3*, *MinIO*, *Google Cloud Storage* or *Azure*.
-In that case you will need to have the corresponding credentials and provide them as a `token` argument during Dataset creation. It can be a filepath to your credentials or a `dict`.
+This is the default way to work with Hub datasets. Besides, you can also create or load a dataset locally or in *S3*, *MinIO*, *Google Cloud Storage* and *Azure*.
+In case you choose other remote storage platforms, you will need to provide the corresponding credentials as a `token` argument during Dataset creation or loading. It can be a filepath to your credentials or a `dict`.
+
+#### Local storage
 
+To store datasets locally, let the `url` parameter be a local path.
+```python
+url = './datasets/'
+ds = hub.Dataset(url, shape=(1000,), schema=my_schema)
+```
 #### S3
  ```python
-url = 's3://new_dataset'  # s3
+url = 's3://new_dataset'  # your s3 path
 ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token={"aws_access_key_id": "...",
                                                               "aws_secret_access_key": "...",
                                                               ...})
 ```
 
 #### MinIO
 ```python
-url = 's3://new_dataset'  # minio also uses `s3://` prefix
+url = 's3://new_dataset'  # minio also uses *s3://* prefix
 ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token={"aws_access_key_id": "your_minio_access_key",
                                                               "aws_secret_access_key": "your_minio_secret_key",
                                                               "endpoint_url": "your_minio_url:port",
@@ -95,41 +104,85 @@ ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token={"aws_access_key_id
 
 #### Google Cloud Storage
 ```python
-url = 'gcs://new_dataset' # gcloud
+url = 'gcs://new_dataset' # your google storage (gs://) path
 ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token="/path/to/credentials")
 ```
 
 #### Azure
 ```python
-url = 'https://activeloop.blob.core.windows.net/activeloop-hub/dataset' # Azure
+url = 'https://activeloop.blob.core.windows.net/activeloop-hub/dataset' # Azure link
 ds = hub.Dataset(url, shape=(1000,), schema=my_schema, token="/path/to/credentials")
 ```
-### Deletion
 
-You can delete your dataset in [app.activeloop.ai](https://app.activeloop.ai/) in a dataset overview tab.
 
-### Notes
+### Schema
+
+[Schema](./concepts/features.md) is a dictionary that describes what a dataset consists of. Every dataset is required to have a schema. This is how you can create a simple schema:
+
+```python
+from hub.schema import ClassLabel, Image, BBox, Text
+
+my_schema = {
+    'kind': ClassLabel(names=["cows", "horses"]),
+    'animal': Image(shape=(512, 256, 3)),
+    'eyes': BBox(),
+    'description': Text(max_shape=(100,))
+}
+```
+
+### Shape
+
+Shape is another required attribute of a dataset. It simply specifies how large a dataset is. The rules associated with shapes are derived from `numpy`. As you might have noticed, shape is a universal attribute that is also present in schemas, however it is no longer required. If a schema does not have a well-definied shape, `max_shape` might be required.
+
+### Dataset Access, Modification and Deletion
 
-New hub mimics TFDS data types. Before creating dataset you have to mention the details of what type of data does it contain. This enables us to compress, process and visualize data more efficiently.
+In order to access the data from the dataset, you should use `.compute()` on a portion of the dataset: `ds['key', :5].compute()`.
 
-This code creates dataset in *"./data/examples/new_api_intro"* folder with overwrite mode. Dataset has 1000 samples. In each sample there is an *image* and a *label*.
+You can modify the data to the dataset with a regular assignment operator or by performing more sophisticated [transforms](./concepts/transform.md).
 
-After this we can loop over dataset and read/write from it.
+You can delete your dataset with `.delete()` or through Activeloop's app on [app.activeloop.ai](https://app.activeloop.ai/) in a dataset overview tab.
 
 
-### Why flush?
+### Flush, Commit and Close
 
-Since caching is in place, you need to tell the program to push final changes to a permanent storage. 
+Since Hub implements caching, you need to tell the program to push the final changes to permanent storage. Hub Datasets have three methods that let you do that.
 
-`.close()` saves changes from cache to dataset final storage and does not invalidate dataset object.
-On the other hand, `.flush()` saves changes to dataset, but invalidates it.
+The most fundamental method, `.flush()` saves changes from cache to the dataset final storage and does not invalidate dataset object. It means that you can continue working on your data and pushing it later on.
 
+`.commit()` saves the changes into a new version of a dataset that you may go back to later on if you want to.
 
-Alternatively you can use the following style.
+In rare cases, you may also use `.close()` to invalidate the dataset object after saving the changes.
 
+If you prefer flushing to be taken care for you, wrap your operations on the dataset with the `with` statement in this fashion:
 ```python
 with hub.Dataset(...) as ds:
     pass
 ```
 
-This works as well.
+### Windows FAQ
+
+**Q: Running `activeloop` commands results in an error with a message stating that `'activeloop' is not recognized as an internal or external command, operable program or batch file.` What should I do to use such commands?**
+
+A: If you are having troubles running `activeloop` commands on Windows, it usually means there are issues with your PATH environmental variable and `activeloop` commands are only affected by this underlying problem. Regardless, there are several ways in which you can still be able to use the CLI.
+
+Option 1. You may try running hub as a module, i.e. `py -m hub` and add arguments as necessary.
+
+Option 2. You may try adding Python scripts to your path. First, you need to find out where your Python installation is located. Start from running:
+```py --list-paths```
+If your Python interpreter is not on the list but you can run it (despite not knowing its path), you should paste the following excerpt to Python console to find out its location:
+```python
+import os
+import sys
+os.path.dirname(sys.executable)
+```
+
+Once you know the path to the directory with the Python version you are using, adapt it to match the pattern in the command below. If you are unsure whether it is correct, check if the path exists. Finally, run this command in the command prompt (CMD):
+<pre>
+setx /m PATH "%PATH%;C:\<i>path\to\Python</i>\Python3<i>X</i>\Scripts\"
+</pre>
+
+Then refresh your CMD with:
+```
+start & exit
+```
+Now, you should be able to run activeloop commands.
diff --git a/hub/__main__.py b/hub/__main__.py
@@ -0,0 +1,4 @@
+from hub.cli.command import cli
+
+if __name__ == "__main__":
+    cli(prog_name="activeloop")
diff --git a/hub/cli/utils.py b/hub/cli/utils.py
diff --git a/hub/client/base.py b/hub/client/base.py
@@ -10,7 +10,7 @@
 from hub import config
 from hub.log import logger
 from hub.client.token_manager import TokenManager
-from hub.cli.utils import get_cli_version
+from hub.version import __version__
 
 from hub.exceptions import (
     AuthenticationException,
@@ -58,7 +58,7 @@ def request(
             endpoint = config.HUB_REST_ENDPOINT
 
         request_url = urljoin(endpoint, relative_url)
-        headers["hub-cli-version"] = get_cli_version()
+        headers["hub-cli-version"] = __version__
         if (
             "Authorization" not in headers
             or headers["Authorization"] != self.auth_header

diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -4,3 +4,4 @@ flake8==3.8.4
 black==20.8b1
 ray==1.2.0
 cloudpickle>=1.6.0,<2
+cachey>=0.2.1
diff --git a/requirements.txt b/requirements.txt
@@ -1,19 +1,16 @@
-click>=6.7,<8
-numpy>=1.13.0,<2 
-requests>=2,<3
-cachey==0.2.1
-fsspec==0.8.5
-s3fs==0.4.2
-gcsfs==0.6.2
-outdated==0.2.0
-lz4>=3,<4
-zarr==2.6.1
-boto3==1.17.20
-tqdm==4.54.1
-azure-storage-blob==12.6.0
-pathos>=0.2.2
-psutil>=5.7.3
-Pillow>=8.0.1
-cloudpickle==1.6.0
-msgpack==1.0.2
-humbug==0.1.4
+click>=6.7, <8
+numpy>=1.17, <2
+requests>=2, <3
+fsspec>=0.8, <1
+gcsfs>=0.6.2, <0.7  # newer versions fail tests #97
+s3fs==0.4.2, <0.5.2  # newer versions require Python 3.7+
+boto3==1.17.22
+lz4>=3, <4
+zarr>=2.4, <2.7
+tqdm>=4.1, <5
+azure-storage-blob>=12, <13
+pathos>=0.2, <0.3
+humbug>=0.1.4, <0.2
+Pillow>=6
+msgpack>=0.6
+psutil>=5.8  # needed only for deprecated code