# Run CleanVision on a dataset in cloud

CleanVision checks can be run on a dataset residing in a cloud storage as well. It currently supports S3, Google Cloud Storage and Azure Blob Storage. Before using CleanVision on a cloud dataset, the appropriate storage needs to be configured on the machine. See the recommended ways to configure different storages on the machine if not already configured

- S3 (`"s3://"`): `aws configure` (same as `boto` configuration).
- Google Storage (`"gs://"`): `gcloud auth login`.
- Azure Blob Storage (`"az://"`): you need to pass your secrets in the `storage_opts`. This is also an option for S3 and Google Storage but is generally not recommended.

Internally CleanVision uses `fsspec` library to support the above cloud datasets. Check the [fsspec documentation](https://filesystem-spec.readthedocs.io/en/latest/) to learn more.

If you are in the cloud the secrets for dataset access may already exist in your environment. In this case, you may not need to do anything. 

### Install optional dependencies

In [None]:
!pip install -U pip
!pip install "cleanvision[s3] @ git+https://github.com/cleanlab/cleanvision.git" # for aws
# !pip install "cleanvision[azure] @ git+https://github.com/cleanlab/cleanvision.git" for azure blob storage
# !pip install "cleanvision[gcs] @ git+https://github.com/cleanlab/cleanvision.git" for google cloud storage

**After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.**

In [None]:
from cleanvision.imagelab import Imagelab

### Set dataset path and optional arguments

For running CleanVision, we use [Amazon Berkeley Objects (ABO)](https://amazon-berkeley-objects.s3.amazonaws.com/index.html#) dataset that is publicly available on [S3](https://amazon-berkeley-objects.s3.amazonaws.com/index.html#aws). Here we are accessing the dataset using an anonymous connection, however for a private dataset you can supply the credentials using config or environment variables and set `anon=False`.

In [None]:
cloud_path = "s3://amazon-berkeley-objects/images/small/aa/"
storage_opts = {"anon": True}

### Run CleanVision

In [None]:
imagelab = Imagelab(data_path=cloud_path, storage_opts=storage_opts)
imagelab.find_issues()

In [None]:
imagelab.report()

### Explicitly authenticating with credentials
Authentication can also be done by passing appropriate environments to the optional argument `storage_opts`. 

We install a dotenv environment for better handling of the secrets.

Create a `.env` file in the root directory with the following contents:

```
AZURE_STORAGE_ACCOUNT_NAME=XXXXX # your account name
AZURE_STORAGE_ACCOUNT_KEY=XXXX # storage account key to your storage account
```

It's also possible to pass the S3 credentials in the same way:
```
AWS_ACCESS_KEY_ID=XXXX
AWS_SECRET_ACCESS_KEY=XXX
```

Install python-dotenv

```shell
pip install python-dotenv
```

The following code snippet shows how to supply credentials explicitly for azure storage.

```python
import os
from dotenv import load_dotenv

load_dotenv()
# the container is called `main` and it contains a folder called `test-dataset`
cloud_path = "az://main/test-dataset/"
ACCOUNT_KEY = os.environ.get("AZURE_STORAGE_ACCOUNT_KEY")
ACCOUNT_NAME = os.environ.get("AZURE_STORAGE_ACCOUNT_NAME")
storage_opts = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY}
imagelab = Imagelab(data_path=cloud_path, storage_opts=storage_opts)
imagelab.find_issues()
imagelab.report()
```