# Run Cleanvision on a cloud dataset

### Before you run

If you're running locally you may need to configure the access to your cloud resource.

- S3 (`"s3://"`): the recommended way is to use `aws configure` (same as `boto` configuration).
- Google Storage (`"gs://"`): the recommended way is to use `gsutil auth login`.
- Azure Blob Storage (`"az://"`): you need to pass your secrets in the `storage_opts`. This is also an option for S3 and Google Storage but is generally not recommended.

Follow [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) documentation to learn more.


If you are in the cloud the secrets for dataset access may already exist in your environment. In this case, you may not need to do anything. 

In [None]:
!pip install -U pip
!pip install "cleanvision[s3]" # cleanvision[azure] for azure blob storage

**After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.**

In [None]:
from cleanvision.imagelab import Imagelab

### Implicit authentication through config or environment variables

In [None]:
# we try to access open dataset from Berkeley in S3
cloud_path = "s3://amazon-berkeley-objects/images/small/aa/"

imagelab = Imagelab(data_path=cloud_path)

In [None]:
imagelab.find_issues()

In [None]:
imagelab.report()

### Explicit credential authentication
Alternatively, you can authenticate by passing appropriate environments to the optional argument `storage_opts`. 

We install a dotenv environment for better handling of the secrets.

Create a `.env` file with the following contents:

```
AZURE_STORAGE_ACCOUNT_NAME=XXXXX # your account name
AZURE_STORAGE_ACCOUNT_KEY=XXXX # storage account key to your storage account
```

It's also possible to pass the S3 credentials in the same way:
```
AWS_ACCESS_KEY_ID=XXXX
AWS_SECRET_ACCESS_KEY=XXX
```

In [None]:
!pip install python-dotenv

For example, here we try to authenticate Azure Stroage Blob. Specifically, we want to access a blob storage account with the corresponding account key. `cloud_path` contains path to the dataset from the container up to the folder-level.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()
# the container is called `main` and it contains a folder called `test-dataset`
cloud_path = "az://main/test-dataset/"
ACCOUNT_KEY = os.environ.get("AZURE_STORAGE_ACCOUNT_KEY")
ACCOUNT_NAME = os.environ.get("AZURE_STORAGE_ACCOUNT_NAME")
storage_opts = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY}
imagelab = Imagelab(data_path=cloud_path, storage_opts=storage_opts)
imagelab.find_issues()

In [None]:
imagelab.report()