# Upload a local datafile to add or replace a Dataset in a Collection

The script in this notebook performs the upload of a local datafile to a given Collection (as identified by its Collection uuid), where the datafile becomes a Dataset accessible via the Data Portal UI.

In order to use this script, you must...
- have a Curation API key (obtained from upper-righthand dropdown in the Data Portal UI after logging in)
- know the id of the Collection to which you wish to upload the datafile (from `/collections/<collection_id>` in url path in Data Portal UI)

**For new Dataset uploads**:
- You must decide upon a string tag (the `curator_tag`) to use to uniquely identify the resultant Dataset within its Collection. This tag must *NOT* be the tag of an existing Dataset within this Collection (read on below), and it must _NOT_ conform to the uuid format.

**For replacing/updating existing Datasets**:
- Uploads to a curator tag for which there already exists a Dataset in the given Collection will result in the existing Dataset being replaced by the new Dataset created from the datafile that you are uploading.
- Alternatively, while not necessarily recommended, an existing dataset _may_ be targeted for replacement by using the Dataset's Cellxgene uuid as the tag when writing to S3.
- You can only add/replace Datasets in private Collections or revision Collections.

For all uploads, the `.h5ad` suffix must be appended to the tag in the S3 write key. See example below.

#### <font color='#bc00b0'>Please fill in the required values:</font>

<font color='#bc00b0'>(Required) Provide the path to your api key file</font>

In [102]:
api_key_file_path = "../../dev_key"

<font color='#bc00b0'>(Required) Provide the absolute path to the h5ad datafile to upload</font>

In [103]:
datafile_path = "/Users/danielhegeman/Downloads/valid_schema_2.0.0.h5ad"

<font color='#bc00b0'>(Required) Enter your chosen `curator_tag`, which will serve as a unique identifier _within this Collection_ for the resultant Dataset. **Must possess the '.h5ad' suffix**.</font>
    
_We recommmend using a tagging scheme that 1) makes sense to you, and 2) will help organize and facilitate your 
automation of future uploads for adding new Datasets and replacing existing Datasets._

In [104]:
curator_tag = "arbitrary/tag/chosen-by-you323.h5ad"

<font color='#bc00b0'>(Required) Enter the uuid of the Collection to which you wish to add this datafile as a Dataset</font>

_The Collection uuid can be found by looking at the url path in the address bar 
when viewing your Collection in the UI of the Data Portal website:_ `collections/{collection_id}`_. You can only add/replace Datasets in private Collections or revision Collections (and not public ones)._

In [105]:
collection_id = "7c6b3ba5-ddb3-465c-adcf-4596c481b994"

### Import dependencies

In [106]:
import boto3
import os
import requests
import threading
from botocore.credentials import RefreshableCredentials
from botocore.session import get_session
from datetime import datetime, timezone

### Import API url helper functions and set url env vars

In [107]:
%run api_url_env_vars_python.ipynb
set_url_env_vars(env="dev")  # or "dev" or "staging"

Set 'site_url' env var to https://cellxgene.dev.single-cell.czi.technology
Set 'api_url_base' env var to https://api.cellxgene.dev.single-cell.czi.technology


### Import access token helper function and set access token env var

In [108]:
%run access_token_python.ipynb
set_access_token_env_var(api_key_file_path)

Response status code: 201
Successfully set 'access_token' env var!


### Define the method for retrieving temporary s3 write credentials. These credentials will only work for _this_ Collection.

In [109]:
s3_credentials_path = f"/curation/v1/collections/{collection_id}/datasets/s3-upload-credentials"
s3_credentials_url = f"{os.getenv('api_url_base')}{s3_credentials_path}"
s3_cred_headers = {"Authorization": f"Bearer {os.getenv('access_token')}"}

time_zone_info = datetime.now(timezone.utc).astimezone().tzinfo

def retrieve_s3_credentials_and_path():
    return requests.post(s3_credentials_url, headers=s3_cred_headers).json()

def s3_refreshable_creds_cb():
    res_data = retrieve_s3_credentials_and_path()
    s3_creds = res_data.get("Credentials")
    s3_creds_formatted = {
        "access_key": s3_creds.get("AccessKeyId"),
        "secret_key": s3_creds.get("SecretAccessKey"),
        "token": s3_creds.get("SessionToken"),
        "expiry_time": datetime.fromtimestamp(s3_creds.get("Expiration")).replace(tzinfo=time_zone_info).isoformat(),
    }
    print("Retrieved/refreshed s3 credentials")
    return s3_creds_formatted

### Define callback method for logging upload progress

In [110]:
def get_progress_cb():
    lock = threading.Lock()
    uploaded_bytes = 0
    prev_percent = 0

    def progress_cb(num_bytes):
        nonlocal uploaded_bytes
        nonlocal prev_percent
        should_update_progress_printout = False
        
        lock.acquire()
        uploaded_bytes += num_bytes
        percent_of_total_upload = float("{:.1f}".format(uploaded_bytes / filesize * 100))
        if percent_of_total_upload > prev_percent:
            should_update_progress_printout = True
        prev_percent = percent_of_total_upload
        lock.release()
        
        if should_update_progress_printout:
            print(f"{percent_of_total_upload}% uploaded\r", end="")
            
    return progress_cb

### Use credential endpoint to retrieve formatted upload path

In [111]:
creds_and_path = retrieve_s3_credentials_and_path()
bucket, key_prefix = creds_and_path["Bucket"], creds_and_path["UploadKeyPrefix"]
upload_key = key_prefix + curator_tag
print(f"Full S3 write path is s3://{bucket}/{upload_key}")

Full S3 write path is s3://cellxgene-dataset-submissions-dev/google-oauth2|115898590228662878630/7c6b3ba5-ddb3-465c-adcf-4596c481b994/arbitrary/tag/chosen-by-you323.h5ad


### Upload file using temporary s3 credentials

In [112]:
def upload_local_datafile():
    session_creds = RefreshableCredentials.create_from_metadata(
        metadata=s3_refreshable_creds_cb(),
        refresh_using=s3_refreshable_creds_cb,
        method="sts-assume-role-with-web-identity",
    )
    session = get_session()
    session._credentials = session_creds
    boto3_session = boto3.Session(botocore_session=session)
    s3 = boto3_session.client("s3")

    filesize = os.path.getsize(filename)

    try:
        print(f"Uploading {filename} to Collection {collection_id} with tag '{curator_tag}'...")
        s3.upload_file(
            Filename=filename,
            Bucket=bucket,
            Key=upload_key,
            Callback=get_progress_cb(),
        )
    except Exception as e:
        print("\n\n\033[1m\033[38;5;9mFAILED\033[0m")  # 'FAILED' in bold red
        print(f"\n\n{e}")
    else:
        print("\n\n\033[1m\033[38;5;10mSUCCESS\033[0m")  # 'SUCCESS' in bold green
        print(f"\nFile {filename} successfully uploaded to Collection {collection_id} with tag '{curator_tag}'")

upload_local_datafile()

Retrieved/refreshed s3 credentials
Uploading /Users/danielhegeman/Downloads/valid_schema_2.0.0.h5ad to Collection 7c6b3ba5-ddb3-465c-adcf-4596c481b994 with tag 'arbitrary/tag/chosen-by-you323.h5ad'...
100.0% uploaded

[1m[38;5;10mSUCCESS[0m

File /Users/danielhegeman/Downloads/valid_schema_2.0.0.h5ad successfully uploaded to Collection 7c6b3ba5-ddb3-465c-adcf-4596c481b994 with tag 'arbitrary/tag/chosen-by-you323.h5ad'
