# Upload a local datafile to add or replace a Dataset in a Collection

This script performs the upload of a local datafile to a given Collection (as identified by its Collection uuid), where the datafile becomes a Dataset accessible via the Data Portal UI. In order to use this script, you must have a Curation API key, you must know the id of the Collection to which you wish to upload the datafile, and you must decide upon a string tag (the `curator_tag`) to use to uniquely identify the resultant Dataset within this Collection going forward. 
- Uploads to a tag that has not been used yet in the given Collection will result in a new Dataset being created.
- Uploads to a tag for which there already exists a Dataset in the given Collection will result in the existing Dataset being replaced by the new Dataset created from the datafile that you are uploading.
- You can only add/replace Datasets in private Collections or revision Collections.

#### <font color='#bc00b0'>Please fill in the required values:</font>

<font color='#bc00b0'>(Required) Provide the path to your api key file</font>

In [120]:
api_key_file = "path/to/api-key.txt"

<font color='#bc00b0'>(Required) Provide the absolute path to the h5ad datafile to upload</font>

In [121]:
filename = "/absolute/path/to-datafile.h5ad"

<font color='#bc00b0'>(Required) Enter your chosen `curator_tag`, which will serve as a unique identifier _within this Collection_ for the resultant Dataset. **Must possess the '.h5ad' suffix**.</font>
    
_We recommmend using a tagging scheme that 1) makes sense to you, and 2) will help organize and facilitate your 
automation of future uploads for adding new Datasets and replacing existing Datasets._

In [122]:
curator_tag = "arbitrary/tag/chosen-by-you.h5ad"

<font color='#bc00b0'>(Required) Enter the uuid of the Collection to which you wish to add this datafile as a Dataset</font>

_The Collection uuid can be found by looking at the url path in the address bar 
when viewing your Collection in the UI of the Data Portal website:_ `collections/{collection_id}`_. You can only add/replace Datasets in private Collections or revision Collections (and not public ones)._

In [123]:
collection_id = "01234567-89ab-cdef-0123-456789abcdef"

### Import dependencies

In [None]:
import boto3
import os
import requests
import threading
from botocore.credentials import RefreshableCredentials
from botocore.session import get_session
from datetime import datetime, timezone

### Use API key to obtain a temporary access token

In [125]:
api_key = open(api_key_file).read().strip()  
access_token_headers = {"x-api-key": api_key}
access_token_url = "https://api.cellxgene.dev.single-cell.czi.technology/curation/v1/auth/token"
resp = requests.post(access_token_url, headers=access_token_headers)
access_token = resp.json().get("access_token")

##### (optional, debug) verify status code of response

In [None]:
print(resp.status_code)

### Define the method for retrieving temporary s3 write credentials. These credentials will only work for _this_ Collection.

In [127]:
s3_credentials_url = f"https://api.cellxgene.dev.single-cell.czi.technology/curation/v1/collections/{collection_id}/datasets/s3-upload-credentials"
s3_cred_headers = {"Authorization": f"Bearer {access_token}"}

time_zone_info = datetime.now(timezone.utc).astimezone().tzinfo

def retrieve_s3_credentials():
    resp = requests.post(s3_credentials_url, headers=s3_cred_headers)
    s3_creds = resp.json().get("Credentials")
    s3_creds_formatted = {
        "access_key": s3_creds.get("AccessKeyId"),
        "secret_key": s3_creds.get("SecretAccessKey"),
        "token": s3_creds.get("SessionToken"),
        "expiry_time": datetime.fromtimestamp(s3_creds.get("Expiration")).replace(tzinfo=time_zone_info).isoformat(),
    }
    print("Retrieved/refreshed s3 credentials")
    return s3_creds_formatted

### Define callback method for logging upload progress

In [128]:
def get_progress_cb():
    lock = threading.Lock()
    uploaded_bytes = 0
    prev_percent = 0

    def progress_cb(num_bytes):
        nonlocal uploaded_bytes
        nonlocal prev_percent
        should_update_progress_printout = False
        
        lock.acquire()
        uploaded_bytes += num_bytes
        percent_of_total_upload = float("{:.1f}".format(uploaded_bytes / filesize * 100))
        if percent_of_total_upload > prev_percent:
            should_update_progress_printout = True
        prev_percent = percent_of_total_upload
        lock.release()
        
        if should_update_progress_printout:
            print(f"{percent_of_total_upload}% uploaded\r", end="")
            
    return progress_cb

### Upload file using temporary s3 credentials

In [None]:
session_creds = RefreshableCredentials.create_from_metadata(
    metadata=retrieve_s3_credentials(),
    refresh_using=retrieve_s3_credentials,
    method="sts-assume-role",
)
session = get_session()
session._credentials = session_creds
boto3_session = boto3.Session(botocore_session=session)
s3 = boto3_session.client("s3")

filesize = os.path.getsize(filename)

try:
    print(f"Uploading {filename} to Collection {collection_id} with tag '{curator_tag}'...")
    s3.upload_file(
        Filename=filename,
        Bucket="cellxgene-dataset-submissions-dev",
        Key=f"{collection_id}/{curator_tag}",
        Callback=get_progress_cb(),
    )
except Exception as e:
    print("\n\n\033[1m\033[38;5;9mFAILED\033[0m")  # 'FAILED' in bold red
    print(f"\n\n{e}")
else:
    print("\n\n\033[1m\033[38;5;10mSUCCESS\033[0m")  # 'SUCCESS' in bold green
    print(f"\nFile {filename} successfully uploaded to Collection {collection_id} with tag {curator_tag}")