# Upload a local datafile to add or replace a Dataset in a Collection

_\*\*The sample code in this notebook limits s3 upload durations to 12 hours. If you think your large file upload may take longer than that, please make use of the `upload_local_datafile` function, whose underlying code supports unlimited upload duration, as seen in the `python/create_dataset_from_local_file.ipynb` notebook.\*\*_

The script in this notebook performs the upload of a local datafile to a given Collection (as identified by its Collection id), where the datafile becomes a Dataset accessible via the Data Portal UI.

In order to use this script, you must...
- have a Curation API key (obtained from upper-righthand dropdown in the Data Portal UI after logging in)
- know the id of the Collection to which you wish to upload the datafile (taken from `/collections/<collection_id>` in url path in Data Portal UI when viewing the Collection)

_For **NEW** Dataset uploads_:
- You must create a `dataset_id` to use to uniquely identify the Dataset within its Collection.

_For **replacing/updating** existing Datasets_:
- Uploads to a `dataset_id` for which there already exists a Dataset in the given Collection will result in the existing Dataset being replaced by the new Dataset created from the datafile that you 
are uploading.
- Alternatively, an existing dataset may be targeted for replacement by using the Dataset's Cellxgene id as the identifier when writing to S3.


You can only add/replace Datasets in _private_ Collections or _private revisions_ of published Collections.

See examples of _add_ vs _replace_ behavior with different identifiers:

```
identifier = "new_unused_tag"
# A new Dataset with curator tag 'new_unused_tag' is created from the local datafile and is added to the given Collection

identifier = "existing/Dataset_tag"
# The existing Dataset with curator tag 'existing/Dataset_tag' in the given Collection gets replaced by a new 
Dataset created from the local datafile

identifier = "abcdef01-2345-6789-abcd-ef01234576789"
# Existing Dataset with id 'abcdef01-2345-6789-abcd-ef01234576789' gets replaced. If no such Dataset exists in the given Collection with this id, no action is taken.
```

### Import dependencies

In [None]:
library("readr")
library("aws.s3")
library("httr")
library("stringr")

#### <font color='#bc00b0'>Please fill in the required values:</font>

<font color='#bc00b0'>(Required) Provide the path to your api key file</font>

In [None]:
api_key_file_path <- "path/to/api-key.txt"

<font color='#bc00b0'>(Required) Provide the absolute path to the h5ad datafile to upload</font>

In [1]:
filename <- "/absolute/path/to-datafile.h5ad"

<font color='#bc00b0'>(Required) Enter your chosen `identifier` (see 'identifier' behavior rules in heading above) which will serve as a unique identifier _within this Collection_ for the resultant Dataset.</font>
    
When using curator tags, we recommmend using a tagging scheme that 1) makes sense to you, and 2) will help organize and facilitate your 
automation of future uploads for adding new Datasets and replacing existing Datasets. Remember that curator tags can be used as the identifier when _adding or replacing_ Datasets, whereas Dataset id's (uuid's) can only be used as the identifier when _replacing_ Datasets.

In [2]:
identifier <- "arbitrary/tag/chosen-by-you"  # Or "<dataset_id>"

<font color='#bc00b0'>(Required) Enter the id of the Collection to which you wish to add this datafile as a Dataset</font>

_The Collection id can be found by looking at the url path in the address bar 
when viewing your Collection in the UI of the Data Portal website:_ `collections/{collection_id}`_. You can only add/replace Datasets in private Collections or private revisions of published Collections. In order to edit a published Collection, you must first create a revision of that Collection._

In [None]:
collection_id <- "01234567-89ab-cdef-0123-456789abcdef"

### Specify domain (and API url)

In [None]:
domain_name <- "cellxgene.cziscience.com"
site_url <- str_interp("https://${domain_name}")
api_url_base <- str_interp("https://api.${domain_name}")

### Use API key to obtain a temporary access token

In [None]:
api_key <- read_file(api_key_file_path)
access_token_path <- "/curation/v1/auth/token"
access_token_url <- str_interp("${api_url_base}${access_token_path}")
res <- POST(url=access_token_url, add_headers(`x-api-key`=api_key))
stop_for_status(res)
access_token <- content(res)$access_token

##### (optional, debug) verify status code of response

In [None]:
print(res$status_code)

### Retrieve temporary s3 write credentials

In [None]:
s3_credentials_path <- str_interp("/curation/v1/collections/${collection_id}/datasets/s3-upload-credentials")
url <- str_interp("${api_url_base}${s3_credentials_path}")
bearer_token <- str_interp("Bearer ${access_token}")
res <- GET(url=url, add_headers(`Authorization`=bearer_token, `Content-Type`="application/json"))
stop_for_status(res)
res_content <- content(res)
access_key_id <- res_content$Credentials$AccessKeyId
secret_access_key <- res_content$Credentials$SecretAccessKey
session_token <- res_content$Credentials$SessionToken
upload_path <- res_content$UploadPath

### Extract formatted upload path from credentials endpoint response

In [None]:
bucket <- res_content$Bucket
key_prefix <- res_content$UploadKeyPrefix
upload_key <- paste(key_prefix, identifier, sep="")
print(str_interp("Full S3 write path is s3://${bucket}/${upload_key}"))

### Upload file using temporary AWS S3 credentials

In [None]:
Sys.setenv(
    "AWS_ACCESS_KEY_ID" = access_key_id,
    "AWS_SECRET_ACCESS_KEY" = secret_access_key,
    "AWS_SESSION_TOKEN" = session_token,
    "AWS_DEFAULT_REGION" = "us-west-2"
)
put_object(file=filename, object=upload_key, bucket=bucket)
