# Copy Datasets Using Globus


Each attendee has access to their own Guest Collection, which is a shared endpoint that can be used to host and share data.


* Explain how to login
* Explain how to find the Guest Collection
* Explain how to configure permissions on each folder

Specify the UUID of your Guest Collection in `my_collection_uuid`

For this tutorial, we assume you're putting all of the datasets under a single folder. This isn't necessary, it just makes things easier.

Define your collection's base folder using the `my_collection_folder` variable. Reasonable names include "datasets", "data", or "repository". These forlder names are just for hierachical organization.

In [22]:
# Rick's example collection
my_collection_uuid = '85017645-30ef-4519-abbb-a73811b914b7'
my_collection_folder = '/datasets/'

## Authenticate with Globus

Use the Globus SDK to copy the examples datasets with Globus. When you first run the `copydataset` method you'll be prompted to login to Globus to get tokens to copy the data and upload the manifest of files for each dataset.

In [23]:
import json
import copydataset

We copy the dataset by providing the name of the dataset, the UUID of the destination Guest Collection, and the folder we want it copied to.

The method will return the base URL of the Guest Collection and the manifest of the files that were copied. The URL returned will be the same for each of the datasets we copy because they're all going to the same Guest Collection. The method will also write a copy of the manifest to the local directory named `<dataset>-manifest.json`.

In [24]:
url, cmb_manifest = copydataset.copydataset('cmb', my_collection_uuid, my_collection_folder)

Login Here:

https://auth.globus.org/v2/oauth2/authorize?client_id=1dc53da9-4f45-43b2-b75f-54368fed256c&redirect_uri=https%3A%2F%2Fauth.globus.org%2Fv2%2Fweb%2Fauth-code&scope=urn%3Aglobus%3Aauth%3Ascope%3Atransfer.api.globus.org%3Aall+https%3A%2F%2Fauth.globus.org%2Fscopes%2F85017645-30ef-4519-abbb-a73811b914b7%2Fhttps&state=_default&response_type=code&code_challenge=NYx1o56wKYLSaueLtErFYErccMjR2se4rwRtLQkJkec&code_challenge_method=S256&access_type=online


Please enter the code you get after login here:  Wa9JSTPFYyExlqg5Vdrtn2DF3ftvlj



The base URL of collection 85017645-30ef-4519-abbb-a73811b914b7 is https://g-053b28.c2d0f8.bd7c.data.globus.org

Waiting for 14191da8-7ac4-11ef-91c0-f95c6daa0003 to complete copying the data
Task has finished, creating the manifest

Uploading (HTTP PUT) manifest.json to https://g-053b28.c2d0f8.bd7c.data.globus.org/datasets/cmb/manifest.json
manfiest PUT to https://g-053b28.c2d0f8.bd7c.data.globus.org/datasets/cmb/manifest.json with status 200
Manifest file info (HTTP HEAD):
	Content-Length: 2747
	Content-Type: application/json

A copy of the manifest has been stored locally as cmb-manifest.json


The URL returned is the base URL of your Guest Collection. This will be used later.

In [25]:
print(url)

https://g-053b28.c2d0f8.bd7c.data.globus.org


Let's look at a couple of the entries in the file manifest.

In [26]:
print(json.dumps(manifest[:2], indent=2))

[
  {
    "filename": "cmb_023GHz.fits",
    "length": 2367360,
    "url": "https://g-053b28.c2d0f8.bd7c.data.globus.org/datasets/cmb/cmb_023GHz.fits",
    "sha256": "fc7036188f700f8c420fcb5ca59b6ef20e64d0030c276902b705db984d8ca93f"
  },
  {
    "filename": "cmb_100GHz.fits",
    "length": 2367360,
    "url": "https://g-053b28.c2d0f8.bd7c.data.globus.org/datasets/cmb/cmb_100GHz.fits",
    "sha256": "fc7036188f700f8c420fcb5ca59b6ef20e64d0030c276902b705db984d8ca93f"
  }
]


Now we can copy the other two datasets. You won't need to login again because the tokens have been cached in `~/.cheapandfair.json`.

In [27]:
url, dust_manifest = copydataset.copydataset('dust', my_collection_uuid, my_collection_folder)
url, synch_manifest = copydataset.copydataset('synch', my_collection_uuid, my_collection_folder)

The base URL of collection 85017645-30ef-4519-abbb-a73811b914b7 is https://g-053b28.c2d0f8.bd7c.data.globus.org

Waiting for 248bf8cc-7ac4-11ef-b35e-a1206a7ee65f to complete copying the data
Task has finished, creating the manifest

Uploading (HTTP PUT) manifest.json to https://g-053b28.c2d0f8.bd7c.data.globus.org/datasets/dust/manifest.json
manfiest PUT to https://g-053b28.c2d0f8.bd7c.data.globus.org/datasets/dust/manifest.json with status 200
Manifest file info (HTTP HEAD):
	Content-Length: 2580
	Content-Type: application/json

A copy of the manifest has been stored locally as dust-manifest.json
The base URL of collection 85017645-30ef-4519-abbb-a73811b914b7 is https://g-053b28.c2d0f8.bd7c.data.globus.org

Waiting for 2c0be6f2-7ac4-11ef-b6c6-6d7d1acfb36d to complete copying the data
Task has finished, creating the manifest

Uploading (HTTP PUT) manifest.json to https://g-053b28.c2d0f8.bd7c.data.globus.org/datasets/synch/manifest.json
manfiest PUT to https://g-053b28.c2d0f8.bd7c.data.

We can see that the manifests were saved locally.

In [28]:
!ls *.json

cmb-manifest.json   dust-manifest.json  synch-manifest.json


Finally let's save details about the endpoint in a file so that we can use it in the following steps, this includes the Guest Collection UUID, the domain for HTTPS access and the root folder of the datasets relative to the root of the Guest Collection.
The file will be used both in bash script and in Python, make sure there are no spaces around the `=` sign:

In [30]:
# We need to check the path on this, where will they be working? Should they cd to this directory earlier to save the manifests?
%cd cheapandfair-template/

[Errno 2] No such file or directory: 'cheapandfair-template/'
/Users/rpwagner/tmp/cheapandfair-gateways-2024-rpwagner/cheapandfair-template


In [31]:
with open('ENDPOINT.sh', 'w') as f:
    f.write(f'UUID={my_collection_uuid}\n')
    f.write(f'FOLDER={my_collection_folder}\n')
    f.write(f'DOMAIN={url}\n')

In [32]:
!cat ENDPOINT.sh

UUID=85017645-30ef-4519-abbb-a73811b914b7
FOLDER=/datasets/
DOMAIN=https://g-053b28.c2d0f8.bd7c.data.globus.org
