# Globus and Funcx 

Using Globus and Funcx to automate the process of uploading data to NeSI, running something on NeSI and then copying results back.

Requirements:

* Globus account
* NeSI account

Authentication/setup Steps:

1. Start a FuncX endpoint on NeSI
2. Create a Globus guest collection on NeSI
3. Globus authentication on local machine
4. Start funcX client locally
5. Connect to our Globus guest connection on NeSI
6. Configure HTTPS uploads/downloads for our NeSI guest collection

Processing Steps:

7. Transfer input data to NeSI using Globus
8. Run the workflow using funcX
9. Copy results back using Globus

The tokens generated during step 3 on the local machine are stored in a file and reused, so you should only need to authenticate the first time you run this notebook.

References:

* [Globus tutorial](https://globus-sdk-python.readthedocs.io/en/stable/tutorial.html)
* [funcX endpoint documentation](https://funcx.readthedocs.io/en/latest/endpoints.html)
* [fair-research-login](https://github.com/fair-research/native-login)

## 1. Start a funcx endpoint on NeSI

### Install and configure funcx endpoint if you have not done it before

Connect to a Mahuika login node by SSH and run the following commands to install funcx:

```sh
ssh mahuika
module load Python
pip install --user funcx funcx_endpoint
funcx-endpoint configure
```

During the final command you will be asked to authenticate with Globus Auth so that your endpoint can be made available to funcx running outside of NeSI.

For more details see: https://funcx.readthedocs.io/en/latest/endpoints.html.

### Start the funcx endpoint on NeSI

A default endpoint profile is created during the configure step above, which will suffice for us. We will be using funcx to submit jobs to Slurm or check the status of submitted jobs; no computationally expensive tasks should run directly on the endpoint itself.

```sh
# we are still on the Mahuika login node here...
funcx-endpoint start
```

Now list your endpoints, confirm that the *default* endpoint is "Active" and make a note of your endpoint ID:

```sh
funcx-endpoint list
+---------------+-------------+--------------------------------------+
| Endpoint Name |   Status    |             Endpoint ID              |
+===============+=============+======================================+
| default       | Active      | 3abf6696-8ba4-4ac8-be69-c6c24031373d |
+---------------+-------------+--------------------------------------+
```

In [1]:
# store your funcx endpoint id here
funcx_endpoint = "3abf6696-8ba4-4ac8-be69-c6c24031373d"  # my default endpoint on NeSI

## 2. Create a Globus guest collection on NeSI

Create a guest collection on the NeSI endpoint, so that we don't need to do the NeSI two factor authentication repeatedly, we can just use Globus auth.

Navigate to a directory under */nesi/nobackup/[project_code]/*, click sharing and add a shared collection. Make a note of the "Endpoint UUID". Also store the full path on NeSI to the shared collection you just created (`nesi_path`):

https://transfer.nesi.org.nz/file-manager?origin_id=cc45cfe3-21ae-4e31-bad4-5b3e7d6a2ca1

In [2]:
# store your NeSI endpoint and path here
nesi_endpoint = "f456a507-3c5b-41b9-9d7f-2315b9fed386"  # my guest collection on NeSI
nesi_path = "/nesi/nobackup/nesi99999/csco212/funcx_demo"  # the full path to where I created the guest collection

## 3. Globus authentication on local machine

### Register an app with Globus, if you haven't done it already

Note: I think this is a one off, you can reuse the same client id.

> Navigate to the [Developer Site](https://developers.globus.org/) and select “Register your app with Globus.” You will be prompted to login – do so with the account you wish to use as your app’s administrator...

In [3]:
# identifier for the app we created on globus website above, can be reused
CLIENT_ID = "6ffc9c02-cf62-4268-a695-d9d100181962"

### Use fair-research-login to authenticate once with Globus for both FuncX and Globus transfer

The first time you have to authenticate, then token is stored in mytokens.json and loaded from there on subsequent calls.

In [4]:
from fair_research_login import NativeClient, JSONTokenStorage

cli = NativeClient(
    client_id=CLIENT_ID,
    token_storage=JSONTokenStorage('mytokens.json'),  # save/load tokens here
    app_name="FuncX/Globus NeSI Demo",
)

# get the requested scopes (load tokens from file if available, otherwise request new tokens)
search_scope = "urn:globus:auth:scope:search.api.globus.org:all"  # for FuncX
funcx_scope = "https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all"  # for FuncX
openid_scope = "openid"  # for FuncX
transfer_scope = "urn:globus:auth:scope:transfer.api.globus.org:all"  # for Globus transfer client
https_scope = f"https://auth.globus.org/scopes/{nesi_endpoint}/https"  # for HTTPS upload/download to our guest collection on NeSI
tokens = cli.login(
    refresh_tokens=True,
    requested_scopes=[openid_scope, search_scope, funcx_scope, transfer_scope, https_scope]
)

Starting login with Globus Auth, press ^C to cancel.
Opening in existing browser session.


[1357587:1357587:0100/000000.911487:ERROR:sandbox_linux.cc(378)] InitializeSandbox() called with multiple threads in process gpu-process.


In [5]:
# authorisers for requested scopes
authorisers = cli.get_authorizers_by_scope(requested_scopes=[openid_scope, funcx_scope, search_scope, transfer_scope, https_scope])

## 4. Start funcX client locally

Start the funcX client locally so we can submit jobs to the NeSI funcX endpoint we just created. This will also require authentication with Globus Auth.

In [6]:
from funcx.sdk.client import FuncXClient

fxc = FuncXClient(
    fx_authorizer=authorisers[funcx_scope],
    search_authorizer=authorisers[search_scope],
    openid_authorizer=authorisers[openid_scope],
)

In [7]:
from funcx.sdk.executor import FuncXExecutor

# create a funcX executor
funcx_executor = FuncXExecutor(fxc)

## 5. Connect to our Globus guest collection on NeSI

Connect to the guest collection we created earlier.

In [8]:
import globus_sdk

tc = globus_sdk.TransferClient(authorizer=authorisers[transfer_scope])

In [9]:
# activate the NeSI endpoint
res_nesi_ep = tc.endpoint_autoactivate(nesi_endpoint)
assert res_nesi_ep['code'] != 'AutoActivationFailed'
res_nesi_ep["message"]

'Endpoint activated successfully using Globus Online credentials.'

## 6. Configure HTTPS uploads/downloads for our NeSI guest collection

In [10]:
# get the base URL for uploads and downloads
endpoint = tc.get_endpoint(nesi_endpoint)
https_server = endpoint['https_server']
print(f"Endpoint HTTPS base URL: {https_server}")

Endpoint HTTPS base URL: https://g-dc68ab.c61f4.bd7c.data.globus.org


In [11]:
# set up authentication header
# Globus SDK v2
https_token_dict = cli.load_tokens_by_scope()[https_scope]
https_auth_header = f"{https_token_dict['token_type']} {https_token_dict['access_token']}"
# Globus SDK v3??
#a = authorisers[https_scope]
#https_auth_header = a.get_authorization_header()

In [12]:
# function to download a file via https from the endpoint
def download_file(remote_file, local_file):
    import requests
    import shutil
    import time
    import os

    # file to download and URL
    download_url = f"{https_server}/{remote_file}"
    print(f"Downloading: {download_url}")

    # authorisation
    headers = {
        "Authorization": https_auth_header,
    }

    # download
    start_time = time.perf_counter()
    with requests.get(download_url, headers=headers, stream=True) as r:
        with open(local_file, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    r.raise_for_status()
    download_time = time.perf_counter() - start_time
    file_size = os.path.getsize(local_file)
    print(f"Downloaded {local_file}: {file_size / 1024 / 1024:.3f} MB in {download_time:.3f} seconds ({file_size / 1024 / 1024 / download_time:.3f} MB/s)")

In [13]:
# function to upload a file via https to the endpoint
def upload_file(local_file, remote_file):
    import requests
    import time
    import os

    # file to download and URL
    upload_url = f"{https_server}/{remote_file}"
    print(f"Uploading: {upload_url}")

    # authorisation
    headers = {
        "Authorization": https_auth_header,
    }

    # upload
    start_time = time.perf_counter()
    with open(local_file, 'rb') as f:
        r = requests.put(upload_url, data=f, headers=headers)
    r.raise_for_status()
    upload_time = time.perf_counter() - start_time
    file_size = os.path.getsize(local_file)
    print(f"Uploaded {local_file}: {file_size / 1024 / 1024:.3f} MB in {upload_time:.3f} seconds ({file_size / 1024 / 1024 / upload_time:.3f} MB/s)")

## 7. Transfer input data to NeSI using Globus

First we make a directory name that the simulation will be stored under, then copy the data under that directory.

In [14]:
# make a directory for running under
from datetime import datetime

# get a unique name for this run
workdirbase = datetime.now().strftime("%Y%m%dT%H%M%S")
workdirname = workdirbase
got_dirname = False
existing_names = [item["name"] for item in tc.operation_ls(nesi_endpoint, path="/")]
count = 0
while not got_dirname:
    # check the directory does not already exist
    if workdirname in existing_names:
        count += 1
        workdirname = f"{workdirbase}.{count:06d}"
    else:
        got_dirname = True
print(f"Directory: {workdirname}")
tc.operation_mkdir(nesi_endpoint, workdirname)

Directory: 20220112T125541


TransferResponse({'DATA_TYPE': 'mkdir_result', 'code': 'DirectoryCreated', 'message': 'The directory was created successfully', 'request_id': 'XFztddPaJ', 'resource': '/operation/endpoint/f456a507-3c5b-41b9-9d7f-2315b9fed386/mkdir'})

In [15]:
import os

# can only upload files using HTTPS?
# get the list of source files
src_input_path = "input"
source_files = [f for f in os.listdir(src_input_path) if os.path.isfile(os.path.join(src_input_path, f))]

# transfer the source files
for source_file in source_files:
    upload_file(os.path.join(src_input_path, source_file), f"{workdirname}/{source_file}")

print("transferring source files to NeSI is complete")

Uploading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/apoa1.pdb
Uploaded input/apoa1.pdb: 6.773 MB in 7.535 seconds (0.899 MB/s)
Uploading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/apoa1.namd
Uploaded input/apoa1.namd: 0.001 MB in 5.767 seconds (0.000 MB/s)
Uploading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/apoa1.psf
Uploaded input/apoa1.psf: 12.855 MB in 4.300 seconds (2.990 MB/s)
Uploading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/par_all22_popc.xplor
Uploaded input/par_all22_popc.xplor: 0.000 MB in 3.541 seconds (0.000 MB/s)
Uploading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/par_all22_prot_lipid.xplor
Uploaded input/par_all22_prot_lipid.xplor: 0.149 MB in 5.846 seconds (0.025 MB/s)
Uploading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/run.sl
Uploaded input/run.sl: 0.000 MB in 2.560 seconds (0.000 MB/s)
transferring source files to NeSI is complete


## 8. Run the processing using funcX

Two functions are called using FuncX:

1. Submit job to Slurm
2. Check Slurm job status

In [16]:
print(f"FuncX endpoint id: {funcx_endpoint}")

FuncX endpoint id: 3abf6696-8ba4-4ac8-be69-c6c24031373d


Create a simple test function that returns the hostname where the endpoint is running, just as a test:

In [17]:
# test function to see if things are working
def test_function():
    import socket
    return socket.gethostname()

# With the executor, functions are auto-registered
future = funcx_executor.submit(test_function, endpoint_id=funcx_endpoint)

# You can check status of your task without blocking
print("processing done?", future.done())

# Block and wait for the result:
result = future.result()

print("processing done?", future.done())

print(f"FuncX endpoint is running on: {result}")

processing done? False
processing done? True
FuncX endpoint is running on: mahuika01


Now create the 2 Slurm functions for interacting with Slurm (if the Slurm API was available we could use that instead):

In [18]:
# function that submits a job to Slurm (assumes submit script and other required inputs were uploaded via Globus)
def submit_slurm_job(submit_script, work_dir=None):
    """Runs the given command in a Slurm job."""
    # have to load modules within the function
    import os
    import subprocess
    
    # change to working directory
    if work_dir is not None and os.path.isdir(work_dir):
        os.chdir(work_dir)
        
    print(os.listdir())
    
    # submit the Slurm job and return the job id
    submit_cmd = f'sbatch --priority=9999 {submit_script}'
    with open("submit_cmd.txt", "w") as fh:
        fh.write(submit_cmd + "\n")
    output = subprocess.check_output(submit_cmd, shell=True, universal_newlines=True)
    
    return output

In [19]:
# function that checks Slurm job status
def check_slurm_job_status(jobid):
    """Check Slurm job status."""
    # have to load modules within the function
    import subprocess
    
    # query the status of the job using sacct
    cmd = f'sacct -j {jobid} -X -o State -n'
    output = subprocess.check_output(cmd, shell=True, universal_newlines=True)
    
    return output

### Submit the job to Slurm

In [20]:
import subprocess

# full path on NeSI to where the files were uploaded
nesi_full_path = os.path.join(nesi_path, workdirname)

# With the executor, functions are auto-registered
future = funcx_executor.submit(submit_slurm_job, "run.sl", endpoint_id=funcx_endpoint, work_dir=nesi_full_path)

# Block and wait for the result:
try:
    result = future.result()
except subprocess.CalledProcessError as exc:
    print("submitting job failed:")
    print(f"    return code: {exc.returncode}")
    print(f"    cmd: {exc.cmd}")
    print(f"    output: {exc.output}")

# get the Slurm Job ID
jobid = result.split()[-1]
print(f"Job submitted: {jobid}")

Job submitted: 23826177


### Wait for the job to complete

In [21]:
import time

job_finished = False
while not job_finished:
    future = funcx_executor.submit(check_slurm_job_status, jobid, endpoint_id=funcx_endpoint)
    print("checking Slurm job status via funcX: ", end="")
    result = future.result()
    job_status = result.strip()
    print(job_status)
    if job_status not in ("RUNNING", "PENDING"):  # TODO: check possible statuses
        job_finished = True
    time.sleep(5)
print("Job finished")

checking Slurm job status via funcX: RUNNING
checking Slurm job status via funcX: RUNNING
checking Slurm job status via funcX: COMPLETED
Job finished


## 9. Copy results back using Globus

In [22]:
# create directory for storing result
store_path = os.path.join("output", workdirname)
os.mkdir(store_path)

# list and download files
ls = tc.operation_ls(nesi_endpoint, path=workdirname)
for item in ls:
    if item["type"] == "file":
        fn = item["name"]
        download_file(f"{workdirname}/{fn}", os.path.join(store_path, fn))
    else:
        print(f"Skipping: {item['name']} (can only download files over HTTPS)")

print(f"transferring results from NeSI is complete: {store_path}")

Downloading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/apoa1.namd
Downloaded output/20220112T125541/apoa1.namd: 0.001 MB in 4.362 seconds (0.000 MB/s)
Downloading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/apoa1.pdb
Downloaded output/20220112T125541/apoa1.pdb: 6.773 MB in 5.003 seconds (1.354 MB/s)
Downloading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/apoa1.psf
Downloaded output/20220112T125541/apoa1.psf: 12.855 MB in 4.008 seconds (3.207 MB/s)
Downloading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/par_all22_popc.xplor
Downloaded output/20220112T125541/par_all22_popc.xplor: 0.000 MB in 2.460 seconds (0.000 MB/s)
Downloading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/par_all22_prot_lipid.xplor
Downloaded output/20220112T125541/par_all22_prot_lipid.xplor: 0.149 MB in 6.085 seconds (0.024 MB/s)
Downloading: https://g-dc68ab.c61f4.bd7c.data.globus.org/20220112T125541/run.sl
Downloaded output/20220112

## Notes about FuncX so far

* above uses FuncX just to submit a Slurm job and then poll for completion (could also be done with Slurm API if that was made available)
* researcher needs to manually run a funcx endpoint on NeSI (and keep it running there)
  - eventually should be integrated with Globus federated endpoint?
  - this runs an endpoint on the login node
  - could be a pain if the endpoint is killed for some reason and the user needs to reconnect and start it again
* FuncX does know about Slurm too, so you could set FuncX up to directly run your function in a Slurm job without having to submit anything separately, see snippet from an endpoint config.py:
  ```sh
    from funcx_endpoint.endpoint.utils.config import Config
    from parsl.providers import LocalProvider, SlurmProvider

    config = Config(
        scaling_enabled=True,
        provider=SlurmProvider(
            "large",
            min_blocks=1,
            max_blocks=1,
            nodes_per_block=1,
            cores_per_node=2,
            mem_per_node=16,
            exclusive=False,
            cmd_timeout=120,
            walltime='2:00:00',
        ),
        #max_workers_per_node=2,
        funcx_service_address='https://api.funcx.org/v1'
    )
  ```
* reasons for not using FuncX SlurmProvider directly currently
  - funcx currently has no way to know how much work a function may involve
    - could lead to failures due to wall time exceeded, etc.
  - not "elastic"
    - have to start a new endpoint if need more resources