# Globus and funcX 

Using Globus and funcX to automate the process of uploading data to NeSI, running a Slurm job on NeSI and then copying results back.

This will demonstrate:

* Globus and funcX Python packages
* Doing Globus authentication use the fair_research_login module
* Uploading and downloading files to a shared collection using https (no personal endpoint required)
* Running funcX functions to submit and wait for a Slurm job on the remote machine

Requirements:

* Globus account
* NeSI account

Setup Steps:

1. Start a FuncX endpoint on NeSI (via SSH or Jupyter)
2. Create a Globus guest collection on NeSI (via Globus web app)
3. Globus authentication on local machine
4. Start funcX client locally
5. Connect to our Globus guest connection on NeSI
6. Configure HTTPS uploads/downloads for our NeSI guest collection
7. Create remote directory using funcX

Processing Steps:

8. Transfer input data to NeSI using Globus
9. Run the workflow using funcX
10. Copy results back using Globus

The tokens generated during step 3 on the local machine are stored in a file and reused, so you should only need to authenticate the first time you run this notebook.

References:

* [Globus tutorial](https://globus-sdk-python.readthedocs.io/en/stable/tutorial.html)
* [funcX endpoint documentation](https://funcx.readthedocs.io/en/latest/endpoints.html)
* [fair-research-login](https://github.com/fair-research/native-login)

## 1. Start a funcX endpoint on NeSI

### Install and configure funcX endpoint if you have not done it before

Connect to a Mahuika login node by SSH and run the following commands to install funcX:

```sh
ssh mahuika
module load funcx-endpoint
funcx-endpoint configure
```

During the final command you will be asked to authenticate with Globus Auth so that your endpoint can be made available to funcX running outside of NeSI.

For more details see: https://funcx.readthedocs.io/en/latest/endpoints.html.

### Start the funcx endpoint on NeSI

A default endpoint profile is created during the configure step above, which will suffice for us. We will be using funcx to submit jobs to Slurm or check the status of submitted jobs; no computationally expensive tasks should run directly on the endpoint itself.

```sh
# we are still on the Mahuika login node here...
funcx-endpoint start
```

Now list your endpoints, confirm that the *default* endpoint is "Running" and make a note of your endpoint ID:

```sh
funcx-endpoint list
+---------------+-----------+--------------------------------------+
| Endpoint Name | Status    |             Endpoint ID              |
+===============+===========+======================================+
| default       | Running   | ffd77d5c-b65f-4479-bbc3-66a2f7346858 |
+---------------+-----------+--------------------------------------+
```

In [1]:
# store your funcx endpoint id here
funcx_endpoint = "ffd77d5c-b65f-4479-bbc3-66a2f7346858"  # my default endpoint on NeSI

## 2. Create a Globus guest collection on NeSI

Create a guest collection on the NeSI endpoint, so that we don't need to do the NeSI two factor authentication repeatedly, we can just use Globus auth.

Navigate to a directory under */nesi/nobackup/[project_code]/*, click sharing and add a shared collection. Make a note of the "Endpoint UUID". Also store the full path on NeSI to the shared collection you just created (`nesi_path`):

https://transfer.nesi.org.nz/file-manager?origin_id=cc45cfe3-21ae-4e31-bad4-5b3e7d6a2ca1

In [2]:
# store your NeSI endpoint and path here
nesi_endpoint = "3999b2ad-d708-4b0e-9f4d-8f90838c7f23"  # my guest collection on NeSI
nesi_path = "/nesi/nobackup/nesi99999/csco212/rjm"  # the full path to where I created the guest collection

## 3. Globus authentication on local machine

### Register an app with Globus, if you haven't done it already

Note: I think this is a one off, you can reuse the same client id.

> Navigate to the [Developer Site](https://developers.globus.org/) and select “Register your app with Globus.” You will be prompted to login – do so with the account you wish to use as your app’s administrator...

In [3]:
# identifier for the app we created on globus website above, can be reused
CLIENT_ID = "b7f9ff16-4094-4d2a-8183-6dfd9362096a"

### Use fair-research-login to authenticate once with Globus for both FuncX and Globus transfer

The first time you have to authenticate, then token is stored in mytokens.json and loaded from there on subsequent calls.

In [4]:
from fair_research_login import NativeClient, JSONTokenStorage

cli = NativeClient(
    client_id=CLIENT_ID,
    token_storage=JSONTokenStorage('mytokens.json'),  # save/load tokens here
    app_name="FuncX/Globus NeSI Demo",
)

# get the requested scopes (load tokens from file if available, otherwise request new tokens)
search_scope = "urn:globus:auth:scope:search.api.globus.org:all"  # for FuncX
funcx_scope = "https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all"  # for FuncX
openid_scope = "openid"  # for FuncX
transfer_scope = "urn:globus:auth:scope:transfer.api.globus.org:all"  # for Globus transfer client
https_scope = f"https://auth.globus.org/scopes/{nesi_endpoint}/https"  # for HTTPS upload/download to our guest collection on NeSI
tokens = cli.login(
    refresh_tokens=True,
    requested_scopes=[openid_scope, search_scope, funcx_scope, transfer_scope, https_scope]
)

# authorisers for requested scopes
authorisers = cli.get_authorizers_by_scope(requested_scopes=[openid_scope, funcx_scope, search_scope, transfer_scope, https_scope])

Starting login with Globus Auth, press ^C to cancel.
Opening in existing browser session.


[1476416:1476416:0100/000000.760070:ERROR:sandbox_linux.cc(377)] InitializeSandbox() called with multiple threads in process gpu-process.


## 4. Start funcX client locally

Start the funcX client locally so we can submit jobs to the NeSI funcX endpoint we just created. This will also require authentication with Globus Auth.

In [6]:
from funcx.sdk.client import FuncXClient

fxc = FuncXClient(
    fx_authorizer=authorisers[funcx_scope],
    search_authorizer=authorisers[search_scope],
    openid_authorizer=authorisers[openid_scope],
)

# create a funcX executor (based on concurrent.futures.Executor)
funcx_executor = FuncXExecutor(fxc)

### Quick test that funcX is working

In [8]:
# test function to see if things are working
def test_function():
    import socket
    return socket.gethostname()

# With the executor, functions are auto-registered
future = funcx_executor.submit(test_function, endpoint_id=funcx_endpoint)

# You can check status of your task without blocking
print("processing done?", future.done())

# Block and wait for the result:
result = future.result()

print("processing done?", future.done())

print(f"FuncX endpoint is running on: {result}")

processing done? False
processing done? True
FuncX endpoint is running on: mahuika01


## 5. Connect to our Globus guest collection on NeSI

Connect to the guest collection we created earlier.

In [10]:
import globus_sdk

tc = globus_sdk.TransferClient(authorizer=authorisers[transfer_scope])

# activate the NeSI endpoint
res_nesi_ep = tc.endpoint_autoactivate(nesi_endpoint)
assert res_nesi_ep['code'] != 'AutoActivationFailed'
res_nesi_ep["message"]

'Endpoint activated successfully using Globus Online credentials.'

## 6. Setting up HTTPS uploads/downloads for our NeSI guest collection

In [12]:
# get the base URL for uploads and downloads
endpoint = tc.get_endpoint(nesi_endpoint)
https_server = endpoint['https_server']
print(f"Endpoint HTTPS base URL: {https_server}")

# set up authentication header
https_authoriser = authorisers[https_scope]
https_auth_header = https_authoriser.get_authorization_header()

Endpoint HTTPS base URL: https://g-51ede9.c61f4.bd7c.data.globus.org


## 7. Create remote directory using funcX

In [13]:
def make_remote_directory(base_dir, prefix):
    # catch all errors due to problem with exceptions being wrapped in a parsl class,
    # requiring parsl to be installed on the local machine
    # which is not supported on windows for the version of parsl that funcx-endpoint depends on
    try:
        import os
        import tempfile

        remote_dir = tempfile.mkdtemp(prefix=prefix + "-", dir=base_dir)
        remote_name = os.path.basename(remote_dir)
        status = 0

    except Exception as exc:
        remote_dir = repr(exc)
        remote_name = None
        status = 1

    return status, (remote_dir, remote_name)

In [14]:
# make a directory for running under
from datetime import datetime

# get a unique name for this run
prefix = datetime.now().strftime("funcx-test-%Y%m%dT%H%M%S")

# submit the make directory function to funcx
future = funcx_executor.submit(make_remote_directory, nesi_path, prefix, endpoint_id=funcx_endpoint)

# wait for the function to complete
status, (remote_full_path, remote_dir_name) = future.result()

print(f"status: {status}")
print(f"remote directory: {remote_full_path}")

status: 0
remote directory: /nesi/nobackup/nesi99999/csco212/rjm/funcx-test-20220324T113946-99ybnaa2


## 8. Transfer input data to NeSI

In [15]:
# function to upload a file via https to the endpoint
def upload_file(local_file, remote_file):
    import requests
    import time
    import os

    # file to download and URL
    upload_url = f"{https_server}/{remote_file}"
    print(f"Uploading: {upload_url}")

    # authorisation
    headers = {
        "Authorization": https_auth_header,
    }

    # upload
    start_time = time.perf_counter()
    with open(local_file, 'rb') as f:
        r = requests.put(upload_url, data=f, headers=headers)
        r.raise_for_status()
    upload_time = time.perf_counter() - start_time
    file_size = os.path.getsize(local_file)
    print(f"Uploaded {local_file}: {file_size / 1024 / 1024:.3f} MB in {upload_time:.3f} seconds ({file_size / 1024 / 1024 / upload_time:.3f} MB/s)")

In [16]:
import os

# get the list of source files
src_input_path = "input"
source_files = [f for f in os.listdir(src_input_path) if os.path.isfile(os.path.join(src_input_path, f))]

# transfer the source files
for source_file in source_files:
    upload_file(os.path.join(src_input_path, source_file), f"{remote_dir_name}/{source_file}")

print("transferring source files to NeSI is complete")

Uploading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T113946-99ybnaa2/apoa1.pdb
Uploaded input/apoa1.pdb: 6.773 MB in 8.118 seconds (0.834 MB/s)
Uploading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T113946-99ybnaa2/apoa1.namd
Uploaded input/apoa1.namd: 0.001 MB in 3.535 seconds (0.000 MB/s)
Uploading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T113946-99ybnaa2/apoa1.psf
Uploaded input/apoa1.psf: 12.855 MB in 8.227 seconds (1.563 MB/s)
Uploading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T113946-99ybnaa2/par_all22_popc.xplor
Uploaded input/par_all22_popc.xplor: 0.000 MB in 2.539 seconds (0.000 MB/s)
Uploading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T113946-99ybnaa2/par_all22_prot_lipid.xplor
Uploaded input/par_all22_prot_lipid.xplor: 0.149 MB in 2.647 seconds (0.056 MB/s)
Uploading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T113946-99ybnaa2/run.sl
Uploaded inpu

## 9. Run the processing using funcX

Submit the Slurm job and then wait for it

### Submit the job to Slurm

In [18]:
def submit_slurm_job(submit_script, work_dir=None):
    """Runs the given command in a Slurm job."""
    # have to load modules within the function
    import os
    import subprocess
    
    # change to working directory
    if work_dir is not None and os.path.isdir(work_dir):
        os.chdir(work_dir)
    
    # submit the Slurm job and return the job id
    submit_cmd = f'sbatch --priority=9999 {submit_script}'
    output = subprocess.check_output(submit_cmd, shell=True, universal_newlines=True, stderr=subprocess.STDOUT)
    
    return output

In [19]:
import subprocess

# With the executor, functions are auto-registered
future = funcx_executor.submit(submit_slurm_job, "run.sl", endpoint_id=funcx_endpoint, work_dir=remote_full_path)

# Block and wait for the result:
try:
    result = future.result()
except subprocess.CalledProcessError as exc:
    # allowing exceptions to propagate back to the local machine won't work with windows
    print("submitting job failed:")
    print(f"    return code: {exc.returncode}")
    print(f"    cmd: {exc.cmd}")
    print(f"    output: {exc.output}")
    raise exc

# get the Slurm Job ID
jobid = result.split()[-1]
print(f"Job submitted: {jobid}")

Job submitted: 25596925


### Wait for the job to complete

In [23]:
def check_slurm_job_status(jobid):
    """Check Slurm job status."""
    # have to load modules within the function
    import subprocess
    
    # query the status of the job using sacct
    cmd = f'sacct -j {jobid} -X -o State -n'
    output = subprocess.check_output(cmd, shell=True, universal_newlines=True, stderr=subprocess.STDOUT)
    
    return output

In [25]:
import time

job_finished = False
while not job_finished:
    print("checking Slurm job status via funcX: ", end="")
    future = funcx_executor.submit(check_slurm_job_status, jobid, endpoint_id=funcx_endpoint)
    
    try:
        result = future.result()
    except subprocess.CalledProcessError as exc:
        # allowing exceptions to propagate back to the local machine won't work with windows
        print("submitting job failed:")
        print(f"    return code: {exc.returncode}")
        print(f"    cmd: {exc.cmd}")
        print(f"    output: {exc.output}")
        raise exc
    
    job_status = result.strip()
    print(job_status)
    if job_status not in ("RUNNING", "PENDING"):
        job_finished = True
    time.sleep(5)
print("Job finished")

checking Slurm job status via funcX: submitting job failed:
    return code: 1
    cmd: sacct -j 25596925 -X -o State -n
    output: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:hpcwslurmdb01:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused



CalledProcessError: Command 'sacct -j 25596925 -X -o State -n' returned non-zero exit status 1.

## 10. Copy results back using Globus

In [21]:
# function to download a file via https from the endpoint
def download_file(remote_file, local_file):
    import requests
    import time
    import os

    # file to download and URL
    download_url = f"{https_server}/{remote_file}"
    print(f"Downloading: {download_url}")

    # authorisation header
    headers = {
        "Authorization": https_auth_header,
    }

    # download
    start_time = time.perf_counter()
    with requests.get(download_url, headers=headers, stream=True) as r:
        r.raise_for_status()
        with open(local_file, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
    download_time = time.perf_counter() - start_time
    file_size = os.path.getsize(local_file)
    print(f"Downloaded {local_file}: {file_size / 1024 / 1024:.3f} MB in {download_time:.3f} seconds ({file_size / 1024 / 1024 / download_time:.3f} MB/s)")

In [22]:
# create directory for storing result
store_path = os.path.join("output", remote_dir_name)
os.mkdir(store_path)

# list and download files
ls = tc.operation_ls(nesi_endpoint, path=remote_dir_name)
for item in ls:
    if item["type"] == "file":
        fn = item["name"]
        download_file(f"{remote_dir_name}/{fn}", os.path.join(store_path, fn))
    else:
        print(f"Skipping: {item['name']} (can only download files over HTTPS)")

print(f"transferring results from NeSI is complete: {store_path}")

Downloading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T110820-q1t724b0/apoa1.namd
Downloaded output/funcx-test-20220324T110820-q1t724b0/apoa1.namd: 0.001 MB in 5.167 seconds (0.000 MB/s)
Downloading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T110820-q1t724b0/apoa1.pdb
Downloaded output/funcx-test-20220324T110820-q1t724b0/apoa1.pdb: 6.773 MB in 3.464 seconds (1.955 MB/s)
Downloading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T110820-q1t724b0/apoa1.psf
Downloaded output/funcx-test-20220324T110820-q1t724b0/apoa1.psf: 12.855 MB in 7.582 seconds (1.696 MB/s)
Downloading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T110820-q1t724b0/par_all22_popc.xplor
Downloaded output/funcx-test-20220324T110820-q1t724b0/par_all22_popc.xplor: 0.000 MB in 5.450 seconds (0.000 MB/s)
Downloading: https://g-51ede9.c61f4.bd7c.data.globus.org/funcx-test-20220324T110820-q1t724b0/par_all22_prot_lipid.xplor
Downloaded output/funcx-tes

## Notes about FuncX and Globus HTTPS

* this example mostly uses FuncX as a Slurm API - can be used for much more than that
* fair_research_login is a nice interface to globus auth
* guest collection HTTPS access convenient because it doesn't need a personal endpoint running on the local machine
* Guest collections can be accessed with just Globus auth, no need for NeSI 2 factor auth after initially creating the guest collection
* any user can set up and use funcX by themself - they don't need us to do anything to enable it
* researcher needs to manually run a funcx endpoint on NeSI (and keep it running there)
  - eventually could be integrated with Globus federated endpoint?
  - could be a pain if the endpoint is killed for some reason and the user needs to reconnect and start it again (e.g. login node gets rebooted)
    * have a bash script running as a scrontab currently
* (THIS EXAMPLE NEEDS UPDATING) FuncX does know about Slurm too, so you could set FuncX up to directly run your function in a Slurm job without having to submit anything separately, see snippet from an endpoint config.py:
  ```sh
    from funcx_endpoint.endpoint.utils.config import Config
    from parsl.providers import LocalProvider, SlurmProvider

    config = Config(
        scaling_enabled=True,
        provider=SlurmProvider(
            "large",
            min_blocks=1,
            max_blocks=1,
            nodes_per_block=1,
            cores_per_node=2,
            mem_per_node=16,
            exclusive=False,
            cmd_timeout=120,
            walltime='2:00:00',
        ),
        #max_workers_per_node=2,
        funcx_service_address='https://api.funcx.org/v1'
    )
  ```
* reasons for not using FuncX SlurmProvider directly currently
  - funcx currently has no way to know how much work a function may involve
    - could lead to failures due to wall time exceeded, etc.
  - not "elastic"
    - have to start a new endpoint if need resources
  - submitting and checking slurm jobs is something that can run on a login node, no need for more in this case