# Globus and Funcx 

Using Globus and Funcx to automate the process of uploading data to NeSI, running something on NeSI and then copying results back.

Requirements:

* Globus account
* Globus personal endpoint running on the machine you are executing this notebook
* Globus shared collection created on NeSI on the nobackup filesystem
* NeSI account, username and 2nd factor, for authenticating with the NeSI Globus endpoint and running a Funcx endpoint

Authentication/setup Steps:

1. Connect to NeSI cluster and start a funcx endpoint there
2. Globus authentication (including FuncX) on local machine
3. Start funcX client locally
4. Create Globus transfer client
5. Connect to source Globus endpoint
6. Connect to NeSI Globus endpoint

Processing Steps:

7. Transfer input data to NeSI using Globus
8. Run the workflow using funcX
9. Copy result back using Globus

Steps 1,2 and 6 above require authentication. 

The tokens generated in step 2 on the local machine are stored in a file and reused, so you should only need to authenticate the first time.

Connecting to NeSI Globus endpoint in step 6 requires NeSI 2 factor authentication and you only remain authenticated for ~24 hours.

References:

* [Globus tutorial](https://globus-sdk-python.readthedocs.io/en/stable/tutorial.html)
* [funcX endpoint documentation](https://funcx.readthedocs.io/en/latest/endpoints.html)
* [fair-research-login](https://github.com/fair-research/native-login)

## 1. Start funcx endpoint on NeSI

### Install and configure funcx endpoint if you have not done it before

Connect to a Mahuika login node by SSH and run the following commands to install funcx:

```sh
ssh mahuika
module load Python
pip install --user funcx funcx_endpoint
funcx-endpoint configure
```

During the final command you will be asked to authenticate with Globus Auth so that your endpoint can be made available to funcx running outside of NeSI.

For more details see: https://funcx.readthedocs.io/en/latest/endpoints.html.

### Start the funcx endpoint on NeSI

A default endpoint profile is created during the configure step above, which will suffice for us. We will be using funcx to submit jobs to Slurm or check the status of submitted jobs; no computationally expensive tasks should run directly on the endpoint itself.

```sh
# we are still on the Mahuika login node here...
funcx-endpoint start
```

Now list your endpoints, confirm that the *default* endpoint is "Active" and make a note of your endpoint ID:

```sh
funcx-endpoint list
+---------------+-------------+--------------------------------------+
| Endpoint Name |   Status    |             Endpoint ID              |
+===============+=============+======================================+
| default       | Active      | 3abf6696-8ba4-4ac8-be69-c6c24031373d |
+---------------+-------------+--------------------------------------+
```

In [1]:
# store your funcx endpoint id here
funcx_endpoint = "3abf6696-8ba4-4ac8-be69-c6c24031373d"  # my default endpoint on NeSI

## 2. Globus authentication (including FuncX) on local machine

### Register an app with Globus, if you haven't done it already

Note: I think this is a one off, you can reuse the same client id.

> Navigate to the [Developer Site](https://developers.globus.org/) and select “Register your app with Globus.” You will be prompted to login – do so with the account you wish to use as your app’s administrator...

In [2]:
# identifier for the app we created on globus website above, can be reused
CLIENT_ID = "6ffc9c02-cf62-4268-a695-d9d100181962"

### Use fair-research-login to authenticate once with Globus for both FuncX and Globus transfer

The first time you have to authenticate, then token is stored in mytokens.json and loaded from there on subsequent calls.

In [3]:
from fair_research_login import NativeClient, JSONTokenStorage

cli = NativeClient(
    client_id=CLIENT_ID,
    token_storage=JSONTokenStorage('mytokens.json'),  # save/load tokens here
    app_name="FuncX/Globus NeSI Demo",
)

# get the requested scopes
search_scope = "urn:globus:auth:scope:search.api.globus.org:all"  # for FuncX
funcx_scope = "https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all"  # for FuncX
openid_scope = "openid"  # for FuncX
transfer_scope = "urn:globus:auth:scope:transfer.api.globus.org:all"  # for Globus transfer
tokens = cli.login(
    refresh_tokens=True,
    requested_scopes=[openid_scope, search_scope, funcx_scope, transfer_scope]
)

In [4]:
# authorisers for requested scopes
authorisers = cli.get_authorizers_by_scope(requested_scopes=[openid_scope, funcx_scope, search_scope, transfer_scope])

## 3. Start funcX client locally

Start the funcX client locally so we can submit jobs to the NeSI funcX endpoint we just created. This will also require authentication with Globus Auth.

In [5]:
from funcx.sdk.client import FuncXClient

fxc = FuncXClient(
    fx_authorizer=authorisers[funcx_scope],
    search_authorizer=authorisers[search_scope],
    openid_authorizer=authorisers[openid_scope],
)

In [6]:
from funcx.sdk.executor import FuncXExecutor

# create a funcX executor
funcx_executor = FuncXExecutor(fxc)

## 4. Create Globus transfer client

In [7]:
import globus_sdk

tc = globus_sdk.TransferClient(authorizer=authorisers[transfer_scope])

In [8]:
print("My Globus Endpoints:")
for ep in tc.endpoint_search(filter_scope="my-endpoints"):
    print("  - [{}] {}".format(ep["id"], ep["display_name"]))

My Globus Endpoints:
  - [6890f1a4-3f21-11eb-b55a-02d9497ca481] cdjs-desktop
  - [d5a64768-cf1c-11eb-bde7-5111456017d9] laptoptestshare
  - [d1afe264-ceee-11eb-8172-2bdce096500e] nesitestshare
  - [0ad12a38-40d9-11ec-beaf-59ff7db44e9b] temptest
  - [106cd58b-e35c-11eb-832a-45cc1b8ccd4a] TestShareOnNeSI01
  - [2d63f434-e35c-11eb-832a-45cc1b8ccd4a] TestShareOnNeSI02
  - [1fdfb7aa-544e-11eb-87b7-02187389bd35] WorkLaptop


## 5. Connect to source Globus endpoint

Connect to the personal endpoint you have on your local machine (e.g. where you are running this notebook).

You should not need to do any authentication since Globus will use the token you generated above.

We list the files in the input data directory to check our access is working.

In [9]:
import os

# put your Globus endpoint id here:
#src_endpoint = "1fdfb7aa-544e-11eb-87b7-02187389bd35"
src_endpoint = "6890f1a4-3f21-11eb-b55a-02d9497ca481"

# paths to input and output
src_base_path = os.getcwd()
src_input_path = os.path.join(src_base_path, "input")
src_output_path = os.path.join(src_base_path, "output")

# activate the endpoint
res_src_ep = tc.endpoint_autoactivate(src_endpoint, if_expires_in=3600)
if res_src_ep['code'] == 'AutoActivationFailed':
    print("Endpoint activation failed!")
    print(res_src_ep)
else:
    print(f"Endpoint activation succeeded: {res_src_ep['code']}")
    
    # list the source directory
    print("Source files:")
    for entry in tc.operation_ls(src_endpoint, path=src_input_path):
        print(f'  {entry["name"]} ({entry["type"]})')

Endpoint activation succeeded: AlreadyActivated
Source files:
  apoa1.namd (file)
  apoa1.pdb (file)
  apoa1.psf (file)
  par_all22_popc.xplor (file)
  par_all22_prot_lipid.xplor (file)
  run.sl (file)


## 6. Connect to the NeSI Globus endpoint

Create a guest collection on the NeSI endpoint, so that we don't need to do the NeSI two factor authentication repeatedly, we can just use Globus auth.

Navigate to a directory under */nesi/nobackup/[project_code]/*, click sharing and add a shared collection. Make a note of the "Endpoint UUID". Also store the full path on NeSI to the shared collection you just created (`nesi_path`).

In [10]:
#nesi_endpoint = "3064bb28-e940-11e8-8caa-0a1d4c5c824a"  # NeSI endpoint
#nesi_endpoint = "cc45cfe3-21ae-4e31-bad4-5b3e7d6a2ca1"  # NeSI v5 endpoint
#nesi_path = "/nesi/nobackup/nesi99999/csco212/cer_instrument_data/funcx/test-workflow"
nesi_endpoint = "f456a507-3c5b-41b9-9d7f-2315b9fed386"  # shared collection on NeSI
nesi_path = "/nesi/nobackup/nesi99999/csco212/funcx_demo"

In [12]:
# activate the endpoint
res_nesi_ep = tc.endpoint_autoactivate(nesi_endpoint, if_expires_in=3600)
assert res_nesi_ep['code'] != 'AutoActivationFailed'
res_nesi_ep["message"]

'Endpoint is already activated and does not expire before the requested if_expires_in.'

## 7. Transfer input data to NeSI

First we make a directory name that the simulation will be stored under, then copy the data under that directory.

In [14]:
# make a directory for running under
from datetime import datetime

# get a unique name for this run
workdirbase = datetime.now().strftime("%Y%m%dT%H%M%S")
workdirname = workdirbase
got_dirname = False
existing_names = [item["name"] for item in tc.operation_ls(nesi_endpoint, path="/")]
count = 0
while not got_dirname:
    # check the directory does not already exist
    if workdirname in existing_names:
        count += 1
        workdirname = f"{workdirbase}.{count:06d}"
    else:
        got_dirname = True
print(f"Directory: {workdirname}")

Directory: 20211129T131824


In [15]:
# initiate the data transfer to NeSI
tdata = globus_sdk.TransferData(tc, src_endpoint,
                                    nesi_endpoint,
                                    label="Sending input data to NeSI",
                                    sync_level="checksum")

# add the input files to the transfer
nesi_relative_path = "/" + workdirname
nesi_full_path = nesi_path + nesi_relative_path
print(f"Working directory on NeSI will be (relative to collection): {nesi_relative_path}")
print(f"Working directory on NeSI will be (full): {nesi_full_path}")
tdata.add_item(src_input_path, nesi_relative_path,
               recursive=True)

# actually start the transfer
transfer_result = tc.submit_transfer(tdata)
task_id = transfer_result["task_id"]
print("task_id =", transfer_result["task_id"])

# the task id can be used to refer to this transfer
# for example, here we wait for the data transfer to complete
while not tc.task_wait(task_id, timeout=10, polling_interval=10):
    print("waiting for transfer to complete...")
print("transfer to NeSI is complete")

Working directory on NeSI will be (relative to collection): /20211129T131824
Working directory on NeSI will be (full): /nesi/nobackup/nesi99999/csco212/funcx_demo/20211129T131824
task_id = dfd70b1e-50a9-11ec-a9ca-91e0e7641750
waiting for transfer to complete...
waiting for transfer to complete...
waiting for transfer to complete...
waiting for transfer to complete...
waiting for transfer to complete...
waiting for transfer to complete...
waiting for transfer to complete...
waiting for transfer to complete...
transfer to NeSI is complete


You can also login to Globus web interface and see the status of your transfer there.

## 8. Run the processing using funcX

Two functions are called using FuncX:

1. Submit job to Slurm
2. Check Slurm job status

In [16]:
print(f"FuncX endpoint id: {funcx_endpoint}")

FuncX endpoint id: 3abf6696-8ba4-4ac8-be69-c6c24031373d


Create a simple test function that returns the hostname where the endpoint is running, just as a test:

In [17]:
# test function to see if things are working
def test_function():
    import socket
    return socket.gethostname()

# With the executor, functions are auto-registered
future = funcx_executor.submit(test_function, endpoint_id=funcx_endpoint)

# You can check status of your task without blocking
print("processing done?", future.done())

# Block and wait for the result:
result = future.result()

print("processing done?", future.done())

print(f"FuncX endpoint is running on: {result}")

processing done? False
processing done? True
mahuika01


Now create the 2 Slurm functions for interacting with Slurm (if the Slurm API was available we could use that instead):

In [18]:
# function that submits a job to Slurm (assumes submit script and other required inputs were uploaded via Globus)
def submit_slurm_job(submit_script, work_dir=None):
    """Runs the given command in a Slurm job."""
    # have to load modules within the function
    import os
    import subprocess
    
    # change to working directory
    if work_dir is not None and os.path.isdir(work_dir):
        os.chdir(work_dir)
        
    print(os.listdir())
    
    # submit the Slurm job and return the job id
    submit_cmd = f'sbatch --priority=9999 {submit_script}'
    with open("submit_cmd.txt", "w") as fh:
        fh.write(submit_cmd + "\n")
    output = subprocess.check_output(submit_cmd, shell=True, universal_newlines=True)
    
    return output

In [19]:
# function that checks Slurm job status
def check_slurm_job_status(jobid):
    """Check Slurm job status."""
    # have to load modules within the function
    import subprocess
    
    # query the status of the job using sacct
    cmd = f'sacct -j {jobid} -X -o State -n'
    output = subprocess.check_output(cmd, shell=True, universal_newlines=True)
    
    return output

### Submit the job to Slurm

In [20]:
import subprocess

# With the executor, functions are auto-registered
future = funcx_executor.submit(submit_slurm_job, "run.sl", endpoint_id=funcx_endpoint, work_dir=nesi_full_path)

# Block and wait for the result:
try:
    result = future.result()
except subprocess.CalledProcessError as exc:
    print("submitting job failed:")
    print(f"    return code: {exc.returncode}")
    print(f"    cmd: {exc.cmd}")
    print(f"    output: {exc.output}")

# get the Slurm Job ID
jobid = result.split()[-1]
print(f"Job submitted: {jobid}")

Job submitted: 23296012


### Wait for the job to complete

In [21]:
job_finished = False
while not job_finished:
    future = funcx_executor.submit(check_slurm_job_status, jobid, endpoint_id=funcx_endpoint)
    print("checking Slurm job status via funcX: ", end="")
    result = future.result()
    job_status = result.strip()
    print(job_status)
    if job_status not in ("RUNNING", "PENDING"):  # TODO: check possible statuses
        job_finished = True
    time.sleep(5)
print("Job finished")

checking Slurm job status via funcX: RUNNING
checking Slurm job status via funcX: RUNNING
checking Slurm job status via funcX: RUNNING
checking Slurm job status via funcX: RUNNING
checking Slurm job status via funcX: RUNNING
checking Slurm job status via funcX: RUNNING
checking Slurm job status via funcX: COMPLETED
Job finished


## 9. Copy result back using Globus

In [22]:
import os

from_path = nesi_relative_path
to_path = os.path.join(src_output_path, workdirname)
tdata = globus_sdk.TransferData(tc, nesi_endpoint,
                                    src_endpoint,
                                    label="Retrieving results from NeSI",
                                    sync_level="checksum")
tdata.add_item(from_path, to_path, recursive=True)
transfer_result = tc.submit_transfer(tdata)
task_id = transfer_result["task_id"]
print("task_id =", transfer_result["task_id"])

# wait for the data transfer to complete
while not tc.task_wait(task_id, timeout=10, polling_interval=10):
    print("waiting for transfer to complete...")
print(f"transfer from NeSI is complete: {to_path}")

task_id = 31d9412a-50aa-11ec-a9ca-91e0e7641750
waiting for transfer to complete...
waiting for transfer to complete...
waiting for transfer to complete...
waiting for transfer to complete...
transfer from NeSI is complete: /home/cdjs/DocumentsSync/work/projects/funcx_globus_demo/funcx-globus-nesi-demo/output/20211129T131824


## Notes about FuncX so far

* above uses FuncX just to submit a Slurm job and then poll for completion (could also be done with Slurm API if that was made available)
* researcher needs to manually run a funcx endpoint on NeSI (and keep it running there)
  - eventually should be integrated with Globus federated endpoint?
  - this runs an endpoint on the login node
  - could be a pain if the endpoint is killed for some reason and the user needs to reconnect and start it again
* FuncX does know about Slurm too, so you could set FuncX up to directly run your function in a Slurm job without having to submit anything separately, see snippet from an endpoint config.py:
  ```sh
    from funcx_endpoint.endpoint.utils.config import Config
    from parsl.providers import LocalProvider, SlurmProvider

    config = Config(
        scaling_enabled=True,
        provider=SlurmProvider(
            "large",
            min_blocks=1,
            max_blocks=1,
            nodes_per_block=1,
            cores_per_node=2,
            mem_per_node=16,
            exclusive=False,
            cmd_timeout=120,
            walltime='2:00:00',
        ),
        #max_workers_per_node=2,
        funcx_service_address='https://api.funcx.org/v1'
    )
  ```
* reasons for not using FuncX SlurmProvider directly currently
  - funcx currently has no way to know how much work a function may involve
    - could lead to failures due to wall time exceeded, etc.
  - not "elastic"
    - have to start a new endpoint if need more resources