# PARASCHUT notebook

we'll go through a small-scale example of `parachut` tools.

In [1]:
import os
import paraschut as psu
print(psu.config.QFile)
print(psu.config.JobDir)

example/job_queue.db
example/


## generate a job

we'll start from the default template and update it with data relevant to our example. note that these functions may be run offline.

In [2]:
jobinfo = psu.get_job_template(SetID=True)
jobinfo['name'] = 'example'
jobinfo['CodeDir'] = os.path.abspath('.')
jobinfo['JobIndex'] = 0
jobinfo['script'] = 'python example/job.py {BatchID} {JobIndex}'
# jobinfo['script'] = 'example/template.sh'
# jobinfo['pyfile'] = 'example/job.py'
jobinfo

{'BatchID': 20210225224111,
 'JobIndex': 0,
 'priority': 1,
 'name': 'example',
 'batch_type': 'foo',
 'data': None,
 'script': 'python example/job.py {BatchID} {JobIndex}',
 'queue': None,
 'resources': None,
 'state': 'init',
 'CodeDir': '/home/ec2-user/tools/parallel-comp'}

now let's add some random data for the job to operate on. this job will just output its mean.

In [3]:
from numpy.random import randint
data = randint(1, 100, (1, 10**4))
psu.generate_data(jobinfo, data)
jobinfo['data']

'example/20210225224111/data_0.pkl'

In [4]:
psu.generate_script(jobinfo)
jobinfo['script']

'python example/job.py 20210225224111 0'

you may also try setting the 'script' field to 'example/template.sh' and try generating a script. watch the script file that is written in this case.

finally, let's add the job we built to the queue.

In [5]:
psu.add_job_to_queue(jobinfo)

now let's check that a new job (with JobIndex=0) was added to our queue:

In [6]:
psu.get_queue()


20210225224111: example
{'init': [0]}

missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/0/1


NOTE, that the server queue job counter (appearing in the last line of `get_queue` output) counts all currently online jobs associated with one's user (including those that are not part of the projects currently managed using `paraschut`).

next, let's verify that the metadata has been properly stored:

In [7]:
psu.get_job_info(20210225224111, 0)

{'BatchID': 20210225224111,
 'JobIndex': 0,
 'priority': 1,
 'name': 'example',
 'batch_type': 'foo',
 'data': 'example/20210225224111/data_0.pkl',
 'script': 'python example/job.py 20210225224111 0',
 'queue': None,
 'resources': None,
 'state': 'init',
 'CodeDir': '/home/ec2-user/tools/parallel-comp',
 'md5': 'f4005066d30f34ec323850f0954a3536'}

## multiple jobs and collection

first, we'll add 3 more simlar jobs similar to our first job.

In [8]:
def duplicate_job(jobinfo, i):
    newjob = jobinfo.copy()  # duplicating to keep BatchID and similar fields identical
    newjob['script'] = 'python example/job.py {BatchID} {JobIndex}'
#     newjob['script'] = 'example/template.sh'
    newjob['JobIndex'] = i

    data = randint(1, 100, (1, 10**4))
    psu.generate_data(newjob, data)

    psu.add_job_to_queue(newjob, build_script=True)
    # this will also generate the script

for i in range(3):
    duplicate_job(jobinfo, i+1)

let's verify that we indeed generated additional jobs.

In [9]:
psu.get_queue()
psu.get_job_info(20210225224111, 3)


20210225224111: example
{'init': [0, 1, 2, 3]}

missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/0/4


{'BatchID': 20210225224111,
 'JobIndex': 3,
 'priority': 1,
 'name': 'example',
 'batch_type': 'foo',
 'data': 'example/20210225224111/data_3.pkl',
 'script': 'python example/job.py 20210225224111 3',
 'queue': None,
 'resources': None,
 'state': 'init',
 'CodeDir': '/home/ec2-user/tools/parallel-comp',
 'md5': '1b27256a0e89a9e6252158bf972ed7c3'}

finally, let's add a collect job that will compute the mean of means. this job will execute only once the first 4 jobs have completed successfully.

In [10]:
newjob = jobinfo.copy()
newjob['priority'] = 0.5  # lower priority gets executed after higher priority jobs are done
newjob['script'] = 'python example/collect_job.py {BatchID} {JobIndex}'
# newjob['script'] = 'example/template.sh'
# newjob['pyfile'] = 'example/collect_job.py'
newjob['JobIndex'] = 4
newjob['data'] = range(4)  # pointing to previous JobIndices to compute the mean of their results

psu.add_job_to_queue(newjob, build_script=True)

## submit jobs
the only job control function that must run on a server. in our case LocalJobExecutor is configured to run on the local machine.

In [12]:
psu.submit_jobs()

submiting:	python example/job.py 20210225224111 0
submiting:	python example/job.py 20210225224111 1
submiting:	python example/job.py 20210225224111 2
submiting:	python example/job.py 20210225224111 3
max jobs: 1000
in queue: 0
submitted: 4


note that only the first 4 jobs were submitted and are currently running. the collect job is waiting for them to complete.

## monitor jobs


let's check if the job is indeed online and running: (note the * next to jobs 0-3 in the batch, which indicates that)

In [13]:
psu.get_queue()


20210225224111: example
{'run': ['0*', '1*', '2*', '3*'], 'init': [4]}

missing jobs: {}

total jobs on server queue: 4
running/complete/total: 4/0/5


this is how the output looks once the jobs have finished:

In [14]:
psu.get_queue()


20210225224111: example
{'complete': [0, 1, 2, 3], 'init': [4]}

missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/4/5


it's time to run the collect job.

In [15]:
psu.submit_jobs()

submiting:	python example/collect_job.py 20210225224111 4
max jobs: 1000
in queue: 0
submitted: 1


after a short while all jobs should be in 'complete' state.

In [16]:
psu.get_queue()


20210225224111: example
{'complete': [0, 1, 2, 3, 4]}

missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/5/5


we can now check the logs created by the jobs (stdout and sterr), and its post-run metadata (which may includs a PBS report summary, for example). in this case, the result was printed to screen in the stdout file as well as stored in the 'result' field of the job metadata.

In [17]:
psu.print_log(20210225224111, 4, 'stdout')
psu.get_job_info(20210225224111, 4)



[[[stdout log for 20210225224111/example/job_4:]]]

50.183325
max jobs: 1000
in queue: 0
submitted: 0


{'BatchID': 20210225224111,
 'JobIndex': 4,
 'priority': 0.5,
 'name': 'example',
 'batch_type': 'foo',
 'data': range(0, 4),
 'script': 'python example/collect_job.py 20210225224111 4',
 'queue': None,
 'resources': None,
 'state': 'complete',
 'CodeDir': '/home/ec2-user/tools/parallel-comp',
 'subtime': 20210225224451,
 'stdout': ['example/20210225224111/logs/example.o4293091116'],
 'stderr': ['example/20210225224111/logs/example.e4293091116'],
 'hostname': 'ip-10-217-9-184',
 'result': 50.183325,
 'qstat': {},
 'md5': '227158f81a8b5caf5e3d6b1955ed08db'}

finally, we may clear all batches that have completed all their jobs using the following functions:

In [18]:
psu.remove_batch_by_state('complete')
psu.get_queue()


missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/0/0
