# PARASCHUT notebook

we'll go through a small-scale example of `parachut` tools.

In [1]:
import os
import paraschut as psu
print(psu.config.QFile)
print(psu.config.JobDir)

example/job_queue.db
example/


## generate a job

we'll start from the default template and update it with data relevant to our example. note that these functions may be run offline.

In [2]:
jobinfo = psu.get_job_template(SetID=True)
jobinfo['name'] = ['paraschut example']
jobinfo['CodeDir'] = os.path.abspath('.')
jobinfo['JobIndex'] = 0
jobinfo['script'] = 'python example/job.py {BatchID} {JobIndex}'
jobinfo

{'BatchID': 20210220122655,
 'JobIndex': 0,
 'priority': 1,
 'name': ['paraschut example'],
 'data_type': 'foo',
 'data': None,
 'script': 'python example/job.py {BatchID} {JobIndex}',
 'queue': 'tamir-nano4',
 'resources': {'mem': '1gb',
  'pmem': '1gb',
  'vmem': '3gb',
  'pvmem': '3gb',
  'cput': '04:59:00'},
 'state': 'init',
 'CodeDir': 't:\\dalon\\misc\\paraschut'}

now let's add some random data for the job to operate on. this job will just output its mean.

In [3]:
from numpy.random import randint
data = randint(1, 100, (1, 10**4))
psu.generate_data(jobinfo, data)
jobinfo['data']

'example//20210220122655/data_0.pkl'

In [4]:
psu.generate_script(jobinfo)
jobinfo['script']

'python example/job.py 20210220122655 0'

you may also try setting the 'script' field to 'example/template.sh' and try generating a script. watch the script file that is written in this case.

finally, let's add the job we built to the queue.

In [5]:
psu.add_job_to_queue(jobinfo)

now let's check that a new job (with JobIndex=0) was added to our queue:

In [6]:
psu.get_queue()


20210220122655: paraschut example
{'init': [0]}

missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/0/1


NOTE, that the server queue job counter (appearing in the last line of `get_queue` output) counts all currently online jobs associated with one's user (including those that are not part of the projects currently managed using `paraschut`).

next, let's verify that the metadata has been properly stored:

In [7]:
psu.get_job_info(20210220122655, 0)

{'BatchID': 20210220122655,
 'JobIndex': 0,
 'priority': 1,
 'name': ['paraschut example'],
 'data_type': 'foo',
 'data': 'example//20210220122655/data_0.pkl',
 'script': 'python example/job.py 20210220122655 0',
 'queue': 'tamir-nano4',
 'resources': {'mem': '1gb',
  'pmem': '1gb',
  'vmem': '3gb',
  'pvmem': '3gb',
  'cput': '04:59:00'},
 'state': 'init',
 'CodeDir': 't:\\dalon\\misc\\paraschut',
 'md5': 'a5abaab4190f1dd7fab0e540322194c5'}

## multiple jobs and collection

first, we'll add 3 more simlar jobs similar to our first job.

In [8]:
def duplicate_job(jobinfo, i):
    newjob = jobinfo.copy()  # duplicating to keep BatchID and similar fields identical
    newjob['script'] = 'python example/job.py {BatchID} {JobIndex}'
    newjob['JobIndex'] = i

    data = randint(1, 100, (1, 10**4))
    psu.generate_data(newjob, data)

    psu.add_job_to_queue(newjob, build_script=True)
    # this will also generate the script

for i in range(3):
    duplicate_job(jobinfo, i+1)

let's verify that we indeed generated additional jobs.

In [9]:
psu.get_queue()
psu.get_job_info(20210220122655, 3)


20210220122655: paraschut example
{'init': [0, 1, 2, 3]}

missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/0/4


{'BatchID': 20210220122655,
 'JobIndex': 3,
 'priority': 1,
 'name': ['paraschut example'],
 'data_type': 'foo',
 'data': 'example//20210220122655/data_3.pkl',
 'script': 'python example/job.py 20210220122655 3',
 'queue': 'tamir-nano4',
 'resources': {'mem': '1gb',
  'pmem': '1gb',
  'vmem': '3gb',
  'pvmem': '3gb',
  'cput': '04:59:00'},
 'state': 'init',
 'CodeDir': 't:\\dalon\\misc\\paraschut',
 'md5': '5477cbf6983e33c5ad1e907231bab119'}

finally, let's add a collect job that will compute the mean of means. this job will execute only once the first 4 jobs have completed successfully.

In [10]:
newjob = jobinfo.copy()
newjob['priority'] = 0.5  # lower priority gets executed after higher priority jobs are done
newjob['script'] = 'python example/collect_job.py {BatchID} {JobIndex}'
newjob['JobIndex'] = 4
newjob['data'] = range(4)  # pointing to previous JobIndices to compute the mean of their results

psu.add_job_to_queue(newjob, build_script=True)

In [11]:
psu.get_job_info(20210220122655, 4)

{'BatchID': 20210220122655,
 'JobIndex': 4,
 'priority': 0.5,
 'name': ['paraschut example'],
 'data_type': 'foo',
 'data': range(0, 4),
 'script': 'python example/collect_job.py 20210220122655 4',
 'queue': 'tamir-nano4',
 'resources': {'mem': '1gb',
  'pmem': '1gb',
  'vmem': '3gb',
  'pvmem': '3gb',
  'cput': '04:59:00'},
 'state': 'init',
 'CodeDir': 't:\\dalon\\misc\\paraschut',
 'md5': 'a6e406c9267820d9509a35902eaee20f'}

## submit jobs
the only job control function that must run on a server. in our case LocalJobExecutor is configured to run on the local machine.

In [12]:
psu.submit_jobs()

submiting:	python example/job.py 20210220122655 0
submiting:	python example/job.py 20210220122655 1
submiting:	python example/job.py 20210220122655 2
submiting:	python example/job.py 20210220122655 3
max jobs: 1000
in queue: 0
submitted: 4


note that only the first 4 jobs were submitted and are currently running. the collect job is waiting for them to complete.

## monitor jobs


let's check if the job is indeed online and running: (note the * next to jobs 0-3 in the batch, which indicates that)

In [14]:
psu.get_queue()


20210220122655: paraschut example
{'run': ['0*', '1*', '2*', '3*'], 'init': [4]}

missing jobs: {}

total jobs on server queue: 4
running/complete/total: 4/0/5


this is how the output looks once the jobs have finished:

In [19]:
psu.get_queue()


20210220122655: paraschut example
{'complete': [0, 1, 2, 3], 'init': [4]}

missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/4/5


it's time to run the collect job.

In [20]:
psu.submit_jobs()

submiting:	python example/collect_job.py 20210220122655 4
max jobs: 1000
in queue: 0
submitted: 1


after a short while all jobs should be in 'complete' state.

In [23]:
psu.get_queue()


20210220122655: paraschut example
{'complete': [0, 1, 2, 3, 4]}

missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/5/5


we can now check the logs created by the jobs (stdout and sterr), and its post-run metadata (which may includs a PBS report summary, for example). in this case, the result was printed to screen in the stdout file as well as stored in the 'result' field of the job metadata.

In [24]:
psu.print_log(20210220122655, 4, 'stdout')
psu.get_job_info(20210220122655, 4)



[[[stdout log for 20210220122655/paraschut example/job_4:]]]

50.058975000000004
max jobs: 1000
in queue: 0
submitted: 0


{'BatchID': 20210220122655,
 'JobIndex': 4,
 'priority': 0.5,
 'name': ['paraschut example'],
 'data_type': 'foo',
 'data': range(0, 4),
 'script': 'python example/collect_job.py 20210220122655 4',
 'queue': 'tamir-nano4',
 'resources': {'mem': '1gb',
  'pmem': '1gb',
  'vmem': '3gb',
  'pvmem': '3gb',
  'cput': '04:59:00'},
 'state': 'complete',
 'CodeDir': 't:\\dalon\\misc\\paraschut',
 'subtime': 20210220122847,
 'stdout': ['example/20210220122655/logs/3816927923.power8.tau.ac.il.OU'],
 'stderr': ['example/20210220122655/logs/3816927923.power8.tau.ac.il.ER'],
 'hostname': 'alond-pc',
 'result': 50.058975000000004,
 'qstat': {},
 'md5': 'e3ca8880b8b6d0047aeca777415e7045'}

finally, we may clear all batches that have completed all their jobs using the following functions:

In [25]:
psu.remove_batch_by_state('complete')
psu.get_queue()


missing jobs: {}

total jobs on server queue: 0
running/complete/total: 0/0/0
