Submit ALICE pilot jobs to Work Queue. Configure an AliEn Grid site using Work Queue from CCTools as a backend to manage jobs, without the need for a complex resource management system.
Assumptions:
- CCTools comes from
/cvmfs/alice.cern.ch/x86_64-2.6-gnu-4.1.2/Packages/cctools/v5.3.1-2
- alien-workqueue will be installed to
$HOME/awq
You need to install CCTools first.
Once you have done that, clone, configure and install:
git clone https://github.com/dberzano/alien-workqueue
cd alien-workqueue
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/awq -DCCTOOLS=/cvmfs/alice.cern.ch/x86_64-2.6-gnu-4.1.2/Packages/cctools/v5.3.1-2
make install
This will install it under ~/awq
.
In our simple setup we have:
- alien-workqueue acting as a Work Queue master
- a layer of Work Queue foremen to offload jobs control (they will act as a worker with respect to the master, and as a master with respect to the workers)
The AliEn submit node must be configured as a Condor node. alien-workqueue
comes with a fake condor_q
and condor_submit
that will enable the AliEn
Condor backend to perform Work Queue submission as if it were a pure Condor
site.
Start the foremen as follows (use WQ_NUM_FOREMEN
for the number of desired
foremen):
source $HOME/awq/etc/env.sh
WQ_NUM_FOREMEN=12 alien_wq_worker.sh start-foremen
By default they will connect to a master on localhost:9094
. If you want to
change the master, prepend WQ_MASTER=host:port
(port is mandatory).
Foremen will be started in the background. They can be terminated with:
alien_wq_worker.sh stop-foremen
Used ports are by defaut from 9080
to 9080+$WQ_NUM_FOREMEN-1
. See the
advanced usage for configuring this paramenter and
more.
Now start the AliEn CE service, again after loading the alien-workqueue
environment (which is necessary for AliEn to pick our fake condor_q
and
condor_submit
):
source $HOME/awq/etc/env.sh
alien StartCE
Now it is recommended you open a "screen" to start alien-workqueue. Open a screen:
screen -S awq_screen
Inside the screen:
source $HOME/awq/etc/env.sh
alien_work_queue
Terminate it with Ctrl-C
.
Note that by default no Work Queue
Catalog is updated. To
enable reporting to the catalog see the --catalog-*
parameters under
advanced usage.
The procedure is similar.
source $HOME/awq/etc/env.sh
WQ_NUM_FOREMEN=12 WQ_FOREMEN=foremen_host alien_wq_worker.sh start-workers
Set $WQ_FOREMEN
to the hostname or IP address of the host running foremen.
To stop them:
alien_wq_worker.sh stop-workers
In this case we are not using any foreman. We start the workers like this:
source $HOME/awq/etc/env.sh
WQ_NUM_FOREMEN=1 WQ_FOREMEN=master_host WQ_MASTER_BASEPORT=9094 alien_wq_worker.sh start-workers
From the point of view of the work_queue_worker
there is no difference in
connecting to a master or to a foreman. Here we are assuming that the master is
running on port 9094
of host master_host
.
The following variables can be defined when starting alien_wq_worker.sh
:
WQ_MASTER
:host:port
of the Work Queue master, i.e. thealien_work_queue
process. Port is mandatory. Defaults to127.0.0.1:9094
.WQ_FOREMEN
: host (IP or hostname) where foremen are running. Defaults to127.0.0.1
.WQ_NUM_FOREMEN
: number of foremen. Defaults to2
.WQ_MASTER_BASEPORT
: first listening port of the foremen. Every foreman will listen to$WQ_MASTER_BASEPORT + index
. Defaults to9080
.WQ_WORKDIR
: writable temporary directory for Work Queue foremen and/or workers. Defaults to~/.alien_wq_tmp
. Does not need to be created in advance.WQ_DRAIN
: if this file exists (by default~/.alien_wq_drain
) it puts workers and foremen in drain mode, i.e. when current jobs are done, exits.WQ_SINGLETASK
: if set to1
(default is0
) the Work Queue worker will be run in "single task" mode, i.e. it will shut down after executing the first task. This is useful in the case where workers are frequently scheduled as microservices (as in Mesos).WQ_IDLETIMEOUT
: number of seconds to wait for a new task before exiting. Defaults to100
.WQ_NORESPAWN
: do not respawnwork_queue_worker
after it exits if set to1
. Defaults to0
(i.e. respawn indefinitely).
The following options are available to alien_work_queue
:
--job-wrapper <full_path>
: specify a path to an executable (a simple script will do) used to load the AliEn job agent. The script must at some point load it with something likeexec "$@"
(therefore by maintaining the PID). Before doing this process it can, for instance, set the environment properly. We provide an example, installed under<prefix>/etc/alien_job_wrapper.sh
, which is used to conditionally run the job agent through Parrot if the CVMFS mount point is not available in the job environment.--work-dir <full_path>
: working directory (defaults to/tmp/awq
). If you set this parameter you must alsoexport AWQ_WORKDIR
for making it visible to the fakecondor_*
commands.--catalog-host <host>
,--catalog-port <port>
: if specified, report information to the specified Work Queue Catalog server. Port is optional and defaults to9097
. Project name can be optionally specified with--project-name
.--project-name
: project name reported to the catalog server, if enabled. Defaults toalien_jobs
.
alien-workqueue enables Work Queue to be used as a (fake) Condor backend for ALICE AliEn job submission. Since we are submitting pilot jobs we do not need a complex job scheduling mechanism and Work Queue is lightweight enough for us.
We are currently using alien-workqueue in production for opportunistically using the ALICE High Level Trigger clusters outside data taking. With this extremely simple approach, constituted by:
- 12 foremen
- alien-workqueue
- the AliEn VoBox services
all running on a single box we can easily scale from zero to 10k concurrently running jobs within minutes!
Moreover, with the mesos-workqueue tool we have developed we can run the ALICE AliEn middleware on Mesos-provided resources too.
Thanks to the CCTools people for making such a great tool!