Linux batch scheduler supporting job memory usage limits

README.md

org.brutusin:wava Build Status Maven Central Latest Version

wava is a linux command line tool wich allows for multiple users to securely run batch processes, scheduded in a timely manner under contraints of allocated (resident) memory capacities.

wava menu

Table of contents

Overview

wava scheduler is designed to run batch processes in a single Linux machine in an operator-friendly manner while imposing a memory limit across the overall job set (strict limits) and guaranteeing a minimum of available memory per job (soft limits).

Traditionally fixed-sized queues were used for enqueueing jobs, but when used over a heterogeneus (in terms of memory demands) set of jobs, they show different weaknesses: inefficient resource utilization, system performance degradation, and job resource competition.

wava scheduler is designed to overcome this, that is, without losing easy of use, providing guarantees for both system stability and job resource allocation and, on the other side, offering better resource utilization rates.

Features

Capacity guarantees

Jobs are submitted with a minimum memory claim (job size), then enqueued, and finally executed in a sandboxed environment (implemented via cgroups) with the scheduler capacity as hard limit and the jobs size as soft limit.

At runtime, the job process tree is allowed to allocate an amount of resident memory up to the scheduler-capacity (if no more jobs are scheduled), and in case of memory pressure forced to swap out.

The scheduler guareantees that each job process tree has at its disposition at least the amount of minimum memory claimed at job submission.

Security

Job processes are run by the same machine user that submitted the job, so the scheduler can not be used to escale the running privileges of a user.

Resource-based scheduling

Scheduling is based on memory. The scheduler is configured to have a certain capacity and each job has an associated minimum memory size.

The main scheduling constraint is the following: the sum of the running jobs minimum memory size cannot exceed the scheduler capacity.

Priority-based scheduling

This feature allows jobs to be submitted and scheduled with different priorities.

All jobs belong (implicity or explicity) to a priority group that determines their global ordering, used for positioning in the queue and assigning a process niceness when running.

Statistics

Both global and per job statistics can be recorded. Statistics sensitivity can be tuned with the statsCpuStep,statsRssStep,statsSwapStep and statsIOStep configuration parameters.

#start          end             running queued  cpu(%)  rss(B)  swap(B) io(B/s)
1494930723326   1494930723326   0       0       0.0     0       0       0
1494930723326   1494930731348   1       0       0.0     0       0       0
1494930731348   1494930733355   1       0       92.7    696320  0       0
...

Global statistics

To enable global statistics set the "logStat"=true" in the configuration file.

Job statistics

At submit time (wava -r) specify an additional parameter -s with the folder to store the stats files for this only execution.

Architecture

The scheduler runs as a centralized process (wava -s) and all interaction with it is performed by separated client processes started when invoking the other command options.

In particular job submissions are performed by separate peer processes (wava -r) that serve as lightweight placeholders of the real jobs executed by the scheduler. Peer and job processes have their lifecycle bound to each other. If one dies the other dies too.

wava menu

The scheduler pipes standard io-streams between the job processes and their respective peer processes. Additionally, it pipes scheduler events to the peer stderr unless an event file has been specified in submission (wava -r -e <file>).

wava example Running an example command requiring a minimum of 100MB. Observe also that this command runs untils user 'nacho' cancels it, returning a non-zero return code

Priority and groups

All submitted jobs belong to a priority group that determines their global ordering, used for positioning in the queue and assigning a process niceness when running.

Besides priority groups also have a timeToIdleSeconds property. This is the time elapsed between the last job finishes and the group is removed. If this value is set to -1, the group is eternal.

Jobs that do not specify a group at submit time are assigned to the default group (priority=0, timeToIdleSeconds=-1).

Jobs that specify a non-existing group create at submit-time a dynamic group (priority=0, timeToIdleSeconds specified at configuration).

wava group listing Sample output of command wava -g -l for querying groups

Job order

Jobs are ordered by the following rules:

  • First by group priority (lower value means higher priority)
  • Then by group id (incremental). In case of same priority, jobs of the oldest group go first.
  • Finally, by job id (incremental). For jobs inside the same group, FIFO ordering.

wava job listing Sample output of command wava -j for querying jobs (white: running, yellow: queued). This scheduler instance has a capacity of 500 MB

Niceness

The scheduler sets the niceness of the job processes according to their global ordering within the working niceness range. The concrete strategy is determined by the NicenessHandler implementation used (set in configuration).

Events

Besides stderr and stdout, the scheduler process maintains a dedicated channel (named pipe) for communicating events to client processes. These events are serialized in the form:

${time-millis}:${event-type}[:${event-value}]

For peer processes these events are output to stderr after being formatted as [wava] [date] [${event-type}:${event-value}] unless a file is specified (wava -r -e <file>) for redirecting them.

Event type (Events.java) Valued Description
id yes Id assigned to the job.
queued yes Position in the queue, if the job is queued.
priority yes Piority of the job, given by its group.
running yes Root pId of the job process when started.
niceness yes Niceness set to the job process.
cancelled yes User cancelling the job.
ping no Send periodically to detect stale peers.
exceed_tree yes Memory claim exceeds capacity
shutdown yes Scheduler is being stopped
maxrss yes Max RSS allocated to the process tree
maxswap yes Max swap allocated to the process tree
error yes To send information about an error.
retcode yes Return code for the client process to use.
starvation_relaunch yes Indicates that the job has been reenqueued due to a starvation scenario (applies for idempotent jobs)
starvation_stop yes Indicates that the job has been stopped due to a starvation scenario (applies for non-idempotent jobs)

Job hierarchy

Running jobs can submit more jobs, thus a job hierarchy is established. This potentially can lead to a deadlock scenario, when all parent (running) jobs are waiting for a queued child job to finish.

Blocked state

All parent jobs with no children running are considered blocked, ie waiting for their queued children to finish.

Deadlock prevention

In order to avoid deadlock, and prevent from starvation (having too much jobs blocked by a waiting children), the scheduler follows a series of rules that may force a running blocking job to be re-enqueued (if submitted as 'idempotent') or even stoped.

On each scheduler main-loop iteration:

  1. The scheduler choses a candidate job to be preempted, based on its idempotency (idempotent first) and priority (low priority first).
  2. In case that the ratio of the sum of the sizes of the blocked jobs to the scheduler capacity exceeds a configurable value, the scenario is considered as starving, and the scheduler preempts the candidate job to make room for a potentially blocking job to run.

Requirements

$JAVA_HOME environment variable set pointing to a JRE 8+

Installation

1. Create the WAVA_HOME environment variable pointing to the desired installation folder:

export WAVA_HOME=/opt/wava

make this variable persistent adding the previous line to the file: ~root/.bashrc

Multiple scheduler instances are supported in a single machine. Just change the current WAVA_HOMEvalue to point to one or another

2. Execute the installation script:

sudo -E bash -c "$(curl -L https://raw.githubusercontent.com/brutusin/wava/master/wava-core/src/main/scripts/wava-update)"

3. Service registration (systemd):

  1. Create service file /etc/systemd/system/wava.service with the following contents:
[Unit]
#@author Ignacio del Valle Alles
Description=WAVA scheduler

[Service]
Type=forking
ExecStart=/opt/wava/bin/wava -s &
ExecStop=/opt/wava/bin/wava -x
# set delegate yes so that systemd does not reset the cgroups
Delegate=yes

[Install]
WantedBy=multi-user.target
  1. Reload systemd configuration:
sudo systemctl daemon-reload
  1. Enable for running at startup:
sudo systemctl enable wava.service
  1. Init service:
sudo systemctl start wava.service

systemd known issues:

In version 219 it is neccessary to set JoinControllers= at /etc/systemd/system.conf to have different controllers for cpuacct and cpu cgroup subsystems, otherwise the cgroup of the jobs cannot be changed at runtime. Maybe this configuration change requires to rebuild the initial ramdisk image to take effect.

Configuration

Configuration is set in file: $WAVA_HOME/cfg/wava.json. Environment variables can be used in this file.

Default configuration

  "tempFolder" : "/dev/shm",
  "uICfg" : {
    "ansiColors" : true,
    "sIMemoryUnits" : true
  },
  "schedulerCfg" : {
    "nicenessHandlerClassName" : "org.brutusin.wava.core.plug.impl.niceness.HomogeneusSpreadNicenessHandler",
    "cgroupRootPath" : "/sys/fs/cgroup",
    "refreshLoopSleepMillisecs" : 1000,
    "pingMillisecs" : 1000,
    "schedulerCapacity" : "$DEFAULT_CAPACITY",
    "maxSwap" : "$DEFAULT_SWAP",
    "maxJobSize" : "$DEFAULT_CAPACITY",
    "outOfMemoryKillerEnabled" : false,
    "maxBlockedRssStarvationRatio" : 0.5,
    "logFolder" : "/tmp/wava",
    "loggingLevel" : "FINE",
    "maxLogSize" : "100MB",
    "maxStatsLogSize" : "100MB",
    "statsCpuStep" : 15,
    "statsRssStep" : "50MB",
    "statsSwapStep" : "50MB",
    "statsIOStep" : "50MB",
    "logStats" : true
  },
  "processCfg" : {
    "nicenessRange" : [ 1, 19 ],
    "cpuAfinity" : "$DEFAULT_CPU_AFINITY"
  },
  "groupCfg" : {
    "dynamicGroupIdleSeconds" : 10,
    "predefinedGroups" : [ {
      "name" : "high",
      "priority" : -10,
      "timeToIdleSeconds" : -1
    }, {
      "name" : "low",
      "priority" : 10,
      "timeToIdleSeconds" : -1
    } ]
  }
}

Configuration description

Property Description
uICfg.ansiColors Use ANSI escape code sequences to highlight UI.
uICfg.sIMemoryUnits Use units from the International System for output memory values. true: kB based, false:KiB based
schedulerCfg.nicenessHandlerClassName FQN of the NicenessHandler implementation (see impl package) to use.
schedulerCfg.memoryCgroupBasePath Root path to the parent memory cgroup
schedulerCfg.refreshLoopSleepMillisecs Sleeping time for the main looping thread.
schedulerCfg.pingMillisecs Time interval between ping events to peer processes.
schedulerCfg.schedulerCapacity Scheduler capacity. Maximum amount of physical memory permitted for all jobs. By default is 3/4 of total memory. Different memory units can be used, for example 4 GB
schedulerCfg.maxSwap Maximum swap size to be used by all wava jobs. By default equals to the total amount of swap available in the system
schedulerCfg.maxJobSize Maximum value for a job memory claim. By default equal to the scheduler capacity
schedulerCfg.outOfMemoryKillerEnabled Enable/disable the Out Of Memory Killer, triggered when a job is forced to page out and there is no enough swap memory available. If disabled the job is stopped until enough memory is available.
schedulerCfg.maxBlockedRssStarvationRatio Maximum ratio between the sum of memory claims of the blocked jobs divided by the scheduler capacity. If exceeded the starvation prevention mechanism is triggered.
schedulerCfg.logFolder Folder to store logs and global stats (if enabled).
schedulerCfg.loggingLevel Logging level (According to the Java logging levels)
schedulerCfg.maxLogSize Maximum size allowed overall logging files
schedulerCfg.maxStatsLogSize Maximum size allowed overall stats files for global stats, and per job stats if enabled
schedulerCfg.statsCpuStep Stats cpu percentage precission
schedulerCfg.statsRssStep Stats rss memory precission
schedulerCfg.statsSwapStep Stats swap memory precission
schedulerCfg.statsIOStep Stats io bandwidth precission (per second)
schedulerCfg.logStats true to enable global stats logging
processCfg.nicenessRange Minimum (most favorable) and maximum (less favorable) niceness to be assigned to a job process tree
processCfg.cpuAfinity CPU affinity to be set to the job processes. In a format supported by the -c parameter of taskset.
groupCfg.dynamicGroupIdleSeconds Idle time for dynamic groups in seconds.
groupCfg.predefinedGroups Set of groups to be available since startup.

Running

Starting scheduler (wava -s)

> wava -s&
Logging to /tmp/wava/logs ...

Submit a job (wava -r)

> wava -r -m 100MB -s /tmp/wava/my-job-stats bash -c "while true; do date; done" &> /dev/null &

List jobs (wava -j)

> wava -j
Jobs: 1 running; 0 bloqued; 0 queued
  Available memory: 23.8 GB / 24.7 GB

 JOB INFO                            100.0 MB   PROCESS TREE STATS                      696.3 kB       0 B        0 B       92.7  COMMAND
  JOB ID   PARENT GROUP    USER        JOB_RSS NICE    MAX_RSS   MAX_SWAP    MAX_IO         RSS       SWAP        IO        CPU%
       1          default  root      100.0 MB     1  696.3 kB       0 B        0 B/s    696.3 kB       0 B        0 B/s     92.7 [bash, -c, while true; do date; done]

Cancel job (wava -c)

> wava -c 1
Running job sucessfully cancelled

Stop scheduler (wava -x)

> wava -x
Stopping scheduler process ...

Other commands

Execute the wava command for detailed help

> wava
___________________________________________
__  __ _____ __ __ _____
/   /  /  _  /  |  /  _  \
|  /\  |  _  \  |  |  _  |
\__/\__\__|__/\___/\__|__/

[W]hen [AVA]ilable scheduler 2.3.1-SNAPSHOT
___________________________________________
usage: wava [option]
    -a,--about      information about the program
    -c,--cancel     cancel a running or enqueued job
    -g,--group      group management commands
    -h,--help       print this message
    -j,--jobs       view jobs
    -r,--run        enqueue a job to be executed when enough physical memory is available
    -s,--start      start core scheduler process
    -t,--status     return core process status
    -u,--update     update to lastest version
    -v,--version    show wava version
    -x,--exit       stop core process, terminating all jobs

Support bugs and requests

https://github.com/brutusin/wava/issues

Authors

Contributions are always welcome and greatly appreciated!

License

Apache License, Version 2.0