Permalink
Cannot retrieve contributors at this time
Schedulers | |
---------- | |
Study notes for a number of Scheduling-related papers, which served as | |
a preparation for building an SMP scheduling base for my “Cute” kernel. | |
The __primary sources__ (papers from the 60s, 80s, 90s, and 2000s) of | |
the following topics are discussed: | |
- Multi-level Feedback Queues: that algorithm is the cornerstone | |
of old & new schedulers like SVR2/3/4, the BSDs, Linux O(1), NT, | |
Staircase, and Staircase Deadline algorithms. | |
- SVR4 and Solaris event-driven scheduling: including the | |
disadvantages I faced while implementing it. | |
- Spinlocks: why was it badly-needed and invented by VMS engineers, | |
and on missing an important condition of critical section theory | |
in its original design. | |
- Per-CPU Data Areas: how VAX/VMS originally handled it, how the | |
VMS-based NT kernel did it afterwards, and how the AMD x86-64 | |
architecture specifically kept the %fs and %gs segments cause | |
of that original NT design. | |
- Finding the current thread state using stack masking: why was | |
that design engineered by VMS on the VAX architecture, and how | |
it was transfered to DEC OSF/1 and unnecessarily later to Linux. | |
- DEC OSF/1 and VAX/VMS symmetric multiprocessing: these kernels | |
were the origin of now-classical ideas like spinlocks, per-CPU | |
runqueues, interactions between spinlocks and kernel preemption, | |
and thread scheduling soft and hard affinity! | |
2010-2011, Ahmed S. Darwish <darwish.07@gmail.com> | |
NOTE: To get things going at first, I start with a very basic overview of | |
general-purpose thread scheduling using course notes by R. Arpaci-Dusseau. | |
If you're familiar with the basics, move directly to the legendary F.J. | |
Corbato's “An Experimental Time-Sharing System” paper. | |
NOTE-2: You can find a cached copy of *all* the discussed papers at | |
http://gitorious.org/cute-os/references/trees/master/papers/sched | |
NOTE-3: Turn UTF-8 on if you're reading this inside a web browser. | |
* TOC: | |
---- | |
- Arpaci00, Operating Systems Notes, 'Scheduling: Introduction' | |
- Corbato62, 'An Experimental Time-Sharing System', MIT | |
- Corbato63, 'The Compatible Time-Sharing System - A Programmer's Guide' | |
- Dingledine09, 'Performance Improvements on Tor' | |
- Arpaci00, OS Notes, 'Scheduling: The Multi-Level Feedback Queue' | |
- Arpaci98, 'Multilevel Feedback Queue Scheduling in Solaris' | |
- Oracle02, Man Pages Section 4: File Formats, ts_dptbl(4) | |
- Black90, 'Concurrency and Parallelism in the Mach Operating System' | |
- Denham94, 'DEC OSF/1 Version 3.0 Symmetric Multiprocessing Implementation' | |
- Gamache88, 'VMS Symmetric Multiprocessing' | |
- Sliberchatz08, Bibliography from the 'Operating System Concepts' book | |
* Remzi Arpaci-Dusseau, Operating System Notes, “Scheduling: | |
Introduction”, University of Wisconsin-Madison, 2000: | |
---------------------------------------------------------- | |
The included portion of Remzi's notes (ch. 6, 7, and 8) build a light- | |
weight (undergrad-level) introduction to general-purpose OS scheduling. | |
These documents also include a number of good references. | |
Here are the important points in these notes + analysis of mine: | |
- Some of the metrics in measuring schedulers include 'Turnaround time', | |
which is the time taken between registering a job for execution and | |
its completion/return/finish. Mainly: | |
T_{turnaround} = T_{finish} - T_{arrival} | |
Another metric, caring more about latency, is the 'response time' | |
which is the time the job waits in the ready queue till some CPU time | |
is provided: | |
T_{response} = T_{firstrun} - T_{arrival} | |
- The cost of context switching does not arise from the OS actions of | |
saving and restoring a few registers. For example, we do context | |
switching in our kernel using only 37 x86 opcodes: they are mainly | |
register moves and stack pops and pushes. | |
The real cost is that "when programs run, they build up a great deal | |
of state in CPU caches, TLBs, branch predictors, and other on-chip | |
hardware. Switching to another job causes this state to be flushed | |
and new state relevant to the currently-running job to be brought in, | |
which may exact a noticeable performance cost." | |
- Round-robin, the simplest practical scheduler possible, involves the | |
act of __cycling__ on the runqueue's list of processes, running each | |
job for a fixed time slice (unit). Assuming processes A, B, C, and D, | |
the scheduler run the jobs as follows | |
| A | B | C | D | A | B | C | D | A | B | C | D | A | B | C | ... | |
each for an equal amount of time. Advantages include implementation | |
simplicity, good average response times (see disadvantages though), | |
and being starvation-free by design. | |
Disadvantages include very bad behaviour on system saturation, cause | |
the system will swap-in and out programs for each (supposedly small) | |
time slice [*], and response-time rising linearly with the number of | |
active jobs in the system, even for highly-interactive processes. | |
[*] this is a point handled by the Corbato paper below | |
- In practice, most of processes will issue a number of blocking I/O | |
requests during their allocated time quanta, which leads to moving | |
them out of the runqueue altogether, and returning back to such queue | |
after I/O completion. | |
HOW are these jobs treated after returning back can be as simple as | |
considering them new processes (classical round-robin), or to perform | |
timing calculations and modify their priority (assuming 'priorities' | |
support in the scheduler, or a similar concept). | |
- Finally, there's a nice paragraph on the separation of mechanism and | |
policy in OS Design. For example, stuff like programming the system | |
ticks handler and context-switching code are scheduling mechanisms | |
(mechanically how the jobs get interleaved). Scheduling policies | |
(or disciplines) is what you're currently reading is about. | |
In real-time POSIX for example, there are several programmable | |
scheduling 'policies' that can be setup through sched_setscheduler(2). | |
Such policies include SCHED_FIFO, SCHED_RR, and SCHED_OTHER. | |
* F.J. Corbato et al., “An Experimental Time-Sharing System”, AIEE-IRE | |
Proceedings of the May 1-3, 1962, Spring Joint Computer Conference, MIT: | |
------------------------------------------------------------------------ | |
This is one of the earliest papers on time-sharing schedulers. The core | |
of the paper (beside its overall historical importance) is the section | |
'A Multi-Level Scheduling Algorithm', which is in modern terminology now | |
called the 'Multi-level Feedback Queue' (MLFQ). Important points: | |
- The main problem the algorithm tries to solve is system saturation: | |
a) total size of active user programs exceed high-speed memory | |
b) too many active user programs to maintain an adequate response at | |
each user console. | |
c) high performance cost of context switching | |
According to the authors, a good solution is to provide a procedure | |
which gives graceful degradation of the response time and effective | |
real-time computation speed of the *large and long-running users*. | |
- The paper argues that a simple round-robin scheduler will not work in | |
case of saturation cause it will cause an 'abrupt collapse of service' | |
due to the sudden burden of time required to swap programs in-and-out | |
of secondary memory such as a disk. | |
The multi-level scheduling algorithm was thus presented to improve the | |
saturation performance of a time-sharing system. | |
- We now delve into the scheduler algorithm details: from the very first, | |
the scheduler partitions its runqueue to several priority queues | |
l_0, l_1, ..., l_L. ('L' is to be calculated later) | |
- In general, the bigger the program memory footprint (size), the less | |
queue priority it's assigned to. But, the less the queue priority, the | |
bigger the runtime for each of its tasks in number of quantum. | |
Mathematically: l = [ log_2( [w_p/w_q] + 1 ) ] | |
where each task at queue 'l' runs for 2^l quanta, and that if | |
l_a < l_b, then l_a has a higher priority than l_b. We can deduce | |
several things from above equation: | |
a) if 0 < w_p < w_q, [w_p/w_q] = 0, runtime = 1 quantum |-- 1 | |
b) if w_q < w_p < 2*w_q, [w_p/w_q] = 1, runtime = 2 quanta -| | |
c) if 2*w_q < w_p < 3*w_q, [w_p/w_q] = 2, runtime = 2 quanta -|-- 2 | |
d) if 3*w_q < w_p < 4*w_q, [w_p/w_q] = 3, runtime = 4 quanta ..| | |
e) if 4*w_q < w_p < 5*w_q, [w_p/w_q] = 4, runtime = 4 quanta ..| | |
f) if 5*w_q < w_p < 6*w_q: [w_p/w_q] = 5, runtime = 4 quanta ..|-- 4 | |
g) if 6*w_q < w_p < 7*w_q: [w_p/w_q] = 6, runtime = 4 quanta ..| | |
h) if 7*w_q < w_p < 8*w_q: [w_p/w_q] = 7, runtime = 8 quanta +++ | |
... +++|-- 8 | |
p) if 15*w_q < w_p < 16*w_q: [w_p/w_q] = 15, runtime = 16 quanta xxxx | |
... xxxx|-- 16 | |
_) if 31*w_q < w_p < 32*w_q: [w_p/w_q] = 31, runtime = 32 quanta @@@@@ | |
... @@@@@|-- 32 | |
_) if 63*w_q < w_p < 64*w_q: [w_p/w_q] = 63, runtime = 64 quanta | |
Note: at point 'a)', queue 'l' is equal to 0: the highest possible | |
queue in priority. | |
By above design, the bigger the program, the bigger its allocated | |
number of quanta, making the time-overhead of swapping negligible. | |
Thus, optimizing the turnaround time of CPU-bound processes. | |
- The above point in scheduler design _tries_ to solve the first part of | |
the problem the algorithm tackles: the stated point 'a) total size of | |
active user programs exceed high-speed memory' above. | |
From now on, the paper tackles second part of the problem, 'b) too many | |
active user programs to maintain an adequate response'. | |
- The scheduler __always__ begin with the tasks at the head of the | |
highest priority queue (l_0). After that queue gets exhausted, it moves | |
to the direct following queue l_1, and so on <== (I) | |
Tasks in a single queue get run in a round-robin fashion. | |
- If while executing a task at queue 'l' (during the 2**l quantum), a | |
task got assigned to a higher priority queue 'l*', the current | |
executing task get put at the __head__ of the 'l'th queue, and the | |
process at (I) is restarted from queue 'l*', and so on. | |
In other words, a task cannot get executed while another task is | |
waiting at a higher priority queue. | |
- If a program in queue 'l' has not 'completed' (did not issue any I/O) | |
within its 2**l-quantum runtime, it's placed at the __tail__ of the | |
less-priority queue 'l+1' | |
- A powerful point in the algorithm above (beside broad bounds on system | |
performance) is that the classification procedure for tasks is entirely | |
automatic: it depends on performance and program size rather than | |
static declarations or (the possibly malicious) users hopes. | |
- As once stated above, the main theme around this scheduler is improving | |
the __saturation performance__ of the system. | |
This performance improvement on saturation exists cause as a user taxes | |
the system, the degradation of service occurs progressively starting | |
with users with large or long-running programs (least-priority queues) | |
first. | |
- Starvation: there's a probable case of a big number of active high | |
priority tasks starving less-priority CPU-bound tasks out of execution. | |
The paper __models__ a worst cast of having N tasks, each running at | |
queue 'l', and each relinquishing the CPU __directly__ before their | |
2**l quantum runtime, and thus each also returning to the the same- | |
priority queue, with the condition of returning to the runqueue only | |
't_u'-seconds later (time for finishing I/O and, or, user response). | |
If the system will always be busy at queue 'l' above, starving all less | |
priority queues 'l + x' for x > 0, then the queue 'l' must never be | |
empty; we must have: | |
N * 2^l * q >= t_u | |
2^l >= (t_u / N*q) | |
l >= log_2 (t_u / N*q) | |
where 'N' is the number of tasks at queue 'l'. | |
From above, we've found an upper-bound on the priority queue that can | |
be serviced: queues > 'l' above can possibly starve if number of tasks | |
reach N. Thus, we have a guarantee that all | |
l_a <= [ log_2 (t_u / N * q) ] | |
will be serviced and will __not__ get starved (given the estimated | |
response time t_u), where '[]' denotes the integral part of enclosed | |
equation. The author calls this queue the 'highest serviced level' | |
queue. | |
- The equation above has set a bound on the starvation problem. So if we | |
set the max possible 'highest serviced queue' to the max possible queue | |
on the system, we have a __theoretical guarantee__ of no starvation in | |
the system. | |
(Providing this guarantee is very hard for modern-day general purpose | |
operating systems: we cannot easily put a limit on the number of tasks | |
'N', and we cannot even have a valid estimation of 't_u', the I/O | |
or user response finish time. | |
These values can be calculated easily if the target hardware and work | |
load is previously known; this is the case in the paper [IBM 7090], | |
and this can also be the case for small and predictable devices like a | |
small network router.) | |
- By now, we've analyzed most of the paper sans the response time | |
analysis. | |
While these response time equations are intellectually simulating, they | |
offer very broad bounds that are not entirely practical for a general | |
purpose scheduler, similar to the paper's way of handling starvation in | |
the above point. | |
* F.J. Corbato et al., “The Compatible Time-Sharing System: A Programmer's | |
Guide”, The MIT Computation Center, MIT Press, 1963: | |
------------------------------------------------------------------------ | |
This manual gives a very nice overview of the historical time-sharing | |
hands-on experience. It's a worthwhile read while studying the CTSS | |
scheduler paper above. | |
A useful demonstration of a programmer's session on CTSS is also | |
presented in the Appendix. Interesting text editor!! it makes me | |
remember the original UNIX editor 'ed'. | |
* Roger Dingledine, “Performance Improvements on Tor”, 2009: | |
---------------------------------------------------------- | |
Now why am I including a paper on Tor while we're dealing with | |
schedulers? Because the Tor network tries to solve a problem very | |
similar to the one Corbato was designing his scheduler model against: | |
``Section 1 described mechanisms to let low-volume streams have a | |
chance at competing with high-volume streams. Without those mechanisms, | |
normal web browsing users will always get squeezed out by people | |
pulling down larger content and tolerating high latency. | |
But the next problem is that some users simply add more load than the | |
network can handle. Just making sure that all the load gets handled | |
__fairly__ isn't enough if there’s too much load in the first place.(+) | |
When we originally designed Tor, we aimed for high throughput. We | |
figured that providing high throughput would mean we get good latency | |
properties for free. However, now that it’s clear we have several user | |
profiles trying to use the Tor network at once, we need to consider | |
changing some of those design choices. Some of these changes would | |
aim for better latency and worse throughput. (++) '' | |
The statement marked with (+) matches word-by-word the problem | |
statement of the Multi-Level Feedback Queue paper: to gracefuly handle | |
'system saturation' while maintaining an acceptable level of system | |
responsiveness. Thus, avoiding the 'abrupt collapse of service' caused | |
by OS scheduler's round-robin algorithm or Tor's full fairness in high | |
overlad. | |
A main theme of the MLFQ solution, and Tor's one proposed at (++), is | |
not to let throughput-driven jobs (compilers in schedulers case, an | |
700-MB ISO download in the Tor case) deprive latency-senitive jobs | |
out of quickly-needed resources. Examples of such latency-sensive | |
apps include an editor or web-browser GUI scrolling in OS schedulers, | |
and servicing HTML websites in Tor. | |
* Remzi Arpaci-Dusseau, Operating Systems Notes, The MLFQ, 2000: | |
-------------------------------------------------------------- | |
Here, the MLFQ algorithm we studied from __primary sources__ above is | |
re-discussed in modern terms. Important points: | |
- The hardest point in designing schedulers is to minimize response time | |
for interactive jobs, while also minimizing overall system turnround | |
time. | |
- As we know by now, MLFQ varies the priority of a job based on its | |
observational behaviour: learning from jobs as they run, and using | |
history to predict their future behaviour. Historical state is kept in | |
the scheduler priority queues. | |
This is an important systems design pattern of learning from history: | |
observing current trends and learning from the past to predict the | |
future. Such pattern is common in CPUs branch prediction, working set | |
rules, and in networking protocols congestion control mechanisms. | |
Remzi advises us though that 'such approaches work when jobs have | |
__phases of behavior__ and are thus predictable; of course, one must | |
be careful with such techniques, as they can easily be wrong and drive | |
a system to make worse decisions than they would have with no knowledge | |
at all.' | |
- There are two important problems in MLFQs: | |
a) Starvation: One of the possible ways to handle it is the well-known | |
'aging principle': increasing the priority as the process waits. Check | |
below parenthesized notes though; it's not that simple. | |
(Note1: The original paper theoretically 'solved' this issue by giving | |
bounds on starvation. These bounds can only work if we have a true | |
estimate of I/O response times _and_ of the jobs number in the system. | |
This is practically impossible for general-purpose scheduling.) | |
(Note2: There are several ways to __practically__ apply aging methods. | |
Some of these ways are discussed in David Black's paper 'Scheduling | |
support for concurrency and parallelism in the Mach Operating System.') | |
b) Gaming the system: A possible solution is to improve the algorithm | |
accounting of CPU time, making it less viable to gaming. Check the | |
notes for further details. | |
* Remzi Arpaci-Dusseau, “Multilevel Feedback Queue Scheduling in Solaris”, | |
1998 && Solaris, Man Pages Section 4: File Formats, ts_dptbl(4), 2002: | |
------------------------------------------------------------------------ | |
From now on, we check how various real-world kernels implement MLFQ | |
scheduling. Beginning with Solaris, here are the important points: | |
- For regular jobs, the kernel has 60 priority queues. A job begins in | |
the middle, at priority 29. As in classical MLFQ, CPU-bound jobs get | |
assigned smaller priorities over time (with bigger slices), and I/O | |
tasks move up, getting scheduled whenever they have jobs to do. | |
Now here is Solaris 'dispatcher parameter table' defaults: | |
Priority q (ms) tqexp slpret maxwait lwait | |
0 200 0 50 0 50 | |
... vvv v vv v vv | |
9 vvv v vv v vv | |
10 160 v 51 v 51 | |
11 vvv 1 vv v vv | |
12 vvv 2 vv v vv | |
13 vvv 3 vv v vv | |
14 vvv 4 vv v vv | |
15 vvv 5 vv v vv | |
16 vvv 6 vv v vv | |
17 vvv 7 vv v vv | |
18 vvv 8 vv v vv | |
19 vvv 9 vv v vv | |
20 120 10 52 v 52 | |
21 vvv 11 vv v vv | |
22 vvv 12 vv v vv | |
23 vvv 13 vv v vv | |
24 vvv 14 vv v vv | |
25 vvv 15 vv v vv | |
26 vvv 16 vv v vv | |
27 vvv 17 vv v vv | |
28 vvv 18 vv v vv | |
29 vvv 19 vv v vv | |
30 80 20 53 v 53 | |
31 vv 21 vv v vv | |
33 vv 22 vv v vv | |
33 vv 23 vv v vv | |
34 vv 24 vv v vv | |
35 vv 25 54 v 54 | |
36 vv 26 vv v vv | |
37 vv 27 vv v vv | |
38 vv 28 vv v vv | |
39 vv 29 vv v vv | |
40 40 30 55 v 55 | |
41 vv 31 vv v vv | |
42 vv 32 vv v vv | |
43 vv 33 vv v vv | |
44 vv 34 vv v vv | |
45 vv 35 56 v 56 | |
46 vv 36 57 v 57 | |
47 vv 37 58 v 58 | |
48 vv 38 vv v vv | |
49 vv 39 vv v 59 | |
50 vv 40 vv v vv | |
51 vv 41 vv v vv | |
52 vv 42 vv v vv | |
53 vv 43 vv v vv | |
54 vv 44 vv v vv | |
55 vv 45 vv v vv | |
56 vv 46 vv v vv | |
57 vv 47 vv v vv | |
58 vv 48 vv v vv | |
59 20 49 59 32000 vv | |
tqexp: time quantum expired priority, the new priority the thread is set | |
to when it uses its entire time quantum. As you can see, any process | |
using its entire slice get its priority decreased by __10 levels__. | |
slpret: sleep return priority value, the new priority the thread is set | |
to after waking up (from a sleep event like network or disk I/O.) | |
The thread priority is increased to 50 or above after an I/O operation! | |
maxwait: a starvation avoidance mechanism to compensate threads waiting | |
for a relatively long time in the dispatcher queue. 'maxwait' is such | |
waiting time threshold. If a thread waited more than maxwait in the | |
queues, its priority is boosted up to the respective 'lwait' priority. | |
Each thread has a dispatcher_wait ('dispwait') counter, which is reset | |
when the thread is inserted into the dispatch queue (following a time- | |
quantum expiration, or cause of wakeup from I/O or locking). This | |
field is incremented once per second for all processes waiting in the | |
dispatcher queues. If such counter exceeded 'maxwait', then the thread | |
priority is incremented to 'lwait' -- a priority > 50 by default. | |
Note that the default 'maxwait' for Solaris is 0: just 1 second and | |
all the system processes priority will be boosted up above 50. This | |
avoids starving CPU-bound too much. | |
[Sol-*]: | |
(The values above have some very interesting details: Note that after | |
1-second, processes in the lowest (0-9) range will move to priority 50, | |
processes in the range (10-19) will move to 51, (20-29) to priority 52, | |
and so on. By this, high-priority processes response times will NOT be | |
affected because they will still run on queue 59 instantly. | |
Now, how can this prevent starvation? Can't processes keep going in | |
and out very quickly, starving queues < 59?. Indeed that was the | |
mathematically modeled worst case in the original CTSS paper. | |
Now comes the second important detail: even if the worst case above | |
occured (highly unlikely in practice, except if done maliciously), | |
you'll see that the 'lwait' field of the 50-59 priorites is 59. Thus, | |
in the second __starvation round__, all will move to 59 and execute | |
in a round-robin manner. This solves the starvation problem elegantly. | |
Summary: starvation can never exceed 2 seconds + waiting time in prio | |
queue 59. All while maintaining system responsiveness in the process.) | |
- NOTE: I've coded such scheduler in my kernel, and its disadvantages | |
began to appear directly after the first day of benchmarking: too much | |
heuristics and voo-doo constants! | |
Scheduling 100% CPU-hog threads showed the ugly side of that algorithm: | |
the combination of a big number of dynamic priority levels with huge | |
and different slice lengthes per each priority kept starving the low | |
priority processes till the newly-queued CPU hogs moved up to their | |
equivalent low-priority level. | |
While the new CPU hogs were moving up, the older low-priority tasks | |
faced complete starvation for 1+ seconds. The algorithm thus boosted | |
these threads to high-priority levels, but this __starved__ the newer | |
threads that are now residing in relatively lower-priority queues. This | |
effectively made all system processes enter an infinite loop of | |
1+-second starvation and then boosting -- an upleasent situation. | |
Decreasing the number of priority levels and making the slices more fair | |
helped the situation, but kept introducing __new__ corner-cases. After | |
a number of days, I gave-up and went coding a more theoretically-sound | |
scheduling algorithm. | |
(The benchmark was simply plotting different kernel threads properties | |
over time. Such properties included threads total runtime, average | |
runtime per dispatch, total wait in runqueues, average wait in runqueues, | |
number of priority boosts [cause of starvation], number of preemptions | |
cause of a higher-priority thread becoming runnable, and number of | |
preemptions cause of a thread time-slice end: | |
(a) code version 1, 6 threads: | |
http://localf.googlepages.com/pub/mlfq/fairness.gif | |
(b) code version 1, 100 threads: | |
http://localf.googlepages.com/pub/mlfq/fairness100.gif | |
(c) code version 2 (after sanitizing the voo-doos), 6 threads: | |
http://localf.googlepages.com/pub/mlfq/fairness-improved.gif | |
(d) code version 2, 100 threads (things became good!!): | |
http://localf.googlepages.com/pub/mlfq/fairness-improved.gif | |
(e) unfortunately, the _new_ heuristics above had corner cases: | |
http://localf.googlepages.com/pub/mlfq/fairness-improved-pathological.gif | |
It seems I'm not the only one facing this though; I've found some refs | |
in academia on how SVR4-based schedulers [like Solaris] were very hard | |
to tweak and adapt to different workloads.) | |
* David L. Black, “Scheduling Support for Concurrency and Parallelism in | |
the Mach Operating System”, IEEE Computer Journal, Carnegie Mellon, 1990: | |
------------------------------------------------------------------------- | |
This is a very interesting article, but we will concentrate on its MLFQ | |
deficiencies discussion for now (Note: always remember Black's point of | |
creating an idle thread per CPU, not to corrupt kernel stack on | |
multiprocessor systems). Important points: | |
- Multics used a decreasing priority MLFQ (as was exactly described in | |
its original paper) and discovered the major deficiency: on a heavily | |
loaded system with many short-lived jobs, the priority of a CPU-bound | |
job can decrease until little or no CPU-time is assigned to it. The | |
author notes that such problem (starvation) also existed on a number | |
of UNIX kernels. | |
- To handle this problem, it's necessary to elevate priorities in some | |
manner. VAX/VMS elevates a low application priority when it issues an | |
event like I/O completion. | |
The author states that this negatively affect interactive applications | |
which consume large amount of processor time, since they may not issue | |
priority-raising events to an effective degree. | |
(Nonetheless, this approach is effective, response-time wise, for | |
handling software with 'phases' of I/O and CPU-bound activity if the | |
CPU-bound activity is long but not dominant. This is a point the | |
original CTSS scheduler did not address: a job's transition from a | |
CPU-bound to an I/O-bound phase.) | |
- A (non-exclusive) second way to handle this is processor usage aging. | |
To compensate lower-priority jobs, we forget (decay) their past history | |
in some way, usually in an exponential decay manner. | |
(Exponential decay: N(t) = N(0) * (decay_factor)**t. Assuming a decay | |
factor of 0.5, and N(0)=10, then N(1)=5, and N(2) = (10/2)/2 = 0.25) | |
- Solaris, as we've seen, uses BOTH of the prevention methods above: | |
processes priorities get elevated after completion of an I/O event, | |
AND the kernel also forgets (age) __most__ of the current processes | |
usage history after one second (coalescing each 10 priorities to one, | |
while maintaining the relative difference between them). | |
- 4.3BSD: it solely uses the aging method to handle starvation, with a | |
CPU-usage decay factor that's based on system load. The bigger the | |
system load, the less usage decay, avoiding jobs from getting crammed | |
in high priority queues on high system load. | |
Priority of the __currently running__ thread get re-calculated after | |
each 4 ticks of its runtime, decreasing linearly with job's CPU usage. | |
If a higher priority was discovered after this decrease, the thread | |
is preempted. | |
Routines responsible for waking up a thread or adding a new thread to | |
the runqueue check such thread priority against the currently running | |
thread. If it's higher, a reschedule operation is requested. | |
(XXX: Unix Internals discusses the problem of classical SVR2/3 decay, | |
its equations, and the new BSD decay method. Add these notes here as | |
a supplement to the original paper discussion.) | |
* Jeffrey M. Denham et al., “DEC OSF/1 Version 3.0 Symmetric Multiprocessing | |
Implementation”, Digital Technical Journal, volume 6 number 3, Summer 1994: | |
--------------------------------------------------------------------------- | |
This paper is gold, and is comparable in quality to the CTSS one. It | |
discusses a huge number of now-classical SMP features: | |
+ Accessing the 'current' thread state by masking the stack pointer, | |
regardless of the CPU we're currently running on. | |
+ Per-CPU runqueues for thread scheduling! | |
+ SMP scheduling soft affinity | |
+ Interactions between spinlocks and kernel preemption | |
+ Striking a balance between fine-grained locking and lockless algorithms | |
A good number of above ideas were of a VAX/VMS origin. Important notes: | |
- The task of an SMP kernel is to improve system performance as CPUs are | |
added without compromising quality. DEC OSF/1 goal was to provide a | |
performant Unix SMP system over the Alpha architecture. | |
That hardware architecture provided shared memory, symmetric access to | |
system I/O busses, atomic read-and-set instructions, and interprocessor | |
interrupts (IPIs). | |
NOTE: This is very close to the Intel MultiProcessor (MP) architecture. | |
- As we know by now, classical Unix kernels insured its consistency by: | |
a] disabling interrupts in critical regions accessed by IRQ handlers | |
(and thus handling the concurrency of asynchronous IRQ handlers) | |
b] allowing only one process to be in kernel context at a time | |
(and thus handling the virtual concurrency provided by time-sharing) | |
These protections aren't enough in case of multicore systems, since | |
they provide real concurrent execution of kernel code paths. | |
An evolutionary solution to that problem was the use of 'funneling': | |
allowing certain kernel subsystems to be executed only in one CPU at | |
_any_ given time. These subsystems was then iteratively and slowly | |
moved to local fine-grained locking. | |
NOTE: Linux used that 'funneling' method in its early days using the | |
'Big Kernel Lock (BKL)'; FreeBSD did the same using a 'Global Mutex'. | |
Both kernels moved to fine-grained locking iteratively. By virtue of | |
being started in Winter of 2009, my kernel incorporates fine-grained | |
locking from the start. | |
- It's interesting to note that the first version of DEC OSF/1 provided | |
full kernel-mode preemption in its real-time version of the kernel. | |
Solaris was also made fully preemptive with fine-grained locking very | |
early on. Linux and FreeBSD caught on around 10 years later. | |
- The paper then discusses how kernel preemption should be disabled by the | |
entrance of spinlocks. It also discusses the now-very-common idea of | |
counting the depth of spin locks acquired using a reference count, and | |
preempting the kernel if and only if that reference count = 0. | |
NOTE 1: This is equivalent to Robert Love's Linux kernel preemption | |
patch. A google search reveals that this patch got merged in 2001. | |
NOTE 2: FreeBSD on the other hand completely disables local interrupts | |
upon spin lock entrance, automatically disabling preemption in the | |
process (at a slight performance cost). I do the same in my kernel. | |
NOTE 3: (cut-and-paste from my kernel comments) Disabling preemption | |
in spinlock-protected regions is a necessity because: | |
a) If a thread holding a spin lock got preempted, it will let a | |
possibly huge number of other threads consume their entire time | |
slice (which is in the order of milliseconds) just spinning! | |
b) If that preempted thread (which is now in the scheduler runqueue) | |
got killed, we can deadlock the entire system! | |
c) Classical race conditions may arise if the newly scheduled thread | |
accessed the same critical-region the older thread was accessing | |
while being preempted. | |
That's why spin locks are also needed in preemptive Uniprocessor | |
kernels: ignoring their spinning behavior, they act as atomic | |
critical region markers. | |
There's also a warning about accessing per-CPU variables while kernel | |
preemption is enabled. Modern kernels disable preemption in that state. | |
NOTE 4: (also from my kernel comments) Disabling preemption while | |
accessing a per-CPU variable is necessary: | |
a) We might get preempted and then rescheduled on a different CPU. | |
b) If the _preempting_ thread used the same per-CPU variable, that's | |
a classical race condition. | |
The paper notes that due to the complexities of kernel locking, a | |
"coherent locking architecture with automated debugging facilities | |
was needed to ship a reliable product on time." | |
- One problem faced by SMP is finding the descriptor of the current | |
process context. In a uniprocessor machine, that can be simply done | |
by a global memory variable that is changed before context switching. | |
On SMP, there's a "current" process descriptor for each CPU. This | |
descriptor can be located by finding the process number we're running | |
on, and then accessing an array of per-cpu data based on that number. | |
In Unix kernels, the current thread state is dereferenced a huge number | |
of times. Needing to know the current CPU number before each of these | |
dereferences (a costly hardware operation) was unacceptable. | |
NOTE 1: In contemporary x86 CPUs, this is costly since it needs reading | |
the APIC registers for each dereference. APIC registers accessors induce | |
higher latency than the usual general-purpose registers. | |
Even assuming that this operation wasn't costly from the hardware side, | |
we'll have to disable, then enable, kernel preemption each time we want | |
to access that thread state; also unacceptable. | |
DEC OSF/1 solved this by storing a pointer to the currently running | |
process descriptor at the very bottom of each thread stack. By simply | |
masking the stack pointer, they can instantly access that thread state, | |
__regardless__ of the CPU that thread is currently running on. | |
NOTE 2: The paper states that putting the 'current' thread state pointer | |
in the per-CPU area, and then setting it before each context switch, | |
would frequently trash other CPUs caches. This is true in the regular | |
case, but aligning such per-CPU array elements on a cache-line boundary | |
(64 or 128 bytes in x86 CPUs) would solve that problem. It's possible | |
that the Alpha architecture had different semantics though. | |
NOTE 3: Linux x86-32 used that stack-masking design for a long time. I | |
guess it was transfered to its developers by folklore. | |
NOTE: What about the risks, econet anyone? | |
NOTE 4: On the x86 kind of the equation, it's common to _permanently_ | |
assign a segment register with the address of the relevant per-CPU area | |
(aligned on a cache-boundary) for each CPU. This avoids the inefficiency | |
of accessing the APIC for every per-CPU data dereference. | |
NOTE: Not all CPU architectures have the luxury of a such a register, | |
which is left _intact_ by the basic system ABI definition. The absence | |
of such register in the VAX architecture was the main reason of the | |
stack-masking idea in VMS. Explore the 'VMS Symmetric Multiprocessing' | |
paper, DTJ vol.1#7, for further historical context. | |
NOTE: Reading the historical (preliminary) AMD64 architecture document | |
(January 2001, Rev C), it seems that AMD kept the %fs and %gs registers | |
because Windows NT used %gs as a per-CPU area pointer. Interestingly, | |
NT is, afterall, a VMS-based kernel! | |
- Per-CPU runqueues: Paper begins by noting that "scheduling threads out of | |
a global runqueue is highly effective at keeping the N highest-priority | |
threads running". It then states the disadvantages of that method: lock | |
contention over the global runqueue and losing CPU cache state due to | |
uncontrolled bouncing of threads between CPUs. | |
NOTE: Having a global runqueue and taking CPU affinity in consideration | |
is not mutually exclusive. Brainfuck Scheduler (BFS) and the Linux-2.4 | |
epoch scheduler is an example of this. | |
The DEC designers thus added "soft affinity" to timeshare threads: these | |
threads are softly bound to the processor they last run on. Nevertheless, | |
if too much time passed since last run, they can get scheduled elsewhere: | |
the CPU state was probably lost anyway. It's more beneficial to directly | |
run that thread on another less-loaded CPU than wait for the current one. | |
- Load balancing: As a summary of the presented algorithm, let | |
TS(i): The estimated number of ticks available to time-sharing threads | |
by CPU #i in a scheduling period | |
TTS : Total number of system ticks available to time-sharing threads | |
N : Total number of runnable time-sharing threads in the system | |
D(i) : Ideal number of runnable time-sharing threads assigned to CPU #i | |
C(i) : Current number of time-sharing threads assigned to CPU #i | |
we have | |
TTS = TS(1) + TS(2) + ... + TS(k), k = number of system CPUs | |
D(i) = (TS(i) / TTS) * N | |
Every second, the kernel calculates D(i) for each CPU. If a CPU had a | |
C(i) < D(i), and another one had C(i) > D(i), the former is marked as | |
"out of balance". When such unbalanced CPU reschedules, it "steals" | |
tasks from the overloaded CPU. Tasks get stolen only if they did not | |
run on their softly-bound CPU for a configurable amount of time, thus | |
preserving soft affinity usefulness as much as possible. | |
- The paper finishes by stating the software engineering tactics needed | |
to safely develop such an ambitious code base. Remember that this was | |
the 90s. After studying that paper, it's clear that the two most | |
important tactics were: | |
- Code review, and | |
- Iterative development | |
So, well, things haven't changed a lot from the software engineering | |
side! | |
* Rodney N. Gamache && Kathleen D. Morse, “VMS Symmetric Multiprocessing”, | |
Digital Technical Journal, volume 1 number 7, pp. 57-73, August 1988: | |
------------------------------------------------------------------------ | |
NOTE: Below discussion assumes reading our notes on DEC OSF/1 v3.0 first. | |
"Systems software engineers must now design effective ways to utilize | |
systems with six, eight, or even more CPUs", begins this late 80s paper! | |
The SMP additions to VAX/VMS were motiviated by the, back then new, VAX | |
6200 architecture. From other papers in the _same_ journal issue, that | |
hardware design had: | |
+ Symmetric shared memory | |
+ Cache coherency in hardware (always a bless!) | |
+ First-level cache for instructions, and a second-level write-through | |
one for both data and instructions | |
+ Broadcast and single-destination inter-processor interrupts | |
This paper INTRODUCED a good number of now-classical SMP features: | |
+ Spinlocks! (also _coined_ that term) | |
+ Accessing the Per-CPU area (and thus the current thread state) by | |
masking the CPU stack pointer register. | |
+ SMP scheduling hard affinity! (also _coined_ that term) | |
Having above context in mind, here were the important points: | |
- VMS engineers first considered using a global lock[*] to protect the | |
kernel in an SMP machine. They (rightfully) ditched the idea later and | |
moved directly to the ambitious goal of fine-grained locking. | |
NOTE: Compare this to the DEC OSF/1 method, where they first used a | |
global lock, and then fine-grained the system iteratively. | |
[*] Big Kernel Lock in Linux jargon, Global Mutex in the BSDs | |
- Inventing spinlocks: Regular VMS mutexes must be acquired in a process | |
context, but the kernel needed SMP protection in different other | |
contexts. That's why spinlocks were created; they can get acquired by | |
any CPU in any context (as long as certain sanity rules are held). | |
NOTE 1: A classical case is protecting the video-RAM and printf() buffers. | |
The kernel usually want to access such resource in all possible contexts, | |
including boot context, exception and even IRQ context, not just inside | |
process contexts. | |
Usually, a mutex is acquired or "owned" by a certain process; spinlocks | |
on the other hand are acquired by the CPU as a whole. Spinlocks was | |
implemented in VMS using an interlocked bit-test-and-set op. | |
NOTE 2: It's still implemented similarly these days, but it might be | |
faster to use an interlocked swap. A missing issue in that paper design | |
was making threads lock spinning time be finite (bound); that's an | |
important (third) condition in the theory of critical sections. | |
NOTE: That got handled in Linux using N.Piggin's ticket spinlocks. | |
- VMS added two important debugging aids to spin locks: | |
a) each spinlock was assigned a rank (order). All spinlocks must be | |
acquired in that order, eliminating any possible deadlocks. | |
b) each spinlock structure saved the last eight program counters (PC/ | |
x86 %RIP) that acquired or released such lock. | |
NOTE: Can this ranking method dynamically work in my kernel? All my | |
locks are strictly local so far, but that might change once VFS and | |
file-systems are added (e.g. inode locks). | |
- Per-CPU Area: I _love_ how the authors state their thinking process in | |
that section. As in any SMP-capable kernel, VMS needed a per-CPU data | |
area to put things as the current thread state pointer and other per- | |
CPU bookkeeping. | |
The most effecient way to do that is to have a general-purpose register | |
_permanently_ set to the virtual address of the relevant per-CPU area. | |
Such a register did not exist in VAX, so instead of creating it in the | |
VAX hardware, they searched for a way to _overload_ a current register | |
with that purpose. | |
Since each CPU must have a unique stack to safely handle interrupts, | |
the current thread stack pointer can be safely overloaded with that | |
task. They put a pointer to the per-CPU area at the bottom of such | |
stacks, which can be reached by simply masking the stack pointer. | |
NOTE: See MUCH more details on this in our DEC OSF/1 discussion above. | |
- CPU Affinity: They maintained a mask inside each thread descriptor, | |
where each CPU is represented by a single bit. Setting only one bit in | |
that mask is equivilant to hard binding that thread to a single CPU. | |
Such affinity rules are naturally enforced by the scheduler. | |
NOTE: Solaris and Linux roughly use the same method for hard affinity. | |
* Bibliography from “Operating System Concepts”, Sliberchatz et al., 2007: | |
------------------------------------------------------------------------ | |
Feedback queues were originally implemented on the CTSS system described | |
in Corbato et al.[1]. This feedback queue scheduling system was analyzed | |
by Scharge[2]. The preemptive priority scheduling algorithm was suggested | |
by Kleinrock[3]. | |
[1] Fernando J. Corbato et al., 'An Experimental Time-Sharing System', | |
Proceedings of the AFIPS Fall Joint Computer Conference (1962) | |
[2] L. E. Scharge, 'The Queue M/G/I with Feedback to Lower Priority | |
Queues', Management Science, Volume 13 (1967) | |
[3] L. Kleinrock, 'Queuing Systems, Volume II: Computer Applications, | |
Wiley Interscience (1975) | |
Anderson et al.[4], Lewis and Berg[5], and Philbin et al[6] discuss | |
thread scheduling. | |
[4] T. E. Anderson et al., 'The Performance Implications of Thread | |
Management Alternatives for Shared-Memory Multiprocessors', IEEE | |
Transactions on Computers, Volume 38, Number 12 (1989) | |
[5] B. Lewis and D. Berg, 'Multithreaded Programming with Pthreads, | |
Sun Microsystems Press (1998) | |
[6] J. Philbin et al., 'Thread Scheduling for Cache Locality', | |
Architectural Support for Programming Language and Operating Systems, | |
(1996) | |
Scheduling techniques that take into account information regarding | |
process execution times from previous runs are described in Fisher[7], | |
Hall et al.[8], and Lowney et al.[9] | |
[7] J. A. Fisher, 'Trace Scheduling: A Technique for Global Microcode | |
Dispatching', IEEE Transactions on Computers, Volume 30, Number 7 | |
(1981) | |
[8] L. Hall et al., 'Scheduling to Minimize Average Completion Time: | |
Off-line and On-line Algorithms', SODA: ACM-SIAM Symposium on | |
Discrete Algorithms (1996) | |
[9] P. G. Lowney et al., 'The Multiflow Trace Scheduling Compiler', | |
Journal of Supercomputing, Volume 7, Number 1-2 (1993) | |
Fair-share schedulers are covered by Henry[10], Woodside[11], and Kay | |
Kay and Lauder[12] | |
[10] G. Henry, 'The Fair Share Scheduler', AT&T Bell Laboratories | |
Technical Journal (1984) | |
[11] C. Woodside, 'Controllability of Computer Performance Tradeoffs | |
Obtained Using Controlled-Share Queue Schedulers', IEEE | |
Transactions on Software Engineering, Volume SE-12, Number 10 | |
(1986) | |
[12] J. Kay and P. Lauder, 'Fair Share Scheduler', Communications of the | |
ACM, Volume 31, Number 1 (1988) | |
Siddha et al[13] discusses __MODERN__ challenges on multicore systems. | |
[13] S. Siddha et al, 'Process Scheduling Challenges in the Era of | |
Multi-Core Processors', Intel Technology Journal, Volume 11 (2007) | |
Multiprocessor scheduling was discussed by Tucker and Gupta[14], Zahorjan | |
and McCann[15], Fietelson and Rudolph[16], Leutenegger and Vernnon[17], | |
Blumofe and Leisersonn[18], Polychronopoulos and Kuck[19], and Lucco[20]. | |
[14] A. Tucker and A. Gupta, 'Process Control and Scheduling Issues for | |
Multiprogrammed Shared-Memory Multiprocessors', Proceedings of the | |
ACM Symposium on Operating Systems Principles (1989) | |
[15] J. Zahorjan and C. McCann, 'Processor Scheduling in Shared-Memory | |
Multiprocessors', Proceedings of the Conference on Measurement and | |
Modeling of Computer Systems (1990) | |
[16] D. Fietelson and L. Rudolph, 'Mapping and Scheduling in a Shared | |
Parallel Environment Using Distributed Hierarchial Control', | |
Proceedings of the International Conference on Parallel Processing, | |
(1990) | |
[17] S. Leutenegger and M. Vernnon, 'The Performance of Multiprogrammed | |
Multiprocessor Scheduling Policies', Proceedings of the Conference | |
on Measurement and Modeling of Computer Systems (1990) | |
[18] R. Blumofe and C. Leisersonn, 'Scheduling Multi-threaded | |
Computations by Word Stealing', Proceedings of the Annual Symposium | |
on Foundations of Computer Science (1994) | |
[19] C. D. Polychronopoulos and D. J. Kuck, 'Guided Self-Scheduling: | |
'A Practical Scheduling Scheme for Parallel Supercomputers', IEEE | |
Transactions on Computers, Volume 36, Number 12, (1987) | |
[20] S. Lucco, 'A Dynamic Scheduling Method for Irregular Parallel | |
Programs', Proceedings of the conference on Programming Languages | |
Design and Implementation, (1992) | |
The authors then list the standard books covering well-known general | |
purpose operating systems kernels like SVR2, the BSDs, Linux, NT, and | |
Solaris by Bach, Kirk McKusick, Bovet and Cesati, Solomon and | |
Russionovich, and Mauro and McDougal respectively. |