Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
1001 lines (767 sloc) 47.5 KB
Schedulers
----------
Study notes for a number of Scheduling-related papers, which served as
a preparation for building an SMP scheduling base for my “Cute” kernel.
The __primary sources__ (papers from the 60s, 80s, 90s, and 2000s) of
the following topics are discussed:
- Multi-level Feedback Queues: that algorithm is the cornerstone
of old & new schedulers like SVR2/3/4, the BSDs, Linux O(1), NT,
Staircase, and Staircase Deadline algorithms.
- SVR4 and Solaris event-driven scheduling: including the
disadvantages I faced while implementing it.
- Spinlocks: why was it badly-needed and invented by VMS engineers,
and on missing an important condition of critical section theory
in its original design.
- Per-CPU Data Areas: how VAX/VMS originally handled it, how the
VMS-based NT kernel did it afterwards, and how the AMD x86-64
architecture specifically kept the %fs and %gs segments cause
of that original NT design.
- Finding the current thread state using stack masking: why was
that design engineered by VMS on the VAX architecture, and how
it was transfered to DEC OSF/1 and unnecessarily later to Linux.
- DEC OSF/1 and VAX/VMS symmetric multiprocessing: these kernels
were the origin of now-classical ideas like spinlocks, per-CPU
runqueues, interactions between spinlocks and kernel preemption,
and thread scheduling soft and hard affinity!
2010-2011, Ahmed S. Darwish <darwish.07@gmail.com>
NOTE: To get things going at first, I start with a very basic overview of
general-purpose thread scheduling using course notes by R. Arpaci-Dusseau.
If you're familiar with the basics, move directly to the legendary F.J.
Corbato's “An Experimental Time-Sharing System” paper.
NOTE-2: You can find a cached copy of *all* the discussed papers at
http://gitorious.org/cute-os/references/trees/master/papers/sched
NOTE-3: Turn UTF-8 on if you're reading this inside a web browser.
* TOC:
----
- Arpaci00, Operating Systems Notes, 'Scheduling: Introduction'
- Corbato62, 'An Experimental Time-Sharing System', MIT
- Corbato63, 'The Compatible Time-Sharing System - A Programmer's Guide'
- Dingledine09, 'Performance Improvements on Tor'
- Arpaci00, OS Notes, 'Scheduling: The Multi-Level Feedback Queue'
- Arpaci98, 'Multilevel Feedback Queue Scheduling in Solaris'
- Oracle02, Man Pages Section 4: File Formats, ts_dptbl(4)
- Black90, 'Concurrency and Parallelism in the Mach Operating System'
- Denham94, 'DEC OSF/1 Version 3.0 Symmetric Multiprocessing Implementation'
- Gamache88, 'VMS Symmetric Multiprocessing'
- Sliberchatz08, Bibliography from the 'Operating System Concepts' book
* Remzi Arpaci-Dusseau, Operating System Notes, “Scheduling:
Introduction”, University of Wisconsin-Madison, 2000:
----------------------------------------------------------
The included portion of Remzi's notes (ch. 6, 7, and 8) build a light-
weight (undergrad-level) introduction to general-purpose OS scheduling.
These documents also include a number of good references.
Here are the important points in these notes + analysis of mine:
- Some of the metrics in measuring schedulers include 'Turnaround time',
which is the time taken between registering a job for execution and
its completion/return/finish. Mainly:
T_{turnaround} = T_{finish} - T_{arrival}
Another metric, caring more about latency, is the 'response time'
which is the time the job waits in the ready queue till some CPU time
is provided:
T_{response} = T_{firstrun} - T_{arrival}
- The cost of context switching does not arise from the OS actions of
saving and restoring a few registers. For example, we do context
switching in our kernel using only 37 x86 opcodes: they are mainly
register moves and stack pops and pushes.
The real cost is that "when programs run, they build up a great deal
of state in CPU caches, TLBs, branch predictors, and other on-chip
hardware. Switching to another job causes this state to be flushed
and new state relevant to the currently-running job to be brought in,
which may exact a noticeable performance cost."
- Round-robin, the simplest practical scheduler possible, involves the
act of __cycling__ on the runqueue's list of processes, running each
job for a fixed time slice (unit). Assuming processes A, B, C, and D,
the scheduler run the jobs as follows
| A | B | C | D | A | B | C | D | A | B | C | D | A | B | C | ...
each for an equal amount of time. Advantages include implementation
simplicity, good average response times (see disadvantages though),
and being starvation-free by design.
Disadvantages include very bad behaviour on system saturation, cause
the system will swap-in and out programs for each (supposedly small)
time slice [*], and response-time rising linearly with the number of
active jobs in the system, even for highly-interactive processes.
[*] this is a point handled by the Corbato paper below
- In practice, most of processes will issue a number of blocking I/O
requests during their allocated time quanta, which leads to moving
them out of the runqueue altogether, and returning back to such queue
after I/O completion.
HOW are these jobs treated after returning back can be as simple as
considering them new processes (classical round-robin), or to perform
timing calculations and modify their priority (assuming 'priorities'
support in the scheduler, or a similar concept).
- Finally, there's a nice paragraph on the separation of mechanism and
policy in OS Design. For example, stuff like programming the system
ticks handler and context-switching code are scheduling mechanisms
(mechanically how the jobs get interleaved). Scheduling policies
(or disciplines) is what you're currently reading is about.
In real-time POSIX for example, there are several programmable
scheduling 'policies' that can be setup through sched_setscheduler(2).
Such policies include SCHED_FIFO, SCHED_RR, and SCHED_OTHER.
* F.J. Corbato et al., “An Experimental Time-Sharing System”, AIEE-IRE
Proceedings of the May 1-3, 1962, Spring Joint Computer Conference, MIT:
------------------------------------------------------------------------
This is one of the earliest papers on time-sharing schedulers. The core
of the paper (beside its overall historical importance) is the section
'A Multi-Level Scheduling Algorithm', which is in modern terminology now
called the 'Multi-level Feedback Queue' (MLFQ). Important points:
- The main problem the algorithm tries to solve is system saturation:
a) total size of active user programs exceed high-speed memory
b) too many active user programs to maintain an adequate response at
each user console.
c) high performance cost of context switching
According to the authors, a good solution is to provide a procedure
which gives graceful degradation of the response time and effective
real-time computation speed of the *large and long-running users*.
- The paper argues that a simple round-robin scheduler will not work in
case of saturation cause it will cause an 'abrupt collapse of service'
due to the sudden burden of time required to swap programs in-and-out
of secondary memory such as a disk.
The multi-level scheduling algorithm was thus presented to improve the
saturation performance of a time-sharing system.
- We now delve into the scheduler algorithm details: from the very first,
the scheduler partitions its runqueue to several priority queues
l_0, l_1, ..., l_L. ('L' is to be calculated later)
- In general, the bigger the program memory footprint (size), the less
queue priority it's assigned to. But, the less the queue priority, the
bigger the runtime for each of its tasks in number of quantum.
Mathematically: l = [ log_2( [w_p/w_q] + 1 ) ]
where each task at queue 'l' runs for 2^l quanta, and that if
l_a < l_b, then l_a has a higher priority than l_b. We can deduce
several things from above equation:
a) if 0 < w_p < w_q, [w_p/w_q] = 0, runtime = 1 quantum |-- 1
b) if w_q < w_p < 2*w_q, [w_p/w_q] = 1, runtime = 2 quanta -|
c) if 2*w_q < w_p < 3*w_q, [w_p/w_q] = 2, runtime = 2 quanta -|-- 2
d) if 3*w_q < w_p < 4*w_q, [w_p/w_q] = 3, runtime = 4 quanta ..|
e) if 4*w_q < w_p < 5*w_q, [w_p/w_q] = 4, runtime = 4 quanta ..|
f) if 5*w_q < w_p < 6*w_q: [w_p/w_q] = 5, runtime = 4 quanta ..|-- 4
g) if 6*w_q < w_p < 7*w_q: [w_p/w_q] = 6, runtime = 4 quanta ..|
h) if 7*w_q < w_p < 8*w_q: [w_p/w_q] = 7, runtime = 8 quanta +++
... +++|-- 8
p) if 15*w_q < w_p < 16*w_q: [w_p/w_q] = 15, runtime = 16 quanta xxxx
... xxxx|-- 16
_) if 31*w_q < w_p < 32*w_q: [w_p/w_q] = 31, runtime = 32 quanta @@@@@
... @@@@@|-- 32
_) if 63*w_q < w_p < 64*w_q: [w_p/w_q] = 63, runtime = 64 quanta
Note: at point 'a)', queue 'l' is equal to 0: the highest possible
queue in priority.
By above design, the bigger the program, the bigger its allocated
number of quanta, making the time-overhead of swapping negligible.
Thus, optimizing the turnaround time of CPU-bound processes.
- The above point in scheduler design _tries_ to solve the first part of
the problem the algorithm tackles: the stated point 'a) total size of
active user programs exceed high-speed memory' above.
From now on, the paper tackles second part of the problem, 'b) too many
active user programs to maintain an adequate response'.
- The scheduler __always__ begin with the tasks at the head of the
highest priority queue (l_0). After that queue gets exhausted, it moves
to the direct following queue l_1, and so on <== (I)
Tasks in a single queue get run in a round-robin fashion.
- If while executing a task at queue 'l' (during the 2**l quantum), a
task got assigned to a higher priority queue 'l*', the current
executing task get put at the __head__ of the 'l'th queue, and the
process at (I) is restarted from queue 'l*', and so on.
In other words, a task cannot get executed while another task is
waiting at a higher priority queue.
- If a program in queue 'l' has not 'completed' (did not issue any I/O)
within its 2**l-quantum runtime, it's placed at the __tail__ of the
less-priority queue 'l+1'
- A powerful point in the algorithm above (beside broad bounds on system
performance) is that the classification procedure for tasks is entirely
automatic: it depends on performance and program size rather than
static declarations or (the possibly malicious) users hopes.
- As once stated above, the main theme around this scheduler is improving
the __saturation performance__ of the system.
This performance improvement on saturation exists cause as a user taxes
the system, the degradation of service occurs progressively starting
with users with large or long-running programs (least-priority queues)
first.
- Starvation: there's a probable case of a big number of active high
priority tasks starving less-priority CPU-bound tasks out of execution.
The paper __models__ a worst cast of having N tasks, each running at
queue 'l', and each relinquishing the CPU __directly__ before their
2**l quantum runtime, and thus each also returning to the the same-
priority queue, with the condition of returning to the runqueue only
't_u'-seconds later (time for finishing I/O and, or, user response).
If the system will always be busy at queue 'l' above, starving all less
priority queues 'l + x' for x > 0, then the queue 'l' must never be
empty; we must have:
N * 2^l * q >= t_u
2^l >= (t_u / N*q)
l >= log_2 (t_u / N*q)
where 'N' is the number of tasks at queue 'l'.
From above, we've found an upper-bound on the priority queue that can
be serviced: queues > 'l' above can possibly starve if number of tasks
reach N. Thus, we have a guarantee that all
l_a <= [ log_2 (t_u / N * q) ]
will be serviced and will __not__ get starved (given the estimated
response time t_u), where '[]' denotes the integral part of enclosed
equation. The author calls this queue the 'highest serviced level'
queue.
- The equation above has set a bound on the starvation problem. So if we
set the max possible 'highest serviced queue' to the max possible queue
on the system, we have a __theoretical guarantee__ of no starvation in
the system.
(Providing this guarantee is very hard for modern-day general purpose
operating systems: we cannot easily put a limit on the number of tasks
'N', and we cannot even have a valid estimation of 't_u', the I/O
or user response finish time.
These values can be calculated easily if the target hardware and work
load is previously known; this is the case in the paper [IBM 7090],
and this can also be the case for small and predictable devices like a
small network router.)
- By now, we've analyzed most of the paper sans the response time
analysis.
While these response time equations are intellectually simulating, they
offer very broad bounds that are not entirely practical for a general
purpose scheduler, similar to the paper's way of handling starvation in
the above point.
* F.J. Corbato et al., “The Compatible Time-Sharing System: A Programmer's
Guide”, The MIT Computation Center, MIT Press, 1963:
------------------------------------------------------------------------
This manual gives a very nice overview of the historical time-sharing
hands-on experience. It's a worthwhile read while studying the CTSS
scheduler paper above.
A useful demonstration of a programmer's session on CTSS is also
presented in the Appendix. Interesting text editor!! it makes me
remember the original UNIX editor 'ed'.
* Roger Dingledine, “Performance Improvements on Tor”, 2009:
----------------------------------------------------------
Now why am I including a paper on Tor while we're dealing with
schedulers? Because the Tor network tries to solve a problem very
similar to the one Corbato was designing his scheduler model against:
``Section 1 described mechanisms to let low-volume streams have a
chance at competing with high-volume streams. Without those mechanisms,
normal web browsing users will always get squeezed out by people
pulling down larger content and tolerating high latency.
But the next problem is that some users simply add more load than the
network can handle. Just making sure that all the load gets handled
__fairly__ isn't enough if there’s too much load in the first place.(+)
When we originally designed Tor, we aimed for high throughput. We
figured that providing high throughput would mean we get good latency
properties for free. However, now that it’s clear we have several user
profiles trying to use the Tor network at once, we need to consider
changing some of those design choices. Some of these changes would
aim for better latency and worse throughput. (++) ''
The statement marked with (+) matches word-by-word the problem
statement of the Multi-Level Feedback Queue paper: to gracefuly handle
'system saturation' while maintaining an acceptable level of system
responsiveness. Thus, avoiding the 'abrupt collapse of service' caused
by OS scheduler's round-robin algorithm or Tor's full fairness in high
overlad.
A main theme of the MLFQ solution, and Tor's one proposed at (++), is
not to let throughput-driven jobs (compilers in schedulers case, an
700-MB ISO download in the Tor case) deprive latency-senitive jobs
out of quickly-needed resources. Examples of such latency-sensive
apps include an editor or web-browser GUI scrolling in OS schedulers,
and servicing HTML websites in Tor.
* Remzi Arpaci-Dusseau, Operating Systems Notes, The MLFQ, 2000:
--------------------------------------------------------------
Here, the MLFQ algorithm we studied from __primary sources__ above is
re-discussed in modern terms. Important points:
- The hardest point in designing schedulers is to minimize response time
for interactive jobs, while also minimizing overall system turnround
time.
- As we know by now, MLFQ varies the priority of a job based on its
observational behaviour: learning from jobs as they run, and using
history to predict their future behaviour. Historical state is kept in
the scheduler priority queues.
This is an important systems design pattern of learning from history:
observing current trends and learning from the past to predict the
future. Such pattern is common in CPUs branch prediction, working set
rules, and in networking protocols congestion control mechanisms.
Remzi advises us though that 'such approaches work when jobs have
__phases of behavior__ and are thus predictable; of course, one must
be careful with such techniques, as they can easily be wrong and drive
a system to make worse decisions than they would have with no knowledge
at all.'
- There are two important problems in MLFQs:
a) Starvation: One of the possible ways to handle it is the well-known
'aging principle': increasing the priority as the process waits. Check
below parenthesized notes though; it's not that simple.
(Note1: The original paper theoretically 'solved' this issue by giving
bounds on starvation. These bounds can only work if we have a true
estimate of I/O response times _and_ of the jobs number in the system.
This is practically impossible for general-purpose scheduling.)
(Note2: There are several ways to __practically__ apply aging methods.
Some of these ways are discussed in David Black's paper 'Scheduling
support for concurrency and parallelism in the Mach Operating System.')
b) Gaming the system: A possible solution is to improve the algorithm
accounting of CPU time, making it less viable to gaming. Check the
notes for further details.
* Remzi Arpaci-Dusseau, “Multilevel Feedback Queue Scheduling in Solaris”,
1998 && Solaris, Man Pages Section 4: File Formats, ts_dptbl(4), 2002:
------------------------------------------------------------------------
From now on, we check how various real-world kernels implement MLFQ
scheduling. Beginning with Solaris, here are the important points:
- For regular jobs, the kernel has 60 priority queues. A job begins in
the middle, at priority 29. As in classical MLFQ, CPU-bound jobs get
assigned smaller priorities over time (with bigger slices), and I/O
tasks move up, getting scheduled whenever they have jobs to do.
Now here is Solaris 'dispatcher parameter table' defaults:
Priority q (ms) tqexp slpret maxwait lwait
0 200 0 50 0 50
... vvv v vv v vv
9 vvv v vv v vv
10 160 v 51 v 51
11 vvv 1 vv v vv
12 vvv 2 vv v vv
13 vvv 3 vv v vv
14 vvv 4 vv v vv
15 vvv 5 vv v vv
16 vvv 6 vv v vv
17 vvv 7 vv v vv
18 vvv 8 vv v vv
19 vvv 9 vv v vv
20 120 10 52 v 52
21 vvv 11 vv v vv
22 vvv 12 vv v vv
23 vvv 13 vv v vv
24 vvv 14 vv v vv
25 vvv 15 vv v vv
26 vvv 16 vv v vv
27 vvv 17 vv v vv
28 vvv 18 vv v vv
29 vvv 19 vv v vv
30 80 20 53 v 53
31 vv 21 vv v vv
33 vv 22 vv v vv
33 vv 23 vv v vv
34 vv 24 vv v vv
35 vv 25 54 v 54
36 vv 26 vv v vv
37 vv 27 vv v vv
38 vv 28 vv v vv
39 vv 29 vv v vv
40 40 30 55 v 55
41 vv 31 vv v vv
42 vv 32 vv v vv
43 vv 33 vv v vv
44 vv 34 vv v vv
45 vv 35 56 v 56
46 vv 36 57 v 57
47 vv 37 58 v 58
48 vv 38 vv v vv
49 vv 39 vv v 59
50 vv 40 vv v vv
51 vv 41 vv v vv
52 vv 42 vv v vv
53 vv 43 vv v vv
54 vv 44 vv v vv
55 vv 45 vv v vv
56 vv 46 vv v vv
57 vv 47 vv v vv
58 vv 48 vv v vv
59 20 49 59 32000 vv
tqexp: time quantum expired priority, the new priority the thread is set
to when it uses its entire time quantum. As you can see, any process
using its entire slice get its priority decreased by __10 levels__.
slpret: sleep return priority value, the new priority the thread is set
to after waking up (from a sleep event like network or disk I/O.)
The thread priority is increased to 50 or above after an I/O operation!
maxwait: a starvation avoidance mechanism to compensate threads waiting
for a relatively long time in the dispatcher queue. 'maxwait' is such
waiting time threshold. If a thread waited more than maxwait in the
queues, its priority is boosted up to the respective 'lwait' priority.
Each thread has a dispatcher_wait ('dispwait') counter, which is reset
when the thread is inserted into the dispatch queue (following a time-
quantum expiration, or cause of wakeup from I/O or locking). This
field is incremented once per second for all processes waiting in the
dispatcher queues. If such counter exceeded 'maxwait', then the thread
priority is incremented to 'lwait' -- a priority > 50 by default.
Note that the default 'maxwait' for Solaris is 0: just 1 second and
all the system processes priority will be boosted up above 50. This
avoids starving CPU-bound too much.
[Sol-*]:
(The values above have some very interesting details: Note that after
1-second, processes in the lowest (0-9) range will move to priority 50,
processes in the range (10-19) will move to 51, (20-29) to priority 52,
and so on. By this, high-priority processes response times will NOT be
affected because they will still run on queue 59 instantly.
Now, how can this prevent starvation? Can't processes keep going in
and out very quickly, starving queues < 59?. Indeed that was the
mathematically modeled worst case in the original CTSS paper.
Now comes the second important detail: even if the worst case above
occured (highly unlikely in practice, except if done maliciously),
you'll see that the 'lwait' field of the 50-59 priorites is 59. Thus,
in the second __starvation round__, all will move to 59 and execute
in a round-robin manner. This solves the starvation problem elegantly.
Summary: starvation can never exceed 2 seconds + waiting time in prio
queue 59. All while maintaining system responsiveness in the process.)
- NOTE: I've coded such scheduler in my kernel, and its disadvantages
began to appear directly after the first day of benchmarking: too much
heuristics and voo-doo constants!
Scheduling 100% CPU-hog threads showed the ugly side of that algorithm:
the combination of a big number of dynamic priority levels with huge
and different slice lengthes per each priority kept starving the low
priority processes till the newly-queued CPU hogs moved up to their
equivalent low-priority level.
While the new CPU hogs were moving up, the older low-priority tasks
faced complete starvation for 1+ seconds. The algorithm thus boosted
these threads to high-priority levels, but this __starved__ the newer
threads that are now residing in relatively lower-priority queues. This
effectively made all system processes enter an infinite loop of
1+-second starvation and then boosting -- an upleasent situation.
Decreasing the number of priority levels and making the slices more fair
helped the situation, but kept introducing __new__ corner-cases. After
a number of days, I gave-up and went coding a more theoretically-sound
scheduling algorithm.
(The benchmark was simply plotting different kernel threads properties
over time. Such properties included threads total runtime, average
runtime per dispatch, total wait in runqueues, average wait in runqueues,
number of priority boosts [cause of starvation], number of preemptions
cause of a higher-priority thread becoming runnable, and number of
preemptions cause of a thread time-slice end:
(a) code version 1, 6 threads:
http://localf.googlepages.com/pub/mlfq/fairness.gif
(b) code version 1, 100 threads:
http://localf.googlepages.com/pub/mlfq/fairness100.gif
(c) code version 2 (after sanitizing the voo-doos), 6 threads:
http://localf.googlepages.com/pub/mlfq/fairness-improved.gif
(d) code version 2, 100 threads (things became good!!):
http://localf.googlepages.com/pub/mlfq/fairness-improved.gif
(e) unfortunately, the _new_ heuristics above had corner cases:
http://localf.googlepages.com/pub/mlfq/fairness-improved-pathological.gif
It seems I'm not the only one facing this though; I've found some refs
in academia on how SVR4-based schedulers [like Solaris] were very hard
to tweak and adapt to different workloads.)
* David L. Black, “Scheduling Support for Concurrency and Parallelism in
the Mach Operating System”, IEEE Computer Journal, Carnegie Mellon, 1990:
-------------------------------------------------------------------------
This is a very interesting article, but we will concentrate on its MLFQ
deficiencies discussion for now (Note: always remember Black's point of
creating an idle thread per CPU, not to corrupt kernel stack on
multiprocessor systems). Important points:
- Multics used a decreasing priority MLFQ (as was exactly described in
its original paper) and discovered the major deficiency: on a heavily
loaded system with many short-lived jobs, the priority of a CPU-bound
job can decrease until little or no CPU-time is assigned to it. The
author notes that such problem (starvation) also existed on a number
of UNIX kernels.
- To handle this problem, it's necessary to elevate priorities in some
manner. VAX/VMS elevates a low application priority when it issues an
event like I/O completion.
The author states that this negatively affect interactive applications
which consume large amount of processor time, since they may not issue
priority-raising events to an effective degree.
(Nonetheless, this approach is effective, response-time wise, for
handling software with 'phases' of I/O and CPU-bound activity if the
CPU-bound activity is long but not dominant. This is a point the
original CTSS scheduler did not address: a job's transition from a
CPU-bound to an I/O-bound phase.)
- A (non-exclusive) second way to handle this is processor usage aging.
To compensate lower-priority jobs, we forget (decay) their past history
in some way, usually in an exponential decay manner.
(Exponential decay: N(t) = N(0) * (decay_factor)**t. Assuming a decay
factor of 0.5, and N(0)=10, then N(1)=5, and N(2) = (10/2)/2 = 0.25)
- Solaris, as we've seen, uses BOTH of the prevention methods above:
processes priorities get elevated after completion of an I/O event,
AND the kernel also forgets (age) __most__ of the current processes
usage history after one second (coalescing each 10 priorities to one,
while maintaining the relative difference between them).
- 4.3BSD: it solely uses the aging method to handle starvation, with a
CPU-usage decay factor that's based on system load. The bigger the
system load, the less usage decay, avoiding jobs from getting crammed
in high priority queues on high system load.
Priority of the __currently running__ thread get re-calculated after
each 4 ticks of its runtime, decreasing linearly with job's CPU usage.
If a higher priority was discovered after this decrease, the thread
is preempted.
Routines responsible for waking up a thread or adding a new thread to
the runqueue check such thread priority against the currently running
thread. If it's higher, a reschedule operation is requested.
(XXX: Unix Internals discusses the problem of classical SVR2/3 decay,
its equations, and the new BSD decay method. Add these notes here as
a supplement to the original paper discussion.)
* Jeffrey M. Denham et al., “DEC OSF/1 Version 3.0 Symmetric Multiprocessing
Implementation”, Digital Technical Journal, volume 6 number 3, Summer 1994:
---------------------------------------------------------------------------
This paper is gold, and is comparable in quality to the CTSS one. It
discusses a huge number of now-classical SMP features:
+ Accessing the 'current' thread state by masking the stack pointer,
regardless of the CPU we're currently running on.
+ Per-CPU runqueues for thread scheduling!
+ SMP scheduling soft affinity
+ Interactions between spinlocks and kernel preemption
+ Striking a balance between fine-grained locking and lockless algorithms
A good number of above ideas were of a VAX/VMS origin. Important notes:
- The task of an SMP kernel is to improve system performance as CPUs are
added without compromising quality. DEC OSF/1 goal was to provide a
performant Unix SMP system over the Alpha architecture.
That hardware architecture provided shared memory, symmetric access to
system I/O busses, atomic read-and-set instructions, and interprocessor
interrupts (IPIs).
NOTE: This is very close to the Intel MultiProcessor (MP) architecture.
- As we know by now, classical Unix kernels insured its consistency by:
a] disabling interrupts in critical regions accessed by IRQ handlers
(and thus handling the concurrency of asynchronous IRQ handlers)
b] allowing only one process to be in kernel context at a time
(and thus handling the virtual concurrency provided by time-sharing)
These protections aren't enough in case of multicore systems, since
they provide real concurrent execution of kernel code paths.
An evolutionary solution to that problem was the use of 'funneling':
allowing certain kernel subsystems to be executed only in one CPU at
_any_ given time. These subsystems was then iteratively and slowly
moved to local fine-grained locking.
NOTE: Linux used that 'funneling' method in its early days using the
'Big Kernel Lock (BKL)'; FreeBSD did the same using a 'Global Mutex'.
Both kernels moved to fine-grained locking iteratively. By virtue of
being started in Winter of 2009, my kernel incorporates fine-grained
locking from the start.
- It's interesting to note that the first version of DEC OSF/1 provided
full kernel-mode preemption in its real-time version of the kernel.
Solaris was also made fully preemptive with fine-grained locking very
early on. Linux and FreeBSD caught on around 10 years later.
- The paper then discusses how kernel preemption should be disabled by the
entrance of spinlocks. It also discusses the now-very-common idea of
counting the depth of spin locks acquired using a reference count, and
preempting the kernel if and only if that reference count = 0.
NOTE 1: This is equivalent to Robert Love's Linux kernel preemption
patch. A google search reveals that this patch got merged in 2001.
NOTE 2: FreeBSD on the other hand completely disables local interrupts
upon spin lock entrance, automatically disabling preemption in the
process (at a slight performance cost). I do the same in my kernel.
NOTE 3: (cut-and-paste from my kernel comments) Disabling preemption
in spinlock-protected regions is a necessity because:
a) If a thread holding a spin lock got preempted, it will let a
possibly huge number of other threads consume their entire time
slice (which is in the order of milliseconds) just spinning!
b) If that preempted thread (which is now in the scheduler runqueue)
got killed, we can deadlock the entire system!
c) Classical race conditions may arise if the newly scheduled thread
accessed the same critical-region the older thread was accessing
while being preempted.
That's why spin locks are also needed in preemptive Uniprocessor
kernels: ignoring their spinning behavior, they act as atomic
critical region markers.
There's also a warning about accessing per-CPU variables while kernel
preemption is enabled. Modern kernels disable preemption in that state.
NOTE 4: (also from my kernel comments) Disabling preemption while
accessing a per-CPU variable is necessary:
a) We might get preempted and then rescheduled on a different CPU.
b) If the _preempting_ thread used the same per-CPU variable, that's
a classical race condition.
The paper notes that due to the complexities of kernel locking, a
"coherent locking architecture with automated debugging facilities
was needed to ship a reliable product on time."
- One problem faced by SMP is finding the descriptor of the current
process context. In a uniprocessor machine, that can be simply done
by a global memory variable that is changed before context switching.
On SMP, there's a "current" process descriptor for each CPU. This
descriptor can be located by finding the process number we're running
on, and then accessing an array of per-cpu data based on that number.
In Unix kernels, the current thread state is dereferenced a huge number
of times. Needing to know the current CPU number before each of these
dereferences (a costly hardware operation) was unacceptable.
NOTE 1: In contemporary x86 CPUs, this is costly since it needs reading
the APIC registers for each dereference. APIC registers accessors induce
higher latency than the usual general-purpose registers.
Even assuming that this operation wasn't costly from the hardware side,
we'll have to disable, then enable, kernel preemption each time we want
to access that thread state; also unacceptable.
DEC OSF/1 solved this by storing a pointer to the currently running
process descriptor at the very bottom of each thread stack. By simply
masking the stack pointer, they can instantly access that thread state,
__regardless__ of the CPU that thread is currently running on.
NOTE 2: The paper states that putting the 'current' thread state pointer
in the per-CPU area, and then setting it before each context switch,
would frequently trash other CPUs caches. This is true in the regular
case, but aligning such per-CPU array elements on a cache-line boundary
(64 or 128 bytes in x86 CPUs) would solve that problem. It's possible
that the Alpha architecture had different semantics though.
NOTE 3: Linux x86-32 used that stack-masking design for a long time. I
guess it was transfered to its developers by folklore.
NOTE: What about the risks, econet anyone?
NOTE 4: On the x86 kind of the equation, it's common to _permanently_
assign a segment register with the address of the relevant per-CPU area
(aligned on a cache-boundary) for each CPU. This avoids the inefficiency
of accessing the APIC for every per-CPU data dereference.
NOTE: Not all CPU architectures have the luxury of a such a register,
which is left _intact_ by the basic system ABI definition. The absence
of such register in the VAX architecture was the main reason of the
stack-masking idea in VMS. Explore the 'VMS Symmetric Multiprocessing'
paper, DTJ vol.1#7, for further historical context.
NOTE: Reading the historical (preliminary) AMD64 architecture document
(January 2001, Rev C), it seems that AMD kept the %fs and %gs registers
because Windows NT used %gs as a per-CPU area pointer. Interestingly,
NT is, afterall, a VMS-based kernel!
- Per-CPU runqueues: Paper begins by noting that "scheduling threads out of
a global runqueue is highly effective at keeping the N highest-priority
threads running". It then states the disadvantages of that method: lock
contention over the global runqueue and losing CPU cache state due to
uncontrolled bouncing of threads between CPUs.
NOTE: Having a global runqueue and taking CPU affinity in consideration
is not mutually exclusive. Brainfuck Scheduler (BFS) and the Linux-2.4
epoch scheduler is an example of this.
The DEC designers thus added "soft affinity" to timeshare threads: these
threads are softly bound to the processor they last run on. Nevertheless,
if too much time passed since last run, they can get scheduled elsewhere:
the CPU state was probably lost anyway. It's more beneficial to directly
run that thread on another less-loaded CPU than wait for the current one.
- Load balancing: As a summary of the presented algorithm, let
TS(i): The estimated number of ticks available to time-sharing threads
by CPU #i in a scheduling period
TTS : Total number of system ticks available to time-sharing threads
N : Total number of runnable time-sharing threads in the system
D(i) : Ideal number of runnable time-sharing threads assigned to CPU #i
C(i) : Current number of time-sharing threads assigned to CPU #i
we have
TTS = TS(1) + TS(2) + ... + TS(k), k = number of system CPUs
D(i) = (TS(i) / TTS) * N
Every second, the kernel calculates D(i) for each CPU. If a CPU had a
C(i) < D(i), and another one had C(i) > D(i), the former is marked as
"out of balance". When such unbalanced CPU reschedules, it "steals"
tasks from the overloaded CPU. Tasks get stolen only if they did not
run on their softly-bound CPU for a configurable amount of time, thus
preserving soft affinity usefulness as much as possible.
- The paper finishes by stating the software engineering tactics needed
to safely develop such an ambitious code base. Remember that this was
the 90s. After studying that paper, it's clear that the two most
important tactics were:
- Code review, and
- Iterative development
So, well, things haven't changed a lot from the software engineering
side!
* Rodney N. Gamache && Kathleen D. Morse, “VMS Symmetric Multiprocessing”,
Digital Technical Journal, volume 1 number 7, pp. 57-73, August 1988:
------------------------------------------------------------------------
NOTE: Below discussion assumes reading our notes on DEC OSF/1 v3.0 first.
"Systems software engineers must now design effective ways to utilize
systems with six, eight, or even more CPUs", begins this late 80s paper!
The SMP additions to VAX/VMS were motiviated by the, back then new, VAX
6200 architecture. From other papers in the _same_ journal issue, that
hardware design had:
+ Symmetric shared memory
+ Cache coherency in hardware (always a bless!)
+ First-level cache for instructions, and a second-level write-through
one for both data and instructions
+ Broadcast and single-destination inter-processor interrupts
This paper INTRODUCED a good number of now-classical SMP features:
+ Spinlocks! (also _coined_ that term)
+ Accessing the Per-CPU area (and thus the current thread state) by
masking the CPU stack pointer register.
+ SMP scheduling hard affinity! (also _coined_ that term)
Having above context in mind, here were the important points:
- VMS engineers first considered using a global lock[*] to protect the
kernel in an SMP machine. They (rightfully) ditched the idea later and
moved directly to the ambitious goal of fine-grained locking.
NOTE: Compare this to the DEC OSF/1 method, where they first used a
global lock, and then fine-grained the system iteratively.
[*] Big Kernel Lock in Linux jargon, Global Mutex in the BSDs
- Inventing spinlocks: Regular VMS mutexes must be acquired in a process
context, but the kernel needed SMP protection in different other
contexts. That's why spinlocks were created; they can get acquired by
any CPU in any context (as long as certain sanity rules are held).
NOTE 1: A classical case is protecting the video-RAM and printf() buffers.
The kernel usually want to access such resource in all possible contexts,
including boot context, exception and even IRQ context, not just inside
process contexts.
Usually, a mutex is acquired or "owned" by a certain process; spinlocks
on the other hand are acquired by the CPU as a whole. Spinlocks was
implemented in VMS using an interlocked bit-test-and-set op.
NOTE 2: It's still implemented similarly these days, but it might be
faster to use an interlocked swap. A missing issue in that paper design
was making threads lock spinning time be finite (bound); that's an
important (third) condition in the theory of critical sections.
NOTE: That got handled in Linux using N.Piggin's ticket spinlocks.
- VMS added two important debugging aids to spin locks:
a) each spinlock was assigned a rank (order). All spinlocks must be
acquired in that order, eliminating any possible deadlocks.
b) each spinlock structure saved the last eight program counters (PC/
x86 %RIP) that acquired or released such lock.
NOTE: Can this ranking method dynamically work in my kernel? All my
locks are strictly local so far, but that might change once VFS and
file-systems are added (e.g. inode locks).
- Per-CPU Area: I _love_ how the authors state their thinking process in
that section. As in any SMP-capable kernel, VMS needed a per-CPU data
area to put things as the current thread state pointer and other per-
CPU bookkeeping.
The most effecient way to do that is to have a general-purpose register
_permanently_ set to the virtual address of the relevant per-CPU area.
Such a register did not exist in VAX, so instead of creating it in the
VAX hardware, they searched for a way to _overload_ a current register
with that purpose.
Since each CPU must have a unique stack to safely handle interrupts,
the current thread stack pointer can be safely overloaded with that
task. They put a pointer to the per-CPU area at the bottom of such
stacks, which can be reached by simply masking the stack pointer.
NOTE: See MUCH more details on this in our DEC OSF/1 discussion above.
- CPU Affinity: They maintained a mask inside each thread descriptor,
where each CPU is represented by a single bit. Setting only one bit in
that mask is equivilant to hard binding that thread to a single CPU.
Such affinity rules are naturally enforced by the scheduler.
NOTE: Solaris and Linux roughly use the same method for hard affinity.
* Bibliography from “Operating System Concepts”, Sliberchatz et al., 2007:
------------------------------------------------------------------------
Feedback queues were originally implemented on the CTSS system described
in Corbato et al.[1]. This feedback queue scheduling system was analyzed
by Scharge[2]. The preemptive priority scheduling algorithm was suggested
by Kleinrock[3].
[1] Fernando J. Corbato et al., 'An Experimental Time-Sharing System',
Proceedings of the AFIPS Fall Joint Computer Conference (1962)
[2] L. E. Scharge, 'The Queue M/G/I with Feedback to Lower Priority
Queues', Management Science, Volume 13 (1967)
[3] L. Kleinrock, 'Queuing Systems, Volume II: Computer Applications,
Wiley Interscience (1975)
Anderson et al.[4], Lewis and Berg[5], and Philbin et al[6] discuss
thread scheduling.
[4] T. E. Anderson et al., 'The Performance Implications of Thread
Management Alternatives for Shared-Memory Multiprocessors', IEEE
Transactions on Computers, Volume 38, Number 12 (1989)
[5] B. Lewis and D. Berg, 'Multithreaded Programming with Pthreads,
Sun Microsystems Press (1998)
[6] J. Philbin et al., 'Thread Scheduling for Cache Locality',
Architectural Support for Programming Language and Operating Systems,
(1996)
Scheduling techniques that take into account information regarding
process execution times from previous runs are described in Fisher[7],
Hall et al.[8], and Lowney et al.[9]
[7] J. A. Fisher, 'Trace Scheduling: A Technique for Global Microcode
Dispatching', IEEE Transactions on Computers, Volume 30, Number 7
(1981)
[8] L. Hall et al., 'Scheduling to Minimize Average Completion Time:
Off-line and On-line Algorithms', SODA: ACM-SIAM Symposium on
Discrete Algorithms (1996)
[9] P. G. Lowney et al., 'The Multiflow Trace Scheduling Compiler',
Journal of Supercomputing, Volume 7, Number 1-2 (1993)
Fair-share schedulers are covered by Henry[10], Woodside[11], and Kay
Kay and Lauder[12]
[10] G. Henry, 'The Fair Share Scheduler', AT&T Bell Laboratories
Technical Journal (1984)
[11] C. Woodside, 'Controllability of Computer Performance Tradeoffs
Obtained Using Controlled-Share Queue Schedulers', IEEE
Transactions on Software Engineering, Volume SE-12, Number 10
(1986)
[12] J. Kay and P. Lauder, 'Fair Share Scheduler', Communications of the
ACM, Volume 31, Number 1 (1988)
Siddha et al[13] discusses __MODERN__ challenges on multicore systems.
[13] S. Siddha et al, 'Process Scheduling Challenges in the Era of
Multi-Core Processors', Intel Technology Journal, Volume 11 (2007)
Multiprocessor scheduling was discussed by Tucker and Gupta[14], Zahorjan
and McCann[15], Fietelson and Rudolph[16], Leutenegger and Vernnon[17],
Blumofe and Leisersonn[18], Polychronopoulos and Kuck[19], and Lucco[20].
[14] A. Tucker and A. Gupta, 'Process Control and Scheduling Issues for
Multiprogrammed Shared-Memory Multiprocessors', Proceedings of the
ACM Symposium on Operating Systems Principles (1989)
[15] J. Zahorjan and C. McCann, 'Processor Scheduling in Shared-Memory
Multiprocessors', Proceedings of the Conference on Measurement and
Modeling of Computer Systems (1990)
[16] D. Fietelson and L. Rudolph, 'Mapping and Scheduling in a Shared
Parallel Environment Using Distributed Hierarchial Control',
Proceedings of the International Conference on Parallel Processing,
(1990)
[17] S. Leutenegger and M. Vernnon, 'The Performance of Multiprogrammed
Multiprocessor Scheduling Policies', Proceedings of the Conference
on Measurement and Modeling of Computer Systems (1990)
[18] R. Blumofe and C. Leisersonn, 'Scheduling Multi-threaded
Computations by Word Stealing', Proceedings of the Annual Symposium
on Foundations of Computer Science (1994)
[19] C. D. Polychronopoulos and D. J. Kuck, 'Guided Self-Scheduling:
'A Practical Scheduling Scheme for Parallel Supercomputers', IEEE
Transactions on Computers, Volume 36, Number 12, (1987)
[20] S. Lucco, 'A Dynamic Scheduling Method for Irregular Parallel
Programs', Proceedings of the conference on Programming Languages
Design and Implementation, (1992)
The authors then list the standard books covering well-known general
purpose operating systems kernels like SVR2, the BSDs, Linux, NT, and
Solaris by Bach, Kirk McKusick, Bovet and Cesati, Solomon and
Russionovich, and Mauro and McDougal respectively.