Skip to content
Browse files

Merge branch 'encore-32-bfs' into encore-32

  • Loading branch information...
2 parents d62b764 + f0d14f4 commit f7f1930c1a3b89932e716f6e75cfd0987bf12ecd @fat-tire fat-tire committed May 10, 2011
347 Documentation/scheduler/sched-BFS.txt
@@ -0,0 +1,347 @@
+BFS - The Brain Fuck Scheduler by Con Kolivas.
+The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
+completely do away with the complex designs of the past for the cpu process
+scheduler and instead implement one that is very simple in basic design.
+The main focus of BFS is to achieve excellent desktop interactivity and
+responsiveness without heuristics and tuning knobs that are difficult to
+understand, impossible to model and predict the effect of, and when tuned to
+one workload cause massive detriment to another.
+Design summary.
+BFS is best described as a single runqueue, O(n) lookup, earliest effective
+virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
+deadline first) and my previous Staircase Deadline scheduler. Each component
+shall be described in order to understand the significance of, and reasoning for
+it. The codebase when the first stable version was released was approximately
+9000 lines less code than the existing mainline linux kernel scheduler (in
+2.6.31). This does not even take into account the removal of documentation and
+the cgroups code that is not used.
+Design reasoning.
+The single runqueue refers to the queued but not running processes for the
+entire system, regardless of the number of CPUs. The reason for going back to
+a single runqueue design is that once multiple runqueues are introduced,
+per-CPU or otherwise, there will be complex interactions as each runqueue will
+be responsible for the scheduling latency and fairness of the tasks only on its
+own runqueue, and to achieve fairness and low latency across multiple CPUs, any
+advantage in throughput of having CPU local tasks causes other disadvantages.
+This is due to requiring a very complex balancing system to at best achieve some
+semblance of fairness across CPUs and can only maintain relatively low latency
+for tasks bound to the same CPUs, not across them. To increase said fairness
+and latency across CPUs, the advantage of local runqueue locking, which makes
+for better scalability, is lost due to having to grab multiple locks.
+A significant feature of BFS is that all accounting is done purely based on CPU
+used and nowhere is sleep time used in any way to determine entitlement or
+interactivity. Interactivity "estimators" that use some kind of sleep/run
+algorithm are doomed to fail to detect all interactive tasks, and to falsely tag
+tasks that aren't interactive as being so. The reason for this is that it is
+close to impossible to determine that when a task is sleeping, whether it is
+doing it voluntarily, as in a userspace application waiting for input in the
+form of a mouse click or otherwise, or involuntarily, because it is waiting for
+another thread, process, I/O, kernel activity or whatever. Thus, such an
+estimator will introduce corner cases, and more heuristics will be required to
+cope with those corner cases, introducing more corner cases and failed
+interactivity detection and so on. Interactivity in BFS is built into the design
+by virtue of the fact that tasks that are waking up have not used up their quota
+of CPU time, and have earlier effective deadlines, thereby making it very likely
+they will preempt any CPU bound task of equivalent nice level. See below for
+more information on the virtual deadline mechanism. Even if they do not preempt
+a running task, because the rr interval is guaranteed to have a bound upper
+limit on how long a task will wait for, it will be scheduled within a timeframe
+that will not cause visible interface jitter.
+Design details.
+Task insertion.
+BFS inserts tasks into each relevant queue as an O(1) insertion into a double
+linked list. On insertion, *every* running queue is checked to see if the newly
+queued task can run on any idle queue, or preempt the lowest running task on the
+system. This is how the cross-CPU scheduling of BFS achieves significantly lower
+latency per extra CPU the system has. In this case the lookup is, in the worst
+case scenario, O(n) where n is the number of CPUs on the system.
+Data protection.
+BFS has one single lock protecting the process local data of every task in the
+global queue. Thus every insertion, removal and modification of task data in the
+global runqueue needs to grab the global lock. However, once a task is taken by
+a CPU, the CPU has its own local data copy of the running process' accounting
+information which only that CPU accesses and modifies (such as during a
+timer tick) thus allowing the accounting data to be updated lockless. Once a
+CPU has taken a task to run, it removes it from the global queue. Thus the
+global queue only ever has, at most,
+ (number of tasks requesting cpu time) - (number of logical CPUs) + 1
+tasks in the global queue. This value is relevant for the time taken to look up
+tasks during scheduling. This will increase if many tasks with CPU affinity set
+in their policy to limit which CPUs they're allowed to run on if they outnumber
+the number of CPUs. The +1 is because when rescheduling a task, the CPU's
+currently running task is put back on the queue. Lookup will be described after
+the virtual deadline mechanism is explained.
+Virtual deadline.
+The key to achieving low latency, scheduling fairness, and "nice level"
+distribution in BFS is entirely in the virtual deadline mechanism. The one
+tunable in BFS is the rr_interval, or "round robin interval". This is the
+maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
+tasks of the same nice level will be running for, or looking at it the other
+way around, the longest duration two tasks of the same nice level will be
+delayed for. When a task requests cpu time, it is given a quota (time_slice)
+equal to the rr_interval and a virtual deadline. The virtual deadline is
+offset from the current time in jiffies by this equation:
+ jiffies + (prio_ratio * rr_interval)
+The prio_ratio is determined as a ratio compared to the baseline of nice -20
+and increases by 10% per nice level. The deadline is a virtual one only in that
+no guarantee is placed that a task will actually be scheduled by this time, but
+it is used to compare which task should go next. There are three components to
+how a task is next chosen. First is time_slice expiration. If a task runs out
+of its time_slice, it is descheduled, the time_slice is refilled, and the
+deadline reset to that formula above. Second is sleep, where a task no longer
+is requesting CPU for whatever reason. The time_slice and deadline are _not_
+adjusted in this case and are just carried over for when the task is next
+scheduled. Third is preemption, and that is when a newly waking task is deemed
+higher priority than a currently running task on any cpu by virtue of the fact
+that it has an earlier virtual deadline than the currently running task. The
+earlier deadline is the key to which task is next chosen for the first and
+second cases. Once a task is descheduled, it is put back on the queue, and an
+O(n) lookup of all queued-but-not-running tasks is done to determine which has
+the earliest deadline and that task is chosen to receive CPU next.
+The CPU proportion of different nice tasks works out to be approximately the
+ (prio_ratio difference)^2
+The reason it is squared is that a task's deadline does not change while it is
+running unless it runs out of time_slice. Thus, even if the time actually
+passes the deadline of another task that is queued, it will not get CPU time
+unless the current running task deschedules, and the time "base" (jiffies) is
+constantly moving.
+Task lookup.
+BFS has 103 priority queues. 100 of these are dedicated to the static priority
+of realtime tasks, and the remaining 3 are, in order of best to worst priority,
+SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
+scheduling). When a task of these priorities is queued, a bitmap of running
+priorities is set showing which of these priorities has tasks waiting for CPU
+time. When a CPU is made to reschedule, the lookup for the next task to get
+CPU time is performed in the following way:
+First the bitmap is checked to see what static priority tasks are queued. If
+any realtime priorities are found, the corresponding queue is checked and the
+first task listed there is taken (provided CPU affinity is suitable) and lookup
+is complete. If the priority corresponds to a SCHED_ISO task, they are also
+taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
+to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
+stage, every task in the runlist that corresponds to that priority is checked
+to see which has the earliest set deadline, and (provided it has suitable CPU
+affinity) it is taken off the runqueue and given the CPU. If a task has an
+expired deadline, it is taken and the rest of the lookup aborted (as they are
+chosen in FIFO order).
+Thus, the lookup is O(n) in the worst case only, where n is as described
+earlier, as tasks may be chosen before the whole task list is looked over.
+The major limitations of BFS will be that of scalability, as the separate
+runqueue designs will have less lock contention as the number of CPUs rises.
+However they do not scale linearly even with separate runqueues as multiple
+runqueues will need to be locked concurrently on such designs to be able to
+achieve fair CPU balancing, to try and achieve some sort of nice-level fairness
+across CPUs, and to achieve low enough latency for tasks on a busy CPU when
+other CPUs would be more suited. BFS has the advantage that it requires no
+balancing algorithm whatsoever, as balancing occurs by proxy simply because
+all CPUs draw off the global runqueue, in priority and deadline order. Despite
+the fact that scalability is _not_ the prime concern of BFS, it both shows very
+good scalability to smaller numbers of CPUs and is likely a more scalable design
+at these numbers of CPUs.
+It also has some very low overhead scalability features built into the design
+when it has been deemed their overhead is so marginal that they're worth adding.
+The first is the local copy of the running process' data to the CPU it's running
+on to allow that data to be updated lockless where possible. Then there is
+deference paid to the last CPU a task was running on, by trying that CPU first
+when looking for an idle CPU to use the next time it's scheduled. Finally there
+is the notion of "sticky" tasks that are flagged when they are involuntarily
+descheduled, meaning they still want further CPU time. This sticky flag is
+used to bias heavily against those tasks being scheduled on a different CPU
+unless that CPU would be otherwise idle. When a cpu frequency governor is used
+that scales with CPU load, such as ondemand, sticky tasks are not scheduled
+on a different CPU at all, preferring instead to go idle. This means the CPU
+they were bound to is more likely to increase its speed while the other CPU
+will go idle, thus speeding up total task execution time and likely decreasing
+power usage. This is the only scenario where BFS will allow a CPU to go idle
+in preference to scheduling a task on the earliest available spare CPU.
+The real cost of migrating a task from one CPU to another is entirely dependant
+on the cache footprint of the task, how cache intensive the task is, how long
+it's been running on that CPU to take up the bulk of its cache, how big the CPU
+cache is, how fast and how layered the CPU cache is, how fast a context switch
+is... and so on. In other words, it's close to random in the real world where we
+do more than just one sole workload. The only thing we can be sure of is that
+it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and
+utilising idle CPUs is more important than cache locality, and cache locality
+only plays a part after that.
+When choosing an idle CPU for a waking task, the cache locality is determined
+according to where the task last ran and then idle CPUs are ranked from best
+to worst to choose the most suitable idle CPU based on cache locality, NUMA
+node locality and hyperthread sibling business. They are chosen in the
+following preference (if idle):
+* Same core, idle or busy cache, idle threads
+* Other core, same cache, idle or busy cache, idle threads.
+* Same node, other CPU, idle cache, idle threads.
+* Same node, other CPU, busy cache, idle threads.
+* Same core, busy threads.
+* Other core, same cache, busy threads.
+* Same node, other CPU, busy threads.
+* Other node, other CPU, idle cache, idle threads.
+* Other node, other CPU, busy cache, idle threads.
+* Other node, other CPU, busy threads.
+This shows the SMT or "hyperthread" awareness in the design as well which will
+choose a real idle core first before a logical SMT sibling which already has
+tasks on the physical CPU.
+Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
+However this benchmarking was performed on an earlier design that was far less
+scalable than the current one so it's hard to know how scalable it is in terms
+of both CPUs (due to the global runqueue) and heavily loaded machines (due to
+O(n) lookup) at this stage. Note that in terms of scalability, the number of
+_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
+quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
+results are very promising indeed, without needing to tweak any knobs, features
+or options. Benchmark contributions are most welcome.
+As the initial prime target audience for BFS was the average desktop user, it
+was designed to not need tweaking, tuning or have features set to obtain benefit
+from it. Thus the number of knobs and features has been kept to an absolute
+minimum and should not require extra user input for the vast majority of cases.
+There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval
+and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
+to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
+support for CGROUPS. The average user should neither need to know what these
+are, nor should they need to be using them to have good desktop behaviour.
+There is only one "scheduler" tunable, the round robin interval. This can be
+accessed in
+ /proc/sys/kernel/rr_interval
+The value is in milliseconds, and the default value is set to 6ms. Valid values
+are from 1 to 1000. Decreasing the value will decrease latencies at the cost of
+decreasing throughput, while increasing it will improve throughput, but at the
+cost of worsening latencies. The accuracy of the rr interval is limited by HZ
+resolution of the kernel configuration. Thus, the worst case latencies are
+usually slightly higher than this actual value. BFS uses "dithering" to try and
+minimise the effect the Hz limitation has. The default value of 6 is not an
+arbitrary one. It is based on the fact that humans can detect jitter at
+approximately 7ms, so aiming for much lower latencies is pointless under most
+circumstances. It is worth noting this fact when comparing the latency
+performance of BFS to other schedulers. Worst case latencies being higher than
+7ms are far worse than average latencies not being in the microsecond range.
+Experimentation has shown that rr intervals being increased up to 300 can
+improve throughput but beyond that, scheduling noise from elsewhere prevents
+further demonstrable throughput.
+Isochronous scheduling.
+Isochronous scheduling is a unique scheduling policy designed to provide
+near-real-time performance to unprivileged (ie non-root) users without the
+ability to starve the machine indefinitely. Isochronous tasks (which means
+"same time") are set using, for example, the schedtool application like so:
+ schedtool -I -e amarok
+This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
+is that it has a priority level between true realtime tasks and SCHED_NORMAL
+which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
+if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
+rate). However if ISO tasks run for more than a tunable finite amount of time,
+they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
+time is the percentage of _total CPU_ available across the machine, configurable
+as a percentage in the following "resource handling" tunable (as opposed to a
+scheduler tunable):
+ /proc/sys/kernel/iso_cpu
+and is set to 70% by default. It is calculated over a rolling 5 second average
+Because it is the total CPU available, it means that on a multi CPU machine, it
+is possible to have an ISO task running as realtime scheduling indefinitely on
+just one CPU, as the other CPUs will be available. Setting this to 100 is the
+equivalent of giving all users SCHED_RR access and setting it to 0 removes the
+ability to run any pseudo-realtime tasks.
+A feature of BFS is that it detects when an application tries to obtain a
+realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
+appropriate privileges to use those policies. When it detects this, it will
+give the task SCHED_ISO policy instead. Thus it is transparent to the user.
+Because some applications constantly set their policy as well as their nice
+level, there is potential for them to undo the override specified by the user
+on the command line of setting the policy to SCHED_ISO. To counter this, once
+a task has been set to SCHED_ISO policy, it needs superuser privileges to set
+it back to SCHED_NORMAL. This will ensure the task remains ISO and all child
+processes and threads will also inherit the ISO policy.
+Idleprio scheduling.
+Idleprio scheduling is a scheduling policy designed to give out CPU to a task
+_only_ when the CPU would be otherwise idle. The idea behind this is to allow
+ultra low priority tasks to be run in the background that have virtually no
+effect on the foreground tasks. This is ideally suited to distributed computing
+clients (like setiathome, folding, mprime etc) but can also be used to start
+a video encode or so on without any slowdown of other tasks. To avoid this
+policy from grabbing shared resources and holding them indefinitely, if it
+detects a state where the task is waiting on I/O, the machine is about to
+suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As
+per the Isochronous task management, once a task has been scheduled as IDLEPRIO,
+it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can
+be set to start as SCHED_IDLEPRIO with the schedtool command like so:
+ schedtool -D -e ./mprime
+Subtick accounting.
+It is surprisingly difficult to get accurate CPU accounting, and in many cases,
+the accounting is done by simply determining what is happening at the precise
+moment a timer tick fires off. This becomes increasingly inaccurate as the
+timer tick frequency (HZ) is lowered. It is possible to create an application
+which uses almost 100% CPU, yet by being descheduled at the right time, records
+zero CPU usage. While the main problem with this is that there are possible
+security implications, it is also difficult to determine how much CPU a task
+really does use. BFS tries to use the sub-tick accounting from the TSC clock,
+where possible, to determine real CPU usage. This is not entirely reliable, but
+is far more likely to produce accurate CPU usage data than the existing designs
+and will not show tasks as consuming no CPU usage when they actually are. Thus,
+the amount of CPU reported as being used by BFS will more accurately represent
+how much CPU the task itself is using (as is shown for example by the 'time'
+application), so the reported values may be quite different to other schedulers.
+Values reported as the 'load' are more prone to problems with this design, but
+per process values are closer to real usage. When comparing throughput of BFS
+to other designs, it is important to compare the actual completed work in terms
+of total wall clock time taken and total work done, rather than the reported
+"cpu usage".
+Con Kolivas <> Tue, 5 Apr 2011
26 Documentation/sysctl/kernel.txt
@@ -29,6 +29,7 @@ show up in /proc/sys/kernel:
- domainname
- hostname
- hotplug
+- iso_cpu
- java-appletviewer [ binfmt_java, obsolete ]
- java-interpreter [ binfmt_java, obsolete ]
- kstack_depth_to_print [ X86 only ]
@@ -51,6 +52,7 @@ show up in /proc/sys/kernel:
- randomize_va_space
- real-root-dev ==> Documentation/initrd.txt
- reboot-cmd [ SPARC only ]
+- rr_interval
- rtsig-max
- rtsig-nr
- sem
@@ -209,6 +211,16 @@ Default value is "/sbin/hotplug".
+iso_cpu: (BFS CPU scheduler only).
+This sets the percentage cpu that the unprivileged SCHED_ISO tasks can
+run effectively at realtime priority, averaged over a rolling five
+seconds over the -whole- system, meaning all cpus.
+Set to 70 (percent) by default.
l2cr: (PPC only)
This flag controls the L2 cache of G3 processor boards. If
@@ -383,6 +395,20 @@ rebooting. ???
+rr_interval: (BFS CPU scheduler only)
+This is the smallest duration that any cpu process scheduling unit
+will run for. Increasing this value can increase throughput of cpu
+bound tasks substantially but at the expense of increased latencies
+overall. Conversely decreasing it will decrease average and maximum
+latencies but at the expense of throughput. This value is in
+milliseconds and the default value chosen depends on the number of
+cpus available at scheduler initialisation with a minimum of 6.
+Valid values are from 1-5000.
rtsig-max & rtsig-nr:
The file rtsig-max can be used to tune the maximum number
5 arch/powerpc/platforms/cell/spufs/sched.c
@@ -63,11 +63,6 @@ static struct timer_list spusched_timer;
static struct timer_list spuloadavg_timer;
- * Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
- */
-#define NORMAL_PRIO 120
* Frequency of the spu scheduler tick. By default we do one SPU scheduler
* tick for every 10 CPU scheduler ticks.
7 drivers/cpufreq/cpufreq.c
@@ -28,6 +28,7 @@
#include <linux/cpu.h>
#include <linux/completion.h>
#include <linux/mutex.h>
+#include <linux/sched.h>
#define dprintk(msg...) cpufreq_debug_printk(CPUFREQ_DEBUG_CORE, \
"cpufreq-core", msg)
@@ -1536,6 +1537,12 @@ int __cpufreq_driver_target(struct cpufreq_policy *policy,
target_freq, relation);
if (cpu_online(policy->cpu) && cpufreq_driver->target)
retval = cpufreq_driver->target(policy, target_freq, relation);
+ if (likely(retval != -EINVAL)) {
+ if (target_freq == policy->max)
+ cpu_nonscaling(policy->cpu);
+ else
+ cpu_scaling(policy->cpu);
+ }
return retval;
2 fs/proc/base.c
@@ -373,7 +373,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
static int proc_pid_schedstat(struct task_struct *task, char *buffer)
return sprintf(buffer, "%llu %llu %lu\n",
- (unsigned long long)task->se.sum_exec_runtime,
+ (unsigned long long)tsk_seruntime(task),
(unsigned long long)task->sched_info.run_delay,
65 include/linux/init_task.h
@@ -119,6 +119,69 @@ extern struct cred init_cred;
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
+#define INIT_TASK(tsk) \
+{ \
+ .state = 0, \
+ .stack = &init_thread_info, \
+ .usage = ATOMIC_INIT(2), \
+ .flags = PF_KTHREAD, \
+ .lock_depth = -1, \
+ .prio = NORMAL_PRIO, \
+ .static_prio = MAX_PRIO-20, \
+ .normal_prio = NORMAL_PRIO, \
+ .deadline = 0, \
+ .policy = SCHED_NORMAL, \
+ .cpus_allowed = CPU_MASK_ALL, \
+ .mm = NULL, \
+ .active_mm = &init_mm, \
+ .run_list = LIST_HEAD_INIT(tsk.run_list), \
+ .time_slice = HZ, \
+ .tasks = LIST_HEAD_INIT(tsk.tasks), \
+ .pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
+ .ptraced = LIST_HEAD_INIT(tsk.ptraced), \
+ .ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \
+ .real_parent = &tsk, \
+ .parent = &tsk, \
+ .children = LIST_HEAD_INIT(tsk.children), \
+ .sibling = LIST_HEAD_INIT(tsk.sibling), \
+ .group_leader = &tsk, \
+ .real_cred = &init_cred, \
+ .cred = &init_cred, \
+ .cred_guard_mutex = \
+ __MUTEX_INITIALIZER(tsk.cred_guard_mutex), \
+ .comm = "swapper", \
+ .thread = INIT_THREAD, \
+ .fs = &init_fs, \
+ .files = &init_files, \
+ .signal = &init_signals, \
+ .sighand = &init_sighand, \
+ .nsproxy = &init_nsproxy, \
+ .pending = { \
+ .list = LIST_HEAD_INIT(tsk.pending.list), \
+ .signal = {{0}}}, \
+ .blocked = {{0}}, \
+ .alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \
+ .journal_info = NULL, \
+ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
+ .fs_excl = ATOMIC_INIT(0), \
+ .pi_lock = __SPIN_LOCK_UNLOCKED(tsk.pi_lock), \
+ .timer_slack_ns = 50000, /* 50 usec default slack */ \
+ .pids = { \
+ }, \
+ .dirties = INIT_PROP_LOCAL_SINGLE(dirties), \
+#else /* CONFIG_SCHED_BFS */
#define INIT_TASK(tsk) \
{ \
.state = 0, \
@@ -185,7 +248,7 @@ extern struct cred init_cred;
+#endif /* CONFIG_SCHED_BFS */
#define INIT_CPU_TIMERS(cpu_timers) \
{ \
2 include/linux/ioprio.h
@@ -64,6 +64,8 @@ static inline int task_ioprio_class(struct io_context *ioc)
static inline int task_nice_ioprio(struct task_struct *task)
+ if (iso_task(task))
+ return 0;
return (task_nice(task) + 20) / 5;
2 include/linux/jiffies.h
@@ -164,7 +164,7 @@ static inline u64 get_jiffies_64(void)
* Have the 32 bit jiffies value wrap 5 minutes after boot
* so jiffies wrap bugs show up earlier.
-#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
+#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-10*HZ))
* Change timeval to jiffies, trying to avoid the
121 include/linux/sched.h
@@ -36,8 +36,15 @@
#define SCHED_FIFO 1
#define SCHED_RR 2
#define SCHED_BATCH 3
-/* SCHED_ISO: reserved but not implemented yet */
+/* SCHED_ISO: Implemented on BFS only */
#define SCHED_IDLE 5
+#define SCHED_ISO 4
+#define SCHED_RANGE(policy) ((policy) <= SCHED_MAX)
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
@@ -261,9 +268,6 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
extern void init_idle(struct task_struct *idle, int cpu);
extern void init_idle_bootup_task(struct task_struct *idle);
-extern int runqueue_is_locked(int cpu);
-extern void task_rq_unlock_wait(struct task_struct *p);
extern cpumask_var_t nohz_cpu_mask;
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
extern int select_nohz_load_balancer(int cpu);
@@ -628,6 +632,10 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
+ cputime_t prev_utime, prev_stime;
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1221,17 +1229,34 @@ struct task_struct {
int lock_depth; /* BKL lock depth */
int oncpu;
+#else /* CONFIG_SCHED_BFS */
+ int oncpu;
int prio, static_prio, normal_prio;
unsigned int rt_priority;
+ int time_slice;
+ u64 deadline;
+ struct list_head run_list;
+ u64 last_ran;
+ u64 sched_time; /* sched_clock time spent running */
+#ifdef CONFIG_SMP
+ int sticky; /* Soft affined flag */
+ unsigned long rt_timeout;
+// #else /* CONFIG_SCHED_BFS */
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
+// #endif
/* list of struct preempt_notifier: */
@@ -1330,6 +1355,9 @@ struct task_struct {
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
cputime_t utime, stime, utimescaled, stimescaled;
+ unsigned long utime_pc, stime_pc;
cputime_t gtime;
cputime_t prev_utime, prev_stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
@@ -1541,6 +1569,74 @@ struct task_struct {
unsigned long stack_start;
+extern int grunqueue_is_locked(void);
+extern void grq_unlock_wait(void);
+extern void cpu_scaling(int cpu);
+extern void cpu_nonscaling(int cpu);
+#define tsk_seruntime(t) ((t)->sched_time)
+#define tsk_rttimeout(t) ((t)->rt_timeout)
+#define task_rq_unlock_wait(tsk) grq_unlock_wait()
+static inline void set_oom_timeslice(struct task_struct *p)
+ p->time_slice = HZ;
+static inline void tsk_cpus_current(struct task_struct *p)
+#define runqueue_is_locked(cpu) grunqueue_is_locked()
+static inline void print_scheduler_version(void)
+ printk(KERN_INFO"BFS CPU scheduler v0.401 by Con Kolivas.\n");
+static inline int iso_task(struct task_struct *p)
+ return (p->policy == SCHED_ISO);
+extern int runqueue_is_locked(int cpu);
+extern void task_rq_unlock_wait(struct task_struct *p);
+static inline void cpu_scaling(int cpu)
+static inline void cpu_nonscaling(int cpu)
+#define tsk_seruntime(t) ((t)->se.sum_exec_runtime)
+#define tsk_rttimeout(t) ((t)->rt.timeout)
+static inline void sched_exit(struct task_struct *p)
+static inline void set_oom_timeslice(struct task_struct *p)
+ p->rt.time_slice = HZ;
+static inline void tsk_cpus_current(struct task_struct *p)
+ p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
+static inline void print_scheduler_version(void)
+ printk(KERN_INFO"CFS CPU scheduler.\n");
+static inline int iso_task(struct task_struct *p)
+ return 0;
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpumask(tsk) (&(tsk)->cpus_allowed)
@@ -1559,9 +1655,19 @@ struct task_struct {
#define MAX_USER_RT_PRIO 100
+#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
+#define PRIO_RANGE (40)
+#define NORMAL_PRIO (MAX_RT_PRIO + 1)
+#define IDLE_PRIO (MAX_RT_PRIO + 2)
+#define PRIO_LIMIT ((IDLE_PRIO) + 1)
+#else /* CONFIG_SCHED_BFS */
#define MAX_PRIO (MAX_RT_PRIO + 40)
-#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
+#endif /* CONFIG_SCHED_BFS */
static inline int rt_prio(int prio)
@@ -1873,7 +1979,7 @@ task_sched_runtime(struct task_struct *task);
extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
/* sched_exec is called by processes performing an exec */
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_BFS)
extern void sched_exec(void);
#define sched_exec() {}
@@ -2028,6 +2134,9 @@ extern void wake_up_new_task(struct task_struct *tsk,
static inline void kick_process(struct task_struct *tsk) { }
extern void sched_fork(struct task_struct *p, int clone_flags);
+extern void sched_exit(struct task_struct *p);
extern void sched_dead(struct task_struct *p);
extern void proc_caches_init(void);
17 init/Kconfig
@@ -23,6 +23,19 @@ config CONSTRUCTORS
menu "General setup"
+config SCHED_BFS
+ bool "BFS cpu scheduler"
+ ---help---
+ The Brain Fuck CPU Scheduler for excellent interactivity and
+ responsiveness on the desktop and solid scalability on normal
+ hardware. Not recommended for 4096 CPUs.
+ Currently incompatible with the Group CPU scheduler, and RCU TORTURE
+ TEST so these options are disabled.
+ Say Y here.
+ default y
bool "Prompt for development and/or incomplete code/drivers"
@@ -428,7 +441,7 @@ config HAVE_UNSTABLE_SCHED_CLOCK
bool "Group CPU scheduler"
- depends on EXPERIMENTAL
default n
This feature lets CPU scheduler recognize task groups and control CPU
@@ -544,7 +557,7 @@ config PROC_PID_CPUSET
bool "Simple CPU accounting cgroup subsystem"
- depends on CGROUPS
+ depends on CGROUPS && !SCHED_BFS
Provides a simple Resource Controller for monitoring the
total CPU consumed by the tasks in a cgroup.
2 init/main.c
@@ -811,6 +811,8 @@ static noinline int init_post(void)
system_state = SYSTEM_RUNNING;
+ print_scheduler_version();
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
printk(KERN_WARNING "Warning: unable to open an initial console.\n");
2 kernel/delayacct.c
@@ -128,7 +128,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
t1 = tsk->sched_info.pcount;
t2 = tsk->sched_info.run_delay;
- t3 = tsk->se.sum_exec_runtime;
+ t3 = tsk_seruntime(tsk);
d->cpu_count += t1;
2 kernel/exit.c
@@ -120,7 +120,7 @@ static void __exit_signal(struct task_struct *tsk)
sig->inblock += task_io_get_inblock(tsk);
sig->oublock += task_io_get_oublock(tsk);
task_io_accounting_add(&sig->ioac, &tsk->ioac);
- sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
+ sig->sum_sched_runtime += tsk_seruntime(tsk);
sig = NULL; /* Marker for below. */
14 kernel/posix-cpu-timers.c
@@ -250,7 +250,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
do {
times->utime = cputime_add(times->utime, t->utime);
times->stime = cputime_add(times->stime, t->stime);
- times->sum_exec_runtime += t->se.sum_exec_runtime;
+ times->sum_exec_runtime += tsk_seruntime(t);
t = next_thread(t);
} while (t != tsk);
@@ -517,7 +517,7 @@ static void cleanup_timers(struct list_head *head,
void posix_cpu_timers_exit(struct task_struct *tsk)
- tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);
+ tsk->utime, tsk->stime, tsk_seruntime(tsk));
void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -527,7 +527,7 @@ void posix_cpu_timers_exit_group(struct task_struct *tsk)
cputime_add(tsk->utime, sig->utime),
cputime_add(tsk->stime, sig->stime),
- tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
+ tsk_seruntime(tsk) + sig->sum_sched_runtime);
static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -1020,7 +1020,7 @@ static void check_thread_timers(struct task_struct *tsk,
struct cpu_timer_list *t = list_first_entry(timers,
struct cpu_timer_list,
- if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
+ if (!--maxfire || tsk_seruntime(tsk) < t->expires.sched) {
tsk->cputime_expires.sched_exp = t->expires.sched;
@@ -1036,15 +1036,15 @@ static void check_thread_timers(struct task_struct *tsk,
unsigned long *soft = &sig->rlim[RLIMIT_RTTIME].rlim_cur;
if (hard != RLIM_INFINITY &&
- tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
+ tsk_rttimeout(tsk) > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
* At the hard limit, we just die.
* No need to calculate anything else now.
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
- if (tsk->rt.timeout > DIV_ROUND_UP(*soft, USEC_PER_SEC/HZ)) {
+ if (tsk_rttimeout(tsk) > DIV_ROUND_UP(*soft, USEC_PER_SEC/HZ)) {
* At the soft limit, send a SIGXCPU every second.
@@ -1367,7 +1367,7 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
struct task_cputime task_sample = {
.utime = tsk->utime,
.stime = tsk->stime,
- .sum_exec_runtime = tsk->se.sum_exec_runtime
+ .sum_exec_runtime = tsk_seruntime(tsk)
if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
4 kernel/sched.c
@@ -1,3 +1,6 @@
+#include "sched_bfs.c"
* kernel/sched.c
@@ -10957,3 +10960,4 @@ void synchronize_sched_expedited(void)
#endif /* #else #ifndef CONFIG_SMP */
+#endif /* CONFIG_SCHED_BFS */
7,072 kernel/sched_bfs.c
7,072 additions, 0 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
35 kernel/sysctl.c
@@ -106,7 +106,12 @@ static int zero;
static int __maybe_unused one = 1;
static int __maybe_unused two = 2;
static unsigned long one_ul = 1;
-static int one_hundred = 100;
+static int __maybe_unused one_hundred = 100;
+extern int rr_interval;
+extern int sched_iso_cpu;
+static int __read_mostly one_thousand = 1000;
static int ten_thousand = 10000;
@@ -244,14 +249,15 @@ static struct ctl_table root_table[] = {
{ .ctl_name = 0 }
+#if defined(CONFIG_SCHED_DEBUG) && !defined(CONFIG_SCHED_BFS)
static int min_sched_granularity_ns = 100000; /* 100 usecs */
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
static int min_wakeup_granularity_ns; /* 0 usecs */
static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
static struct ctl_table kern_table[] = {
.ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
@@ -380,6 +386,7 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
+#endif /* !CONFIG_SCHED_BFS */
.ctl_name = CTL_UNNUMBERED,
@@ -831,6 +838,30 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "rr_interval",
+ .data = &rr_interval,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one,
+ .extra2 = &one_thousand,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "iso_cpu",
+ .data = &sched_iso_cpu,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
#if defined(CONFIG_S390) && defined(CONFIG_SMP)
.ctl_name = KERN_SPIN_RETRY,
2 lib/Kconfig.debug
@@ -719,7 +719,7 @@ config BOOT_PRINTK_DELAY
tristate "torture tests for RCU"
- depends on DEBUG_KERNEL
+ depends on DEBUG_KERNEL && !SCHED_BFS
default n
This option provides a kernel module that runs torture tests
2 mm/oom_kill.c
@@ -365,7 +365,7 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
- p->rt.time_slice = HZ;
+ set_oom_timeslice(p);
set_tsk_thread_flag(p, TIF_MEMDIE);
force_sig(SIGKILL, p);

0 comments on commit f7f1930

Please sign in to comment.
Something went wrong with that request. Please try again.