Browse files

merge: CONFIG_PREEMPT_RT Patch Set

Signed-off-by: Robert Nelson <>
  • Loading branch information...
RobertCNelson committed Apr 6, 2018
1 parent c8f540d commit 47c98bc88333ea35ac6afd229f137c3b39352267
Showing 407 changed files with 12,976 additions and 3,579 deletions.
@@ -0,0 +1,64 @@
The module hwlat_detector is a special purpose kernel module that is used to
detect large system latencies induced by the behavior of certain underlying
hardware or firmware, independent of Linux itself. The code was developed
originally to detect SMIs (System Management Interrupts) on x86 systems,
however there is nothing x86 specific about this patchset. It was
originally written for use by the "RT" patch since the Real Time
kernel is highly latency sensitive.
SMIs are usually not serviced by the Linux kernel, which typically does not
even know that they are occuring. SMIs are instead are set up by BIOS code
and are serviced by BIOS code, usually for "critical" events such as
management of thermal sensors and fans. Sometimes though, SMIs are used for
other tasks and those tasks can spend an inordinate amount of time in the
handler (sometimes measured in milliseconds). Obviously this is a problem if
you are trying to keep event service latencies down in the microsecond range.
The hardware latency detector works by hogging all of the cpus for configurable
amounts of time (by calling stop_machine()), polling the CPU Time Stamp Counter
for some period, then looking for gaps in the TSC data. Any gap indicates a
time when the polling was interrupted and since the machine is stopped and
interrupts turned off the only thing that could do that would be an SMI.
Note that the SMI detector should *NEVER* be used in a production environment.
It is intended to be run manually to determine if the hardware platform has a
problem with long system firmware service routines.
Loading the module hwlat_detector passing the parameter "enabled=1" (or by
setting the "enable" entry in "hwlat_detector" debugfs toggled on) is the only
step required to start the hwlat_detector. It is possible to redefine the
threshold in microseconds (us) above which latency spikes will be taken
into account (parameter "threshold=").
# modprobe hwlat_detector enabled=1 threshold=100
After the module is loaded, it creates a directory named "hwlat_detector" under
the debugfs mountpoint, "/debug/hwlat_detector" for this text. It is necessary
to have debugfs mounted, which might be on /sys/debug on your system.
The /debug/hwlat_detector interface contains the following files:
count - number of latency spikes observed since last reset
enable - a global enable/disable toggle (0/1), resets count
max - maximum hardware latency actually observed (usecs)
sample - a pipe from which to read current raw sample data
in the format <timestamp> <latency observed usecs>
(can be opened O_NONBLOCK for a single sample)
threshold - minimum latency value to be considered (usecs)
width - time period to sample with CPUs held (usecs)
must be less than the total window size (enforced)
window - total period of sampling, width being inside (usecs)
By default we will set width to 500,000 and window to 1,000,000, meaning that
we will sample every 1,000,000 usecs (1s) for 500,000 usecs (0.5s). If we
observe any latencies that exceed the threshold (initially 100 usecs),
then we write to a global sample ring buffer of 8K samples, which is
consumed by reading from the "sample" (pipe) debugfs file interface.
@@ -1640,6 +1640,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
irqaffinity= [SMP] Set the default irq affinity mask
<cpu number>,...,<cpu number>
<cpu number>-<cpu number>
(must be a positive range in ascending order)
or a mixture
<cpu number>,...,<cpu number>-<cpu number>
irqfixup [HW]
When an interrupt is not handled search all handlers
for it. Intended to get systems with badly broken
@@ -59,10 +59,17 @@ On PowerPC - Press 'ALT - Print Screen (or F13) - <command key>,
On other - If you know of the key combos for other architectures, please
let me know so I can add them to this section.
On all - write a character to /proc/sysrq-trigger. e.g.:
On all - write a character to /proc/sysrq-trigger, e.g.:
echo t > /proc/sysrq-trigger
On all - Enable network SysRq by writing a cookie to icmp_echo_sysrq, e.g.
echo 0x01020304 >/proc/sys/net/ipv4/icmp_echo_sysrq
Send an ICMP echo request with this pattern plus the particular
SysRq command key. Example:
# ping -c1 -s57 -p0102030468
will trigger the SysRq-H (help) command.
* What are the 'command' keys?
'b' - Will immediately reboot the system without syncing or unmounting
@@ -0,0 +1,186 @@
Using the Linux Kernel Latency Histograms
This document gives a short explanation how to enable, configure and use
latency histograms. Latency histograms are primarily relevant in the
context of real-time enabled kernels (CONFIG_PREEMPT/CONFIG_PREEMPT_RT)
and are used in the quality management of the Linux real-time
* Purpose of latency histograms
A latency histogram continuously accumulates the frequencies of latency
data. There are two types of histograms
- potential sources of latencies
- effective latencies
* Potential sources of latencies
Potential sources of latencies are code segments where interrupts,
preemption or both are disabled (aka critical sections). To create
histograms of potential sources of latency, the kernel stores the time
stamp at the start of a critical section, determines the time elapsed
when the end of the section is reached, and increments the frequency
counter of that latency value - irrespective of whether any concurrently
running process is affected by latency or not.
- Configuration items (in the Kernel hacking/Tracers submenu)
* Effective latencies
Effective latencies are actually occuring during wakeup of a process. To
determine effective latencies, the kernel stores the time stamp when a
process is scheduled to be woken up, and determines the duration of the
wakeup time shortly before control is passed over to this process. Note
that the apparent latency in user space may be somewhat longer, since the
process may be interrupted after control is passed over to it but before
the execution in user space takes place. Simply measuring the interval
between enqueuing and wakeup may also not appropriate in cases when a
process is scheduled as a result of a timer expiration. The timer may have
missed its deadline, e.g. due to disabled interrupts, but this latency
would not be registered. Therefore, the offsets of missed timers are
recorded in a separate histogram. If both wakeup latency and missed timer
offsets are configured and enabled, a third histogram may be enabled that
records the overall latency as a sum of the timer latency, if any, and the
wakeup latency. This histogram is called "timerandwakeup".
- Configuration items (in the Kernel hacking/Tracers submenu)
* Usage
The interface to the administration of the latency histograms is located
in the debugfs file system. To mount it, either enter
mount -t sysfs nodev /sys
mount -t debugfs nodev /sys/kernel/debug
from shell command line level, or add
nodev /sys sysfs defaults 0 0
nodev /sys/kernel/debug debugfs defaults 0 0
to the file /etc/fstab. All latency histogram related files are then
available in the directory /sys/kernel/debug/tracing/latency_hist. A
particular histogram type is enabled by writing non-zero to the related
variable in the /sys/kernel/debug/tracing/latency_hist/enable directory.
Select "preemptirqsoff" for the histograms of potential sources of
latencies and "wakeup" for histograms of effective latencies etc. The
histogram data - one per CPU - are available in the files
The histograms are reset by writing non-zero to the file "reset" in a
particular latency directory. To reset all latency data, use
if test -d $HISTDIR
for i in `find . | grep /reset$`
echo 1 >$i
* Data format
Latency data are stored with a resolution of one microsecond. The
maximum latency is 10,240 microseconds. The data are only valid, if the
overflow register is empty. Every output line contains the latency in
microseconds in the first row and the number of samples in the second
row. To display only lines with a positive latency count, use, for
grep -v " 0$" /sys/kernel/debug/tracing/latency_hist/preemptoff/CPU0
#Minimum latency: 0 microseconds.
#Average latency: 0 microseconds.
#Maximum latency: 25 microseconds.
#Total samples: 3104770694
#There are 0 samples greater or equal than 10240 microseconds
#usecs samples
0 2984486876
1 49843506
2 58219047
3 5348126
4 2187960
5 3388262
6 959289
7 208294
8 40420
9 4485
10 14918
11 18340
12 25052
13 19455
14 5602
15 969
16 47
17 18
18 14
19 1
20 3
21 2
22 5
23 2
25 1
* Wakeup latency of a selected process
To only collect wakeup latency data of a particular process, write the
PID of the requested process to
PIDs are not considered, if this variable is set to 0.
* Details of the process with the highest wakeup latency so far
Selected data of the process that suffered from the highest wakeup
latency that occurred in a particular CPU are available in the file
In addition, other relevant system data at the time when the
latency occurred are given.
The format of the data is (all in one line):
<PID> <Priority> <Latency> (<Timeroffset>) <Command> \
<- <PID> <Priority> <Command> <Timestamp>
The value of <Timeroffset> is only relevant in the combined timer
and wakeup latency recording. In the wakeup recording, it is
always 0, in the missed_timer_offsets recording, it is the same
as <Latency>.
When retrospectively searching for the origin of a latency and
tracing was not enabled, it may be helpful to know the name and
some basic data of the task that (finally) was switching to the
late real-tlme task. In addition to the victim's data, also the
data of the possible culprit are therefore displayed after the
"<-" symbol.
Finally, the timestamp of the time when the latency occurred
in <seconds>.<microseconds> after the most recent system boot
is provided.
These data are also reset when the wakeup histogram is reset.
@@ -797,6 +797,9 @@ KBUILD_CFLAGS += $(call cc-option,-Werror=strict-prototypes)
# Prohibit date/time macros, which would make the build non-deterministic
KBUILD_CFLAGS += $(call cc-option,-Werror=date-time)
# enforce correct pointer usage
KBUILD_CFLAGS += $(call cc-option,-Werror=incompatible-pointer-types)
# use the deterministic mode of AR if available
KBUILD_ARFLAGS := $(call ar-option,D)
@@ -9,6 +9,7 @@ config OPROFILE
tristate "OProfile system profiling"
depends on PROFILING
depends on HAVE_OPROFILE
depends on !PREEMPT_RT_FULL
@@ -52,6 +53,7 @@ config KPROBES
bool "Optimize very unlikely/likely branches"
This option enables a transparent branch optimization that
makes certain almost-always-true or almost-always-false branch
@@ -33,7 +33,7 @@ config ARM
select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
@@ -68,6 +68,7 @@ config ARM
@@ -3,6 +3,13 @@
#include <linux/thread_info.h>
void switch_kmaps(struct task_struct *prev_p, struct task_struct *next_p);
static inline void
switch_kmaps(struct task_struct *prev_p, struct task_struct *next_p) { }
* For v7 SMP cores running a preemptible kernel we may be pre-empted
* during a TLB maintenance operation, so execute an inner-shareable dsb
@@ -25,6 +32,7 @@ extern struct task_struct *__switch_to(struct task_struct *, struct thread_info
#define switch_to(prev,next,last) \
do { \
__complete_pending_tlbi(); \
switch_kmaps(prev, next); \
last = __switch_to(prev,task_thread_info(prev), task_thread_info(next)); \
} while (0)
@@ -49,6 +49,7 @@ struct cpu_context_save {
struct thread_info {
unsigned long flags; /* low level flags */
int preempt_count; /* 0 => preemptable, <0 => bug */
int preempt_lazy_count; /* 0 => preemptable, <0 => bug */
mm_segment_t addr_limit; /* address limit */
struct task_struct *task; /* main task structure */
__u32 cpu; /* cpu */
@@ -142,7 +143,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
#define TIF_SYSCALL_TRACE 4 /* syscall trace active */
#define TIF_SYSCALL_AUDIT 5 /* syscall auditing active */
#define TIF_SYSCALL_TRACEPOINT 6 /* syscall tracepoint instrumentation */
#define TIF_SECCOMP 7 /* seccomp syscall filtering active */
#define TIF_SECCOMP 8 /* seccomp syscall filtering active */
#define TIF_NOHZ 12 /* in adaptive nohz mode */
@@ -152,6 +154,7 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
#define _TIF_UPROBE (1 << TIF_UPROBE)
@@ -167,7 +170,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
* Change these and you break ASM code in entry-common.S
#endif /* __KERNEL__ */
#endif /* __ASM_ARM_THREAD_INFO_H */
@@ -65,6 +65,7 @@ int main(void)
DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
DEFINE(TI_PREEMPT, offsetof(struct thread_info, preempt_count));
DEFINE(TI_PREEMPT_LAZY, offsetof(struct thread_info, preempt_lazy_count));
DEFINE(TI_ADDR_LIMIT, offsetof(struct thread_info, addr_limit));
DEFINE(TI_TASK, offsetof(struct thread_info, task));
DEFINE(TI_CPU, offsetof(struct thread_info, cpu));
Oops, something went wrong.

0 comments on commit 47c98bc

Please sign in to comment.