Permalink
Commits on Oct 11, 2011
  1. writeback: fix ppc compile warnings on do_div(long long, unsigned long)

    fengguang committed Oct 11, 2011
    Fix powerpc compile warnings
    
    mm/page-writeback.c: In function 'bdi_position_ratio':
    mm/page-writeback.c:622:3: warning: comparison of distinct pointer types lacks a cast [enabled by default]
    page-writeback.c:635:4: warning: comparison of distinct pointer types lacks a cast [enabled by default]
    
    Also fix gcc "uninitialized var" warnings.
    
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Commits on Oct 3, 2011
  1. writeback: per-bdi background threshold

    fengguang committed Nov 18, 2010
    One thing puzzled me is that in JBOD case, the per-disk writeout
    performance is smaller than the corresponding single-disk case even
    when they have comparable bdi_thresh. Tracing shows find that in single
    disk case, bdi_writeback is always kept high while in JBOD case, it
    could drop low from time to time and correspondingly bdi_reclaimable
    could sometimes rush high.
    
    The fix is to watch bdi_reclaimable and kick background writeback as
    soon as it goes high. This resembles the global background threshold
    but in per-bdi manner. The trick is, as long as bdi_reclaimable does
    not go high, bdi_writeback naturally won't go low because
    bdi_reclaimable+bdi_writeback ~= bdi_thresh.
    
    With less fluctuated writeback pages, JBOD performance is observed to
    increase noticeably in various cases.
    
    vmstat:nr_written values before/after patch:
    
      3.1.0-rc4-wo-underrun+      3.1.0-rc4-bgthresh3+  
    ------------------------  ------------------------  
                   125596480       +25.9%    158179363  JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
                    61790815      +110.4%    130032231  JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
                    58853546        -0.1%     58823828  JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
                   110159811       +24.7%    137355377  JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
                    69544762       +10.8%     77080047  JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
                    50644862        +0.5%     50890006  JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
                    42677090       +28.0%     54643527  JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
                    47491324       +13.3%     53785605  JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
                    52548986        +0.9%     53001031  JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
                    26783091       +36.8%     36650248  JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                    35526347       +14.0%     40492312  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                    44670723        -1.1%     44177606  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
                   127996037       +22.4%    156719990  JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
                    57518856        +3.8%     59677625  JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
                    51919909       +12.2%     58269894  JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
                    86410514       +79.0%    154660433  JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
                    40132519       +38.6%     55617893  JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
                    48423248        +7.5%     52042927  JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
                   206041046       +44.1%    296846536  JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
                    72312903       -19.4%     58272885  JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
                    50635672        -0.5%     50384787  JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
                    68308534      +115.7%    147324758  JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
                    57882933       +14.5%     66269621  JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
                    52183472       +12.8%     58855181  JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
                    53788956       +94.2%    104460352  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
                    44493342       +35.5%     60298210  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
                    42641209       +18.9%     50681038  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X
    
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  2. writeback: dirty position control - bdi reserve area

    fengguang committed Aug 5, 2011
    Keep a minimal pool of dirty pages for each bdi, so that the disk IO
    queues won't underrun. Also gently increase a small bdi_thresh to avoid
    it stuck in 0 for some light dirtied bdi.
    
    It's particularly useful for JBOD and small memory system.
    
    It may result in (pos_ratio > 1) at the setpoint and push the dirty
    pages high. This is more or less intended because the bdi is in the
    danger of IO queue underflow.
    
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  3. writeback: control dirty pause time

    fengguang committed Jun 12, 2011
    The dirty pause time shall ultimately be controlled by adjusting
    nr_dirtied_pause, since there is relationship
    
    	pause = pages_dirtied / task_ratelimit
    
    Assuming
    
    	pages_dirtied ~= nr_dirtied_pause
    	task_ratelimit ~= dirty_ratelimit
    
    We get
    
    	nr_dirtied_pause ~= dirty_ratelimit * desired_pause
    
    Here dirty_ratelimit is preferred over task_ratelimit because it's
    more stable.
    
    It's also important to limit possible large transitional errors:
    
    - bw is changing quickly
    - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
    - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
      separate fix, but still expect non-trivial errors)
    
    So we end up using the above formula inside clamp_val().
    
    The best test case for this code is to run 100 "dd bs=4M" tasks on
    btrfs and check its pause time distribution.
    
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  4. writeback: limit max dirty pause time

    fengguang committed Jun 12, 2011
    Apply two policies to scale down the max pause time for
    
    1) small number of concurrent dirtiers
    2) small memory system (comparing to storage bandwidth)
    
    MAX_PAUSE=200ms may only be suitable for high end servers with lots of
    concurrent dirtiers, where the large pause time can reduce much overheads.
    
    Otherwise, smaller pause time is desirable whenever possible, so as to
    get good responsiveness and smooth user experiences. It's actually
    required for good disk utilization in the case when all the dirty pages
    can be synced to disk within MAX_PAUSE=200ms.
    
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  5. writeback: IO-less balance_dirty_pages()

    fengguang committed Aug 28, 2010
    As proposed by Chris, Dave and Jan, don't start foreground writeback IO
    inside balance_dirty_pages(). Instead, simply let it idle sleep for some
    time to throttle the dirtying task. In the mean while, kick off the
    per-bdi flusher thread to do background writeback IO.
    
    RATIONALS
    =========
    
    - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
    
      If every thread doing writes and being throttled start foreground
      writeback, it leads to N IO submitters from at least N different
      inodes at the same time, end up with N different sets of IO being
      issued with potentially zero locality to each other, resulting in
      much lower elevator sort/merge efficiency and hence we seek the disk
      all over the place to service the different sets of IO.
      OTOH, if there is only one submission thread, it doesn't jump between
      inodes in the same way when congestion clears - it keeps writing to
      the same inode, resulting in large related chunks of sequential IOs
      being issued to the disk. This is more efficient than the above
      foreground writeback because the elevator works better and the disk
      seeks less.
    
    - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
    
      With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
      from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
    
      * "CPU usage has dropped by ~55%", "it certainly appears that most of
        the CPU time saving comes from the removal of contention on the
        inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
        cacheline bouncing, because the new code is able to call much less
        frequently into balance_dirty_pages() and hence access the global
        page states)
    
      * the user space "App overhead" is reduced by 20%, by avoiding the
        cacheline pollution by the complex writeback code path
    
      * "for a ~5% throughput reduction", "the number of write IOs have
        dropped by ~25%", and the elapsed time reduced from 41:42.17 to
        40:53.23.
    
      * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
        and improves IO throughput from 38MB/s to 42MB/s.
    
    - IO size too small for fast arrays and too large for slow USB sticks
    
      The write_chunk used by current balance_dirty_pages() cannot be
      directly set to some large value (eg. 128MB) for better IO efficiency.
      Because it could lead to more than 1 second user perceivable stalls.
      Even the current 4MB write size may be too large for slow USB sticks.
      The fact that balance_dirty_pages() starts IO on itself couples the
      IO size to wait time, which makes it hard to do suitable IO size while
      keeping the wait time under control.
    
      Now it's possible to increase writeback chunk size proportional to the
      disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
      the larger writeback size dramatically reduces the seek count to 1/10
      (far beyond my expectation) and improves the write throughput by 24%.
    
    - long block time in balance_dirty_pages() hurts desktop responsiveness
    
      Many of us may have the experience: it often takes a couple of seconds
      or even long time to stop a heavy writing dd/cp/tar command with
      Ctrl-C or "kill -9".
    
    - IO pipeline broken by bumpy write() progress
    
      There are a broad class of "loop {read(buf); write(buf);}" applications
      whose read() pipeline will be under-utilized or even come to a stop if
      the write()s have long latencies _or_ don't progress in a constant rate.
      The current threshold based throttling inherently transfers the large
      low level IO completion fluctuations to bumpy application write()s,
      and further deteriorates with increasing number of dirtiers and/or bdi's.
    
      For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
      the rsync progresses very bumpy in legacy kernel, and throughput is
      improved by 67% by this patchset. (plus the larger write chunk size,
      it will be 93% speedup).
    
      The new rate based throttling can support 1000+ dd's with excellent
      smoothness, low latency and low overheads.
    
    For the above reasons, it's much better to do IO-less and low latency
    pauses in balance_dirty_pages().
    
    Jan Kara, Dave Chinner and me explored the scheme to let
    balance_dirty_pages() wait for enough writeback IO completions to
    safeguard the dirty limit. However it's found to have two problems:
    
    - in large NUMA systems, the per-cpu counters may have big accounting
      errors, leading to big throttle wait time and jitters.
    
    - NFS may kill large amount of unstable pages with one single COMMIT.
      Because NFS server serves COMMIT with expensive fsync() IOs, it is
      desirable to delay and reduce the number of COMMITs. So it's not
      likely to optimize away such kind of bursty IO completions, and the
      resulted large (and tiny) stall times in IO completion based throttling.
    
    So here is a pause time oriented approach, which tries to control the
    pause time in each balance_dirty_pages() invocations, by controlling
    the number of pages dirtied before calling balance_dirty_pages(), for
    smooth and efficient dirty throttling:
    
    - avoid useless (eg. zero pause time) balance_dirty_pages() calls
    - avoid too small pause time (less than   4ms, which burns CPU power)
    - avoid too large pause time (more than 200ms, which hurts responsiveness)
    - avoid big fluctuations of pause times
    
    It can control pause times at will. The default policy (in a followup
    patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
    in 1000-dd case.
    
    BEHAVIOR CHANGE
    ===============
    
    (1) dirty threshold
    
    Users will notice that the applications will get throttled once crossing
    the global (background + dirty)/2=15% threshold, and then balanced around
    17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
    memory in 1-dd case.
    
    Since the task will be soft throttled earlier than before, it may be
    perceived by end users as performance "slow down" if his application
    happens to dirty more than 15% dirtyable memory.
    
    (2) smoothness/responsiveness
    
    Users will notice a more responsive system during heavy writeback.
    "killall dd" will take effect instantly.
    
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  6. writeback: per task dirty rate limit

    fengguang committed Jun 12, 2011
    Add two fields to task_struct.
    
    1) account dirtied pages in the individual tasks, for accuracy
    2) per-task balance_dirty_pages() call intervals, for flexibility
    
    The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
    scale near-sqrt to the safety gap between dirty pages and threshold.
    
    The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
    pages at exactly the same time, each task will be assigned a large
    initial nr_dirtied_pause, so that the dirty threshold will be exceeded
    long before each task reached its nr_dirtied_pause and hence call
    balance_dirty_pages().
    
    The solution is to watch for the number of pages dirtied on each CPU in
    between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
    (3% dirty threshold), force call balance_dirty_pages() for a chance to
    set bdi->dirty_exceeded. In normal situations, this safeguarding
    condition is not expected to trigger at all.
    
    On the sqrt in dirty_poll_interval():
    
    It will serve as an initial guess when dirty pages are still in the
    freerun area.
    
    When dirty pages are floating inside the dirty control scope [freerun,
    limit], a followup patch will use some refined dirty poll interval to
    get the desired pause time.
    
       thresh-dirty (MB)    sqrt
    		   1      16
    		   2      22
    		   4      32
    		   8      45
    		  16      64
    		  32      90
    		  64     128
    		 128     181
    		 256     256
    		 512     362
    		1024     512
    
    The above table means, given 1MB (or 1GB) gap and the dd tasks polling
    balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
    be exceeded as long as there are less than 16 (or 512) concurrent dd's.
    
    So sqrt naturally leads to less overheads and more safe concurrent tasks
    for large memory servers, which have large (thresh-freerun) gaps.
    
    peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
    
    CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Reviewed-by: Andrea Righi <andrea@betterlinux.com>
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  7. writeback: stabilize bdi->dirty_ratelimit

    fengguang committed Aug 26, 2011
    There are some imperfections in balanced_dirty_ratelimit.
    
    1) large fluctuations
    
    The dirty_rate used for computing balanced_dirty_ratelimit is merely
    averaged in the past 200ms (very small comparing to the 3s estimation
    period for write_bw), which makes rather dispersed distribution of
    balanced_dirty_ratelimit.
    
    It's pretty hard to average out the singular points by increasing the
    estimation period. Considering that the averaging technique will
    introduce very undesirable time lags, I give it up totally. (btw, the 3s
    write_bw averaging time lag is much more acceptable because its impact
    is one-way and therefore won't lead to oscillations.)
    
    The more practical way is filtering -- most singular
    balanced_dirty_ratelimit points can be filtered out by remembering some
    prev_balanced_rate and prev_prev_balanced_rate. However the more
    reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.
    
    2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
    match could become unbalanced, which may lead to large systematical
    errors in balanced_dirty_ratelimit. The truncates, due to its possibly
    bumpy nature, can hardly be compensated smoothly. So let's face it. When
    some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
    high, dirty pages will go higher than the setpoint. task_ratelimit will
    in turn become lower than dirty_ratelimit.  So if we consider both
    balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
    only when they are on the same side of dirty_ratelimit, the systematical
    errors in balanced_dirty_ratelimit won't be able to bring
    dirty_ratelimit far away.
    
    The balanced_dirty_ratelimit estimation may also be inaccurate near
    @limit or @freerun, however is less an issue.
    
    3) since we ultimately want to
    
    - keep the fluctuations of task ratelimit as small as possible
    - keep the dirty pages around the setpoint as long time as possible
    
    the update policy used for (2) also serves the above goals nicely:
    if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
    and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
    there is no point to bring up dirty_ratelimit in a hurry only to hurt
    both the above two goals.
    
    So, we make use of task_ratelimit to limit the update of dirty_ratelimit
    in two ways:
    
    1) avoid changing dirty rate when it's against the position control target
       (the adjusted rate will slow down the progress of dirty pages going
       back to setpoint).
    
    2) limit the step size. task_ratelimit is changing values step by step,
       leaving a consistent trace comparing to the randomly jumping
       balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
       errors in stable state and typically larger errors when there are big
       errors in rate.  So it's a pretty good limiting factor for the step
       size of dirty_ratelimit.
    
    Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
    task_ratelimit is merely used as a limiting factor.
    
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  8. writeback: dirty rate control

    fengguang committed Jun 12, 2011
    It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
    when there are N dd tasks.
    
    On write() syscall, use bdi->dirty_ratelimit
    ============================================
    
        balance_dirty_pages(pages_dirtied)
        {
            task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
            pause = pages_dirtied / task_ratelimit;
            sleep(pause);
        }
    
    On every 200ms, update bdi->dirty_ratelimit
    ===========================================
    
        bdi_update_dirty_ratelimit()
        {
            task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
            balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
            bdi->dirty_ratelimit = balanced_dirty_ratelimit
        }
    
    Estimation of balanced bdi->dirty_ratelimit
    ===========================================
    
    balanced task_ratelimit
    -----------------------
    
    balance_dirty_pages() needs to throttle tasks dirtying pages such that
    the total amount of dirty pages stays below the specified dirty limit in
    order to avoid memory deadlocks. Furthermore we desire fairness in that
    tasks get throttled proportionally to the amount of pages they dirty.
    
    IOW we want to throttle tasks such that we match the dirty rate to the
    writeout bandwidth, this yields a stable amount of dirty pages:
    
            dirty_rate == write_bw                                          (1)
    
    The fairness requirement gives us:
    
            task_ratelimit = balanced_dirty_ratelimit
                           == write_bw / N                                  (2)
    
    where N is the number of dd tasks.  We don't know N beforehand, but
    still can estimate balanced_dirty_ratelimit within 200ms.
    
    Start by throttling each dd task at rate
    
            task_ratelimit = task_ratelimit_0                               (3)
                             (any non-zero initial value is OK)
    
    After 200ms, we measured
    
            dirty_rate = # of pages dirtied by all dd's / 200ms
            write_bw   = # of pages written to the disk / 200ms
    
    For the aggressive dd dirtiers, the equality holds
    
            dirty_rate == N * task_rate
                       == N * task_ratelimit_0                              (4)
    Or
            task_ratelimit_0 == dirty_rate / N                              (5)
    
    Now we conclude that the balanced task ratelimit can be estimated by
    
                                                          write_bw
            balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                          dirty_rate
    
    Because with (4) and (5) we can get the desired equality (1):
    
                                                           write_bw
            balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                           dirty_rate
                                     == write_bw / N
    
    Then using the balanced task ratelimit we can compute task pause times like:
    
            task_pause = task->nr_dirtied / task_ratelimit
    
    task_ratelimit with position control
    ------------------------------------
    
    However, while the above gives us means of matching the dirty rate to
    the writeout bandwidth, it at best provides us with a stable dirty page
    count (assuming a static system). In order to control the dirty page
    count such that it is high enough to provide performance, but does not
    exceed the specified limit we need another control.
    
    The dirty position control works by extending (2) to
    
            task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)
    
    where pos_ratio is a negative feedback function that subjects to
    
    1) f(setpoint) = 1.0
    2) df/dx < 0
    
    That is, if the dirty pages are ABOVE the setpoint, we throttle each
    task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
    pages are created less fast than they are cleaned, thus DROP to the
    setpoints (and the reverse).
    
    Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
    remains CONSTANT for the past 200ms, we get
    
            task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)
    
    Putting (8) into (6), we get the formula used in
    bdi_update_dirty_ratelimit():
    
                                                    write_bw
            balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                    dirty_rate
    
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  9. writeback: add bg_threshold parameter to __bdi_update_bandwidth()

    fengguang committed Oct 4, 2011
    No behavior change.
    
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  10. writeback: dirty position control

    fengguang committed Mar 2, 2011
    bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
    that the resulted task rate limit can drive the dirty pages back to the
    global/bdi setpoints.
    
    Old scheme is,
                                              |
                               free run area  |  throttle area
      ----------------------------------------+---------------------------->
                                        thresh^                  dirty pages
    
    New scheme is,
    
      ^ task rate limit
      |
      |            *
      |             *
      |              *
      |[free run]      *      [smooth throttled]
      |                  *
      |                     *
      |                         *
      ..bdi->dirty_ratelimit..........*
      |                               .     *
      |                               .          *
      |                               .              *
      |                               .                 *
      |                               .                    *
      +-------------------------------.-----------------------*------------>
                              setpoint^                  limit^  dirty pages
    
    The slope of the bdi control line should be
    
    1) large enough to pull the dirty pages to setpoint reasonably fast
    
    2) small enough to avoid big fluctuations in the resulted pos_ratio and
       hence task ratelimit
    
    Since the fluctuation range of the bdi dirty pages is typically observed
    to be within 1-second worth of data, the bdi control line's slope is
    selected to be a linear function of bdi write bandwidth, so that it can
    adapt to slow/fast storage devices well.
    
    Assume the bdi control line
    
    	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
    
    where k is the negative slope.
    
    If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
    are fluctuating in range
    
    	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
    
    we get slope
    
    	k = - 1 / (8 * write_bw)
    
    Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
    
    	x_intercept = bdi_setpoint + 8 * write_bw
    
    The global/bdi slopes are nicely complementing each other when the
    system has only one major bdi (indicated by bdi_thresh ~= thresh):
    
    1) slope of global control line    => scaling to the control scope size
    2) slope of main bdi control line  => scaling to the writeout bandwidth
    
    so that
    
    - in memory tight systems, (1) becomes strong enough to squeeze dirty
      pages inside the control scope
    
    - in large memory systems where the "gravity" of (1) for pulling the
      dirty pages to setpoint is too weak, (2) can back (1) up and drive
      dirty pages to bdi_setpoint ~= setpoint reasonably fast.
    
    Unfortunately in JBOD setups, the fluctuation range of bdi threshold
    is related to memory size due to the interferences between disks.  In
    this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
    
    Given equations
    
            span = x_intercept - bdi_setpoint
            k = df/dx = - 1 / span
    
    and the extremum values
    
            span = bdi_thresh
            dx = bdi_thresh
    
    we get
    
            df = - dx / span = - 1.0
    
    That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
    task ratelimit will fluctuate by -100%.
    
    peter: use 3rd order polynomial for the global control line
    
    CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Acked-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  11. writeback: account per-bdi accumulated dirtied pages

    fengguang committed Jan 23, 2011
    Introduce the BDI_DIRTIED counter. It will be used for estimating the
    bdi's dirty bandwidth.
    
    CC: Jan Kara <jack@suse.cz>
    CC: Michael Rubin <mrubin@google.com>
    CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  12. Merge branch 'for-linus' of git://git.infradead.org/users/sameo/mfd-2.6

    torvalds committed Oct 3, 2011
    * 'for-linus' of git://git.infradead.org/users/sameo/mfd-2.6:
      mfd: Fix generic irq chip ack function name for jz4740-adc
  13. Merge branch 'for-linus' of git://github.com/tiwai/sound

    torvalds committed Oct 3, 2011
    * 'for-linus' of git://github.com/tiwai/sound:
      ALSA: hda - Fix a regression of the position-buffer check
Commits on Oct 2, 2011
  1. Merge branch 'perf-urgent-for-linus' of git://tesla.tglx.de/git/linux…

    torvalds committed Oct 2, 2011
    …-2.6-tip
    
    * 'perf-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
      perf tools: Fix raw sample reading
Commits on Oct 1, 2011
  1. Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and 'sc…

    torvalds committed Oct 1, 2011
    …hed-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip
    
    * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
      irq: Fix check for already initialized irq_domain in irq_domain_add
      irq: Add declaration of irq_domain_simple_ops to irqdomain.h
    
    * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
      x86/rtc: Don't recursively acquire rtc_lock
    
    * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
      posix-cpu-timers: Cure SMP wobbles
      sched: Fix up wchan borkage
      sched/rt: Migrate equal priority tasks to available CPUs
Commits on Sep 30, 2011
  1. Merge branch 'perf/urgent' of git://github.com/acmel/linux into perf/…

    Ingo Molnar committed Sep 30, 2011
    …urgent
  2. posix-cpu-timers: Cure SMP wobbles

    Peter Zijlstra committed with Thomas Gleixner Sep 1, 2011
    David reported:
    
      Attached below is a watered-down version of rt/tst-cpuclock2.c from
      GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
      similar.
    
      Run it several times, and you will see cases where the main thread
      will measure a process clock difference before and after the nanosleep
      which is smaller than the cpu-burner thread's individual thread clock
      difference.  This doesn't make any sense since the cpu-burner thread
      is part of the top-level process's thread group.
    
      I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
      64-bit binaries).
    
      For example:
    
      [davem@boricha build-x86_64-linux]$ ./test
      process: before(0.001221967) after(0.498624371) diff(497402404)
      thread:  before(0.000081692) after(0.498316431) diff(498234739)
      self:    before(0.001223521) after(0.001240219) diff(16698)
      [davem@boricha build-x86_64-linux]$ 
    
      The diff of 'process' should always be >= the diff of 'thread'.
    
      I make sure to wrap the 'thread' clock measurements the most tightly
      around the nanosleep() call, and that the 'process' clock measurements
      are the outer-most ones.
    
      ---
      #include <unistd.h>
      #include <stdio.h>
      #include <stdlib.h>
      #include <time.h>
      #include <fcntl.h>
      #include <string.h>
      #include <errno.h>
      #include <pthread.h>
    
      static pthread_barrier_t barrier;
    
      static void *chew_cpu(void *arg)
      {
    	  pthread_barrier_wait(&barrier);
    	  while (1)
    		  __asm__ __volatile__("" : : : "memory");
    	  return NULL;
      }
    
      int main(void)
      {
    	  clockid_t process_clock, my_thread_clock, th_clock;
    	  struct timespec process_before, process_after;
    	  struct timespec me_before, me_after;
    	  struct timespec th_before, th_after;
    	  struct timespec sleeptime;
    	  unsigned long diff;
    	  pthread_t th;
    	  int err;
    
    	  err = clock_getcpuclockid(0, &process_clock);
    	  if (err)
    		  return 1;
    
    	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
    	  if (err)
    		  return 1;
    
    	  pthread_barrier_init(&barrier, NULL, 2);
    	  err = pthread_create(&th, NULL, chew_cpu, NULL);
    	  if (err)
    		  return 1;
    
    	  err = pthread_getcpuclockid(th, &th_clock);
    	  if (err)
    		  return 1;
    
    	  pthread_barrier_wait(&barrier);
    
    	  err = clock_gettime(process_clock, &process_before);
    	  if (err)
    		  return 1;
    
    	  err = clock_gettime(my_thread_clock, &me_before);
    	  if (err)
    		  return 1;
    
    	  err = clock_gettime(th_clock, &th_before);
    	  if (err)
    		  return 1;
    
    	  sleeptime.tv_sec = 0;
    	  sleeptime.tv_nsec = 500000000;
    	  nanosleep(&sleeptime, NULL);
    
    	  err = clock_gettime(th_clock, &th_after);
    	  if (err)
    		  return 1;
    
    	  err = clock_gettime(my_thread_clock, &me_after);
    	  if (err)
    		  return 1;
    
    	  err = clock_gettime(process_clock, &process_after);
    	  if (err)
    		  return 1;
    
    	  diff = process_after.tv_nsec - process_before.tv_nsec;
    	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    		 process_before.tv_sec, process_before.tv_nsec,
    		 process_after.tv_sec, process_after.tv_nsec, diff);
    	  diff = th_after.tv_nsec - th_before.tv_nsec;
    	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    		 th_before.tv_sec, th_before.tv_nsec,
    		 th_after.tv_sec, th_after.tv_nsec, diff);
    	  diff = me_after.tv_nsec - me_before.tv_nsec;
    	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    		 me_before.tv_sec, me_before.tv_nsec,
    		 me_after.tv_sec, me_after.tv_nsec, diff);
    
    	  return 0;
      }
    
    This is due to us using p->se.sum_exec_runtime in
    thread_group_cputime() where we iterate the thread group and sum all
    data. This does not take time since the last schedule operation (tick
    or otherwise) into account. We can cure this by using
    task_sched_runtime() at the cost of having to take locks.
    
    This also means we can (and must) do away with
    thread_group_sched_runtime() since the modified thread_group_cputime()
    is now more accurate and would deadlock when called from
    thread_group_sched_runtime().
    
    Aside of that it makes the function safe on 32 bit systems. The old
    code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
    64bit value and could be changed on another cpu at the same time.
    
    Reported-by: David Miller <davem@davemloft.net>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
    Tested-by: David Miller <davem@davemloft.net>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
  3. ALSA: hda - Fix a regression of the position-buffer check

    tiwai committed Sep 30, 2011
    The commit a810364
        ALSA: hda - Handle -1 as invalid position, too
    caused a regression on some machines that require the position-buffer
    instead of LPIB, e.g. resulting in noises with mic recording with
    PulseAudio.
    
    This patch fixes the detection by delaying the test at the timing as
    same as 3.0, i.e. doing the position check only when requested in
    azx_position_ok().
    
    Reported-and-tested-by: Rocko Requin <rockorequin@hotmail.com>
    Signed-off-by: Takashi Iwai <tiwai@suse.de>
  4. Resource: fix wrong resource window calculation

    Ram Pai committed with torvalds Sep 22, 2011
    __find_resource() incorrectly returns a resource window which overlaps
    an existing allocated window.  This happens when the parent's
    resource-window spans 0x00000000 to 0xffffffff and is entirely allocated
    to all its children resource-windows.
    
    __find_resource() looks for gaps in resource allocation among the
    children resource windows.  When it encounters the last child window it
    blindly tries the range next to one allocated to the last child.  Since
    the last child's window ends at 0xffffffff the calculation overflows,
    leading the algorithm to believe that any window in the range 0x0000000
    to 0xfffffff is available for allocation.  This leads to a conflicting
    window allocation.
    
    Michal Ludvig reported this issue seen on his platform.  The following
    patch fixes the problem and has been verified by Michal.  I believe this
    bug has been there for ages.  It got exposed by git commit 2bbc694
    ("PCI : ability to relocate assigned pci-resources")
    
    Signed-off-by: Ram Pai <linuxram@us.ibm.com>
    Tested-by: Michal Ludvig <mludvig@logix.net.nz>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  5. Merge branch 'for-linus' of git://github.com/NewDreamNetwork/ceph-client

    torvalds committed Sep 30, 2011
    * 'for-linus' of git://github.com/NewDreamNetwork/ceph-client:
      libceph: fix pg_temp mapping update
      libceph: fix pg_temp mapping calculation
      libceph: fix linger request requeuing
      libceph: fix parse options memory leak
      libceph: initialize ack_stamp to avoid unnecessary connection reset
  6. Merge branch 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus

    torvalds committed Sep 30, 2011
    * 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus:
      [media] omap3isp: Fix build error in ispccdc.c
      [media] uvcvideo: Fix crash when linking entities
      [media] v4l: Make sure we hold a reference to the v4l2_device before using it
      [media] v4l: Fix use-after-free case in v4l2_device_release
      [media] uvcvideo: Set alternate setting 0 on resume if the bus has been reset
      [media] OMAP_VOUT: Fix build break caused by update_mode removal in DSS2
  7. Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6

    torvalds committed Sep 30, 2011
    * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
      [S390] cio: fix cio_tpi ignoring adapter interrupts
      [S390] gmap: always up mmap_sem properly
      [S390] Do not clobber personality flags on exec
  8. Merge git://github.com/davem330/sparc

    torvalds committed Sep 30, 2011
    * git://github.com/davem330/sparc:
      sparc64: Force the execute bit in OpenFirmware's translation entries.
      sparc: Make '-p' boot option meaningful again.
      sparc, exec: remove redundant addr_limit assignment
      sparc64: Future proof Niagara cpu detection.
  9. Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~keith…

    torvalds committed Sep 30, 2011
    …p/linux
    
    * 'drm-intel-fixes' of git://people.freedesktop.org/~keithp/linux:
      drm/i915: FBC off for ironlake and older, otherwise on by default
      drm/i915: Enable SDVO hotplug interrupts for HDMI and DVI
      drm/i915: Enable dither whenever display bpc < frame buffer bpc
  10. powerpc: Fix device-tree matching for Apple U4 bridge

    ozbenh committed with torvalds Sep 29, 2011
    Apple Quad G5 has some oddity in it's device-tree which causes the new
    generic matching code to fail to relate nodes for PCI-E devices below U4
    with their respective struct pci_dev.  This breaks graphics on those
    machines among others.
    
    This fixes it using a quirk which copies the node pointer from the host
    bridge for the root complex, which makes the generic code work for the
    children afterward.
    
    Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  11. bootup: move 'usermodehelper_enable()' a little earlier

    udknight committed with torvalds Sep 29, 2011
    Commit d5767c5 ("bootup: move 'usermodehelper_enable()' to the end
    of do_basic_setup()") moved 'usermodehelper_enable()' to end of
    do_basic_setup() to after the initcalls.  But then I get failed to let
    uvesafb work on my computer, and lose the splash boot.
    
    So maybe we could start usermodehelper_enable a little early to make
    some task work that need eary init with the help of user mode.
    
    [ I would *really* prefer that initcalls not call into user space - even
      the real 'init' hasn't been execve'd yet, after all! But for uvesafb
      it really does look like we don't have much choice.
    
      I considered doing this when we mount the root filesystem, but
      depending on config options that is in multiple places.  We could do
      the usermode helper enable as a rootfs_initcall()..
    
      So I'm just using wang yanqing's trivial patch.  It's not wonderful,
      but it's simple and should work.  We should revisit this some day,
      though.      - Linus ]
    
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Sep 29, 2011
  1. perf tools: Fix raw sample reading

    Jiri Olsa committed with Arnaldo Carvalho de Melo Sep 29, 2011
    Wrong pointer is being passed for raw data sanity checking, when parsing
    sample event.
    
    This ends up with invalid event and perf record being stuck in
    __perf_session__process_events function during processing build IDs
    (process_buildids function).
    
    Following command hangs up in my setup:
    	./perf record -e raw_syscalls:sys_enter ls
    
    The fix is to use proper pointer to the raw data instead of the 'u'
    union.
    
    Reviewed-by: David Ahern <dsahern@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Neil Horman <nhorman@tuxdriver.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Link: http://lkml.kernel.org/r/1317308709-9474-2-git-send-email-jolsa@redhat.com
    Signed-off-by: Jiri Olsa <jolsa@redhat.com>
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
  2. sparc64: Force the execute bit in OpenFirmware's translation entries.

    davem330 committed Sep 29, 2011
    In the OF 'translations' property, the template TTEs in the mappings
    never specify the executable bit.  This is the case even though some
    of these mappings are for OF's code segment.
    
    Therefore, we need to force the execute bit on in every mapping.
    
    This problem can only really trigger on Niagara/sun4v machines and the
    history behind this is a little complicated.
    
    Previous to sun4v, the sun4u TTE entries lacked a hardware execute
    permission bit.  So OF didn't have to ever worry about setting
    anything to handle executable pages.  Any valid TTE loaded into the
    I-TLB would be respected by the chip.
    
    But sun4v Niagara chips have a real hardware enforced executable bit
    in their TTEs.  So it has to be set or else the I-TLB throws an
    instruction access exception with type code 6 (protection violation).
    
    We've been extremely fortunate to not get bitten by this in the past.
    
    The best I can tell is that the OF's mappings for it's executable code
    were mapped using permanent locked mappings on sun4v in the past.
    Therefore, the fact that we didn't have the exec bit set in the OF
    translations we would use did not matter in practice.
    
    Thanks to Greg Onufer for helping me track this down.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
Commits on Sep 28, 2011
  1. bootup: move 'usermodehelper_enable()' to the end of do_basic_setup()

    torvalds committed Sep 28, 2011
    Doing it just before starting to call into cpu_idle() made a sick kind
    of sense only because the original bug we fixed (see commit
    288d5ab: "Boot up with usermodehelper disabled") was about problems
    with some scheduler data structures not being initialized, and they had
    better be initialized at that point.
    
    But it really didn't make any other conceptual sense, and doing it after
    the initial "schedule()" call for the idle thread actually opened up a
    race: what if the main initialization thread did everything without
    needing to sleep, and got all the way into user land too? Without
    actually having scheduled back to the idle thread?
    
    Now, in normal circumstances that doesn't ever happen, but it looks like
    Richard Cochran triggered exactly that on his ARM IXP4xx machines:
    
      "I have some ARM IXP4xx based machines that use the two on chip MAC
       ports (aka NPEs).  The NPE needs a firmware in order to function.
       Ever since the following commit [that 288d5ab one], it is no
       longer possible to bring up the interfaces during the init scripts."
    
    with a call trace showing an ioctl coming from user space. Richard says:
    
      "The init is busybox, and the startup script does mount, syslogd, and
       then ifup, so that all can go by quickly."
    
    The fix is to move the usermodehelper_enable() into the main 'init'
    thread, and just put it after we've done all our initcalls.  By then,
    everything really should be up, but we've obviously not actually started
    the user-mode portion of init yet.
    
    Reported-and-tested-by: Richard Cochran <richardcochran@gmail.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  2. libceph: fix pg_temp mapping update

    liewegas committed Sep 28, 2011
    The incremental map updates have a record for each pg_temp mapping that is
    to be add/updated (len > 0) or removed (len == 0).  The old code was
    written as if the updates were a complete enumeration; that was just wrong.
    Update the code to remove 0-length entries and drop the rbtree traversal.
    
    This avoids misdirected (and hung) requests that manifest as server
    errors like
    
    [WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11
    
    Signed-off-by: Sage Weil <sage@newdream.net>
  3. libceph: fix pg_temp mapping calculation

    liewegas committed Sep 28, 2011
    We need to apply the modulo pg_num calculation before looking up a pgid in
    the pg_temp mapping rbtree.  This fixes pg_temp mappings, and fixes
    (some) misdirected requests that result in messages like
    
    [WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11
    
    on the server and stall make the client block without getting a reply (at
    least until the pg_temp mapping goes way, but that can take a long long
    time).
    
    Reorder calc_pg_raw() a bit to make more sense.
    
    Signed-off-by: Sage Weil <sage@newdream.net>
  4. Merge git://github.com/davem330/net

    torvalds committed Sep 28, 2011
    * git://github.com/davem330/net:
      ipv6-multicast: Fix memory leak in IPv6 multicast.
      ipv6: check return value for dst_alloc
      net: check return value for dst_alloc
      ipv6-multicast: Fix memory leak in input path.
      bnx2x: add missing break in bnx2x_dcbnl_get_cap
      bnx2x: fix WOL by enablement PME in config space
      bnx2x: fix hw attention handling
      net: fix a typo in Documentation/networking/scaling.txt
      ath9k: Fix a dma warning/memory leak
      rtlwifi: rtl8192cu: Fix unitialized struct
      iwlagn: fix dangling scan request
      batman-adv: do_bcast has to be true for broadcast packets only
      cfg80211: Fix validation of AKM suites
      iwlegacy: do not use interruptible waits
      iwlegacy: fix command queue timeout
      ath9k_hw: Fix Rx DMA stuck for AR9003 chips
  5. Merge git://bedivere.hansenpartnership.com/git/scsi-rc-fixes-2.6

    torvalds committed Sep 28, 2011
    * git://bedivere.hansenpartnership.com/git/scsi-rc-fixes-2.6:
      [SCSI] 3w-9xxx: fix iommu_iova leak
      [SCSI] cxgb3i: convert cdev->l2opt to use rcu to prevent NULL dereference
      [SCSI] scsi: qla4xxx needs libiscsi.o
      [SCSI] libsas: fix failure to revalidate domain for anything but the first expander child.
      [SCSI] aacraid: reset should disable MSI interrupt
  6. Merge branch 'for-linus' of git://git.kernel.dk/linux-block

    torvalds committed Sep 28, 2011
    * 'for-linus' of git://git.kernel.dk/linux-block:
      block: Free queue resources at blk_release_queue()