Permalink
Commits on Jan 13, 2012
  1. Merge branch 'akpm' (aka "Andrew's patch-bomb, take two")

    Andrew explains:
    
     - various misc stuff
    
     - Most of the rest of MM: memcg, threaded hugepages, others.
    
     - cpumask
    
     - kexec
    
     - kdump
    
     - some direct-io performance tweaking
    
     - radix-tree optimisations
    
     - new selftests code
    
       A note on this: often people will develop a new userspace-visible
       feature and will develop userspace code to exercise/test that
       feature.  Then they merge the patch and the selftest code dies.
       Sometimes we paste it into the changelog.  Sometimes the code gets
       thrown into Documentation/(!).
    
       This saddens me.  So this patch creates a bare-bones framework which
       will henceforth allow me to ask people to include their test apps in
       the kernel tree so we can keep them alive.  Then when people enhance
       or fix the feature, I can ask them to update the test app too.
    
       The infrastruture is terribly trivial at present - let's see how it
       evolves.
    
     - checkpoint/restart feature work.
    
       A note on this: this is a project by various mad Russians to perform
       c/r mainly from userspace, with various oddball helper code added
       into the kernel where the need is demonstrated.
    
       So rather than some large central lump of code, what we have is
       little bits and pieces popping up in various places which either
       expose something new or which permit something which is normally
       kernel-private to be modified.
    
       The overall project is an ongoing thing.  I've judged that the size
       and scope of the thing means that we're more likely to be successful
       with it if we integrate the support into mainline piecemeal rather
       than allowing it all to develop out-of-tree.
    
       However I'm less confident than the developers that it will all
       eventually work! So what I'm asking them to do is to wrap each piece
       of new code inside CONFIG_CHECKPOINT_RESTORE.  So if it all
       eventually comes to tears and the project as a whole fails, it should
       be a simple matter to go through and delete all trace of it.
    
    This lot pretty much wraps up the -rc1 merge for me.
    
    * akpm: (96 commits)
      unlzo: fix input buffer free
      ramoops: update parameters only after successful init
      ramoops: fix use of rounddown_pow_of_two()
      c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
      c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
      c/r: introduce CHECKPOINT_RESTORE symbol
      selftests: new x86 breakpoints selftest
      selftests: new very basic kernel selftests directory
      radix_tree: take radix_tree_path off stack
      radix_tree: remove radix_tree_indirect_to_ptr()
      dio: optimize cache misses in the submission path
      vfs: cache request_queue in struct block_device
      fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
      drivers/parport/parport_pc.c: fix warnings
      panic: don't print redundant backtraces on oops
      sysctl: add the kernel.ns_last_pid control
      kdump: add udev events for memory online/offline
      include/linux/crash_dump.h needs elf.h
      kdump: fix crash_kexec()/smp_send_stop() race in panic()
      kdump: crashk_res init check for /sys/kernel/kexec_crash_size
      ...
    torvalds committed Jan 13, 2012
  2. Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
      pptp: Accept packet with seq zero
      RDS: Remove some unused iWARP code
      net: fsl: fec: handle 10Mbps speed in RMII mode
      drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c: add missing iounmap
      drivers/net/ethernet/tundra/tsi108_eth.c: add missing iounmap
      ksz884x: fix mtu for VLAN
      net_sched: sfq: add optional RED on top of SFQ
      dp83640: Fix NOHZ local_softirq_pending 08 warning
      gianfar: Fix invalid TX frames returned on error queue when time stamping
      gianfar: Fix missing sock reference when processing TX time stamps
      phylib: introduce mdiobus_alloc_size()
      net: decrement memcg jump label when limit, not usage, is changed
      net: reintroduce missing rcu_assign_pointer() calls
      inet_diag: Rename inet_diag_req_compat into inet_diag_req
      inet_diag: Rename inet_diag_req into inet_diag_req_v2
      bond_alb: don't disable softirq under bond_alb_xmit
      mac80211: fix rx->key NULL pointer dereference in promiscuous mode
      nl80211: fix old station flags compatibility
      mdio-octeon: use an unique MDIO bus name.
      mdio-gpio: use an unique MDIO bus name.
      ...
    torvalds committed Jan 13, 2012
  3. unlzo: fix input buffer free

    unlzo modifies the pointer to in_buf, so we have to free the original
    buffer, not the modified pointer.
    
    Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
    Cc: Lasse Collin <lasse.collin@tukaani.org>
    Cc: Namhyung Kim <namhyung@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    saschahauer committed with torvalds Jan 13, 2012
  4. ramoops: update parameters only after successful init

    If a platform device exists on the system, but ramoops fails to attach to
    it, the module parameters are overridden before ramoops can fall back and
    try to use passed module parameters.  Move update to end of init routine.
    
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Cc: Marco Stornelli <marco.stornelli@gmail.com>
    Cc: Sergiu Iordache <sergiu@chromium.org>
    Cc: Seiji Aguchi <seiji.aguchi@hds.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    kees committed with torvalds Jan 13, 2012
  5. ramoops: fix use of rounddown_pow_of_two()

    The return value of rounddown_pow_of_two wasn't evaluated, so the
    operation was a no-op.
    
    Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
    Reported-by: Andrew Morton <akpm@linux-foundation.org>
    Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Marco Stornelli committed with torvalds Jan 13, 2012
  6. c/r: prctl: add PR_SET_MM codes to set up mm_struct entries

    When we restore a task we need to set up text, data and data heap sizes
    from userspace to the values a task had at checkpoint time.  This patch
    adds auxilary prctl codes for that.
    
    While most of them have a statistical nature (their values are involved
    into calculation of /proc/<pid>/statm output) the start_brk and brk values
    are used to compute an allowed size of program data segment expansion.
    Which means an arbitrary changes of this values might be dangerous
    operation.  So to restrict access the following requirements applied to
    prctl calls:
    
     - The process has to have CAP_SYS_ADMIN capability granted.
     - For all opcodes except start_brk/brk members an appropriate
       VMA area must exist and should fit certain VMA flags,
       such as:
       - code segment must be executable but not writable;
       - data segment must not be executable.
    
    start_brk/brk values must not intersect with data segment and must not
    exceed RLIMIT_DATA resource limit.
    
    Still the main guard is CAP_SYS_ADMIN capability check.
    
    Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
    otherwise these prctl calls will return -EINVAL.
    
    [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
    Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Andrew Vagin <avagin@openvz.org>
    Cc: Serge Hallyn <serge.hallyn@canonical.com>
    Cc: Pavel Emelyanov <xemul@parallels.com>
    Cc: Vasiliy Kulikov <segoon@openwall.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    cyrillos committed with torvalds Jan 13, 2012
  7. c/r: procfs: add start_data, end_data, start_brk members to /proc/$pi…

    …d/stat v4
    
    The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
    involved into calculation of program text/data segment sizes (which might
    be seen in /proc/<pid>/statm) and into brk() call final address.
    
    For restore we need to know all these values.  While
    mm->start_code/end_code already present in /proc/$pid/stat, the rest
    members are not, so this patch brings them in.
    
    The restore procedure of these members is addressed in another patch using
    prctl().
    
    Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
    Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Andrew Vagin <avagin@openvz.org>
    Cc: Vasiliy Kulikov <segoon@openwall.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    cyrillos committed with torvalds Jan 13, 2012
  8. c/r: introduce CHECKPOINT_RESTORE symbol

    For checkpoint/restore we need auxilary features being compiled into the
    kernel, such as additional prctl codes, /proc/<pid>/map_files and etc...
    but same time these features are not mandatory for a regular kernel so
    CHECKPOINT_RESTORE config symbol should bring a way to disable them all at
    once if one wish to get rid of additional functionality.
    
    Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Andrew Vagin <avagin@openvz.org>
    Cc: Serge Hallyn <serge.hallyn@canonical.com>
    Cc: Vasiliy Kulikov <segoon@openwall.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    cyrillos committed with torvalds Jan 13, 2012
  9. selftests: new x86 breakpoints selftest

    Bring a first selftest in the relevant directory.  This tests several
    combinations of breakpoints and watchpoints in x86, as well as icebp traps
    and int3 traps.  Given the amount of breakpoint regressions we raised
    after we merged the generic breakpoint infrastructure, such selftest
    became necessary and can still serve today as a basis for new patches that
    touch the do_debug() path.
    
    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Jason Wessel <jason.wessel@windriver.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: Michal Marek <mmarek@suse.cz>
    Cc: Sam Ravnborg <sam@ravnborg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    fweisbec committed with torvalds Jan 13, 2012
  10. selftests: new very basic kernel selftests directory

    Bring a new kernel selftests directory in tools/testing/selftests.  To
    add a new selftest, create a subdirectory with the sources and a
    makefile that creates a target named "run_test" then add the
    subdirectory name to the TARGET var in tools/testing/selftests/Makefile
    and tools/testing/selftests/run_tests script.
    
    This can help centralizing and maintaining any useful selftest that
    developers usually tend to let rust in peace on some random server.
    
    Suggested-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Jason Wessel <jason.wessel@windriver.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Michal Marek <mmarek@suse.cz>
    Cc: Sam Ravnborg <sam@ravnborg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    fweisbec committed with torvalds Jan 13, 2012
  11. radix_tree: take radix_tree_path off stack

    Down, down in the deepest depths of GFP_NOIO page reclaim, we have
    shrink_page_list() calling __remove_mapping() calling __delete_from_
    swap_cache() or __delete_from_page_cache().
    
    You would not expect those to need much stack, but in fact they call
    radix_tree_delete(): which declares a 192-byte radix_tree_path array on
    its stack (to record the node,offsets it visits when descending, in case
    it needs to ascend to update them).  And if any tag is still set [1],
    that calls radix_tree_tag_clear(), which declares a further such
    192-byte radix_tree_path array on the stack.  (At least we have
    interrupts disabled here, so won't then be pushing registers too.)
    
    That was probably a good choice when most users were 32-bit (array of
    half the size), and adding fields to radix_tree_node would have bloated
    it unnecessarily.  But nowadays many are 64-bit, and each
    radix_tree_node contains a struct rcu_head, which is only used when
    freeing; whereas the radix_tree_path info is only used for updating the
    tree (deleting, clearing tags or setting tags if tagged) when a lock
    must be held, of no interest when accessing the tree locklessly.
    
    So add a parent pointer to the radix_tree_node, in union with the
    rcu_head, and remove all uses of the radix_tree_path.  There would be
    space in that union to save the offset when descending as before (we can
    argue that a lock must already be held to exclude other users), but
    recalculating it when ascending is both easy (a constant shift and a
    constant mask) and uncommon, so it seems better just to do that.
    
    Two little optimizations: no need to decrement height when descending,
    adjusting shift is enough; and once radix_tree_tag_if_tagged() has set
    tag on a node and its ancestors, it need not ascend from that node
    again.
    
    perf on the radix tree test harness reports radix_tree_insert() as 2%
    slower (now having to set parent), but radix_tree_delete() 24% faster.
    Surely that's an exaggeration from rtth's artificially low map shift 3,
    but forcing it back to 6 still rates radix_tree_delete() 8% faster.
    
    [1] Can a pagecache tag (dirty, writeback or towrite) actually still be
    set at the time of radix_tree_delete()? Perhaps not if the filesystem is
    well-behaved.  But although I've not tracked any stack overflow down to
    this cause, I have observed a curious case in which a dirty tag is set
    and left set on tmpfs: page migration's migrate_page_copy() happens to
    use __set_page_dirty_nobuffers() to set PageDirty on the newpage, and
    that sets PAGECACHE_TAG_DIRTY as a side-effect - harmless to a
    filesystem which doesn't use tags, except for this stack depth issue.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Nai Xia <nai.xia@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  12. radix_tree: remove radix_tree_indirect_to_ptr()

    It is not used anymore, remove it
    
    Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Xiao Guangrong committed with torvalds Jan 13, 2012
  13. dio: optimize cache misses in the submission path

    Some investigation of a transaction processing workload showed that a
    major consumer of cycles in __blockdev_direct_IO is the cache miss while
    accessing the block size.  This is because it has to walk the chain from
    block_dev to gendisk to queue.
    
    The block size is needed early on to check alignment and sizes.  It's only
    done if the check for the inode block size fails.  But the costly block
    device state is unconditionally fetched.
    
    - Reorganize the code to only fetch block dev state when actually
      needed.
    
    Then do a prefetch on the block dev early on in the direct IO path.  This
    is worth it, because there is substantial code run before we actually
    touch the block dev now.
    
    - I also added some unlikelies to make it clear the compiler that block
      device fetch code is not normally executed.
    
    This gave a small, but measurable improvement on a large database
    benchmark (about 0.3%)
    
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
    Signed-off-by: Andi Kleen <ak@linux.intel.com>
    Cc: Jeff Moyer <jmoyer@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Andi Kleen committed with torvalds Jan 13, 2012
  14. vfs: cache request_queue in struct block_device

    This makes it possible to get from the inode to the request_queue with one
    less cache miss.  Used in followon optimization.
    
    The livetime of the pointer is the same as the gendisk.
    
    This assumes that the queue will always stay the same in the gendisk while
    it's visible to block_devices.  I think that's safe correct?
    
    Signed-off-by: Andi Kleen <ak@linux.intel.com>
    Acked-by: Jeff Moyer <jmoyer@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Andi Kleen committed with torvalds Jan 13, 2012
  15. fs/direct-io.c: calculate fs_count correctly in get_more_blocks()

    In get_more_blocks(), we use dio_count to calcuate fs_count and do some
    tricky things to increase fs_count if dio_count isn't aligned.  But
    actually it still has some corner cases that can't be coverd.  See the
    following example:
    
    	dio_write foo -s 1024 -w 4096
    
    (direct write 4096 bytes at offset 1024).  The same goes if the offset
    isn't aligned to fs_blocksize.
    
    In this case, the old calculation counts fs_count to be 1, but actually we
    will write into 2 different blocks (if fs_blocksize=4096).  The old code
    just works, since it will call get_block twice (and may have to allocate
    and create extents twice for filesystems like ext4).  So we'd better call
    get_block just once with the proper fs_count.
    
    Signed-off-by: Tao Ma <boyu.mt@taobao.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    taoma-tm committed with torvalds Jan 13, 2012
  16. drivers/parport/parport_pc.c: fix warnings

    drivers/parport/parport_pc.c: In function '__check_irq':
    drivers/parport/parport_pc.c:3415: warning: return from incompatible pointer type
    drivers/parport/parport_pc.c: In function '__check_dma':
    drivers/parport/parport_pc.c:3417: warning: return from incompatible pointer type
    
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Andrew Morton committed with torvalds Jan 13, 2012
  17. panic: don't print redundant backtraces on oops

    When an oops causes a panic and panic prints another backtrace it's pretty
    common to have the original oops data be scrolled away on a 80x50 screen.
    
    The second backtrace is quite redundant and not needed anyways.
    
    So don't print the panic backtrace when oops_in_progress is true.
    
    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Andi Kleen <ak@linux.intel.com>
    Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Andi Kleen committed with torvalds Jan 13, 2012
  18. sysctl: add the kernel.ns_last_pid control

    The sysctl works on the current task's pid namespace, getting and setting
    its last_pid field.
    
    Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible
    to create a task with desired pid value.  This ability is required badly
    for the checkpoint/restore in userspace.
    
    This approach suits all the parties for now.
    
    Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Cyrill Gorcunov <gorcunov@openvz.org>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Serge Hallyn <serue@us.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    xemul committed with torvalds Jan 13, 2012
  19. kdump: add udev events for memory online/offline

    Currently no udev events for memory hotplug "online" and "offline" are
    generated:
    
      # udevadm monitor
      # echo offline > /sys/devices/system/memory/memory4/state
      ==> No event
    
    When kdump is loaded, kexec detects the current memory configuration and
    stores it in the pre-allocated ELF core header.  Therefore, for kdump it
    is necessary to reload the kdump kernel with kexec when the memory
    configuration changes (e.g.  for online/offline hotplug memory).
    
    In order to do this automatically, udev rules should be used.  This kernel
    patch adds udev events for "online" and "offline".  Together with this
    kernel patch, the following udev rules for online/offline have to be added
    to "/etc/udev/rules.d/98-kexec.rules":
    
      SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/etc/init.d/kdump restart"
      SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/etc/init.d/kdump restart"
    
    [sfr@canb.auug.org.au: fixups for class to subsystem conversion]
    Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Kay Sievers <kay.sievers@vrfy.org>
    Cc: Dave Hansen <haveblue@us.ibm.com>
    Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
    Cc: Greg KH <greg@kroah.com>
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    michael-holzheu committed with torvalds Jan 13, 2012
  20. include/linux/crash_dump.h needs elf.h

    Building an ARM target we get the following warnings:
    
      CC      arch/arm/kernel/setup.o
      In file included from arch/arm/kernel/setup.c:39:
      arch/arm/include/asm/elf.h:102:1: warning: "vmcore_elf64_check_arch" redefined
      In file included from arch/arm/kernel/setup.c:24:
      include/linux/crash_dump.h:30:1: warning: this is the location of the previous definition
    
    Quoting Russell King:
    
    "linux/crash_dump.h makes no attempt to include asm/elf.h, but it depends
    on stuff in asm/elf.h to determine how stuff inside this file is defined
    at parse time.
    
    So, if asm/elf.h is included after linux/crash_dump.h or not at all, you
    get a different result from the situation where asm/elf.h is included
    before."
    
    So add elf.h header to crash_dump.h to avoid this problem.
    
    The original discussion about this can be found at:
    http://www.spinics.net/lists/arm-kernel/msg154113.html
    
    Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
    Cc: Russell King <rmk@arm.linux.org.uk>
    Cc: <stable@vger.kernel.org>	[3.2.1]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    fabioestevam committed with torvalds Jan 13, 2012
  21. kdump: fix crash_kexec()/smp_send_stop() race in panic()

    When two CPUs call panic at the same time there is a possible race
    condition that can stop kdump.  The first CPU calls crash_kexec() and the
    second CPU calls smp_send_stop() in panic() before crash_kexec() finished
    on the first CPU.  So the second CPU stops the first CPU and therefore
    kdump fails:
    
    1st CPU:
      panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump
    
    2nd CPU:
      panic()->crash_kexec()->kexec_mutex already held by 1st CPU
           ->smp_send_stop()-> stop 1st CPU (stop kdump)
    
    This patch fixes the problem by introducing a spinlock in panic that
    allows only one CPU to process crash_kexec() and the subsequent panic
    code.
    
    All other CPUs call the weak function panic_smp_self_stop() that stops the
    CPU itself.  This function can be overloaded by architecture code.  For
    example "tile" can use their lower-power "nap" instruction for that.
    
    Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Acked-by: Chris Metcalf <cmetcalf@tilera.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    michael-holzheu committed with torvalds Jan 13, 2012
  22. kdump: crashk_res init check for /sys/kernel/kexec_crash_size

    Currently it is possible to set the crash_size via the sysfs
    /sys/kernel/kexec_crash_size even if no crash kernel memory has been
    defined with the "crashkernel" parameter.  In this case "crashk_res" is
    not initialized and crashk_res.start = crashk_res.end = 0.  Unfortunately
    resource_size(&crashk_res) returns 1 in this case.  This breaks the s390
    implementation of crash_(un)map_reserved_pages().
    
    To fix the problem the correct "old_size" is now calculated in
    crash_shrink_memory().  "old_size is set to "0" if crashk_res is not
    initialized.  With this change crash_shrink_memory() will do nothing, when
    "crashk_res" is not initialized.  It will return "0" for "echo 0 >
    /sys/kernel/kexec_crash_size" and -EINVAL for "echo [not zero] >
    /sys/kernel/kexec_crash_size".
    
    In addition to that this patch also simplifies the "ret = -EINVAL" vs.
    "ret = 0" logic as suggested by Simon Horman.
    
    Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Reviewed-by: Dave Young <dyoung@redhat.com>
    Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
    Reviewed-by: Simon Horman <horms@verge.net.au>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    michael-holzheu committed with torvalds Jan 13, 2012
  23. kdump: add missing RAM resource in crash_shrink_memory()

    When shrinking crashkernel memory using /sys/kernel/kexec_crash_size for
    the newly added memory no RAM resource is created at the moment.
    
    Example:
    
      $ cat /proc/iomem
      00000000-bfffffff : System RAM
        00000000-005b7ac3 : Kernel code
        005b7ac4-009743bf : Kernel data
        009bb000-00a85c33 : Kernel bss
      c0000000-cfffffff : Crash kernel
      d0000000-ffffffff : System RAM
    
      $ echo 0 > /sys/kernel/kexec_crash_size
      $ cat /proc/iomem
      00000000-bfffffff : System RAM
        00000000-005b7ac3 : Kernel code
        005b7ac4-009743bf : Kernel data
        009bb000-00a85c33 : Kernel bss
                                       <<-- here is System RAM missing
      d0000000-ffffffff : System RAM
    
    One result of this bug is that the memory chunk can never be set offline
    using memory hotplug.  With this patch I insert a new "System RAM"
    resource for the released memory.  Then the upper example looks like the
    following:
    
      $ echo 0 > /sys/kernel/kexec_crash_size
      $ cat /proc/iomem
      00000000-bfffffff : System RAM
        00000000-005b7ac3 : Kernel code
        005b7ac4-009743bf : Kernel data
        009bb000-00a85c33 : Kernel bss
      c0000000-cfffffff : System RAM   <<-- new rescoure
      d0000000-ffffffff : System RAM
    
    And now I can set chunk c0000000-cfffffff offline.
    
    Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    michael-holzheu committed with torvalds Jan 13, 2012
  24. kexec: remove KMSG_DUMP_KEXEC

    KMSG_DUMP_KEXEC is useless because we already save kernel messages inside
    /proc/vmcore, and it is unsafe to allow modules to do other stuffs in a
    crash dump scenario.
    
    [akpm@linux-foundation.org: fix powerpc build]
    Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
    Reported-by: Vivek Goyal <vgoyal@redhat.com>
    Acked-by: Vivek Goyal <vgoyal@redhat.com>
    Acked-by: Jarod Wilson <jarod@redhat.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    congwang committed with torvalds Jan 13, 2012
  25. cpumask: update setup_node_to_cpumask_map() comments

    node_to_cpumask() has been replaced by cpumask_of_node(), and wholly
    removed since commit 29c337a ("cpumask: remove obsolete node_to_cpumask
    now everyone uses cpumask_of_node").
    
    So update the comments for setup_node_to_cpumask_map().
    
    Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
    Acked-by: Rusty Russell <rusty@rustcorp.com.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    gaowanlong committed with torvalds Jan 13, 2012
  26. mm/vmalloc.c: eliminate extra loop in pcpu_get_vm_areas error path

    If either of the vas or vms arrays are not properly kzalloced, then the
    code jumps to the err_free label.
    
    The err_free label runs a loop to check and free each of the array members
    of the vas and vms arrays which is not required for this situation as none
    of the array members have been allocated till this point.
    
    Eliminate the extra loop we have to go through by introducing a new label
    err_free2 and then jumping to it.
    
    [akpm@linux-foundation.org: remove now-unneeded tests]
    Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Kautuk Consul committed with torvalds Jan 13, 2012
  27. mm: rearrange putback_inactive_pages

    There is sometimes confusion between the global putback_lru_pages() in
    migrate.c and the static putback_lru_pages() in vmscan.c: rename the
    latter putback_inactive_pages(): it helps shrink_inactive_list() rather as
    move_active_pages_to_lru() helps shrink_active_list().
    
    Remove unused scan_control arg from putback_inactive_pages() and from
    update_isolated_counts().  Move clear_active_flags() inside
    update_isolated_counts().  Move NR_ISOLATED accounting up into
    shrink_inactive_list() itself, so the balance is clearer.
    
    Do the spin_lock_irq() before calling putback_inactive_pages() and
    spin_unlock_irq() after return from it, so that it better matches
    update_isolated_counts() and move_active_pages_to_lru().
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  28. mm: remove isolate_pages()

    The isolate_pages() level in vmscan.c offers little but indirection: merge
    it into isolate_lru_pages() as the compiler does, and use the names
    nr_to_scan and nr_scanned in each case.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  29. mm: remove del_page_from_lru, add page_off_lru

    del_page_from_lru() repeats del_page_from_lru_list(), also working out
    which LRU the page was on, clearing the relevant bits.  Decouple those
    functions: remove del_page_from_lru() and add page_off_lru().
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  30. mm: enum lru_list lru

    Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  31. mm: no blank line after EXPORT_SYMBOL in swap.c

    checkpatch rightly protests
    
      WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
    
    so fix the five offenders in mm/swap.c.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  32. mm: fewer underscores in ____pagevec_lru_add

    What's so special about ____pagevec_lru_add() that it needs four leading
    underscores?  Nothing, it just helped to distinguish from
    __pagevec_lru_add() in 2.6.28 development.  Cut two leading underscores.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  33. mm: take pagevecs off reclaim stack

    Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
    by lists of pages_to_free: then apply Konstantin Khlebnikov's
    free_hot_cold_page_list() to them instead of pagevec_release().
    
    Which simplifies the flow (no need to drop and retake lock whenever
    pagevec fills up) and reduces stale addresses in stack backtraces
    (which often showed through the pagevecs); but more importantly,
    removes another 120 bytes from the deepest stacks in page reclaim.
    Although I've not recently seen an actual stack overflow here with
    a vanilla kernel, move_active_pages_to_lru() has often featured in
    deep backtraces.
    
    However, free_hot_cold_page_list() does not handle compound pages
    (nor need it: a Transparent HugePage would have been split by the
    time it reaches the call in shrink_page_list()), but it is possible
    for putback_lru_pages() or move_active_pages_to_lru() to be left
    holding the last reference on a THP, so must exclude the unlikely
    compound case before putting on pages_to_free.
    
    Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
    The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
    but that is never on the reclaim path, and cannot be replaced by a list.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Reviewed-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  34. memcg: fix mem_cgroup_print_bad_page

    If DEBUG_VM, mem_cgroup_print_bad_page() is called whenever bad_page()
    shows a "Bad page state" message, removes page from circulation, adds a
    taint and continues.  This is at a very low level, often when a spinlock
    is held (sometimes when page table lock is held, for example).
    
    We want to recover from this badness, not make it worse: we must not
    kmalloc memory here, we must not do a cgroup path lookup via dubious
    pointers.  No doubt that code was useful to debug a particular case at one
    time, and may be again, but take it out of the mainline kernel.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012
  35. memcg: fix split_huge_page_refcounts()

    This patch started off as a cleanup: __split_huge_page_refcounts() has to
    cope with two scenarios, when the hugepage being split is already on LRU,
    and when it is not; but why does it have to split that accounting across
    three different sites?  Consolidate it in lru_add_page_tail(), handling
    evictable and unevictable alike, and use standard add_page_to_lru_list()
    when accounting is needed (when the head is not yet on LRU).
    
    But a recent regression in -next, I guess the removal of PageCgroupAcctLRU
    test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix:
    under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number,
    messing up reclaim calculations and causing a freeze at rmdir of cgroup.
    
    Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that
    count - this has not been the only such incident.  Document that
    lru_add_page_tail() is for Transparent HugePages by #ifdef around it.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Hugh Dickins committed with torvalds Jan 13, 2012