Commits on Jul 20, 2016
  1. @tcaputi

    Added highbit() and lowbit() macros

    Signed-off-by: Tom Caputi <>
    Signed-off-by: Tony Hutter <>
    Signed-off-by: Brian Behlendorf <>
    Closes #562
    tcaputi committed with Jul 14, 2016
Commits on Jun 21, 2016
  1. @tonyhutter

    Add _ALIGNMENT_REQUIRED to isa_defs.h for checksums

    _ALIGNMENT_REQUIRED needs to be #defined in isa_defs.h in order to
    port the Illumos checksum code to ZoL:
    4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
    OpenZFS-commit: openzfs/openzfs@45818ee
    Signed-off-by: Tony Hutter <>
    Signed-off-by: Brian Behlendorf <>
    Closes #561
    tonyhutter committed with Jun 14, 2016
Commits on Jun 1, 2016
  1. Improve spl slab cache alloc

    The policy is to try to allocate with KM_NOSLEEP, which will lead to
    memory allocation with GFP_ATOMIC, and if it fails, it will launch
    an taskq to expand slab space.
    This way it should be able to get better NUMA memory locality and
    reduce the overhead of context switch.
    Signed-off-by: Jinshan Xiong <>
    Signed-off-by: Brian Behlendorf <>
    Closes #551
    Jinshan Xiong committed with May 19, 2016
Commits on May 31, 2016
  1. @tuxoko

    Fix use-after-free in splat_taskq_test7

    This splat_vprint is using tq_arg->name after tq_arg is freed.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Closes #557
    tuxoko committed with May 27, 2016
  2. @tuxoko

    Implement a proper rw_tryupgrade

    Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
    does rw_enter(RW_READER) if it fails. This violate the assumption that
    rw_tryupgrade should be atomic and could cause extra contention or even lock
    This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
    the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
    use cmpxchg on rwsem->count to change the value from single reader to single
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Closes zfsonlinux/zfs#4692
    Closes #554
    tuxoko committed with May 25, 2016
  3. Add isa_defs for MIPS

    GCC for MIPS only defines _LP64 when 64bit,
    while no _ILP32 defined when 32bit.
    Signed-off-by: YunQiang Su <>
    Signed-off-by: Brian Behlendorf <>
    Closes #558
    YunQiang Su committed with May 28, 2016
Commits on May 24, 2016
  1. @tuxoko

    Fix taskq_wait_outstanding re-evaluate tq_next_id

    wait_event is a macro, so the current implementation will cause re-
    evaluation of tq_next_id every time it wakes up. This would cause
    taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq)
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Issue #553
    tuxoko committed with May 23, 2016
  2. @tuxoko

    Fix race between taskq_destroy and dynamic spawning thread

    While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it
    does not implies the thread being spawned is up and running. This will cause
    taskq to be freed before the thread can exit.
    We fix this by using tq_nspawn to indicate how many threads are being spawned
    before they are inserted to the thread list. And have taskq_destroy to wait
    for it to drop to zero.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Issue #553
    Closes #550
    tuxoko committed with May 20, 2016
  3. @tuxoko

    Restore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hires

    In 39cd90e, I mistakenly disabled the ability of using absolute expire time in
    cv_timedwait_hires. I don't quite sure why I did that, so let's restore it.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Issue #553
    tuxoko committed with May 20, 2016
Commits on May 20, 2016
  1. @tuxoko

    Linux 4.7 compat: inode_lock() and friends

    Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and
    inode_lock_shared to do exclusive and shared lock respectively.
    We use spl_inode_lock{,_shared}() to hide the difference. Note that on older
    kernel you'll always take an exclusive lock.
    We also add all other inode_lock friends. And nested users now should
    explicitly call spl_inode_lock_nested with correct subclass.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Issue zfsonlinux/zfs#4665
    Closes #549
    tuxoko committed with May 18, 2016
Commits on May 12, 2016
  1. @tuxoko

    Add cv_timedwait_sig_hires to allow interruptible sleep

    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Closes #548
    tuxoko committed with May 11, 2016
Commits on May 5, 2016
  1. @dpquigl

    Add a macro to convert seconds to nanoseconds and vice-versa

    Required infrastructure for zfsonlinux/zfs#4600.
    Signed-off-by: Brian Behlendorf <>
    Closes #546
    dpquigl committed with May 5, 2016
Commits on Apr 26, 2016
  1. @dweeezil

    Clear PF_FSTRANS over spl_filp_fallocate()

    The problem described in 2a5d574 also applies to XFS's file or inode
    fallocate method.  Both paths may trigger writeback and expose this
    issue, see the full stack below.
    When layered on XFS a warning will be emitted under CentOS7 when entering
    either the file or inode fallocate method with PF_FSTRANS already set.
    To avoid triggering this error PF_FSTRANS is cleared and then reset
    in vn_space().
    WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0
    Call Trace:
     [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0
     [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20
     [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
     [<ffffffff81173ed7>] __writepage+0x17/0x40
     [<ffffffff81176f81>] write_cache_pages+0x251/0x530
     [<ffffffff811772b1>] generic_writepages+0x51/0x80
     [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs]
     [<ffffffff81177300>] do_writepages+0x20/0x30
     [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100
     [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0
     [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs]
     [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs]
     [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl]
     [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs]
     [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs]
     [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs]
     [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl]
     [<ffffffff810c1777>] kthread+0xd7/0xf0
     [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Signed-off-by: Nikolay Borisov <>
    Closes zfsonlinux/zfs#4529
    dweeezil committed with Apr 26, 2016
  2. @dweeezil

    Use vmem_free() in dfl_free() and add dfl_alloc()

    This change was lost, somehow, in e5f9a9a.  Since the arrays can be
    rather large, they need to be allocated with vmem_zalloc() via dfl_alloc()
    and freed with vmem_free() via dfl_free().
    The new dfl_alloc() function should be used to allocate object of type
    dkioc_free_list_t in order that they're allocated from vmem.
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Signed-off-by: Nikolay Borisov <>
    Closes #543
    dweeezil committed with Apr 24, 2016
  3. @tuxoko

    Use kernel provided mutex owner

    To reduce mutex footprint, we detect the existence of owner in kernel mutex,
    and rely on it if it exists.
    Note that before Linux 3.0, mutex owner is of type thread_info. Also note
    that, in Linux 3.18, the condition for owner is changed from
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Closes #540
    tuxoko committed with Apr 12, 2016
Commits on Mar 17, 2016
  1. @xnox

    Add support for s390[x].

    Signed-off-by: Dimitri John Ledkov <>
    Signed-off-by: Richard Yao <>
    Signed-off-by: Brian Behlendorf <>
    Closes #537
    xnox committed with Mar 16, 2016
  2. Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq

    When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another
    thread to be spawned.  This will cause TQ_NOQUEUE to behave similarly
    as it does with non-dynamic taskqs.
    Add support for TQ_NOQUEUE to taskq_dispatch_ent().
    Signed-off-by: Tim Chase <>
    Signed-off-by: Brian Behlendorf <>
    Closes #530
    Tim Chase committed with Feb 8, 2016
Commits on Mar 10, 2016
  1. Add rw_tryupgrade()

    This implementation of rw_tryupgrade() behaves slightly differently
    from its counterparts on other platforms.  It drops the RW_READER lock
    and then acquires the RW_WRITER lock leaving a small window where no
    lock is held.  On other platforms the lock is never released during
    the upgrade process.  This is necessary under Linux because the kernel
    does not provide an upgrade function.
    There are currently no callers in the ZFS code where this change in
    behavior is a problem.  In fact, in most cases the code is already
    written such that if the upgrade fails the RW_READER lock is dropped
    and the caller blocks waiting to acquire the lock as RW_WRITER.
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Signed-off-by: Matthew Thode <>
    Closes zfsonlinux/zfs#4388
    Closes #534
    committed Mar 9, 2016
  2. Remove RPM package restriction

    ZFS on Linux is regularly tested on arm, ppc, ppc64, i686 and x86_64
    architectures.  Given this the artificial architecture restriction in
    the packaging has been removed.
    Signed-off-by: Brian Behlendorf <>
    committed Mar 10, 2016
Commits on Feb 25, 2016
  1. @tcaputi

    Changes to support zfs encryption

    Unused modlinkage struct removed and ntohll functions added.
    Signed-off-by: Tom Caputi <>
    Signed-off-by: Brian Behlendorf <>
    Closes #533
    tcaputi committed with Feb 18, 2016
Commits on Feb 17, 2016
  1. @ryao

    random_get_pseudo_bytes() need not provide cryptographic strength ent…

    Perf profiling of dd on a zvol revealed that my system spent 3.16% of
    its time in random_get_pseudo_bytes(). No SPL consumers need
    cryptographic strength entropy, so we can reduce our overhead by
    changing the implementation to utilize a fast PRNG.
    The Linux kernel did not export a suitable PRNG function until it
    exported get_random_int() in Linux 3.10. While we could implement an
    autotools check so that we use it when it is available or even try to
    access the symbol on older kernels where it is not exported using the
    fact that it is exported on newer ones as justification, we can instead
    implement our own pseudo-random data generator. For this purpose, I have
    written one based on a 128-bit pseudo-random number generator proposed
    in a paper by Sebastiano Vigna that itself was based on work by the late
    George Marsaglia.
    Profiling the same benchmark with an earlier variant of this patch that
    used a slightly different generator (roughly same number of
    instructions) by the same author showed that time spent in
    random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
    improvement. This particular generator algorithm is also well known to
    be fast:
    The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
    GBps of throughput on an Intel Core i7-4770 in what is presumably a
    single-threaded context. Using it in `random_get_pseudo_bytes()` in the
    manner I have will probably not reach that level of performance, but it
    should be fairly high and many times higher than the Linux
    `get_random_bytes()` function that we use now, which runs at 16.3 MB/s
    on my Intel Xeon E3-1276v3 processor when measured by using dd on
    Also, putting this generator's seed into per-CPU variables allows us to
    eliminate overhead from both spin locks and CPU memory barriers, which
    is NUMA friendly.
    We could have alternatively modified consumers to use something like
    `gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
    that has a few potential problems that this approach avoids:
    1. Switching to `gethrtime() % 3` in hot code paths today requires
    diverging from illumos-gate and does nothing about potential future
    patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
    in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
    with a per-CPU PRNG avoids both of those things entirely, which means
    less work for us in the future.
    2.  Looking at the code that implements `gethrtime()`, I think it is
    unlikely to be faster than this per-CPU PRNG implementation of
    `random_get_pseudo_bytes()`. It would be best to go with something fast
    now so that there is no point in revisiting this from a performance
    3. `gethrtime() % 3` can vary in behavior from system to system based on
    kernel version, architecture and clock source. In comparison, this
    per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
    that should behave consistently across all systems regardless of kernel
    version, system architecture or machine clock source. It is unlikely
    that we would ever need to revisit this per-CPU PRNG while the same
    cannot be said for `gethrtime() % 3`.
    4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
    depending on the clock source, so replacing `random_get_pseudo_bytes()`
    with `gethrtime()` in hot code paths could still require a future person
    working on NUMA scalability to reimplement it anyway while this per-CPU
    PRNG would not by virtue of using neither CPU memory barriers nor atomic
    instructions. Note that I did not check various clock sources for the
    presence of atomic instructions. There is simply too much code to read
    and given the drawbacks versus this per-cpu PRNG, there is no point in
    being certain.
    5. I have heard of instances where poor quality pseudo-random numbers
    caused problems for HPC code in ways that took more than a year to
    identify and were remedied by switching to a higher quality source of
    pseudo-random numbers. While filesystems are different than HPC code, I
    do not think it is impossible for us to have instances where poor
    quality pseudo-random numbers can cause problems. Opting for a well
    studied PRNG algorithm that passes tests for statistical randomness over
    changing callers to use `gethrtime() % 3` bypasses the need to think
    about both whether poor quality pseudo-random numbers can cause problems
    and the statistical quality of numbers from `gethrtime() % 3`.
    6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
    probably not a huge issue, but anyone using kgdb would never be able to
    step through a seqlock critical section, which is not a problem either
    now or with the per-CPU PRNG:
    The only downside that I can see is that this code's memory requirement
    is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
    3`, which are O(1), but that should not be a problem. The seeds will use
    64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
    memory at the low end (i.e. `NR_CPU == 1`).  In either case, we should
    only use a few hundred bytes of code for text, especially since
    `spl_rand_jump()` should be inlined into `spl_random_init()`, which
    should be removed during early boot as part of "Freeing unused kernel
    memory". In either case, the memory requirements are minuscule.
    Signed-off-by: Richard Yao <>
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Closes #372
    ryao committed with Jul 11, 2014
Commits on Feb 5, 2016
  1. @tuxoko

    Allow kicking a taskq to spawn more threads

    This patch add a module parameter spl_taskq_kick. When writing non-zero value
    to it, it will scan all the taskq, if a taskq contains a task pending for more
    than 5 seconds, it will be forced to spawn a new thread. This is use as an
    emergency recovery from deadlock, not a general solution.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Closes #529
    tuxoko committed with Jan 27, 2016
Commits on Jan 26, 2016
  1. @infowolfe

    Ensure spl/ only occurs once in core-y

    Update copy-builtin so it may be run multiple times against
    the kernel source tree.  This change makes sed more discriminating
    to ensure spl/ only occurs once in core-y.
    Signed-off-by: Chip Parker <>
    Signed-off-by: Brian Behlendorf <>
    Closes #526
    infowolfe committed with Jan 25, 2016
Commits on Jan 23, 2016
  1. Remove RLIM64_INFINITY assert in vn_rdwr()

    Previous commit be29e6a updated kobj_read_file() so it no longer
    unconditionally passes RLIM64_INFINITY.  The vn_rdwr() function
    needs to be updated accordingly.
    Signed-off-by: Brian Behlendorf <>
    Issue #513
    committed Jan 23, 2016
  2. @ryao

    kobj_read_file: Return -1 on vn_rdwr() error

    I noticed that the SPL implementation of kobj_read_file is not correct
    after comparing it with the userland implementation of kobj_read_file()
    in zfsonlinux/zfs#4104.
    Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
    implementation did not support it anyway, so there is no difference.
    Signed-off-by: Richard Yao <>
    Signed-off-by: Brian Behlendorf <>
    Closes #513
    ryao committed with Dec 15, 2015
Commits on Jan 21, 2016
  1. @ofaaland

    Create spl-kmod-debuginfo rpm with redhat spec file

    Correct the redhat specfile so that working debuginfo rpms are created
    for the kernel modules.  The generic specfile already does the right
    Signed-off-by: Olaf Faaland <>
    Signed-off-by: Brian Behlendorf <>
    Closes zfsonlinux/zfs#4224
    ofaaland committed with Jan 19, 2016
Commits on Jan 20, 2016
  1. @tuxoko

    Use tsd to store tq for taskq_member

    To prevent taskq_member holding tq_lock and doing linear search, thus causing
    contention. We store the taskq pointer to which the thread belongs in tsd.
    This way taskq_member will not need to touch tq_lock, and tsd has per slot
    spinlock. So the contention should be reduced greatly.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Closes #500
    Closes #504
    Closes #505
    tuxoko committed with Dec 2, 2015
  2. Linux 4.5 compat: pfn_t typedef

    The pfn_t typedef was inherited from Illumos but never directly
    used by any SPL consumers.  This didn't cause any issues until
    the Linux 4.5 kernel introduced a typedef of the same name.
    See torvalds/linux/commit/34c0fd54, this patch removes the
    unused Illumos version to prevent a conflict.
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Signed-off-by: Chunwei Chen <>
    Closes #524
    committed Jan 19, 2016
  3. @tuxoko

    Turn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark

    In b4ad50a, we abandoned memalloc_noio_save in favor of spl_fstrans_mark
    because earlier kernel with it doesn't turn off __GFP_FS. However, for newer
    kernel, we would prefer PF_MEMALLOC_NOIO because it would work for allocation
    in kernel which we cannot control otherwise. So in this patch, we turn on both
    PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Closes #523
    tuxoko committed with Jan 18, 2016
Commits on Jan 12, 2016
  1. @tuxoko

    Don't hold mutex until release cv in cv_wait

    If a thread is holding mutex when doing cv_destroy, it might end up waiting a
    thread in cv_wait. The waiter would wake up trying to aquire the same mutex
    and cause deadlock.
    We solve this by move the mutex_enter to the bottom of cv_wait, so that
    the waiter will release the cv first, allowing cv_destroy to succeed and have
    a chance to free the mutex.
    This would create race condition on the cv_mutex. We use xchg to set and check
    it to ensure we won't be harmed by the race. This would result in the cv_mutex
    debugging becomes best-effort.
    Also, the change reveals a race, which was unlikely before, where we call
    mutex_destroy while test threads are still holding the mutex. We use
    kthread_stop to make sure the threads are exit before mutex_destroy.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Issue zfsonlinux/zfs#4166
    Issue zfsonlinux/zfs#4106
    tuxoko committed with Jan 6, 2016
  2. Add spl_kmem_cache_kmem_threads man page entry

    The spl_kmem_cache_kmem_threads module option was accidentally
    omitted from the documentation.  Add it.
    Signed-off-by: Brian Behlendorf <>
    Closes #512
    committed Dec 14, 2015
Commits on Jan 8, 2016
  1. @5YN3R6Y

    _ILP32 is always defined on SPARC

    Signed-off-by: Alex McWhirter <>
    Signed-off-by: Brian Behlendorf <>
    Closes #520
    5YN3R6Y committed with Jan 8, 2016
Commits on Dec 22, 2015
  1. Fix do_div() types in condvar:timeout

    The do_div() macro expects unsigned types and this is detected in
    powerpc implementation of do_div().
    Signed-off-by: Brian Behlendorf <>
    Closes #516
    committed Dec 22, 2015
Commits on Dec 18, 2015
  1. @tuxoko

    Use spl_fstrans_mark instead of memalloc_noio_save

    For earlier versions of the kernel with memalloc_noio_save, it only turns
    off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
    would cause threads to direct reclaim into ZFS and cause deadlock.
    Instead, we should stick to using spl_fstrans_mark. Since we would
    explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
    will work on every version of the kernel.
    This impacts kernel versions 3.9-3.17, see upstream kernel commit
    torvalds/linux@934f307 for reference.
    Signed-off-by: Chunwei Chen <>
    Signed-off-by: Brian Behlendorf <>
    Signed-off-by: Tim Chase <>
    Closes #515
    Issue zfsonlinux/zfs#4111
    tuxoko committed with Dec 17, 2015
Commits on Dec 16, 2015
  1. @dweeezil

    Provide kstat for taskqs

    This patch provides 2 new kstats to display task queues:
      /proc/spl/taskqs-all - Display all task queues
      /proc/spl/taskqs - Display only "active" task queues
    A task queue is considered to be "active" if it currently has active
    (running) threads or if any of its pending, priority, delay or waitq
    lists are not empty.
    If the task queue has running threads, displays each thread function's
    address (symbolically, if possibly) and its argument.
    If the task queue has a non-empty list of pending, priority or delayed
    task queue entries (taskq_ent_t), displays each entry's thread function
    address and arguemnt.
    If the task queue has any waiters, displays each waiting task's pid.
    Note: This patch also updates some comments in taskq.h which referred to
    "taskq_t" when they should have referred to "taskq_ent_t".
    Signed-off-by: Tim Chase <>
    Signed-off-by: Brian Behlendorf <>
    Closes #491
    dweeezil committed with Oct 19, 2015