Skip to content
Commits on Jun 29, 2016
  1. Merge branch 'illumos-2605'

    Adds support for resuming interrupted zfs send streams and include
    all related send/recv bug fixes from upstream OpenZFS.
    
    Unlike the upstream implementation this branch does not change
    the existing ioctl interface.  Instead a new ZFS_IOC_RECV_NEW ioctl
    was added to support resuming zfs send streams.  This was done by
    applying the original upstream patch and then reverting the ioctl
    changes in a follow up patch.  For this reason there are a handful
    on commits between the relevant patches on this branch which are
    not interoperable.  This was done to make it easier to extract
    the new ZFS_IOC_RECV_NEW and submit it upstream.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4742
    committed Jun 29, 2016
  2. Vectorized fletcher_4 must be 128-bit aligned

    The fletcher_4_native() and fletcher_4_byteswap() functions may only
    safely use the vectorized implementations when the buffer is 128-bit
    aligned.  This is because both the AVX2 and SSE implementations process
    four 32-bit words per iterations.  Fallback to the scalar implementation
    which only processes a single 32-bit word for unaligned buffers.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
    Issue #4330
    committed Jun 28, 2016
Commits on Jun 28, 2016
  1. @pcd1193182

    OpenZFS 6876 - Stack corruption after importing a pool with a too-lon…

    …g name
    
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
    Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
    for trouble. We should check every dataset on import, using a 1024 byte
    buffer and checking each time to see if the dataset's new name is longer
    than 256 bytes.
    
    OpenZFS-issue: https://www.illumos.org/issues/6876
    OpenZFS-commit: openzfs/openzfs@ca8674e
    pcd1193182 committed with Jun 15, 2016
  2. @ikozhukhov

    OpenZFS 6314 - buffer overflow in dsl_dataset_name

    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
    Approved by: Dan McDonald <danmcd@omniti.com>
    Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6314
    OpenZFS-commit: openzfs/openzfs@d6160ee
    ikozhukhov committed with Jun 15, 2016
  3. Implement zfs_ioc_recv_new() for OpenZFS 2605

    Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
    ZFS_IOC_RECV user/kernel interface.  The new interface supports all
    stream options but is currently only used for resumable streams.
    This way updated user space utilities will interoperate with older
    kernel modules.
    
    ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
    handler.  Non-Linux OpenZFS platforms have opted to change the
    legacy interface in an incompatible fashion instead of adding a
    new ioctl.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    committed Jun 9, 2016
  4. @danmcd

    OpenZFS 6562 - Refquota on receive doesn't account for overage

    Authored by: Dan McDonald <danmcd@omniti.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Gordon Ross <gwr@nexenta.com>
    Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6562
    OpenZFS-commit: openzfs/openzfs@5f7a8e6
    danmcd committed with Jun 9, 2016
  5. @danmcd

    OpenZFS 4986 - receiving replication stream fails if any snapshot exc…

    …eeds refquota
    
    Authored by: Dan McDonald <danmcd@omniti.com>
    Reviewed by: John Kennedy <john.kennedy@delphix.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Approved by: Gordon Ross <gordon.ross@nexenta.com>
    Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/4986
    OpenZFS-commit: openzfs/openzfs@5878fad
    danmcd committed with Jun 9, 2016
  6. OpenZFS 6738 - zfs send stream padding needs documentation

    Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
    Reviewed by: Paul Dagnelie <pcd@delphix.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Robert Mustacchi <rm@joyent.com>
    Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6738
    OpenZFS-commit: openzfs/openzfs@c20404f
    Eli Rosenthal committed with Jun 9, 2016
  7. @andy-js

    OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FR…

    …EERECORDS
    
    Authored by: Andrew Stormont <astormont@racktopsystems.com>
    Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com>
    Reviewed by: Kim Shrier <kshrier@racktopsystems.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Approved by: Dan McDonald <danmcd@omniti.com>
    Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6536
    OpenZFS-commit: openzfs/openzfs@880094b
    andy-js committed with Jun 9, 2016
  8. @pcd1193182

    OpenZFS 6393 - zfs receive a full send as a clone

    Authored by: Paul Dagnelie <pcd@delphix.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
    Approved by: Dan McDonald <danmcd@omniti.com>
    Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6394
    OpenZFS-commit: openzfs/openzfs@68ecb2e
    pcd1193182 committed with Jun 9, 2016
  9. OpenZFS 6051 - lzc_receive: allow the caller to read the begin record

    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Paul Dagnelie <pcd@delphix.com>
    Approved by: Robert Mustacchi <rm@joyent.com>
    Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6051
    OpenZFS-commit: openzfs/openzfs@620f322
    committed Jun 16, 2016
  10. @ahrens

    OpenZFS 2605, 6980, 6902

    2605 want to resume interrupted zfs send
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Paul Dagnelie <pcd@delphix.com>
    Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
    Reviewed by: Xin Li <delphij@freebsd.org>
    Reviewed by: Arne Jansen <sensille@gmx.net>
    Approved by: Dan McDonald <danmcd@omniti.com>
    Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/2605
    OpenZFS-commit: openzfs/openzfs@9c3fd12
    
    6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
    Reviewed by: Paul Dagnelie <pcd@delphix.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Approved by: Robert Mustacchi <rm@joyent.com>
    Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6980
    OpenZFS-commit: openzfs/openzfs@ea4a67f
    
    Porting notes:
    - All rsend and snapshop tests enabled and updated for Linux.
    - Fix misuse of input argument in traverse_visitbp().
    - Fix ISO C90 warnings and errors.
    - Fix gcc 'missing braces around initializer' in
      'struct send_thread_arg to_arg =' warning.
    - Replace 4 argument fletcher_4_native() with 3 argument version,
      this change was made in OpenZFS 4185 which has not been ported.
    - Part of the sections for 'zfs receive' and 'zfs send' was
      rewritten and reordered to approximate upstream.
    - Fix mktree xattr creation, 'user.' prefix required.
    - Minor fixes to newly enabled test cases
    - Long holds for volumes allowed during receive for minor registration.
    ahrens committed with Jan 6, 2016
Commits on Jun 24, 2016
  1. Sync DMU_BACKUP_FEATURE_* flags

    Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING.  The
    DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and
    then reserved in the upstream OpenZFS implementation.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Closes #4795
    committed Jun 24, 2016
  2. @nedbass

    Implement large_dnode pool feature

    Justification
    -------------
    
    This feature adds support for variable length dnodes. Our motivation is
    to eliminate the overhead associated with using spill blocks.  Spill
    blocks are used to store system attribute data (i.e. file metadata) that
    does not fit in the dnode's bonus buffer. By allowing a larger bonus
    buffer area the use of a spill block can be avoided.  Spill blocks
    potentially incur an additional read I/O for every dnode in a dnode
    block. As a worst case example, reading 32 dnodes from a 16k dnode block
    and all of the spill blocks could issue 33 separate reads. Now suppose
    those dnodes have size 1024 and therefore don't need spill blocks.  Then
    the worst case number of blocks read is reduced to from 33 to two--one
    per dnode block. In practice spill blocks may tend to be co-located on
    disk with the dnode blocks so the reduction in I/O would not be this
    drastic. In a badly fragmented pool, however, the improvement could be
    significant.
    
    ZFS-on-Linux systems that make heavy use of extended attributes would
    benefit from this feature. In particular, ZFS-on-Linux supports the
    xattr=sa dataset property which allows file extended attribute data
    to be stored in the dnode bonus buffer as an alternative to the
    traditional directory-based format. Workloads such as SELinux and the
    Lustre distributed filesystem often store enough xattr data to force
    spill bocks when xattr=sa is in effect. Large dnodes may therefore
    provide a performance benefit to such systems.
    
    Other use cases that may benefit from this feature include files with
    large ACLs and symbolic links with long target names. Furthermore,
    this feature may be desirable on other platforms in case future
    applications or features are developed that could make use of a
    larger bonus buffer area.
    
    Implementation
    --------------
    
    The size of a dnode may be a multiple of 512 bytes up to the size of
    a dnode block (currently 16384 bytes). A dn_extra_slots field was
    added to the current on-disk dnode_phys_t structure to describe the
    size of the physical dnode on disk. The 8 bits for this field were
    taken from the zero filled dn_pad2 field. The field represents how
    many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
    This convention results in a value of 0 for 512 byte dnodes which
    preserves on-disk format compatibility with older software.
    
    Similarly, the in-memory dnode_t structure has a new dn_num_slots field
    to represent the total number of dnode_phys_t slots consumed on disk.
    Thus dn->dn_num_slots is 1 greater than the corresponding
    dnp->dn_extra_slots. This difference in convention was adopted
    because, unlike on-disk structures, backward compatibility is not a
    concern for in-memory objects, so we used a more natural way to
    represent size for a dnode_t.
    
    The default size for newly created dnodes is determined by the value of
    a new "dnodesize" dataset property. By default the property is set to
    "legacy" which is compatible with older software. Setting the property
    to "auto" will allow the filesystem to choose the most suitable dnode
    size. Currently this just sets the default dnode size to 1k, but future
    code improvements could dynamically choose a size based on observed
    workload patterns. Dnodes of varying sizes can coexist within the same
    dataset and even within the same dnode block. For example, to enable
    automatically-sized dnodes, run
    
     # zfs set dnodesize=auto tank/fish
    
    The user can also specify literal values for the dnodesize property.
    These are currently limited to powers of two from 1k to 16k. The
    power-of-2 limitation is only for simplicity of the user interface.
    Internally the implementation can handle any multiple of 512 up to 16k,
    and consumers of the DMU API can specify any legal dnode value.
    
    The size of a new dnode is determined at object allocation time and
    stored as a new field in the znode in-memory structure. New DMU
    interfaces are added to allow the consumer to specify the dnode size
    that a newly allocated object should use. Existing interfaces are
    unchanged to avoid having to update every call site and to preserve
    compatibility with external consumers such as Lustre. The new
    interfaces names are given below. The versions of these functions that
    don't take a dnodesize parameter now just call the _dnsize() versions
    with a dnodesize of 0, which means use the legacy dnode size.
    
    New DMU interfaces:
      dmu_object_alloc_dnsize()
      dmu_object_claim_dnsize()
      dmu_object_reclaim_dnsize()
    
    New ZAP interfaces:
      zap_create_dnsize()
      zap_create_norm_dnsize()
      zap_create_flags_dnsize()
      zap_create_claim_norm_dnsize()
      zap_create_link_dnsize()
    
    The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
    spa_maxdnodesize() function should be used to determine the maximum
    bonus length for a pool.
    
    These are a few noteworthy changes to key functions:
    
    * The prototype for dnode_hold_impl() now takes a "slots" parameter.
      When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
      ensure the hole at the specified object offset is large enough to
      hold the dnode being created. The slots parameter is also used
      to ensure a dnode does not span multiple dnode blocks. In both of
      these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
      these failure cases are only possible when using DNODE_MUST_BE_FREE.
    
      If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
      dnode_hold_impl() will check if the requested dnode is already
      consumed as an extra dnode slot by an large dnode, in which case
      it returns ENOENT.
    
    * The function dmu_object_alloc() advances to the next dnode block
      if dnode_hold_impl() returns an error for a requested object.
      This is because the beginning of the next dnode block is the only
      location it can safely assume to either be a hole or a valid
      starting point for a dnode.
    
    * dnode_next_offset_level() and other functions that iterate
      through dnode blocks may no longer use a simple array indexing
      scheme. These now use the current dnode's dn_num_slots field to
      advance to the next dnode in the block. This is to ensure we
      properly skip the current dnode's bonus area and don't interpret it
      as a valid dnode.
    
    zdb
    ---
    The zdb command was updated to display a dnode's size under the
    "dnsize" column when the object is dumped.
    
    For ZIL create log records, zdb will now display the slot count for
    the object.
    
    ztest
    -----
    Ztest chooses a random dnodesize for every newly created object. The
    random distribution is more heavily weighted toward small dnodes to
    better simulate real-world datasets.
    
    Unused bonus buffer space is filled with non-zero values computed from
    the object number, dataset id, offset, and generation number.  This
    helps ensure that the dnode traversal code properly skips the interior
    regions of large dnodes, and that these interior regions are not
    overwritten by data belonging to other dnodes. A new test visits each
    object in a dataset. It verifies that the actual dnode size matches what
    was stored in the ztest block tag when it was created. It also verifies
    that the unused bonus buffer space is filled with the expected data
    patterns.
    
    ZFS Test Suite
    --------------
    Added six new large dnode-specific tests, and integrated the dnodesize
    property into existing tests for zfs allow and send/recv.
    
    Send/Receive
    ------------
    ZFS send streams for datasets containing large dnodes cannot be received
    on pools that don't support the large_dnode feature. A send stream with
    large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
    unrecognized by an incompatible receiving pool so that the zfs receive
    will fail gracefully.
    
    While not implemented here, it may be possible to generate a
    backward-compatible send stream from a dataset containing large
    dnodes. The implementation may be tricky, however, because the send
    object record for a large dnode would need to be resized to a 512
    byte dnode, possibly kicking in a spill block in the process. This
    means we would need to construct a new SA layout and possibly
    register it in the SA layout object. The SA layout is normally just
    sent as an ordinary object record. But if we are constructing new
    layouts while generating the send stream we'd have to build the SA
    layout object dynamically and send it at the end of the stream.
    
    For sending and receiving between pools that do support large dnodes,
    the drr_object send record type is extended with a new field to store
    the dnode slot count. This field was repurposed from unused padding
    in the structure.
    
    ZIL Replay
    ----------
    The dnode slot count is stored in the uppermost 8 bits of the lr_foid
    field. The bits were unused as the object id is currently capped at
    48 bits.
    
    Resizing Dnodes
    ---------------
    It should be possible to resize a dnode when it is dirtied if the
    current dnodesize dataset property differs from the dnode's size, but
    this functionality is not currently implemented. Clearly a dnode can
    only grow if there are sufficient contiguous unused slots in the
    dnode block, but it should always be possible to shrink a dnode.
    Growing dnodes may be useful to reduce fragmentation in a pool with
    many spill blocks in use. Shrinking dnodes may be useful to allow
    sending a dataset to a pool that doesn't support the large_dnode
    feature.
    
    Feature Reference Counting
    --------------------------
    The reference count for the large_dnode pool feature tracks the
    number of datasets that have ever contained a dnode of size larger
    than 512 bytes. The first time a large dnode is created in a dataset
    the dataset is converted to an extensible dataset. This is a one-way
    operation and the only way to decrement the feature count is to
    destroy the dataset, even if the dataset no longer contains any large
    dnodes. The complexity of reference counting on a per-dnode basis was
    too high, so we chose to track it on a per-dataset basis similarly to
    the large_block feature.
    
    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3542
    nedbass committed with Mar 16, 2016
  3. @nedbass

    Backfill metadnode more intelligently

    Only attempt to backfill lower metadnode object numbers if at least
    4096 objects have been freed since the last rescan, and at most once
    per transaction group. This avoids a pathology in dmu_object_alloc()
    that caused O(N^2) behavior for create-heavy workloads and
    substantially improves object creation rates.  As summarized by
    @mahrens in #4636:
    
    "Normally, the object allocator simply checks to see if the next
    object is available. The slow calls happened when dmu_object_alloc()
    checks to see if it can backfill lower object numbers. This happens
    every time we move on to a new L1 indirect block (i.e. every 32 *
    128 = 4096 objects).  When re-checking lower object numbers, we use
    the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
    indirect blocks that don’t have enough free dnodes (defined as an L2
    with at least 393,216 of 524,288 dnodes free). Therefore, we may
    find that a block of dnodes has a low (or zero) fill count, and yet
    we can’t allocate any of its dnodes, because they've been allocated
    in memory but not yet written to disk. In this case we have to hold
    each of the dnodes and then notice that it has been allocated in
    memory.
    
    The end result is that allocating N objects in the same TXG can
    require CPU usage proportional to N^2."
    
    Add a tunable dmu_rescan_dnode_threshold to define the number of
    objects that must be freed before a rescan is performed. Don't bother
    to export this as a module option because testing doesn't show a
    compelling reason to change it. The vast majority of the performance
    gain comes from limit the rescan to at most once per TXG.
    
    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    nedbass committed with May 17, 2016
  4. @nedbass

    xattrtest: allow verify with -R and other improvements

    - Use a fixed buffer of random bytes when random xattr values are in
      effect.  This eliminates the potential performance bottleneck of
      reading from /dev/urandom for each file. This also allows us to
      verify xattrs in random value mode.
    
    - Show the rate of operations per second in addition to elapsed time
      for each phase of the test. This may be useful for benchmarking.
    
    - Set default xattr size to 6 so that verify doesn't fail if user
      doesn't specify a size. We need at least six bytes to store the
      leading "size=X" string that is used for verification.
    
    - Allow user to execute just one phase of the test. Acceptable
      values for -o and their meanings are:
    
       1 - run the create phase
       2 - run the setxattr phase
       3 - run the getxattr phase
       4 - run the unlink phase
    
    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    nedbass committed with Apr 10, 2015
  5. FreeBSD rS271776 - Persist vdev_resilver_txg changes

    Persist vdev_resilver_txg changes to avoid panic caused by validation
    vs a vdev_resilver_txg value from a previous resilver.
    
    Authored-by: smh <smh@FreeBSD.org>
    Ported-by: Chris Dunlop <chris@onthe.net.au>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/5154
    FreeBSD-issue: https://reviews.freebsd.org/rS271776
    FreeBSD-commit: freebsd/freebsd@c3c60bf
    Closes #4790
    smh committed with Jun 24, 2016
  6. OpenZFS 6878 - Add scrub completion info to "zpool history"

    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
    Approved by: Dan McDonald <danmcd@omniti.com>
    Authored by: Nav Ravindranath <nav@delphix.com>
    Ported-by: Chris Dunlop <chris@onthe.net.au>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6878
    OpenZFS-commit: openzfs/openzfs@1825bc5
    Closes #4787
    Nav Ravindranath committed with Jun 23, 2016
  7. Revert "Add a test case for dmu_free_long_range() to ztest"

    This reverts commit d0de2e8 which
    introduced a new test case to ztest which is failing occasionally
    during automated testing.  The change is being reverted until
    the issue can be fully investigated.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #4754
    committed Jun 24, 2016
Commits on Jun 21, 2016
  1. @bprotopopov

    Add a test case for dmu_free_long_range() to ztest

    Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4754
    bprotopopov committed with Dec 16, 2015
  2. @pcd1193182

    OpenZFS 6513 - partially filled holes lose birth time

    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>a
    Ported by: Boris Protopopov <bprotopopov@actifio.com>
    Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    OpenZFS-issue: https://www.illumos.org/issues/6513
    OpenZFS-commit: openzfs/openzfs@8df0bcf
    
    If a ZFS object contains a hole at level one, and then a data block is
    created at level 0 underneath that l1 block, l0 holes will be created.
    However, these l0 holes do not have the birth time property set; as a
    result, incremental sends will not send those holes.
    
    Fix is to modify the dbuf_read code to fill in birth time data.
    pcd1193182 committed with May 15, 2016
  3. @tuxoko

    Fix NFS credential

    The commit f74b821 caused a regression where creating file through NFS will
    always create a file owned by root. This is because the patch enables the KSID
    code in zfs_acl_ids_create, which it would use euid and egid of the current
    process. However, on Linux, we should use fsuid and fsgid for file operations,
    which is the original behaviour. So we revert this part of code.
    
    The patch also enables secpolicy_vnode_*, since they are also used in file
    operations, we change them to use fsuid and fsgid.
    
    Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4772
    Closes #4758
    tuxoko committed with Jun 17, 2016
  4. @ironMann

    SIMD implementation of vdev_raidz generate and reconstruct routines

    This is a new implementation of RAIDZ1/2/3 routines using x86_64
    scalar, SSE, and AVX2 instruction sets. Included are 3 parity
    generation routines (P, PQ, and PQR) and 7 reconstruction routines,
    for all RAIDZ level. On module load, a quick benchmark of supported
    routines will select the fastest for each operation and they will
    be used at runtime. Original implementation is still present and
    can be selected via module parameter.
    
    Patch contains:
    - specialized gen/rec routines for all RAIDZ levels,
    - new scalar raidz implementation (unrolled),
    - two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
    - fastest routines selected on module load (benchmark).
    - cmd/raidz_test - verify and benchmark all implementations
    - added raidz_test to the ZFS Test Suite
    
    New zfs module parameters:
    - zfs_vdev_raidz_impl (str): selects the implementation to use. On
      module load, the parameter will only accept first 3 options, and
      the other implementations can be set once module is finished
      loading. Possible values for this option are:
        "fastest" - use the fastest math available
        "original" - use the original raidz code
        "scalar" - new scalar impl
        "sse" - new SSE impl if available
        "avx2" - new AVX2 impl if available
    
    See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
    get the list of supported values. If an implementation is not supported
    on the system, it will not be shown. Currently selected option is
    enclosed in `[]`.
    
    Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4328
    ironMann committed with Apr 25, 2016
Commits on Jun 17, 2016
  1. @dweeezil

    Linux 4.6 compat: Fall back to d_prune_aliases() if necessary

    As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
    kernel's per-superblock shrinker is concerned.  The effect is that dcache
    or icache entries added by a task in a non-root memcg won't be scanned
    by the shrinker in the context of the root (or NULL) memcg.  This defeats
    the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
    grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
    in case the kernel's per-superblock shrinker is not able to free anything.
    
    Signed-off-by: Tim Chase <tim@chase2k.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
    Closes: #4726
    dweeezil committed with Jun 16, 2016
Commits on Jun 16, 2016
  1. Remove libzfs_graph.c

    The libzfs_graph.c source file should have been removed in 330d06f,
    it is entirely unused.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4766
    committed Jun 15, 2016
Commits on Jun 7, 2016
  1. Add `zfs allow` and `zfs unallow` support

    ZFS allows for specific permissions to be delegated to normal users
    with the `zfs allow` and `zfs unallow` commands.  In addition, non-
    privileged users should be able to run all of the following commands:
    
      * zpool [list | iostat | status | get]
      * zfs [list | get]
    
    Historically this functionality was not available on Linux.  In order
    to add it the secpolicy_* functions needed to be implemented and mapped
    to the equivalent Linux capability.  Only then could the permissions on
    the `/dev/zfs` be relaxed and the internal ZFS permission checks used.
    
    Even with this change some limitations remain.  Under Linux only the
    root user is allowed to modify the namespace (unless it's a private
    namespace).  This means the mount, mountpoint, canmount, unmount,
    and remount delegations cannot be supported with the existing code.  It
    may be possible to add this functionality in the future.
    
    This functionality was validated with the cli_user and delegation test
    cases from the ZFS Test Suite.  These tests exhaustively verify each
    of the supported permissions which can be delegated and ensures only
    an authorized user can perform it.
    
    Two minor bug fixes were required for test-running.py.  First, the
    Timer() object cannot be safely created in a `try:` block when there
    is an unconditional `finally` block which references it.  Second,
    when running as a normal user also check for scripts using the
    both the .ksh and .sh suffixes.
    
    Finally, existing users who are simulating delegations by setting
    group permissions on the /dev/zfs device should revert that
    customization when updating to a version with this change.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Tony Hutter <hutter2@llnl.gov>
    Closes #362 
    Closes #434 
    Closes #4100
    Closes #4394 
    Closes #4410 
    Closes #4487
    committed Jun 7, 2016
Commits on Jun 6, 2016
  1. @ColinIanKing

    Fix minor spelling mistakes

    Trivial spelling mistake fix in error message text.
    
    * Fix spelling mistake "adminstrator" -> "administrator"
    * Fix spelling mistake "specificed" -> "specified"
    * Fix spelling mistake "interperted" -> "interpreted"
    
    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4728
    ColinIanKing committed with Jun 6, 2016
Commits on Jun 3, 2016
  1. Fix cstyle.pl warnings

    As of perl v5.22.1 the following warnings are generated:
    
    * Redundant argument in printf at scripts/cstyle.pl line 194
    
    * Unescaped left brace in regex is deprecated, passed through
      in regex; marked by <-- HERE in m/\S{ <-- HERE / at
      scripts/cstyle.pl line 608.
    
    They have been addressed by escaping the left braces and by
    providing the correct number of arguments to printf based on
    the fmt specifier set by the verbose option.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4723
    committed Jun 3, 2016
Commits on Jun 2, 2016
  1. Implementation of AVX2 optimized Fletcher-4

    New functionality:
    - Preserves existing scalar implementation.
    - Adds AVX2 optimized Fletcher-4 computation.
    - Fastest routines selected on module load (benchmark).
    - Test case for Fletcher-4 added to ztest.
    
    New zcommon module parameters:
    -  zfs_fletcher_4_impl (str): selects the implementation to use.
        "fastest" - use the fastest version available
        "cycle"   - cycle trough all available impl for ztest
        "scalar"  - use the original version
        "avx2"    - new AVX2 implementation if available
    
    Performance comparison (Intel i7 CPU, 1MB data buffers):
    - Scalar:  4216 MB/s
    - AVX2:   14499 MB/s
    
    See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
    to get list of supported values. If an implementation is not supported
    on the system, it will not be shown. Currently selected option is
    enclosed in `[]`.
    
    Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
    Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4330
    Jinshan Xiong committed with Dec 9, 2015
  2. Linux 4.7 compat: handler->set() takes both dentry and inode

    Counterpart to fd4c7b7, the same approach was taken to resolve
    the compatibility issue.
    
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
    Closes #4717 
    Issue #4665
    committed Jun 1, 2016
Commits on May 31, 2016
  1. @tuxoko

    Fix memleak in vdev_config_generate_stats

    fnvlist_add_nvlist will copy the contents of nvx, so we need to
    free it here.
    
    unreferenced object 0xffff8800a6934e80 (size 64):
      comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s)
      hex dump (first 32 bytes):
        60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff  `..s.....|.s....
        00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff  ........@.p.....
      backtrace:
        [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
        [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310
        [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl]
        [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl]
        [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
        [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair]
        [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair]
        [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair]
        [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs]
        [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs]
        [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs]
        [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs]
        [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs]
        [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs]
        [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0
        [<ffffffff812333b9>] SyS_ioctl+0x79/0x90
    
    Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
    Signed-off-by: Tony Hutter <hutter2@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4707
    Issue #4708
    tuxoko committed with May 27, 2016
  2. @tuxoko

    Fix memleak in zpl_parse_options

    strsep() will advance tmp_mntopts, and will change it to NULL on last
    iteration.  This will cause strfree(tmp_mntopts) to not free anything.
    
    unreferenced object 0xffff8800883976c0 (size 64):
      comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s)
      hex dump (first 32 bytes):
        72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a  rw.strictatime.z
        66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d  fsutil.mntpoint=
      backtrace:
        [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
        [<ffffffff811f9cac>] __kmalloc+0x16c/0x250
        [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl]
        [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs]
        [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs]
        [<ffffffff81222dc8>] mount_fs+0x38/0x160
        [<ffffffff81240097>] vfs_kern_mount+0x67/0x110
        [<ffffffff812428e0>] do_mount+0x250/0xe20
        [<ffffffff812437d5>] SyS_mount+0x95/0xe0
        [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
        [<ffffffffffffffff>] 0xffffffffffffffff
    
    Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
    Signed-off-by: Tony Hutter <hutter2@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4706
    Issue #4708
    tuxoko committed with May 27, 2016
  3. @tuxoko

    Fix out-of-bound access in zfs_fillpage

    The original code will do an out-of-bound access on pl[] during last
    iteration.
    
     ==================================================================
     BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs]
     Read of size 8 by task tmpfile/7850
     page:ffffea00017c6dc0 count:0 mapcount:0 mapping:          (null) index:0x0
     flags: 0xffff8000000000()
     page dumped because: kasan: bad access detected
     CPU: 3 PID: 7850 Comm: tmpfile Tainted: G           OE   4.6.0+ #3
      ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618
      ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8
      ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3
     Call Trace:
      [<ffffffff81635618>] dump_stack+0x63/0x8b
      [<ffffffff81313ee8>] kasan_report_error+0x528/0x560
      [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0
      [<ffffffff813144b8>] kasan_report+0x58/0x60
      [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs]
      [<ffffffff81312e4e>] __asan_load8+0x5e/0x70
      [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs]
      [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs]
    
      [<ffffffff81353c3a>] SyS_execve+0x3a/0x50
      [<ffffffff810058ef>] do_syscall_64+0xef/0x180
      [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25
     Memory state around the buggy address:
      ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
     >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
                                                                     ^
      ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
      ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
     ==================================================================
    
    Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
    Signed-off-by: Tony Hutter <hutter2@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4705
    Issue #4708
    tuxoko committed with May 27, 2016
  4. Add isa_defs for MIPS

    GCC for MIPS only defines _LP64 when 64bit,
    while no _ILP32 defined when 32bit.
    
    Signed-off-by: YunQiang Su <syq@debian.org>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4712
    YunQiang Su committed with May 28, 2016
Commits on May 27, 2016
  1. @GeLiXin

    Fix self-healing IO prior to dsl_pool_init() completion

    Async writes triggered by a self-healing IO may be issued before the
    pool finishes the process of initialization.  This results in a NULL
    dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes().
    
    George Wilson recommended addressing this issue by initializing the
    passed `dsl_pool_t **` prior to dmu_objset_open_impl().  Since the
    caller is passing the `spa->spa_dsl_pool` this has the effect of
    ensuring it's initialized.
    
    However, since this depends on the caller knowing they must pass
    the `spa->spa_dsl_pool` an additional NULL check was added to
    vdev_queue_max_async_writes().  This guards against any future
    restructuring of the code which might result in dsl_pool_init()
    being called differently.
    
    Signed-off-by: GeLiXin <47034221@qq.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #4652
    GeLiXin committed with May 21, 2016
Something went wrong with that request. Please try again.