Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
branch: for-linus
Commits on Jun 30, 2015
  1. @idryomov

    rbd: use GFP_NOIO in rbd_obj_request_create()

    idryomov authored
    rbd_obj_request_create() is called on the main I/O path, so we need to
    use GFP_NOIO to make sure allocation doesn't blow back on us.  Not all
    callers need this, but I'm still hardcoding the flag inside rather than
    making it a parameter because a) this is going to stable, and b) those
    callers shouldn't really use rbd_obj_request_create() and will be fixed
    in the future.
    
    More memory allocation fixes will follow.
    
    Cc: stable@vger.kernel.org # 3.10+
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
  2. @idryomov

    crush: fix a bug in tree bucket decode

    idryomov authored
    struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
    should be used.  -Wconversion catches this, but I guess it went
    unnoticed in all the noise it spews.  The actual problem (at least for
    common crushmaps) isn't the u32 -> u8 truncation though - it's the
    advancement by 4 bytes instead of 1 in the crushmap buffer.
    
    Fixes: http://tracker.ceph.com/issues/2759
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Commits on Jun 29, 2015
  1. @benoit-canet @idryomov

    libceph: Fix ceph_tcp_sendpage()'s more boolean usage

    benoit-canet authored idryomov committed
    From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h:
    
    bool    last_piece;     /* current is last piece */
    
    In ceph_msg_data_next():
    
    *last_piece = cursor->last_piece;
    
    A call to ceph_msg_data_next() is followed by:
    
    ret = ceph_tcp_sendpage(con->sock, page, page_offset,
                            length, last_piece);
    
    while ceph_tcp_sendpage() is:
    
    static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
                                 int offset, size_t size, bool more)
    
    The logic is inverted: correct it.
    
    Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Commits on Jun 25, 2015
  1. @benoit-canet @idryomov

    libceph: Remove spurious kunmap() of the zero page

    benoit-canet authored idryomov committed
    ceph_tcp_sendpage already does the work of mapping/unmapping
    the zero page if needed.
    
    Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
  2. @idryomov

    rbd: queue_depth map option

    idryomov authored
    nr_requests (/sys/block/rbd<id>/queue/nr_requests) is pretty much
    irrelevant in blk-mq case because each driver sets its own max depth
    that it can handle and that's the number of tags that gets preallocated
    on setup.  Users can't increase queue depth beyond that value via
    writing to nr_requests.
    
    For rbd we are happy with the default BLKDEV_MAX_RQ (128) for most
    cases but we want to give users the opportunity to increase it.
    Introduce a new per-device queue_depth option to do just that:
    
        $ sudo rbd map -o queue_depth=1024 ...
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
  3. @idryomov

    rbd: store rbd_options in rbd_device

    idryomov authored
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
  4. @idryomov

    rbd: terminate rbd_opts_tokens with Opt_err

    idryomov authored
    Also nuke useless Opt_last_bool and don't break lines unnecessarily.
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
  5. @ukernel @idryomov

    ceph: fix ceph_writepages_start()

    ukernel authored idryomov committed
    Before a page get locked, someone else can write data to the page
    and increase the i_size. So we should re-check the i_size after
    pages are locked.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  6. @idryomov

    rbd: bump queue_max_segments

    idryomov authored
    The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128)
    unnecessarily limits bio sizes to 512k (assuming 4k pages).  rbd, being
    a virtual block device, doesn't have any restrictions on the number of
    physical segments, so bump max_segments to max_hw_sectors, in theory
    allowing a sector per segment (although the only case this matters that
    I can think of is some readv/writev style thing).  In practice this is
    going to give us 1M bios - the number of segments in a bio is limited
    in bio_get_nr_vecs() by BIO_MAX_PAGES = 256.
    
    Note that this doesn't result in any improvement on a typical direct
    sequential test.  This is because on a box with a not too badly
    fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice
    rbd object size sized requests.  The only difference is the size of
    bios being merged - 512k vs 1M for something like
    
        $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE
        $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
  7. @ukernel @idryomov

    ceph: rework dcache readdir

    ukernel authored idryomov committed
    Previously our dcache readdir code relies on that child dentries in
    directory dentry's d_subdir list are sorted by dentry's offset in
    descending order. When adding dentries to the dcache, if a dentry
    already exists, our readdir code moves it to head of directory
    dentry's d_subdir list. This design relies on dcache internals.
    Al Viro suggests using ncpfs's approach: keeping array of pointers
    to dentries in page cache of directory inode. the validity of those
    pointers are presented by directory inode's complete and ordered
    flags. When a dentry gets pruned, we clear directory inode's complete
    flag in the d_prune() callback. Before moving a dentry to other
    directory, we clear the ordered flag for both old and new directory.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  8. @idryomov

    crush: sync up with userspace

    idryomov authored
    .. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff
    between kernel and userspace").  This fixes a bunch of recently pulled
    coding style issues and makes includes a bit cleaner.
    
    A patch "crush:Make the function crush_ln static" from Nicholas Krause
    <xerofoify@gmail.com> is folded in as crush_ln() has been made static
    in userspace as well.
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
  9. @idryomov

    crush: fix crash from invalid 'take' argument

    idryomov authored
    Verify that the 'take' argument is a valid device or bucket.
    Otherwise ignore it (do not add the value to the working vector).
    
    Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4.
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
  10. @ukernel @idryomov

    ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL

    ukernel authored idryomov committed
    GFP_NOFS memory allocation is required for page writeback path.
    But there is no need to use GFP_NOFS in syscall path and readpage
    path
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  11. @ukernel @idryomov

    ceph: pre-allocate data structure that tracks caps flushing

    ukernel authored idryomov committed
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  12. @ukernel @idryomov

    ceph: re-send flushing caps (which are revoked) in reconnect stage

    ukernel authored idryomov committed
    if flushing caps were revoked, we should re-send the cap flush in
    client reconnect stage. This guarantees that MDS processes the cap
    flush message before issuing the flushing caps to other client.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  13. @ukernel @idryomov

    ceph: send TID of the oldest pending caps flush to MDS

    ukernel authored idryomov committed
    According to this information, MDS can trim its completed caps flush
    list (which is used to detect duplicated cap flush).
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  14. @ukernel @idryomov

    ceph: track pending caps flushing globally

    ukernel authored idryomov committed
    So we know TID of the oldest pending caps flushing. Later patch will
    send this information to MDS, so that MDS can trim its completed caps
    flush list.
    
    Tracking pending caps flushing globally also simplifies syncfs code.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  15. @ukernel @idryomov

    ceph: track pending caps flushing accurately

    ukernel authored idryomov committed
    Previously we do not trace accurate TID for flushing caps. when
    MDS failovers, we have no choice but to re-send all flushing caps
    with a new TID. This can cause problem because MDS can has already
    flushed some caps and has issued the same caps to other client.
    The re-sent cap flush has a new TID, which makes MDS unable to
    detect if it has already processed the cap flush.
    
    This patch adds code to track pending caps flushing accurately.
    When re-sending cap flush is needed, we use its original flush
    TID.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  16. @honkiko @idryomov

    libceph: fix wrong name "Ceph filesystem for Linux"

    honkiko authored idryomov committed
    modinfo libceph prints the module name "Ceph filesystem for Linux",
    which is same as the real fs module ceph. It's confusing.
    
    Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
  17. @ukernel @idryomov

    ceph: fix directory fsync

    ukernel authored idryomov committed
    fsync() on directory should flush dirty caps and wait for any
    uncommitted directory opertions to commit. But ceph_dir_fsync()
    only waits for uncommitted directory opertions.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  18. @ukernel @idryomov

    ceph: fix flushing caps

    ukernel authored idryomov committed
    Current ceph_fsync() only flushes dirty caps and wait for them to be
    flushed. It doesn't wait for caps that has already been flushing.
    This patch makes ceph_fsync() wait for pending flushing caps too.
    Besides, this patch also makes caps_are_flushed() peroperly handle
    tid wrapping.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  19. @ukernel @idryomov

    ceph: don't include used caps in cap_wanted

    ukernel authored idryomov committed
    when copying files to cephfs, file data may stay in page cache after
    corresponding file is closed. Cached data use Fc capability. If we
    include Fc capability in cap_wanted, MDS will treat files with cached
    data as open files, and journal them in an EOpen event when trimming
    log segment.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  20. @ukernel @idryomov

    ceph: ratelimit warn messages for MDS closes session

    ukernel authored idryomov committed
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  21. @idryomov

    rbd: timeout watch teardown on unmap with mount_timeout

    idryomov authored
    As part of unmap sequence, kernel client has to talk to the OSDs to
    teardown watch on the header object.  If none of the OSDs are available
    it would hang forever, until interrupted by a signal - when that
    happens we follow through with the rest of unmap procedure (i.e.
    unregister the device and put all the data structures) and the unmap is
    still considired successful (rbd cli tool exits with 0).  The watch on
    the userspace side should eventually timeout so that's fine.
    
    This isn't very nice, because various userspace tools (pacemaker rbd
    resource agent, for example) then have to worry about setting up their
    own timeouts.  Timeout it with mount_timeout (60 seconds by default).
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
    Reviewed-by: Sage Weil <sage@redhat.com>
  22. @idryomov

    ceph: simplify two mount_timeout sites

    idryomov authored
    No need to bifurcate wait now that we've got ceph_timeout_jiffies().
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
    Reviewed-by: Yan, Zheng <zyan@redhat.com>
  23. @idryomov

    libceph: a couple tweaks for wait loops

    idryomov authored
    - return -ETIMEDOUT instead of -EIO in case of timeout
    - wait_event_interruptible_timeout() returns time left until timeout
      and since it can be almost LONG_MAX we had better assign it to long
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
  24. @idryomov

    libceph: store timeouts in jiffies, verify user input

    idryomov authored
    There are currently three libceph-level timeouts that the user can
    specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
    these are in seconds and no checking is done on user input: negative
    values are accepted, we multiply them all by HZ which may or may not
    overflow, arbitrarily large jiffies then get added together, etc.
    
    There is also a bug in the way mount_timeout=0 is handled.  It's
    supposed to mean "infinite timeout", but that's not how wait.h APIs
    treat it and so __ceph_open_session() for example will busy loop
    without much chance of being interrupted if none of ceph-mons are
    there.
    
    Fix all this by verifying user input, storing timeouts capped by
    msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
    helper for all user-specified waits to handle infinite timeouts
    correctly.
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
  25. @idryomov

    libceph: nuke time_sub()

    idryomov authored
    Unused since ceph got merged into mainline I guess.
    
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Alex Elder <elder@linaro.org>
  26. @ukernel @idryomov

    ceph: exclude setfilelock requests when calculating oldest tid

    ukernel authored idryomov committed
    setfilelock requests can block for a long time, which can prevent
    client from advancing its oldest tid.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  27. @ukernel @idryomov

    ceph: don't pre-allocate space for cap release messages

    ukernel authored idryomov committed
    Previously we pre-allocate cap release messages for each caps. This
    wastes lots of memory when there are large amount of caps. This patch
    make the code not pre-allocate the cap release messages. Instead,
    we add the corresponding ceph_cap struct to a list when releasing a
    cap. Later when flush cap releases is needed, we allocate the cap
    release messages dynamically.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  28. @ukernel @idryomov

    ceph: make sure syncfs flushes all cap snaps

    ukernel authored idryomov committed
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  29. @ukernel @idryomov

    ceph: don't trim auth cap when there are cap snaps

    ukernel authored idryomov committed
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  30. @ukernel @idryomov

    ceph: take snap_rwsem when accessing snap realm's cached_context

    ukernel authored idryomov committed
    When ceph inode's i_head_snapc is NULL, __ceph_mark_dirty_caps()
    accesses snap realm's cached_context. So we need take read lock
    of snap_rwsem.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  31. @ukernel @idryomov

    ceph: avoid sending unnessesary FLUSHSNAP message

    ukernel authored idryomov committed
    when a snap notification contains no new snapshot, we can avoid
    sending FLUSHSNAP message to MDS. But we still need to create
    cap_snap in some case because it's required by write path and
    page writeback path
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
  32. @ukernel @idryomov

    ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference

    ukernel authored idryomov committed
    In most cases that snap context is needed, we are holding
    reference of CEPH_CAP_FILE_WR. So we can set ceph inode's
    i_head_snapc when getting the CEPH_CAP_FILE_WR reference,
    and make codes get snap context from i_head_snapc. This makes
    the code simpler.
    
    Another benefit of this change is that we can handle snap
    notification more elegantly. Especially when snap context
    is updated while someone else is doing write. The old queue
    cap_snap code may set cap_snap's context to ether the old
    context or the new snap context, depending on if i_head_snapc
    is set. The new queue capp_snap code always set cap_snap's
    context to the old snap context.
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
Something went wrong with that request. Please try again.