Skip to content

Commit

Permalink
Add support for autoexpand property
Browse files Browse the repository at this point in the history
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure.  Enough
of that infrastructure is now in place with a few modifications
for Linux it can be supported.

Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev.  The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid.  If
a match is found the ZED reopens the pool vdev.  This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read.  Otherwise, it wouldn't
be possible to report thee maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be
attempted.  After performing some sanity checks on the disk to
verify that it is safe to expand,  the primary partition (-part1)
will be expanded and the partition table updated.  The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.

In order to make all of the above possible the following changes
were required:

* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
  These tests now create a pool which is layered on a loopback,
  scsi_debug, and file vdev.  This allows for testing of non-
  partitioned block device (loopback), a partition block device
  (scsi_debug), and a file which does not receive udev change
  events.  This provided for better test coverage, and by removing
  the layering on ZFS volumes there issues surrounding layering
  one pool on another are avoided.

* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
  This allows for matching by guid rather than path which is a
  more reliable way for the ZED to reference a vdev.

* Fixed zfs_zevent_wait() signal handling which could result
  in the ZED spinning when a signal was not handled.

* Removed vdev_disk_rrpart() functionality which can be abandoned
  in favor of kernel provided blkdev_reread_part() function.

* Added a rwlock which is held as a writer while a disk is being
  reopened.  This is important to prevent errors from occurring
  for any configuration related IOs which bypass the SCL_ZIO lock.
  The zpool_reopen_007_pos.ksh test case was added to verify IO
  error are never observed when reopening.  This is not expected
  to impact IO performance.

Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.

* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
  ZFS volumes.  This is as good as a unique physical path, while the
  volumes are not used in the test cases anymore for other reasons
  this improvement was included.

Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#120
Issue openzfs#2437
Issue openzfs#5771
Issue openzfs#7366
Issue openzfs#7582
  • Loading branch information
behlendorf committed Jul 12, 2018
1 parent 2e5dc44 commit 21f7c97
Show file tree
Hide file tree
Showing 26 changed files with 677 additions and 358 deletions.
85 changes: 55 additions & 30 deletions cmd/zed/agents/zfs_mod.c
Original file line number Diff line number Diff line change
Expand Up @@ -697,51 +697,67 @@ zfsdle_vdev_online(zpool_handle_t *zhp, void *data)
{
char *devname = data;
boolean_t avail_spare, l2cache;
vdev_state_t newstate;
nvlist_t *tgt;
int error;

zed_log_msg(LOG_INFO, "zfsdle_vdev_online: searching for '%s' in '%s'",
devname, zpool_get_name(zhp));

if ((tgt = zpool_find_vdev_by_physpath(zhp, devname,
&avail_spare, &l2cache, NULL)) != NULL) {
char *path, fullpath[MAXPATHLEN];
uint64_t wholedisk = 0ULL;
uint64_t wholedisk;

verify(nvlist_lookup_string(tgt, ZPOOL_CONFIG_PATH,
&path) == 0);
verify(nvlist_lookup_uint64(tgt, ZPOOL_CONFIG_WHOLE_DISK,
&wholedisk) == 0);
error = nvlist_lookup_string(tgt, ZPOOL_CONFIG_PATH, &path);
if (error) {
zpool_close(zhp);
return (0);
}

(void) strlcpy(fullpath, path, sizeof (fullpath));
if (wholedisk) {
char *spath = zfs_strip_partition(fullpath);
boolean_t scrub_restart = B_TRUE;
error = nvlist_lookup_uint64(tgt, ZPOOL_CONFIG_WHOLE_DISK,
&wholedisk);
if (error)
wholedisk = 0;

if (!spath) {
zed_log_msg(LOG_INFO, "%s: Can't alloc",
__func__);
if (wholedisk) {
path = strrchr(path, '/');
if (path != NULL) {
path = zfs_strip_partition(path + 1);
if (path == NULL) {
zpool_close(zhp);
return (0);
}
} else {
zpool_close(zhp);
return (0);
}

(void) strlcpy(fullpath, spath, sizeof (fullpath));
free(spath);
(void) strlcpy(fullpath, path, sizeof (fullpath));
free(path);

/*
* We need to reopen the pool associated with this
* device so that the kernel can update the size
* of the expanded device.
* device so that the kernel can update the size of
* the expanded device. When expanding there is no
* need to restart the scrub from the * beginning.
*/
boolean_t scrub_restart = B_FALSE;
(void) zpool_reopen_one(zhp, &scrub_restart);
} else {
(void) strlcpy(fullpath, path, sizeof (fullpath));
}

if (zpool_get_prop_int(zhp, ZPOOL_PROP_AUTOEXPAND, NULL)) {
zed_log_msg(LOG_INFO, "zfsdle_vdev_online: setting "
"device '%s' to ONLINE state in pool '%s'",
fullpath, zpool_get_name(zhp));
if (zpool_get_state(zhp) != POOL_STATE_UNAVAIL)
(void) zpool_vdev_online(zhp, fullpath, 0,
vdev_state_t newstate;

if (zpool_get_state(zhp) != POOL_STATE_UNAVAIL) {
error = zpool_vdev_online(zhp, fullpath, 0,
&newstate);
zed_log_msg(LOG_INFO, "zfsdle_vdev_online: "
"setting device '%s' to ONLINE state "
"in pool '%s': %d", fullpath,
zpool_get_name(zhp), error);
}
}
zpool_close(zhp);
return (1);
Expand All @@ -751,23 +767,32 @@ zfsdle_vdev_online(zpool_handle_t *zhp, void *data)
}

/*
* This function handles the ESC_DEV_DLE event.
* This function handles the ESC_DEV_DLE device change event. Use the
* provided vdev guid when looking up a disk or partition, when the guid
* is not present assume the entire disk is owned by ZFS and append the
* expected -part1 partition information then lookup by physical path.
*/
static int
zfs_deliver_dle(nvlist_t *nvl)
{
char *devname;

if (nvlist_lookup_string(nvl, DEV_PHYS_PATH, &devname) != 0) {
zed_log_msg(LOG_INFO, "zfs_deliver_dle: no physpath");
return (-1);
char *devname, name[MAXPATHLEN];
uint64_t guid;

if (nvlist_lookup_uint64(nvl, ZFS_EV_VDEV_GUID, &guid) == 0) {
sprintf(name, "%llu", (u_longlong_t)guid);
} else if (nvlist_lookup_string(nvl, DEV_PHYS_PATH, &devname) == 0) {
strlcpy(name, devname, MAXPATHLEN);
zfs_append_partition(name, MAXPATHLEN);
} else {
zed_log_msg(LOG_INFO, "zfs_deliver_dle: no guid or physpath");
}

if (zpool_iter(g_zfshdl, zfsdle_vdev_online, devname) != 1) {
if (zpool_iter(g_zfshdl, zfsdle_vdev_online, name) != 1) {
zed_log_msg(LOG_INFO, "zfs_deliver_dle: device '%s' not "
"found", devname);
"found", name);
return (1);
}

return (0);
}

Expand Down
19 changes: 0 additions & 19 deletions config/kernel-blkdev-get.m4

This file was deleted.

21 changes: 21 additions & 0 deletions config/kernel-blkdev-reread-part.m4
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
dnl #
dnl # 4.1 API, exported blkdev_reread_part() symbol, backported to the
dnl # 3.10.0 CentOS 7.x enterprise kernels.
dnl #
AC_DEFUN([ZFS_AC_KERNEL_BLKDEV_REREAD_PART], [
AC_MSG_CHECKING([whether blkdev_reread_part() is available])
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
], [
struct block_device *bdev = NULL;
int error;
error = blkdev_reread_part(bdev);
], [
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_BLKDEV_REREAD_PART, 1,
[blkdev_reread_part() is available])
], [
AC_MSG_RESULT(no)
])
])
17 changes: 0 additions & 17 deletions config/kernel-get-gendisk.m4

This file was deleted.

3 changes: 1 addition & 2 deletions config/kernel.m4
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ AC_DEFUN([ZFS_AC_CONFIG_KERNEL], [
ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS_CHECK_EVENTS
ZFS_AC_KERNEL_BLOCK_DEVICE_OPERATIONS_RELEASE_VOID
ZFS_AC_KERNEL_TYPE_FMODE_T
ZFS_AC_KERNEL_3ARG_BLKDEV_GET
ZFS_AC_KERNEL_BLKDEV_GET_BY_PATH
ZFS_AC_KERNEL_BLKDEV_REREAD_PART
ZFS_AC_KERNEL_OPEN_BDEV_EXCLUSIVE
ZFS_AC_KERNEL_LOOKUP_BDEV
ZFS_AC_KERNEL_INVALIDATE_BDEV_ARGS
Expand Down Expand Up @@ -73,7 +73,6 @@ AC_DEFUN([ZFS_AC_CONFIG_KERNEL], [
ZFS_AC_KERNEL_BLK_QUEUE_HAVE_BLK_PLUG
ZFS_AC_KERNEL_GET_DISK_AND_MODULE
ZFS_AC_KERNEL_GET_DISK_RO
ZFS_AC_KERNEL_GET_GENDISK
ZFS_AC_KERNEL_HAVE_BIO_SET_OP_ATTRS
ZFS_AC_KERNEL_GENERIC_READLINK_GLOBAL
ZFS_AC_KERNEL_DISCARD_GRANULARITY
Expand Down
14 changes: 14 additions & 0 deletions include/linux/blkdev_compat.h
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,20 @@ bio_set_bi_error(struct bio *bio, int error)
#define vdev_bdev_close(bdev, md) close_bdev_excl(bdev)
#endif /* HAVE_BLKDEV_GET_BY_PATH | HAVE_OPEN_BDEV_EXCLUSIVE */

/*
* 4.1 - x.y.z API,
* 3.10.0 CentOS 7.x API,
* blkdev_reread_part()
*
* For older kernels trigger a re-reading of the partition table by calling
* check_disk_change() which calls flush_disk() to invalidate the device.
*/
#ifdef HAVE_BLKDEV_REREAD_PART
#define vdev_bdev_reread_part(bdev) blkdev_reread_part(bdev)
#else
#define vdev_bdev_reread_part(bdev) check_disk_change(bdev)
#endif /* HAVE_BLKDEV_REREAD_PART */

/*
* 2.6.22 API change
* The function invalidate_bdev() lost it's second argument because
Expand Down
1 change: 1 addition & 0 deletions include/sys/vdev_disk.h
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ typedef struct vdev_disk {
ddi_devid_t vd_devid;
char *vd_minor;
struct block_device *vd_bdev;
krwlock_t vd_lock;
} vdev_disk_t;

#endif /* _KERNEL */
Expand Down
72 changes: 59 additions & 13 deletions lib/libzfs/libzfs_import.c
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,21 @@ zfs_device_get_devid(struct udev_device *dev, char *bufptr, size_t buflen)
return (0);
}

/*
* For volumes use the persistent /dev/zvol/dataset identifier
*/
entry = udev_device_get_devlinks_list_entry(dev);
while (entry != NULL) {
const char *name;

name = udev_list_entry_get_name(entry);
if (strncmp(name, ZVOL_ROOT, strlen(ZVOL_ROOT)) == 0) {
(void) strlcpy(bufptr, name, buflen);
return (0);
}
entry = udev_list_entry_get_next(entry);
}

/*
* NVME 'by-id' symlinks are similar to bus case
*/
Expand Down Expand Up @@ -187,26 +202,57 @@ int
zfs_device_get_physical(struct udev_device *dev, char *bufptr, size_t buflen)
{
const char *physpath = NULL;
struct udev_list_entry *entry;

/*
* Normal disks use ID_PATH for their physical path. Device mapper
* devices are virtual and don't have a physical path. For them we
* use ID_VDEV instead, which is setup via the /etc/vdev_id.conf file.
* ID_VDEV provides a persistent path to a virtual device. If you
* don't have vdev_id.conf setup, you cannot use multipath autoreplace.
* Normal disks use ID_PATH for their physical path.
*/
if (!((physpath = udev_device_get_property_value(dev, "ID_PATH")) &&
physpath[0])) {
if (!((physpath =
udev_device_get_property_value(dev, "ID_VDEV")) &&
physpath[0])) {
return (ENODATA);
physpath = udev_device_get_property_value(dev, "ID_PATH");
if (physpath != NULL && strlen(physpath) > 0) {
(void) strlcpy(bufptr, physpath, buflen);
return (0);
}

/*
* Device mapper devices are virtual and don't have a physical
* path. For them we use ID_VDEV instead, which is setup via the
* /etc/vdev_id.conf file. ID_VDEV provides a persistent path
* to a virtual device. If you don't have vdev_id.conf setup,
* you cannot use multipath autoreplace with device mapper.
*/
physpath = udev_device_get_property_value(dev, "ID_VDEV");
if (physpath != NULL && strlen(physpath) > 0) {
(void) strlcpy(bufptr, physpath, buflen);
return (0);
}

/*
* For ZFS volumes use the persistent /dev/zvol/dataset identifier
*/
entry = udev_device_get_devlinks_list_entry(dev);
while (entry != NULL) {
physpath = udev_list_entry_get_name(entry);
if (strncmp(physpath, ZVOL_ROOT, strlen(ZVOL_ROOT)) == 0) {
(void) strlcpy(bufptr, physpath, buflen);
return (0);
}
entry = udev_list_entry_get_next(entry);
}

(void) strlcpy(bufptr, physpath, buflen);
/*
* For all other devices fallback to using the by-uuid name.
*/
entry = udev_device_get_devlinks_list_entry(dev);
while (entry != NULL) {
physpath = udev_list_entry_get_name(entry);
if (strncmp(physpath, "/dev/disk/by-uuid", 17) == 0) {
(void) strlcpy(bufptr, physpath, buflen);
return (0);
}
entry = udev_list_entry_get_next(entry);
}

return (0);
return (ENODATA);
}

boolean_t
Expand Down
14 changes: 11 additions & 3 deletions lib/libzfs/libzfs_pool.c
Original file line number Diff line number Diff line change
Expand Up @@ -2283,17 +2283,25 @@ vdev_to_nvlist_iter(nvlist_t *nv, nvlist_t *search, boolean_t *avail_spare,
}

/*
* Given a physical path (minus the "/devices" prefix), find the
* associated vdev.
* Given a physical path or guid, find the associated vdev.
*/
nvlist_t *
zpool_find_vdev_by_physpath(zpool_handle_t *zhp, const char *ppath,
boolean_t *avail_spare, boolean_t *l2cache, boolean_t *log)
{
nvlist_t *search, *nvroot, *ret;
uint64_t guid;
char *end;

verify(nvlist_alloc(&search, NV_UNIQUE_NAME, KM_SLEEP) == 0);
verify(nvlist_add_string(search, ZPOOL_CONFIG_PHYS_PATH, ppath) == 0);

guid = strtoull(ppath, &end, 0);
if (guid != 0 && *end == '\0') {
verify(nvlist_add_uint64(search, ZPOOL_CONFIG_GUID, guid) == 0);
} else {
verify(nvlist_add_string(search, ZPOOL_CONFIG_PHYS_PATH,
ppath) == 0);
}

verify(nvlist_lookup_nvlist(zhp->zpool_config, ZPOOL_CONFIG_VDEV_TREE,
&nvroot) == 0);
Expand Down
Loading

0 comments on commit 21f7c97

Please sign in to comment.