Skip to content

Commit

Permalink
OpenZFS 9486 - reduce memory used by device...
Browse files Browse the repository at this point in the history
removal on fragmented pools

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

    Add the following as module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes

    Document the following module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes
        zfs_condense_min_mapping_bytes

External-issue: DLPX-57962

Authored by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: ahrens/illumos@07152e1
OpenZFS-PR: openzfs/openzfs#627
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
  • Loading branch information
ahrens authored and dweeezil committed May 20, 2018
1 parent 43eb39d commit 272a990
Show file tree
Hide file tree
Showing 7 changed files with 309 additions and 33 deletions.
4 changes: 4 additions & 0 deletions include/sys/range_tree.h
Original file line number Diff line number Diff line change
Expand Up @@ -93,9 +93,13 @@ range_seg_t *range_tree_find(range_tree_t *rt, uint64_t start, uint64_t size);
void range_tree_resize_segment(range_tree_t *rt, range_seg_t *rs,
uint64_t newstart, uint64_t newsize);
uint64_t range_tree_space(range_tree_t *rt);
boolean_t range_tree_is_empty(range_tree_t *rt);
void range_tree_verify(range_tree_t *rt, uint64_t start, uint64_t size);
void range_tree_swap(range_tree_t **rtsrc, range_tree_t **rtdst);
void range_tree_stat_verify(range_tree_t *rt);
uint64_t range_tree_min(range_tree_t *rt);
uint64_t range_tree_max(range_tree_t *rt);
uint64_t range_tree_span(range_tree_t *rt);

void range_tree_add(void *arg, uint64_t start, uint64_t size);
void range_tree_remove(void *arg, uint64_t start, uint64_t size);
Expand Down
3 changes: 3 additions & 0 deletions include/sys/vdev_removal.h
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,9 @@ extern void spa_vdev_remove_suspend(spa_t *);
extern int spa_vdev_remove_cancel(spa_t *);
extern void spa_vdev_removal_destroy(spa_vdev_removal_t *svr);

extern int vdev_removal_max_span;
extern int zfs_remove_max_segment;

#ifdef __cplusplus
}
#endif
Expand Down
59 changes: 59 additions & 0 deletions man/man5/zfs-module-parameters.5
Original file line number Diff line number Diff line change
Expand Up @@ -425,6 +425,24 @@ create) will return ENOSPC.
Default value: \fB5\fR.
.RE

.sp
.ne 2
.na
\fBvdev_removal_max_span\fR (int)
.ad
.RS 12n
During top-level vdev removal, chunks of data are copied from the vdev
which may include free space in order to trade bandwidth for IOPS.
This parameter determines the maximum span of free space (in bytes)
which will be included as "unnecessary" data in a chunk of copied data.

The default value here was chosen to align with
\fBzfs_vdev_read_gap_limit\fR, which is a similar concept when doing
regular reads (but there's no reason it has to be the same).
.sp
Default value: \fB32,768\fR.
.RE

.sp
.ne 2
.na
Expand Down Expand Up @@ -868,6 +886,47 @@ transaction record (itx).
Default value: \fB5\fR%.
.RE

.sp
.ne 2
.na
\fBzfs_condense_indirect_vdevs_enable\fR (int)
.ad
.RS 12n
Enable condensing indirect vdev mappings. When set to a non-zero value,
attempt to condense indirect vdev mappings if the mapping uses more than
\fBzfs_condense_min_mapping_bytes\fR bytes of memory and if the obsolete
space map object uses more than \fBzfs_condense_max_obsolete_bytes\fR
bytes on-disk. The condensing process is an attempt to save memory by
removing obsolete mappings.
.sp
Default value: \fB1\fR.
.RE

.sp
.ne 2
.na
\fBzfs_condense_max_obsolete_bytes\fR (ulong)
.ad
.RS 12n
Only attempt to condense indirect vdev mappings if the on-disk size
of the obsolete space map object is greater than this number of bytes
(see \fBfBzfs_condense_indirect_vdevs_enable\fR).
.sp
Default value: \fB1,073,741,824\fR.
.RE

.sp
.ne 2
.na
\fBzfs_condense_min_mapping_bytes\fR (ulong)
.ad
.RS 12n
Minimum size vdev mapping to attempt to condense (see
\fBzfs_condense_indirect_vdevs_enable\fR).
.sp
Default value: \fB131,072\fR.
.RE

.sp
.ne 2
.na
Expand Down
30 changes: 28 additions & 2 deletions module/zfs/range_tree.c
Original file line number Diff line number Diff line change
Expand Up @@ -491,15 +491,14 @@ range_tree_resize_segment(range_tree_t *rt, range_seg_t *rs,
static range_seg_t *
range_tree_find_impl(range_tree_t *rt, uint64_t start, uint64_t size)
{
avl_index_t where;
range_seg_t rsearch;
uint64_t end = start + size;

VERIFY(size != 0);

rsearch.rs_start = start;
rsearch.rs_end = end;
return (avl_find(&rt->rt_root, &rsearch, &where));
return (avl_find(&rt->rt_root, &rsearch, NULL));
}

range_seg_t *
Expand Down Expand Up @@ -599,6 +598,13 @@ range_tree_space(range_tree_t *rt)
return (rt->rt_space);
}

boolean_t
range_tree_is_empty(range_tree_t *rt)
{
ASSERT(rt != NULL);
return (range_tree_space(rt) == 0);
}

/* Generic range tree functions for maintaining segments in an AVL tree. */
void
rt_avl_create(range_tree_t *rt, void *arg)
Expand Down Expand Up @@ -643,3 +649,23 @@ rt_avl_vacate(range_tree_t *rt, void *arg)
*/
rt_avl_create(rt, arg);
}

uint64_t
range_tree_min(range_tree_t *rt)
{
range_seg_t *rs = avl_first(&rt->rt_root);
return (rs != NULL ? rs->rs_start : 0);
}

uint64_t
range_tree_max(range_tree_t *rt)
{
range_seg_t *rs = avl_last(&rt->rt_root);
return (rs != NULL ? rs->rs_end : 0);
}

uint64_t
range_tree_span(range_tree_t *rt)
{
return (range_tree_max(rt) - range_tree_min(rt));
}
12 changes: 10 additions & 2 deletions module/zfs/vdev_indirect.c
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@
* object.
*/

boolean_t zfs_condense_indirect_vdevs_enable = B_TRUE;
int zfs_condense_indirect_vdevs_enable = B_TRUE;

/*
* Condense if at least this percent of the bytes in the mapping is
Expand All @@ -188,7 +188,7 @@ int zfs_indirect_condense_obsolete_pct = 25;
* consumed by the obsolete space map; the default of 1GB is small enough
* that we typically don't mind "wasting" it.
*/
uint64_t zfs_condense_max_obsolete_bytes = 1024 * 1024 * 1024;
unsigned long zfs_condense_max_obsolete_bytes = 1024 * 1024 * 1024;

/*
* Don't bother condensing if the mapping uses less than this amount of
Expand Down Expand Up @@ -1701,10 +1701,18 @@ EXPORT_SYMBOL(vdev_obsolete_counts_are_precise);
EXPORT_SYMBOL(vdev_obsolete_sm_object);

/* CSTYLED */
module_param(zfs_condense_indirect_vdevs_enable, int, 0644);
MODULE_PARM_DESC(zfs_condense_indirect_vdevs_enable,
"Whether to attempt condensing indirect vdev mappings");

module_param(zfs_condense_min_mapping_bytes, ulong, 0644);
MODULE_PARM_DESC(zfs_condense_min_mapping_bytes,
"Minimum size of vdev mapping to condense");

module_param(zfs_condense_max_obsolete_bytes, ulong, 0644);
MODULE_PARM_DESC(zfs_condense_max_obsolete_bytes,
"Minimum size obsolete spacemap to attempt condensing");

module_param(zfs_condense_indirect_commit_entry_delay_ms, int, 0644);
MODULE_PARM_DESC(zfs_condense_indirect_commit_entry_delay_ms,
"Delay while condensing vdev mapping");
Expand Down
23 changes: 17 additions & 6 deletions module/zfs/vdev_label.c
Original file line number Diff line number Diff line change
Expand Up @@ -515,22 +515,33 @@ vdev_config_generate(spa_t *spa, vdev_t *vd, boolean_t getstats,
* histograms.
*/
uint64_t seg_count = 0;
uint64_t to_alloc = vd->vdev_stat.vs_alloc;

/*
* There are the same number of allocated segments
* as free segments, so we will have at least one
* entry per free segment.
* entry per free segment. However, small free
* segments (smaller than vdev_removal_max_span)
* will be combined with adjacent allocated segments
* as a single mapping.
*/
for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++) {
seg_count += vd->vdev_mg->mg_histogram[i];
if (1ULL << (i + 1) < vdev_removal_max_span) {
to_alloc +=
vd->vdev_mg->mg_histogram[i] <<
(i + 1);
} else {
seg_count +=
vd->vdev_mg->mg_histogram[i];
}
}

/*
* The maximum length of a mapping is SPA_MAXBLOCKSIZE,
* so we need at least one entry per SPA_MAXBLOCKSIZE
* of allocated data.
* The maximum length of a mapping is
* zfs_remove_max_segment, so we need at least one entry
* per zfs_remove_max_segment of allocated data.
*/
seg_count += vd->vdev_stat.vs_alloc / SPA_MAXBLOCKSIZE;
seg_count += to_alloc / zfs_remove_max_segment;

fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_SIZE,
seg_count *
Expand Down

0 comments on commit 272a990

Please sign in to comment.