Skip to content

Commit

Permalink
Account for ashift when gathering buffers to be written to l2arc device
Browse files Browse the repository at this point in the history
If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual
increment to l2ad_hand was introduced in commit 3a17a7a.

The change that introduced l2ad_hand alignment was almost correct
as the write size was accumulated as a sum of rounded buffer sizes.
See commit illumos/illumos-gate@e14bb32.

Also, we now consistently use asize / a_sz for the allocated size and
psize / p_sz for the physical size.  The latter accounts for a
possible size reduction because of the compression, whereas the
former accounts for a possible subsequent size expansion because of
the alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical
I/O has a suitable size.

Note that currently the cache device utilization is calculated based
on the physical size, not the allocated size.  The same applies to
l2_asize kstat. That is wrong, but this commit does not fix that.
The accounting problem was introduced partially in commit 3a17a7a
and partially in 3038a2b (accounting became consistent but in favour
of the wrong size).

Porting Notes:

Reworked to be C90 compatible and the 'write_psize' variable was
removed because it is now unused.

References:
  https://reviews.csiden.org/r/229/
  https://reviews.freebsd.org/D2764

Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3400
Closes openzfs#3433
Closes openzfs#3451

Cherry-pick ported by: Tim Chase <tim@chase2k.com>
Cherry-picked from ef56b07
  • Loading branch information
avg-I authored and Tim Chase committed Aug 20, 2015
1 parent 44b5ec8 commit 02eb405
Showing 1 changed file with 62 additions and 16 deletions.
78 changes: 62 additions & 16 deletions module/zfs/arc.c
Original file line number Diff line number Diff line change
Expand Up @@ -4957,8 +4957,9 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
{
arc_buf_hdr_t *ab, *ab_prev, *head;
list_t *list;
uint64_t write_asize, write_psize, write_sz, headroom,
buf_compress_minsz;
arc_buf_hdr_t *hdr, *hdr_prev, *head;
uint64_t write_asize, write_sz, headroom, buf_compress_minsz,
stats_size;
void *buf_data;
kmutex_t *list_lock = NULL;
boolean_t full;
Expand All @@ -4974,7 +4975,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
*headroom_boost = B_FALSE;

pio = NULL;
write_sz = write_asize = write_psize = 0;
write_sz = write_asize = 0;
full = B_FALSE;
head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
head->b_flags |= ARC_L2_WRITE_HEAD;
Expand Down Expand Up @@ -5013,6 +5014,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
l2arc_buf_hdr_t *l2hdr;
kmutex_t *hash_lock;
uint64_t buf_sz;
uint64_t buf_a_sz;

if (arc_warm == B_FALSE)
ab_prev = list_next(list, ab);
Expand Down Expand Up @@ -5041,7 +5043,15 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
continue;
}

if ((write_sz + ab->b_size) > target_sz) {
/*
* Assume that the buffer is not going to be compressed
* and could take more space on disk because of a larger
* disk block size.
*/
buf_sz = hdr->b_size;
buf_a_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);

if ((write_asize + buf_a_sz) > target_sz) {
full = B_TRUE;
mutex_exit(hash_lock);
break;
Expand Down Expand Up @@ -5081,13 +5091,34 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
* can't access without holding the ARC list locks
* (which we want to avoid during compression/writing)
*/
l2hdr->b_compress = ZIO_COMPRESS_OFF;
l2hdr->b_asize = ab->b_size;
l2hdr->b_tmp_cdata = ab->b_buf->b_data;
HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
l2hdr->b_asize = hdr->b_size;
l2hdr->b_hits = 0;
hdr->b_l1hdr.b_tmp_cdata = hdr->b_l1hdr.b_buf->b_data;

buf_sz = ab->b_size;
ab->b_l2hdr = l2hdr;
/*
* Explicitly set the b_daddr field to a known
* value which means "invalid address". This
* enables us to differentiate which stage of
* l2arc_write_buffers() the particular header
* is in (e.g. this loop, or the one below).
* ARC_FLAG_L2_WRITING is not enough to make
* this distinction, and we need to know in
* order to do proper l2arc vdev accounting in
* arc_release() and arc_hdr_destroy().
*
* Note, we can't use a new flag to distinguish
* the two stages because we don't hold the
* header's hash_lock below, in the second stage
* of this function. Thus, we can't simply
* change the b_flags field to denote that the
* IO has been sent. We can change the b_daddr
* field of the L2 portion, though, since we'll
* be holding the l2ad_mtx; which is why we're
* using it to denote the header's state change.
*/
l2hdr->b_daddr = L2ARC_ADDR_UNSET;
hdr->b_flags |= ARC_FLAG_HAS_L2HDR;

list_insert_head(dev->l2ad_buflist, ab);

Expand All @@ -5101,6 +5132,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
mutex_exit(hash_lock);

write_sz += buf_sz;
write_asize += buf_a_sz;
}

mutex_exit(list_lock);
Expand All @@ -5117,6 +5149,19 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
return (0);
}

/*
* Note that elsewhere in this file arcstat_l2_asize
* and the used space on l2ad_vdev are updated using b_asize,
* which is not necessarily rounded up to the device block size.
* Too keep accounting consistent we do the same here as well:
* stats_size accumulates the sum of b_asize of the written buffers,
* while write_asize accumulates the sum of b_asize rounded up
* to the device block size.
* The latter sum is used only to validate the corectness of the code.
*/
stats_size = 0;
write_asize = 0;

/*
* Now start writing the buffers. We're starting at the write head
* and work backwards, retracing the course of the buffer selector
Expand Down Expand Up @@ -5164,7 +5209,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,

/* Compression may have squashed the buffer to zero length. */
if (buf_sz != 0) {
uint64_t buf_p_sz;
uint64_t buf_a_sz;

wzio = zio_write_phys(pio, dev->l2ad_vdev,
dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF,
Expand All @@ -5175,13 +5220,14 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
zio_t *, wzio);
(void) zio_nowait(wzio);

write_asize += buf_sz;
stats_size += buf_sz;

/*
* Keep the clock hand suitably device-aligned.
*/
buf_p_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
write_psize += buf_p_sz;
dev->l2ad_hand += buf_p_sz;
buf_a_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
write_asize += buf_a_sz;
dev->l2ad_hand += buf_a_sz;
}
}

Expand All @@ -5191,8 +5237,8 @@ l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
ARCSTAT_BUMP(arcstat_l2_writes_sent);
ARCSTAT_INCR(arcstat_l2_write_bytes, write_asize);
ARCSTAT_INCR(arcstat_l2_size, write_sz);
ARCSTAT_INCR(arcstat_l2_asize, write_asize);
vdev_space_update(dev->l2ad_vdev, write_asize, 0, 0);
ARCSTAT_INCR(arcstat_l2_asize, stats_size);
vdev_space_update(dev->l2ad_vdev, stats_size, 0, 0);

/*
* Bump device hand to the device start if it is approaching the end.
Expand Down

0 comments on commit 02eb405

Please sign in to comment.