Skip to content

Commit

Permalink
cluster/ec: fix EIO error for concurrent writes on sparse files
Browse files Browse the repository at this point in the history
EC doesn't allow concurrent writes on overlapping areas, they are
serialized. However non-overlapping writes are serviced in parallel.
When a write is not aligned, EC first needs to read the entire chunk
from disk, apply the modified fragment and write it again.

The problem appears on sparse files because a write to an offset
implicitly creates data on offsets below it (so, in some way, they
are overlapping). For example, if a file is empty and we read 10 bytes
from offset 10, read() will return 0 bytes. Now, if we write one byte
at offset 1M and retry the same read, the system call will return 10
bytes (all containing 0's).

So if we have two writes, the first one at offset 10 and the second one
at offset 1M, EC will send both in parallel because they do not overlap.
However, the first one will try to read missing data from the first chunk
(i.e. offsets 0 to 9) to recombine the entire chunk and do the final write.
This read will happen in parallel with the write to 1M. What could happen
is that half of the bricks process the write before the read, and the
half do the read before the write. Some bricks will return 10 bytes of
data while the otherw will return 0 bytes (because the file on the brick
has not been expanded yet).

When EC tries to recombine the answers from the bricks, it can't, because
it needs more than half consistent answers to recover the data. So this
read fails with EIO error. This error is propagated to the parent write,
which is aborted and EIO is returned to the application.

The issue happened because EC assumed that a write to a given offset
implies that offsets below it exist.

This fix prevents the read of the chunk from bricks if the current size
of the file is smaller than the read chunk offset. This size is
correctly tracked, so this fixes the issue.

Also modifying ec-stripe.t file for Test gluster#13 within it.
In this patch, if a file size is less than the offset we are writing, we
fill zeros in head and tail and do not consider it strip cache miss.
That actually make sense as we know what data that part holds and there is
no need of reading it from bricks.

Upstream-patch: https://review.gluster.org/c/glusterfs/+/23066
Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6
BUG: 1732779
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
Reviewed-on: https://code.engineering.redhat.com/gerrit/176870
Tested-by: RHGS Build Bot <nigelb@redhat.com>
Reviewed-by: Ashish Pandey <aspandey@redhat.com>
Reviewed-by: Sunil Kumar Heggodu Gopala Acharya <sheggodu@redhat.com>
  • Loading branch information
xhernandez authored and sunilheggodu committed Jul 29, 2019
1 parent a494578 commit 20af152
Showing 1 changed file with 13 additions and 6 deletions.
19 changes: 13 additions & 6 deletions xlators/cluster/ec/src/ec-inode-write.c
Original file line number Diff line number Diff line change
Expand Up @@ -1891,15 +1891,22 @@ void ec_writev_start(ec_fop_data_t *fop)
goto failed_fd;
}

tail = fop->size - fop->user_size - fop->head;
if (fop->head > 0) {
if (ec_make_internal_fop_xdata (&xdata)) {
err = -ENOMEM;
goto failed_xdata;
if (current > fop->offset) {
if (ec_make_internal_fop_xdata (&xdata)) {
err = -ENOMEM;
goto failed_xdata;
}
ec_readv(fop->frame, fop->xl, -1, EC_MINIMUM_MIN,
ec_writev_merge_head, NULL, fd, ec->stripe_size,
fop->offset, 0, xdata);
} else {
memset(fop->vector[0].iov_base, 0, fop->head);
memset(fop->vector[0].iov_base + fop->size - tail, 0, tail);
}
ec_readv(fop->frame, fop->xl, -1, EC_MINIMUM_MIN, ec_writev_merge_head,
NULL, fd, ec->stripe_size, fop->offset, 0, xdata);
}
tail = fop->size - fop->user_size - fop->head;

if ((tail > 0) && ((fop->head == 0) || (fop->size > ec->stripe_size))) {
/* Current locking scheme will make sure the 'current' below will
* never decrease while the fop is in progress, so the checks will
Expand Down

0 comments on commit 20af152

Please sign in to comment.