After connection problems: cannot receive incremental stream: destination ... has been modified #203

pschwarzpp · 2023-08-14T07:55:16Z

We have been using zrep for multiple years to replicate through SSH, but in the last few months we suffer from less-than-optimal connectivity which is causing pain with zrep.

The error message is inaccurate, resulting from zrep being confused by its own error handling. When we encounter the issue, manual steps are needed to resolve it.

Order of events

Connection problem

we encounter a connection problem during zrep synconly, this log is from the source where we run the command

packet_write_wait: Connection to x.x.x.x port 22: Broken pipe
packet_write_wait: Connection to UNKNOWN port 65535: Broken pipe
Error: Problem doing sync for esrepo/snap_esids@zrep_000bc6. Renamed to esrepo/snap_esids@zrep_000bc6_unsent

Next run fails

This happens with all subsequent runs:

cannot receive incremental stream: destination esrepo-targethost/snap_esids has been modified since most recent snapshot
warning: cannot send 'esrepo/snap_esids@zrep_000bc6_unsent': signal received
Error: Problem doing sync for esrepo/snap_esids@zrep_000bc7. Renamed to esrepo/snap_esids@zrep_000bc7_unsent

We usually do zrep snaponly followed by zrep synconly to ensure consistency with systems running ontop the replicated dataset.

Failing run with `ZREP_VERBOSE`

# ZREP_VERBOSE=true zrep synconly esrepo/snap_esids
sending esrepo/snap_esids@zrep_000bfb to targethost.domain.com:esrepo-targethost/snap_esids
send from @zrep_000bc5 to esrepo/snap_esids@zrep_000bc6_unsent estimated size is 2.90M
send from @zrep_000bc6_unsent to esrepo/snap_esids@zrep_000bc7_unsent estimated size is 6.42M
# more in between
send from @zrep_000bf8_unsent to esrepo/snap_esids@zrep_000bf9_unsent estimated size is 11.3M
send from @zrep_000bf9_unsent to esrepo/snap_esids@zrep_000bfa_unsent_unsent_unsent estimated size is 27.4M
# multiple _unsent suffix because we did some failing synconly calls without snaponly in between
send from @zrep_000bfa_unsent_unsent_unsent to esrepo/snap_esids@zrep_000bfb estimated size is 13.6M
total estimated size is 2.57G
TIME        SENT   SNAPSHOT
cannot receive incremental stream: destination esrepo-targethost/snap_esids has been modified
since most recent snapshot
warning: cannot send 'esrepo/snap_esids@zrep_000bc6_unsent': signal received
Error: Problem doing sync for esrepo/snap_esids@zrep_000bfb. Renamed to esrepo/snap_esids@zrep_000bfb_unsent

Further info / workaround

Latest snapshot on target

$ zfs list -t snap -r esrepo-targethost/snap_esids | tail -1
esrepo-targethost/snap_esids@zrep_000bc6     0B      -     12.3G  -

But we see above that zrep thinks it needs to send from @zrep_000bc5 instead of @zrep_000bc6. @zrep_000bc6 has been renamed to @zrep_000bc6_unsent on the source.

comparing `@zrep_000bc6`

Source

# zfs get guid esrepo/snap_esids@zrep_000bc6_unsent
NAME                                  PROPERTY  VALUE  SOURCE
esrepo/snap_esids@zrep_000bc6_unsent  guid      12096180512352983159  -

Target

$ zfs get guid esrepo-targethost/snap_esids@zrep_000bc6
NAME                                        PROPERTY  VALUE  SOURCE
esrepo-targethost/snap_esids@zrep_000bc6  guid      12096180512352983159  -

Fixing replication

Rename

Since @zrep_000bc6 has actually been successfully transferred (same guid), we rename it on the source so it has the same name as on the target:

# zfs rename esrepo/snap_esids@zrep_000bc6_unsent esrepo/snap_esids@zrep_000bc6

Rerun

... and run it again, successfully

# ZREP_VERBOSE=true zrep synconly esrepo/snap_esids
sending esrepo/snap_esids@zrep_000bfb_unsent to targethost.domain.com:esrepo-targethost/snap_esids
send from @zrep_000bc5 to esrepo/snap_esids@zrep_000bc6 estimated size is 2.90M
send from @zrep_000bc6 to esrepo/snap_esids@zrep_000bc7_unsent estimated size is 6.42M
send from @zrep_000bc7_unsent to esrepo/snap_esids@zrep_000bc8_unsent estimated size is 7.53M
send from @zrep_000bc8_unsent to esrepo/snap_esids@zrep_000bc9_unsent estimated size is 2.73M
send from @zrep_000bc9_unsent to esrepo/snap_esids@zrep_000bca_unsent estimated size is 4.49M
send from @zrep_000bca_unsent to esrepo/snap_esids@zrep_000bcb_unsent estimated size is 6.07M
# more in between
send from @zrep_000bf4_unsent to esrepo/snap_esids@zrep_000bf5_unsent estimated size is 6.74M
send from @zrep_000bf5_unsent to esrepo/snap_esids@zrep_000bf6_unsent estimated size is 6.10M
send from @zrep_000bf6_unsent to esrepo/snap_esids@zrep_000bf7_unsent estimated size is 11.9M
send from @zrep_000bf7_unsent to esrepo/snap_esids@zrep_000bf8_unsent estimated size is 10.6M
send from @zrep_000bf8_unsent to esrepo/snap_esids@zrep_000bf9_unsent estimated size is 11.3M
send from @zrep_000bf9_unsent to esrepo/snap_esids@zrep_000bfa_unsent_unsent_unsent estimated size is 27.4M
send from @zrep_000bfa_unsent_unsent_unsent to esrepo/snap_esids@zrep_000bfb_unsent estimated size is 13.6M
total estimated size is 2.57G
TIME        SENT   SNAPSHOT
TIME        SENT   SNAPSHOT
07:41:56   2.48M   esrepo/snap_esids@zrep_000bc7_unsent
TIME        SENT   SNAPSHOT
TIME        SENT   SNAPSHOT
TIME        SENT   SNAPSHOT
TIME        SENT   SNAPSHOT
07:42:00   5.89M   esrepo/snap_esids@zrep_000bcb_unsent
# cut out some progress
07:43:31   1013M   esrepo/snap_esids@zrep_000bf4_unsent
07:43:32   1.04G   esrepo/snap_esids@zrep_000bf4_unsent
07:43:33   1.09G   esrepo/snap_esids@zrep_000bf4_unsent
TIME        SENT   SNAPSHOT
07:43:35   3.16M   esrepo/snap_esids@zrep_000bf5_unsent
07:43:36   3.16M   esrepo/snap_esids@zrep_000bf5_unsent
07:43:37   3.16M   esrepo/snap_esids@zrep_000bf5_unsent
07:43:38   3.16M   esrepo/snap_esids@zrep_000bf5_unsent
07:43:39   3.16M   esrepo/snap_esids@zrep_000bf5_unsent
TIME        SENT   SNAPSHOT
TIME        SENT   SNAPSHOT
TIME        SENT   SNAPSHOT
07:43:42   6.27M   esrepo/snap_esids@zrep_000bf8_unsent
TIME        SENT   SNAPSHOT
TIME        SENT   SNAPSHOT
07:43:44   21.8M   esrepo/snap_esids@zrep_000bfa_unsent_unsent_unsent
TIME        SENT   SNAPSHOT
07:43:45   3.00M   esrepo/snap_esids@zrep_000bfb_unsent
Also running expire on targethost.domain.com:esrepo-targethost/snap_esids now...
Expiring zrep snaps on esrepo-targethost/snap_esids

Version used

We are running zrep 1.9.0 on the source and 1.8.0 on the target. We are going to update (& align versions) soon, but from the changelog its not obvious whether there were any related changes.

The text was updated successfully, but these errors were encountered:

ppbrown · 2023-08-15T04:14:25Z

your report of using mismatched zrep versions makes me think you may also be running mismatching zfs versions. Of the two, it is far more important to have ZFS in sync. if they are not, i suggest you focus on standardizing zfs to the latest practical version for you.

pschwarzpp · 2023-08-16T07:22:52Z

I reproduced the issue with freshly installed VMs (ubuntu server 22.04) and the latest version of zrep:

root@zrep-source:~# zfs --version
zfs-2.1.5-1ubuntu6~22.04.1
zfs-kmod-2.1.5-1ubuntu6~22.04.1

The problem only occurs when the connection is dropped after the actual data has been copied by zfs - if it drops mid-stream the next run won't face the problem. I assume this explains why in our production situation we almost only see the problem for a dataset with low activity and basically never on a much busier dataset.

To reproduce I killed ssh on the target host with kill $(ps -ef | grep "sshd: root@notty" | grep -v grep | gawk '{print $2}').

Another transcript with a few annotations

Successful run before reproducing

root@zrep-source:~# zrep snaponly rpool/ds1
rpool/ds1@zrep_000014
root@zrep-source:~# zrep synconly rpool/ds1
sending rpool/ds1@zrep_000014 to zrep-target:rpool/ds1
send from @zrep_000012 to rpool/ds1@zrep_000013_unsent estimated size is 100M
send from @zrep_000013_unsent to rpool/ds1@zrep_000014 estimated size is 624B
total estimated size is 100M
TIME        SENT   SNAPSHOT rpool/ds1@zrep_000013_unsent
08:02:49   3.13M   rpool/ds1@zrep_000013_unsent
[...]
08:03:02   97.5M   rpool/ds1@zrep_000013_unsent
Also running expire on zrep-target:rpool/ds1 now...
Expiring zrep snaps on rpool/ds1

Write a bit of data

root@zrep-source:~# dd bs=1M if=/dev/urandom of=/rpool/ds1/50M.bin count=50 conv=fsync
50+0 records in
50+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 12.9225 s, 4.1 MB/s

Next run, triggering the problem

Note that I kill SSH on the target just as the process stalls briefly at the end. You killed it at the right time when rpool/ds1@zrep_000015 becomes visible a few seconds later on the target (zfs list -t snap -r rpool/ds1).

root@zrep-source:~# zrep snaponly rpool/ds1
rpool/ds1@zrep_000015
root@zrep-source:~# zrep synconly rpool/ds1
sending rpool/ds1@zrep_000015 to zrep-target:rpool/ds1
send from @zrep_000014 to rpool/ds1@zrep_000015 estimated size is 50.1M
total estimated size is 50.1M
TIME        SENT   SNAPSHOT rpool/ds1@zrep_000015
08:04:33   6.40M   rpool/ds1@zrep_000015
08:04:34   16.0M   rpool/ds1@zrep_000015
08:04:35   25.6M   rpool/ds1@zrep_000015
08:04:36   35.1M   rpool/ds1@zrep_000015
08:04:37   44.6M   rpool/ds1@zrep_000015
Connection to zrep-target closed by remote host.
Error: Problem doing sync for rpool/ds1@zrep_000015. Renamed to rpool/ds1@zrep_000015_unsent

Try to do next runs, fails

root@zrep-source:~# zrep synconly rpool/ds1
sending rpool/ds1@zrep_000015_unsent to zrep-target:rpool/ds1
send from @zrep_000014 to rpool/ds1@zrep_000015_unsent estimated size is 50.1M
total estimated size is 50.1M
cannot receive incremental stream: destination rpool/ds1 has been modified
since most recent snapshot
Error: Problem doing sync for rpool/ds1@zrep_000015_unsent. Renamed to rpool/ds1@zrep_000015_unsent_unsent
root@zrep-source:~# zrep snaponly rpool/ds1
rpool/ds1@zrep_000016
root@zrep-source:~# zrep synconly rpool/ds1
sending rpool/ds1@zrep_000016 to zrep-target:rpool/ds1
send from @zrep_000014 to rpool/ds1@zrep_000015_unsent_unsent estimated size is 50.1M
send from @zrep_000015_unsent_unsent to rpool/ds1@zrep_000016 estimated size is 624B
total estimated size is 50.1M
cannot receive incremental stream: destination rpool/ds1 has been modified
since most recent snapshot
Error: Problem doing sync for rpool/ds1@zrep_000016. Renamed to rpool/ds1@zrep_000016_unsent

Fix by renaming latest actually transported snapshot

root@zrep-source:~# zfs rename rpool/ds1@zrep_000015_unsent_unsent rpool/ds1@zrep_000015

Run again successfully


root@zrep-source:~# zrep synconly rpool/ds1
sending rpool/ds1@zrep_000016_unsent to zrep-target:rpool/ds1
send from @zrep_000014 to rpool/ds1@zrep_000015 estimated size is 50.1M
send from @zrep_000015 to rpool/ds1@zrep_000016_unsent estimated size is 624B
total estimated size is 50.1M
TIME        SENT   SNAPSHOT rpool/ds1@zrep_000015
08:05:46   11.5M   rpool/ds1@zrep_000015
08:05:47   21.2M   rpool/ds1@zrep_000015
08:05:48   30.7M   rpool/ds1@zrep_000015
08:05:49   40.2M   rpool/ds1@zrep_000015
08:05:50   49.7M   rpool/ds1@zrep_000015
Also running expire on zrep-target:rpool/ds1 now...
Expiring zrep snaps on rpool/ds1

ppbrown · 2023-08-16T16:43:49Z

Thank you for taking the time to reproduce it cleanly. Do you have any suggestions on how you think it best that Zrep handle this oddity?

…

On Wed, Aug 16, 2023 at 12:23 AM Pascal Schwarz ***@***.***> wrote: I reproduced the issue with freshly installed VMs (ubuntu server 22.04) and the latest version of zrep: ***@***.***:~# zfs --version zfs-2.1.5-1ubuntu6~22.04.1 zfs-kmod-2.1.5-1ubuntu6~22.04.1 The problem only occurs when the connection is dropped after the actual data has been copied by zfs - if it drops mid-stream the next run won't face the problem. To reproduce I killed ssh on the target host with kill $(ps -ef | grep "sshd: ***@***.***" | grep -v grep | gawk '{print $2}'). Another transcript with a few annotations Successful run before reproducing ***@***.***:~# zrep snaponly rpool/ds1 ***@***.***_000014 ***@***.***:~# zrep synconly rpool/ds1 sending ***@***.***_000014 to zrep-target:rpool/ds1 send from @zrep_000012 to ***@***.***_000013_unsent estimated size is 100M send from @zrep_000013_unsent to ***@***.***_000014 estimated size is 624B total estimated size is 100M TIME SENT SNAPSHOT ***@***.***_000013_unsent 08:02:49 3.13M ***@***.***_000013_unsent [...] 08:03:02 97.5M ***@***.***_000013_unsent Also running expire on zrep-target:rpool/ds1 now... Expiring zrep snaps on rpool/ds1 Write a bit of data ***@***.***:~# dd bs=1M if=/dev/urandom of=/rpool/ds1/50M.bin count=50 conv=fsync 50+0 records in 50+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 12.9225 s, 4.1 MB/s Next run, triggering the problem Note that I kill SSH on the target just as the process stalls briefly at the end. You killed it at the right time when ***@***.***_000015 becomes visible a few seconds later on the target (zfs list -t snap -r rpool/ds1). ***@***.***:~# zrep snaponly rpool/ds1 ***@***.***_000015 ***@***.***:~# zrep synconly rpool/ds1 sending ***@***.***_000015 to zrep-target:rpool/ds1 send from @zrep_000014 to ***@***.***_000015 estimated size is 50.1M total estimated size is 50.1M TIME SENT SNAPSHOT ***@***.***_000015 08:04:33 6.40M ***@***.***_000015 08:04:34 16.0M ***@***.***_000015 08:04:35 25.6M ***@***.***_000015 08:04:36 35.1M ***@***.***_000015 08:04:37 44.6M ***@***.***_000015 Connection to zrep-target closed by remote host. Error: Problem doing sync for ***@***.***_000015. Renamed to ***@***.***_000015_unsent Try to do next runs, fails ***@***.***:~# zrep synconly rpool/ds1 sending ***@***.***_000015_unsent to zrep-target:rpool/ds1 send from @zrep_000014 to ***@***.***_000015_unsent estimated size is 50.1M total estimated size is 50.1M cannot receive incremental stream: destination rpool/ds1 has been modified since most recent snapshot Error: Problem doing sync for ***@***.***_000015_unsent. Renamed to ***@***.***_000015_unsent_unsent ***@***.***:~# zrep snaponly rpool/ds1 ***@***.***_000016 ***@***.***:~# zrep synconly rpool/ds1 sending ***@***.***_000016 to zrep-target:rpool/ds1 send from @zrep_000014 to ***@***.***_000015_unsent_unsent estimated size is 50.1M send from @zrep_000015_unsent_unsent to ***@***.***_000016 estimated size is 624B total estimated size is 50.1M cannot receive incremental stream: destination rpool/ds1 has been modified since most recent snapshot Error: Problem doing sync for ***@***.***_000016. Renamed to ***@***.***_000016_unsent Fix by renaming latest actually transported snapshot ***@***.***:~# zfs rename ***@***.***_000015_unsent_unsent ***@***.***_000015 Run again successfully ***@***.***:~# zrep synconly rpool/ds1 sending ***@***.***_000016_unsent to zrep-target:rpool/ds1 send from @zrep_000014 to ***@***.***_000015 estimated size is 50.1M send from @zrep_000015 to ***@***.***_000016_unsent estimated size is 624B total estimated size is 50.1M TIME SENT SNAPSHOT ***@***.***_000015 08:05:46 11.5M ***@***.***_000015 08:05:47 21.2M ***@***.***_000015 08:05:48 30.7M ***@***.***_000015 08:05:49 40.2M ***@***.***_000015 08:05:50 49.7M ***@***.***_000015 Also running expire on zrep-target:rpool/ds1 now... Expiring zrep snaps on rpool/ds1 — Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANEV6OXHWI2WGGF5HLX2B3XVRYNPANCNFSM6AAAAAA3PKCJJU> . You are receiving this because you commented.Message ID: ***@***.***>

pschwarzpp · 2023-08-17T06:28:32Z

I see mostly two ways in which this could be resolved... but I am not an expert on zfs, so...

Potential solutions

a) Improve detecting snapshots on the target

To me it seems that in the output above after renaming the snapshot on the source, zrep still transports too much:

initially the transfer of rpool/ds1@zrep_000014 --> rpool/ds1@zrep_000015 failed (or rather: seemed to fail but actually completed), causing it to be renamed to rpool/ds1@zrep_000015_unsent
we then manually rename it back to rpool/ds1@zrep_000015 on the source
the next run seems to again transfer the delta from rpool/ds1@zrep_000014 --> rpool/ds1@zrep_000015

If ~~zrep snaponly~~ zrep synconly detected that rpool/ds1@zrep_000015_unsent on the source and rpool/ds1@zrep_000015 on the target are identical (they are, we can tell by zrep get guid, see first post). It would never try to transport that delta again and we would probably not see the error at all. This might also be more efficient in that situation.

b) Don't rename to `_unsent`

Maybe we can get around the problem by not renaming upon error. I see that there is ZREP_RENAME_UNSENT, but since it does not only not rename but also destroy the (in our case successfully transported) snapshot on the source (https://github.com/bolthole/zrep/blob/v2.0.2/zrep#L1866), I believe this option would make it even worse for us (we would need to delete the snapshot on the target)

Preference

To me a) seems much better, b) is more of a workaround. But I am not aware how the current logic works to decide which deltas need to be transferred.

ppbrown · 2023-08-17T16:42:45Z

First off, I should say that one needs to take my words cautiously, since my brain is not fully immersed in zrep and zfs these days :)) That being said, the summary of the situation seems to me to be as follows: 1. You have an unreliable overall system in one way or another. So, this isnt a bug in zrep.. this is looking for a zrep-customer-friendly workaround 2.Your remote system is a little quirky. It is reporting "fs changed", and refusing to update the filesystem, when if my memory is correct, OTHER systems dont complain about that sort of thing. I will also make the ASSUMPTION that you are using the pair of systems as intended in the zrep design: that is to say, the remote system is pure archive, and is not written to or modified in any way other than by zrep. If all of the above is true, then my suggestion to you is: Use the "force" flag for syncing, in your standard run. I believe there is a flag that says, "go force the sync to happen, even if the remote side has been modified". That should give you a viable workaround, I think?

pschwarzpp · 2023-08-17T17:01:55Z

I disagree on calling my suggestion above a workaround. Connection issues do happen and a replication tool intended to be used across the network should recover from them. I would consider zrep not recognizing a snaphot it transported as a bug.

We create readonly clones (zfs clone -o readonly=on) of the replicated snapshots for verification purposes (clones are mentioned in http://www.bolthole.com/solaris/zrep/zrep.documentation.html#tuning), does this make "force" unfeasible for us?

In the simple example above where the issue was reproduced, nothing was done on the target except for killing ssh to simulate connection loss. Our mechanism which creates & cleans up the clones has never caused a "destination has been modified" error over the years (actually we never saw it at all). The problem we are seeing now is purely due to the connection issues paired with zreps error handling.

pschwarzpp · 2023-08-17T17:09:56Z

Maybe a third strategy ("do nothing") next to "rename to _unsent" and "delete if transfer seems to have failed" would be a very easy fix.

If zrep simply keeps the snapshot but does not rename it, the system would be in the same state as it is after we manually rename the problematic snapshot on the source. Or do I miss something?

... maybe -r would behave that way?

ppbrown · 2023-08-17T17:41:56Z

your most recent debug output was described as replicating the problem, so I only looked deeply at that. But now you say that the "destination modified" problem, is not part of your real problem. So then, sounds like we need a proper test case that shows your "real problem" properly. Regarding your suggestion of "do nothing"... it seems a bit iffy to me. But if you'd like to try it out, you might replace the "else" at line 1865, with elif [[ "$ZREP_RENAME_UNSENT" != "hold" ]] ; then and try that out.

…

Message ID: ***@***.***>

pschwarzpp · 2023-08-18T07:58:40Z

I have now tried with ZREP_RESUME=1 which also does not rename the snapshot when transfer fails. Using this option, I was unable to reproduce the problem (without it, I can reliably reproduce it).

I noticed that this is not exposed as a normal CLI argument for synconly (or am I making a mistake?), so I wonder if there is a reason not to use it?

Transcript with `ZREP_RESUME=1`

Simulate connection error

root@zrep-source:~# dd bs=1M if=/dev/urandom of=/rpool/ds1/50M.bin count=50 conv=fsync && zrep snaponly rpool/ds1
50+0 records in
50+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 12.2732 s, 4.3 MB/s
rpool/ds1@zrep_000026
root@zrep-source:~# ZREP_RESUME=1 zrep synconly rpool/ds1
sending rpool/ds1@zrep_000026 to zrep-target:rpool/ds1
send from @zrep_000025 to rpool/ds1@zrep_000026 estimated size is 50.1M
total estimated size is 50.1M
TIME        SENT   SNAPSHOT rpool/ds1@zrep_000026
07:42:06   11.7M   rpool/ds1@zrep_000026
07:42:07   21.2M   rpool/ds1@zrep_000026
07:42:08   30.7M   rpool/ds1@zrep_000026
07:42:09   40.2M   rpool/ds1@zrep_000026
07:42:10   49.9M   rpool/ds1@zrep_000026
Connection to zrep-target closed by remote host.
Error: Problem doing sync for rpool/ds1@zrep_000026.

State after connection error

A few seconds after the failed synconly, the snapshot rpool/ds1@zrep_000026 does show up on the target. Only when this happens the error is (would be) triggered:

iso@zrep-target:~$ zfs list -t snap
NAME                    USED  AVAIL     REFER  MOUNTPOINT
rpool/ds1@zrep_000021  50.0M      -     1.12G  -
rpool/ds1@zrep_000022    11K      -     1.12G  -
rpool/ds1@zrep_000023    11K      -     2.10G  -
rpool/ds1@zrep_000024  50.0M      -     2.10G  -
rpool/ds1@zrep_000025  50.0M      -     2.10G  -
rpool/ds1@zrep_000026     0B      -     2.10G  -

Run after connection error

Note that also in this situation zrep transfers the delta from @zrep_000025 to @zrep_000026 even though the snapshot exists on the target already (coming back to "a)" in #203 (comment)).

But since we at least have not renamed it on the source after it has been copied to the target, this re-transmit works.

root@zrep-source:~# dd bs=1M if=/dev/urandom of=/rpool/ds1/50M.bin count=50 conv=fsync && zrep snaponly rpool/ds1
50+0 records in
50+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 12.505 s, 4.2 MB/s
rpool/ds1@zrep_000027root@zrep-source:~#
root@zrep-source:~# ZREP_RESUME=1 zrep synconly rpool/ds1
sending rpool/ds1@zrep_000027 to zrep-target:rpool/ds1
send from @zrep_000025 to rpool/ds1@zrep_000026 estimated size is 50.1M
send from @zrep_000026 to rpool/ds1@zrep_000027 estimated size is 50.1M
total estimated size is 100M
TIME        SENT   SNAPSHOT rpool/ds1@zrep_000026
07:43:19   11.5M   rpool/ds1@zrep_000026
07:43:20   21.1M   rpool/ds1@zrep_000026
07:43:21   30.7M   rpool/ds1@zrep_000026
07:43:22   40.2M   rpool/ds1@zrep_000026
07:43:23   49.7M   rpool/ds1@zrep_000026
TIME        SENT   SNAPSHOT rpool/ds1@zrep_000027
07:43:24   6.89M   rpool/ds1@zrep_000027
07:43:25   17.2M   rpool/ds1@zrep_000027
07:43:26   26.7M   rpool/ds1@zrep_000027
07:43:27   36.2M   rpool/ds1@zrep_000027
07:43:28   45.7M   rpool/ds1@zrep_000027
Also running expire on zrep-target:rpool/ds1 now...
Expiring zrep snaps on rpool/ds1
root@zrep-source:~#

Other

your most recent debug output was described as replicating the problem, so I only looked deeply at that.
But now you say that the "destination modified" problem, is not part of your real problem.

No, sorry if my last post caused confusion. The "destination modified" after connection loss is the problem affecting us.

ppbrown · 2023-08-18T15:19:50Z

well.. focusing on the ultimate cause behind the scenes for a minute: If the "destination modified" thing is the main problem affecting you.. then we should probably focus on that. I'm presuming this is because of your readonly clone making. Even if it theoretically shouldnt... I think it is pretty clear that it is. best case, it isnt strictly a bug, but a contention problem. In which case, you need to create some kind of mutual exclusion lock, where the machines are never doing (create a clone) and (zfs sync) at the same time. But if you are guaranteed never to overlap those operations already... then this should be considered a ZFS bug, and you should complain to your ZFS vendor, I think. Message ID: ***@***.***>

…

pschwarzpp · 2023-08-18T15:34:06Z

Could you please give insight on ZREP_RESUME=1? Are there additional risks / known reasons not to use it?

I'm presuming this is because of your readonly clone making.
Even if it theoretically shouldnt... I think it is pretty clear that it is.

It is unrelated. Again: On the VMs where I now reproduced the issue I am not doing any cloning.

ppbrown · 2023-08-18T15:55:21Z

I believe ZREP_RESUME is an environment setting, because 1. it was difficult to detect whether a particular zfs implementation supported it in the expected manner 2. if you choose to use it, you should always use it. Therefore, it should be set in environment, or the zrep config file, rather than a flag.

…

On Fri, Aug 18, 2023 at 8:34 AM Pascal Schwarz ***@***.***> wrote: Could you please give insight on ZREP_RESUME=1? Are there additional risks / known reasons not to use it? I'm presuming this is because of your readonly clone making. Even if it theoretically shouldnt... I think it is pretty clear that it is. It is unrelated. Again: On the VMs where I now reproduced the issue I am not doing any cloning. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After connection problems: cannot receive incremental stream: destination ... has been modified #203

After connection problems: cannot receive incremental stream: destination ... has been modified #203

pschwarzpp commented Aug 14, 2023 •

edited

Loading

ppbrown commented Aug 15, 2023 via email

pschwarzpp commented Aug 16, 2023 •

edited

Loading

ppbrown commented Aug 16, 2023 via email

pschwarzpp commented Aug 17, 2023 •

edited

Loading

ppbrown commented Aug 17, 2023 via email

pschwarzpp commented Aug 17, 2023 •

edited

Loading

pschwarzpp commented Aug 17, 2023 •

edited

Loading

ppbrown commented Aug 17, 2023 via email

pschwarzpp commented Aug 18, 2023

ppbrown commented Aug 18, 2023 via email

pschwarzpp commented Aug 18, 2023

ppbrown commented Aug 18, 2023 via email

After connection problems: cannot receive incremental stream: destination ... has been modified #203

After connection problems: cannot receive incremental stream: destination ... has been modified #203

Comments

pschwarzpp commented Aug 14, 2023 • edited Loading

Order of events

Connection problem

Next run fails

Failing run with ZREP_VERBOSE

Further info / workaround

Latest snapshot on target

comparing @zrep_000bc6

Source

Target

Fixing replication

Rename

Rerun

Version used

ppbrown commented Aug 15, 2023 via email

pschwarzpp commented Aug 16, 2023 • edited Loading

Another transcript with a few annotations

Successful run before reproducing

Write a bit of data

Next run, triggering the problem

Try to do next runs, fails

Fix by renaming latest actually transported snapshot

Run again successfully

ppbrown commented Aug 16, 2023 via email

pschwarzpp commented Aug 17, 2023 • edited Loading

Potential solutions

a) Improve detecting snapshots on the target

b) Don't rename to _unsent

Preference

ppbrown commented Aug 17, 2023 via email

pschwarzpp commented Aug 17, 2023 • edited Loading

pschwarzpp commented Aug 17, 2023 • edited Loading

ppbrown commented Aug 17, 2023 via email

pschwarzpp commented Aug 18, 2023

Transcript with ZREP_RESUME=1

Simulate connection error

State after connection error

Run after connection error

Other

ppbrown commented Aug 18, 2023 via email

pschwarzpp commented Aug 18, 2023

ppbrown commented Aug 18, 2023 via email

pschwarzpp commented Aug 14, 2023 •

edited

Loading

Failing run with `ZREP_VERBOSE`

comparing `@zrep_000bc6`

pschwarzpp commented Aug 16, 2023 •

edited

Loading

pschwarzpp commented Aug 17, 2023 •

edited

Loading

b) Don't rename to `_unsent`

pschwarzpp commented Aug 17, 2023 •

edited

Loading

pschwarzpp commented Aug 17, 2023 •

edited

Loading

Transcript with `ZREP_RESUME=1`