New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save migrated replicator checkpoint documents immediately #721

Merged
merged 1 commit into from Jul 31, 2017

Conversation

Projects
None yet
3 participants
@nickva
Contributor

nickva commented Jul 29, 2017

Previously, if the replication id algorithm was updated, replicator would
migrate checkpoint documents but keep them in memory. They would be written to
their respective databases only if checkpoints need to be updated, which
doesn't happen unless the source database changes. As a result it was possible
for checkpoints to be lost. Here is how it could happen:

  1. Checkpoints were created for current (3) version of the replicator document.
    Assume the replication document contains some credentials that look like
    'adm:pass', and the computed v3 replication id is "3abc...".

  2. Replication id algorithm is updated to version 4. Version 4 ignores
    passwords, such that changing authentication from 'adm:pass' to 'adm:pass2'
    would not change the replication ids.

  3. Server code is updated with version 4. Replicator looks for checkpoints with
    the new version 4, which it calculates to be "4def...". It can't find it, so it
    looks for v3, it finds "3abc..." and decides to migrate it. However migration
    only happens in memory. That is, the checkpoint document is updated but it
    need a checkpoint to happen for it to be written to disk.

  4. There are no changes to the source db. So no checkpoints are forced to
    happen.

  5. User hears that the new replicator version is improved and passwords
    shouldn't alter the replication ids and all the checkpoints are reused. They
    update the replication document with their new credentials - adm:pass2.

  6. The updated document with 'adm:pass2' credentials is processed by the
    replicator. It computes the v4 replication id - "4def...". It's the same as
    before since it wasn't affected by pass -> pass2 change. That replication
    checkpoint document is not found on neither source not target. Replicator then
    computes v3 of the id to find the older version. However, v3 is affected by the
    passwords, so there it computes "3ghi..." which is different from previous v3
    which was "3abc..." It cannot find it. Computes v2 and checks, then v1, and
    eventually gives up not finding checkpoint and restart the change feed from 0
    again.

To fix it, update find_replication_logs to also write the migrated
replication checkpoint documents to their respective databases as soon as it
finds them.

Related to issue #689

@rnewson

good catch here though I can't guess what percentage of replication docs represent quiescent replication sources.

@davisp since this is a change to how we upgrade checkpoints, could you cast an eye over it? It looks fine to me, we're just saving what we would have saved later.

@davisp

davisp approved these changes Jul 31, 2017

+1

LGTM once that catch/ignore clause is removed.

Save migrated replicator checkpoint documents immediately
Previously, if the replication id algorithm was updated, replicator would
migrate checkpoint documents but keep them in memory. They would be written to
their respective databases only if checkpoints need to be updated, which
doesn't happen unless the source database changes. As a result it was possible
for checkpoints to be lost. Here is how it could happen:

1. Checkpoints were created for current (3) version of the replicator document.
Assume the replication document contains some credentials tha look like
'adm:pass', and the commputed v3 replication id is "3abc...".

2. Replication id algorithm is updated to version 4. Version 4 ignores
passwords, such that changing authentication from 'adm:pass' to 'adm:pass2'
would not change the replication ids.

3. Server code is updated with version 4. Replicator looks for checkpoints with
the new version 4, which it calculates to be "4def...". It can't find it, so it
looks for v3, it finds "3abc..." and decides to migrate it. However migration
only happens in memory. That is, the checkpoint document is updated but it
need a checkpoint to happen for it to be written to disk.

4. There are no changes to the source db. So no checkpoints are forced to
happen.

5. User hears that the new replicator version is improved and passwords
shouldn't alter the replication ids and all the checkpoints are reused. They
update the replication document with their new credentials - adm:pass2.

6. The updated document with 'adm:pass2' credentials is processed by the
replicator. It computes the v4 replication id - "4def...". It's the same as
before since it wasn't affected by pass -> pass2 change. That replication
checkpoint document is not found on neither source not target. Replicator then
computes v3 of the id to find the older version. However, v3 is affected by the
passwords, so there it computes "3ghi..." which is different from previous v3
which was "3abc..." It cannot find it. Computes v2 and checks, then v1, and
eventually gives up not finding checkpoint and restart the change feed from 0
again.

To fix it, update `find_replication_logs` to also write the migrated
replication checkpoint documents to their respective databases as soon as it
finds them.

@nickva nickva merged commit 1022c25 into apache:master Jul 31, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@nickva nickva deleted the cloudant:migrate-replication-checkpoints branch Jul 31, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment