Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEODE-8338: Redis commands may be repeated when server dies #5351

Merged
merged 16 commits into from Jul 10, 2020

Conversation

sabbey37
Copy link
Member

@sabbey37 sabbey37 commented Jul 7, 2020

Since we have one redundant copy of the data, and since we modify the data using a function, I think we may have a data corruption issue with non-idempotent operations. What can happen is that an operation like APPEND can:
0) executor called on non-primary redis server,

  1. modify the primary (by sending a function exec to it),
  2. modify the secondary (by sending a geode delta to it),
  3. the primary server fails now (before the function executing on it completes),
  4. the non-primary redis server sees the function fail and that it is marked as HA so it retries it. This time it sends it the secondary, which is the new primary, but the operation was actually done on the secondary so this retry will end up doing the operation twice.

This may be okay for certain ops (like SADD) that are idempotent (but even they could cause extra key events in the future), but for ops like APPEND we end up appending twice.

This will only happen when a server executing a function dies and our function service retries the function on another server because it is marked HA. The easy way to fix this is to change our function to not be HA. This is just a single one line change.
Note that our clients can already see exceptions/errors if the server they are connected to dies. When that happens the operation they requested may have happened, and if they have multiple geode redis servers running it may have been stored and still in memory. So clients will need some logic to decide if they should redo such an operation or not (because it is already done).

Note: By making the function non-HA, it should just give the client another case in which they need to handle a server crash. It can now be for servers they were not connected to but that were involved in performing the operation they requested.

@dschneider-pivotal dschneider-pivotal added the redis Issues related to the geode-for-redis module label Jul 7, 2020
@dschneider-pivotal dschneider-pivotal added this to In progress in Redis Module via automation Jul 7, 2020
@dschneider-pivotal dschneider-pivotal marked this pull request as ready for review July 9, 2020 19:41
the CommandFunction and RenameFunction would have consistent retry logic
if it fails to lock the primary down for computation.
The function invoke retry logic now only retries if it sees
a PrimaryBucketLockException or if the function is not
registered yet.
@dschneider-pivotal
Copy link
Contributor

the dunit failure is the known issue: https://issues.apache.org/jira/browse/GEODE-8172 and unrelated to this PR

Redis Module automation moved this from In progress to Reviewer approved Jul 10, 2020
}
}
} while (!sesssionObtained);
return responseBody;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see this retry logic happen in the Controller, but we can make that change later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Controller part of the product or part of this test app?

@dschneider-pivotal dschneider-pivotal merged commit 25bb3b5 into apache:develop Jul 10, 2020
Redis Module automation moved this from Reviewer approved to Done Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
redis Issues related to the geode-for-redis module
Projects
Redis Module
  
Done
3 participants