New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEODE-8338: Redis commands may be repeated when server dies #5351
GEODE-8338: Redis commands may be repeated when server dies #5351
Conversation
Added a sleep before returning an error that a server crashed
also fixed a race in registering the function after the region is created also added retry logic to some tests
we no longer retry on this acception since the application needs to decide if the op needs to be done again
removed a sleep that was added before failing due to a server departing
the CommandFunction and RenameFunction would have consistent retry logic
if it fails to lock the primary down for computation. The function invoke retry logic now only retries if it sees a PrimaryBucketLockException or if the function is not registered yet.
the dunit failure is the known issue: https://issues.apache.org/jira/browse/GEODE-8172 and unrelated to this PR |
} | ||
} | ||
} while (!sesssionObtained); | ||
return responseBody; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see this retry logic happen in the Controller, but we can make that change later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Controller part of the product or part of this test app?
Since we have one redundant copy of the data, and since we modify the data using a function, I think we may have a data corruption issue with non-idempotent operations. What can happen is that an operation like APPEND can:
0) executor called on non-primary redis server,
This may be okay for certain ops (like SADD) that are idempotent (but even they could cause extra key events in the future), but for ops like APPEND we end up appending twice.
This will only happen when a server executing a function dies and our function service retries the function on another server because it is marked HA. The easy way to fix this is to change our function to not be HA. This is just a single one line change.
Note that our clients can already see exceptions/errors if the server they are connected to dies. When that happens the operation they requested may have happened, and if they have multiple geode redis servers running it may have been stored and still in memory. So clients will need some logic to decide if they should redo such an operation or not (because it is already done).
Note: By making the function non-HA, it should just give the client another case in which they need to handle a server crash. It can now be for servers they were not connected to but that were involved in performing the operation they requested.