Resiliency of OM for state machine crash #6717
sumitagrawl
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Concerns of current state machine:
Impact: Crash of OM / service not available
Solution Points:
1. Need support skip of transaction - through configuration
Many times for the recovery of the system, this needs support skip of the particular transaction. Otherwise the system becomes in-operable.
For code bugs in the operation, users need to make a decision to skip and recover the system.
2. Making operation failure smoothly (without terminating) for specific transaction
It can segregate the type of operation which must crash and which can just fail,
Critical Operation
create/commit and other critical operation which can create in-consistency in system
Non-critical Operation
Internal cleanup, experimental features and other operation which does not create big impact to the system and do not cause data loss and further failure, and repetitive in nature.
The operation needs to be configured in the configuration file for easy control.
3. Failing operation (operation timeout)
Operation taking more time then threshold like, 10 minutes threshold, it should be terminated and making it failure. This is like the operation is stuck and/or the system is not able to complete due to lack of memory / cpu.
These operations should be failed (critical: causing crash of system, non-critical: making it failure) using interrupt.
Already we capture metrics for time taken by these operations.
Configuration of threshold is required.
Idempotent ,, return user with server busy . And and server should check for duplicate.
To discuss: chain get corrupted in snapshot, if this approach can handle or something else
4. Logging the failed operation
It should log the failed operation terminated abruptly, with operation and transaction Id. This will be useful to know what transaction has failed. (currently, it logs only in normal failure).
5. Alternative approach to crash
Crash mostly happens during ratis transaction (write operation). so instead of crashing, write operation can be disabled, and provide only read operation. This needs some way that the leader is elected (or node is identified providing service) to provide read service.
other suggestion:
Beta Was this translation helpful? Give feedback.
All reactions