-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash simulator facility without actually SIGKILL #423
Conversation
m_crashed.update([](auto* s) { *s = true; }); | ||
|
||
// We can restart on a new thread to allow other operations to continue | ||
std::thread t([this]() { m_restart_cb(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will creating thread cause issue, as we shutting down the old homestore instance and starting new one. But we are still in stack of old homestore instance. How do we make sure caller of crash_if_flip_set wont use old home store instance ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we are creating this thread to avoid this issue. In the test_common, in the new thread we are shutting down old homestore instance, sleep for some 5 seconds or so and then starting homestore. By the time, the caller of crash_if_flip_set() would have been returned and would have done a regular shutdown, right? I am sure we will see some teething issues when we actually use this facility to do crash test, but we can fix it then, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. this of the caller class would have been destroyed by the time crash_if_flip_set returned. If it works for you. Ideally the return from crash_if_flip_set should jump to dummy stack using setjmp. Today in homestore, not sure how we wait for shutdown to start as IO's are on going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really very interesting for simulating a crash without terminating the current process and creating a new one. from my side, I have two concerns.
1 the same as @sanebay , when the caller returns from crash_if_flip_set()
, the host class of the caller might have already been destroyed. but, since the function table(code instruction) of all the class resides in the static area of the process address space, which is determined by the compiler and the system loader, I think the caller will continue after crash_if_flip_set()
returns and will not terminate the whole process.
2 we need to check for the top layer caller if there is any other memory access after the caller of crash_if_flip_set()
returns. if yes , that might be a danger since the memory address is occupied or allocated by the destroyed class , and the consequence is not determined to access that memory address now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept here isn't that fancy and I don't think it need to be. Here is the flow:
Test code:
=> Set a flip to crash at specific points
=> Wait for homestore restart
=> Proceed with next set of tests after restart
Homestore (Service which hits the flip)
=> During regular operation, hits the flip
=> In a separate thread, we initiate the homestore_shutdown
=> the current thread that hits the flip, will proceed as normal and goes back to the caller, except that all writes from now on, until restart, will say it is successful, but it will not be written.
=> Homestore is triggered a shutdown in parallel.
There is one thing @sanebay mentioned is still required to be validated is our behavior on clean shutdown in concurrent with IO. That is anyway needed, because we must support irrespective of how we are simulating crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean shutdown needs to wait for outstanding I/O to complete, no matter it is user I/O or homestore meta I/O.
#1 In HomeObj, if there is no outstanding I/O in HomeObj's knowledge (e.g. HO only knows about user I/O, get/put blob, create shard, etc). Once shutdown is received, HO should reject new incoming I/Os, and wait for outstanding to complete, then send down shutdown API to HomeStore.
#2 HomeStore itself then stop all the services (each services needs to wait on if there is outstanding I/Os there, in this case only I/Os generated by homestore itself), in this case, I think only log/meta/replication and CP need to double check whether it already waits on outstanding I/Os before stop.
Also we should have a test case with shutdown with concurrent I/Os.
These are irrespective to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite getting how to use this.
call to crash_if_flip_set
/crash
will set the crashed
flag and resulting all the write failure. This part I get.
Who should take the responsibility to shutdown the "crashed"
HS ? Should it be the callback or the caller to crash
?
the caller should do like
crash();
shutdown();
wait_until_restart(); // how we implement this ? shold we have another `restart_done_cb`?
check_recovery();
Moving the shutdown into callback seems more clean.
caller
crash();
wait_until_restart();
check_recovery();
callback
shutdown();
start_homestore();
Homestore needs to test with crash recovery. One round about way of doing is to actually crash and then run the tests in a separate process (may be python) and restart the test with another flag and somehow identify that and validate the crash recovery. This is not scalable and generating new test cases is difficult (especially for unit tests)
With this change, we will simulate the crash, by fencing all writes after the simulation and then do reboot.