Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash simulator facility without actually SIGKILL #423

Merged
merged 3 commits into from
Jun 6, 2024

Conversation

hkadayam
Copy link
Collaborator

Homestore needs to test with crash recovery. One round about way of doing is to actually crash and then run the tests in a separate process (may be python) and restart the test with another flag and somehow identify that and validate the crash recovery. This is not scalable and generating new test cases is difficult (especially for unit tests)

With this change, we will simulate the crash, by fencing all writes after the simulation and then do reboot.

m_crashed.update([](auto* s) { *s = true; });

// We can restart on a new thread to allow other operations to continue
std::thread t([this]() { m_restart_cb(); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will creating thread cause issue, as we shutting down the old homestore instance and starting new one. But we are still in stack of old homestore instance. How do we make sure caller of crash_if_flip_set wont use old home store instance ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we are creating this thread to avoid this issue. In the test_common, in the new thread we are shutting down old homestore instance, sleep for some 5 seconds or so and then starting homestore. By the time, the caller of crash_if_flip_set() would have been returned and would have done a regular shutdown, right? I am sure we will see some teething issues when we actually use this facility to do crash test, but we can fix it then, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. this of the caller class would have been destroyed by the time crash_if_flip_set returned. If it works for you. Ideally the return from crash_if_flip_set should jump to dummy stack using setjmp. Today in homestore, not sure how we wait for shutdown to start as IO's are on going.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really very interesting for simulating a crash without terminating the current process and creating a new one. from my side, I have two concerns.

1 the same as @sanebay , when the caller returns from crash_if_flip_set(), the host class of the caller might have already been destroyed. but, since the function table(code instruction) of all the class resides in the static area of the process address space, which is determined by the compiler and the system loader, I think the caller will continue after crash_if_flip_set() returns and will not terminate the whole process.

2 we need to check for the top layer caller if there is any other memory access after the caller of crash_if_flip_set() returns. if yes , that might be a danger since the memory address is occupied or allocated by the destroyed class , and the consequence is not determined to access that memory address now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept here isn't that fancy and I don't think it need to be. Here is the flow:

Test code:
=> Set a flip to crash at specific points
=> Wait for homestore restart
=> Proceed with next set of tests after restart

Homestore (Service which hits the flip)
=> During regular operation, hits the flip
=> In a separate thread, we initiate the homestore_shutdown
=> the current thread that hits the flip, will proceed as normal and goes back to the caller, except that all writes from now on, until restart, will say it is successful, but it will not be written.
=> Homestore is triggered a shutdown in parallel.

There is one thing @sanebay mentioned is still required to be validated is our behavior on clean shutdown in concurrent with IO. That is anyway needed, because we must support irrespective of how we are simulating crash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean shutdown needs to wait for outstanding I/O to complete, no matter it is user I/O or homestore meta I/O.

#1 In HomeObj, if there is no outstanding I/O in HomeObj's knowledge (e.g. HO only knows about user I/O, get/put blob, create shard, etc). Once shutdown is received, HO should reject new incoming I/Os, and wait for outstanding to complete, then send down shutdown API to HomeStore.

#2 HomeStore itself then stop all the services (each services needs to wait on if there is outstanding I/Os there, in this case only I/Os generated by homestore itself), in this case, I think only log/meta/replication and CP need to double check whether it already waits on outstanding I/Os before stop.

Also we should have a test case with shutdown with concurrent I/Os.

These are irrespective to this PR.

Copy link
Contributor

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite getting how to use this.

call to crash_if_flip_set/crash will set the crashed flag and resulting all the write failure. This part I get.

Who should take the responsibility to shutdown the "crashed" HS ? Should it be the callback or the caller to crash?
the caller should do like

crash();
shutdown();
wait_until_restart();  // how we implement this ? shold we have another `restart_done_cb`?  
check_recovery();

Moving the shutdown into callback seems more clean.
caller

crash();
wait_until_restart(); 
check_recovery();

callback

shutdown();
start_homestore();

yamingk
yamingk previously approved these changes Jun 6, 2024
@hkadayam hkadayam merged commit 3748a90 into eBay:master Jun 6, 2024
21 checks passed
@hkadayam hkadayam deleted the crash_simulate branch June 6, 2024 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants