Crash simulator facility without actually SIGKILL #423

hkadayam · 2024-05-15T01:14:42Z

Homestore needs to test with crash recovery. One round about way of doing is to actually crash and then run the tests in a separate process (may be python) and restart the test with another flag and somehow identify that and validate the crash recovery. This is not scalable and generating new test cases is difficult (especially for unit tests)

With this change, we will simulate the crash, by fencing all writes after the simulation and then do reboot.

sanebay · 2024-05-15T01:33:06Z

src/lib/common/crash_simulator.hpp

+            m_crashed.update([](auto* s) { *s = true; });
+
+            // We can restart on a new thread to allow other operations to continue
+            std::thread t([this]() { m_restart_cb(); });


Will creating thread cause issue, as we shutting down the old homestore instance and starting new one. But we are still in stack of old homestore instance. How do we make sure caller of crash_if_flip_set wont use old home store instance ?

Actually we are creating this thread to avoid this issue. In the test_common, in the new thread we are shutting down old homestore instance, sleep for some 5 seconds or so and then starting homestore. By the time, the caller of crash_if_flip_set() would have been returned and would have done a regular shutdown, right? I am sure we will see some teething issues when we actually use this facility to do crash test, but we can fix it then, I think.

Yes. this of the caller class would have been destroyed by the time crash_if_flip_set returned. If it works for you. Ideally the return from crash_if_flip_set should jump to dummy stack using setjmp. Today in homestore, not sure how we wait for shutdown to start as IO's are on going.

this is really very interesting for simulating a crash without terminating the current process and creating a new one. from my side, I have two concerns.

1 the same as @sanebay , when the caller returns from crash_if_flip_set(), the host class of the caller might have already been destroyed. but, since the function table(code instruction) of all the class resides in the static area of the process address space, which is determined by the compiler and the system loader, I think the caller will continue after crash_if_flip_set() returns and will not terminate the whole process.

2 we need to check for the top layer caller if there is any other memory access after the caller of crash_if_flip_set() returns. if yes , that might be a danger since the memory address is occupied or allocated by the destroyed class , and the consequence is not determined to access that memory address now.

The concept here isn't that fancy and I don't think it need to be. Here is the flow:

Test code:
=> Set a flip to crash at specific points
=> Wait for homestore restart
=> Proceed with next set of tests after restart

Homestore (Service which hits the flip)
=> During regular operation, hits the flip
=> In a separate thread, we initiate the homestore_shutdown
=> the current thread that hits the flip, will proceed as normal and goes back to the caller, except that all writes from now on, until restart, will say it is successful, but it will not be written.
=> Homestore is triggered a shutdown in parallel.

There is one thing @sanebay mentioned is still required to be validated is our behavior on clean shutdown in concurrent with IO. That is anyway needed, because we must support irrespective of how we are simulating crash.

Clean shutdown needs to wait for outstanding I/O to complete, no matter it is user I/O or homestore meta I/O.

#1 In HomeObj, if there is no outstanding I/O in HomeObj's knowledge (e.g. HO only knows about user I/O, get/put blob, create shard, etc). Once shutdown is received, HO should reject new incoming I/Os, and wait for outstanding to complete, then send down shutdown API to HomeStore.

#2 HomeStore itself then stop all the services (each services needs to wait on if there is outstanding I/Os there, in this case only I/Os generated by homestore itself), in this case, I think only log/meta/replication and CP need to double check whether it already waits on outstanding I/Os before stop.

Also we should have a test case with shutdown with concurrent I/Os.

These are irrespective to this PR.

xiaoxichen

I am not quite getting how to use this.

call to crash_if_flip_set/crash will set the crashed flag and resulting all the write failure. This part I get.

Who should take the responsibility to shutdown the "crashed" HS ? Should it be the callback or the caller to crash?
the caller should do like

crash();
shutdown();
wait_until_restart();  // how we implement this ? shold we have another `restart_done_cb`?  
check_recovery();

Moving the shutdown into callback seems more clean.
caller

crash();
wait_until_restart(); 
check_recovery();

callback

shutdown();
start_homestore();

Crash simulator facility without actually SIGKILL

f74b0ec

hkadayam requested review from sanebay, yamingk and shosseinimotlagh May 15, 2024 01:14

sanebay reviewed May 15, 2024

View reviewed changes

Resolve build failure and conflicting files

95916cc

xiaoxichen reviewed May 17, 2024

View reviewed changes

yamingk previously approved these changes Jun 6, 2024

View reviewed changes

Merge branch 'master' into crash_simulate

6ebc863

hkadayam dismissed yamingk’s stale review via 6ebc863 June 6, 2024 22:15

hkadayam merged commit 3748a90 into eBay:master Jun 6, 2024
21 checks passed

hkadayam deleted the crash_simulate branch June 6, 2024 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash simulator facility without actually SIGKILL #423

Crash simulator facility without actually SIGKILL #423

hkadayam commented May 15, 2024

sanebay May 15, 2024

hkadayam May 15, 2024

sanebay May 15, 2024

JacksonYao287 May 17, 2024

hkadayam May 30, 2024

yamingk May 30, 2024

xiaoxichen left a comment •

edited

Loading

Crash simulator facility without actually SIGKILL #423

Crash simulator facility without actually SIGKILL #423

Conversation

hkadayam commented May 15, 2024

sanebay May 15, 2024

Choose a reason for hiding this comment

hkadayam May 15, 2024

Choose a reason for hiding this comment

sanebay May 15, 2024

Choose a reason for hiding this comment

JacksonYao287 May 17, 2024

Choose a reason for hiding this comment

hkadayam May 30, 2024

Choose a reason for hiding this comment

yamingk May 30, 2024

Choose a reason for hiding this comment

xiaoxichen left a comment • edited Loading

Choose a reason for hiding this comment

xiaoxichen left a comment •

edited

Loading