-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash simulator facility without actually SIGKILL #423
Merged
Merged
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
#pragma once | ||
|
||
#ifdef _PRERELEASE | ||
#include <functional> | ||
#include <sisl/utility/urcu_helper.hpp> | ||
#include <iomgr/iomgr_flip.hpp> | ||
|
||
namespace homestore { | ||
|
||
class CrashSimulator { | ||
public: | ||
CrashSimulator(std::function< void(void) > cb = nullptr) : m_restart_cb{cb} {} | ||
~CrashSimulator() = default; | ||
|
||
void crash() { | ||
if (m_restart_cb) { | ||
m_crashed.update([](auto* s) { *s = true; }); | ||
|
||
// We can restart on a new thread to allow other operations to continue | ||
std::thread t([this]() { m_restart_cb(); }); | ||
t.detach(); | ||
} else { | ||
raise(SIGKILL); | ||
} | ||
} | ||
|
||
bool is_crashed() const { return ((m_restart_cb != nullptr) && *(m_crashed.access().get())); } | ||
|
||
bool crash_if_flip_set(const std::string& flip_name) { | ||
if (iomgr_flip::instance()->test_flip(flip_name)) { | ||
this->crash(); | ||
return true; | ||
} else { | ||
return false; | ||
} | ||
} | ||
|
||
private: | ||
std::function< void(void) > m_restart_cb{nullptr}; | ||
sisl::urcu_scoped_ptr< bool > m_crashed; | ||
}; | ||
} // namespace homestore | ||
#endif |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will creating thread cause issue, as we shutting down the old homestore instance and starting new one. But we are still in stack of old homestore instance. How do we make sure caller of crash_if_flip_set wont use old home store instance ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we are creating this thread to avoid this issue. In the test_common, in the new thread we are shutting down old homestore instance, sleep for some 5 seconds or so and then starting homestore. By the time, the caller of crash_if_flip_set() would have been returned and would have done a regular shutdown, right? I am sure we will see some teething issues when we actually use this facility to do crash test, but we can fix it then, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. this of the caller class would have been destroyed by the time crash_if_flip_set returned. If it works for you. Ideally the return from crash_if_flip_set should jump to dummy stack using setjmp. Today in homestore, not sure how we wait for shutdown to start as IO's are on going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really very interesting for simulating a crash without terminating the current process and creating a new one. from my side, I have two concerns.
1 the same as @sanebay , when the caller returns from
crash_if_flip_set()
, the host class of the caller might have already been destroyed. but, since the function table(code instruction) of all the class resides in the static area of the process address space, which is determined by the compiler and the system loader, I think the caller will continue aftercrash_if_flip_set()
returns and will not terminate the whole process.2 we need to check for the top layer caller if there is any other memory access after the caller of
crash_if_flip_set()
returns. if yes , that might be a danger since the memory address is occupied or allocated by the destroyed class , and the consequence is not determined to access that memory address now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept here isn't that fancy and I don't think it need to be. Here is the flow:
Test code:
=> Set a flip to crash at specific points
=> Wait for homestore restart
=> Proceed with next set of tests after restart
Homestore (Service which hits the flip)
=> During regular operation, hits the flip
=> In a separate thread, we initiate the homestore_shutdown
=> the current thread that hits the flip, will proceed as normal and goes back to the caller, except that all writes from now on, until restart, will say it is successful, but it will not be written.
=> Homestore is triggered a shutdown in parallel.
There is one thing @sanebay mentioned is still required to be validated is our behavior on clean shutdown in concurrent with IO. That is anyway needed, because we must support irrespective of how we are simulating crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean shutdown needs to wait for outstanding I/O to complete, no matter it is user I/O or homestore meta I/O.
#1 In HomeObj, if there is no outstanding I/O in HomeObj's knowledge (e.g. HO only knows about user I/O, get/put blob, create shard, etc). Once shutdown is received, HO should reject new incoming I/Os, and wait for outstanding to complete, then send down shutdown API to HomeStore.
#2 HomeStore itself then stop all the services (each services needs to wait on if there is outstanding I/Os there, in this case only I/Os generated by homestore itself), in this case, I think only log/meta/replication and CP need to double check whether it already waits on outstanding I/Os before stop.
Also we should have a test case with shutdown with concurrent I/Os.
These are irrespective to this PR.