-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added cluster shut down scenario #25
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,8 @@ kraken: | |
- litmus_scenarios: # List of litmus scenarios to load | ||
- - https://hub.litmuschaos.io/api/chaos/1.10.0?file=charts/generic/node-cpu-hog/rbac.yaml | ||
- scenarios/node_hog_engine.yaml | ||
- cluster_shut_down_scenario: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The scenario type in run_kraken is looking for scenario type that ends with an "s". Just need to add an s here to be: cluster_shut_down_scenarios |
||
- scenarios/cluster_shut_down_scenario.yml | ||
|
||
cerberus: | ||
cerberus_enabled: False # Enable it when cerberus is previously installed | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
#### Kubernetes/OpenShift cluster shut down scenario | ||
Scenario to shut down all the nodes including the masters and restart them after specified duration. Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to [cluster_shut_down_scenario](https://github.com/openshift-scale/kraken/blob/master/scenarios/cluster_shut_down_scenario.yml) config file. | ||
|
||
``` | ||
cluster_shut_down_scenario: # Scenario to stop all the nodes for specified duration and restart the nodes | ||
runs: 1 # Number of times to execute the cluster_shut_down scenario | ||
shut_down_duration: 120 # duration in seconds to shut down the cluster | ||
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,7 +12,7 @@ | |
import kraken.invoke.command as runcommand | ||
import kraken.litmus.common_litmus as common_litmus | ||
import kraken.node_actions.common_node_functions as nodeaction | ||
from kraken.node_actions.aws_node_scenarios import aws_node_scenarios | ||
from kraken.node_actions.aws_node_scenarios import AWS, aws_node_scenarios | ||
from kraken.node_actions.general_cloud_node_scenarios import general_node_scenarios | ||
from kraken.node_actions.gcp_node_scenarios import gcp_node_scenarios | ||
import kraken.time_actions.common_time_functions as time_actions | ||
|
@@ -277,6 +277,57 @@ def litmus_scenarios(scenarios_list, config, litmus_namespaces, litmus_uninstall | |
return litmus_namespaces | ||
|
||
|
||
# Inject the cluster shut down scenario | ||
def cluster_shut_down(shut_down_config, config): | ||
runs = shut_down_config["runs"] | ||
shut_down_duration = shut_down_config["shut_down_duration"] | ||
cloud_type = shut_down_config["cloud_type"] | ||
if cloud_type == "aws": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add in the other cloud types that have been added? |
||
cloud_object = AWS() | ||
|
||
nodes = set(kubecli.list_nodes()) | ||
node_id = {} | ||
for node in nodes: | ||
node_id[node] = cloud_object.get_instance_id(node) | ||
|
||
for _ in range(runs): | ||
logging.info("Starting cluster_shut_down scenario injection") | ||
for node in nodes: | ||
cloud_object.stop_instances(node_id[node]) | ||
logging.info("Waiting for 250s to shut down all the nodes") | ||
time.sleep(250) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we able to have the user set this in their config or use 250 as a default? Is there a specific reason we chose 250 seconds here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's no specific reason, it took around 2 mins on 10 node cluster. So I chose 250 seconds to accommodate clusters of bigger sizes. start_instance function can be called on a node only when it is in stopped state else it throws error. However I have added a try except condition such that we sleep for 10 additional seconds when a node isn't in stopped state even after 250 seconds. |
||
logging.info("Shutting down the cluster for the specified duration: %s" | ||
% (shut_down_duration)) | ||
time.sleep(shut_down_duration) | ||
logging.info("Restarting the nodes") | ||
restarted_nodes = set() | ||
stopped_nodes = nodes | ||
while restarted_nodes != nodes: | ||
for node in stopped_nodes: | ||
try: | ||
cloud_object.start_instances(node_id[node]) | ||
restarted_nodes.add(node) | ||
except Exception: | ||
time.sleep(10) | ||
continue | ||
stopped_nodes = nodes - restarted_nodes | ||
logging.info("Waiting for 250s to allow cluster component initilization") | ||
time.sleep(250) | ||
logging.info("Successfully injected cluster_shut_down scenario!") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we able to add in a verification that the nodes are all back up and ready here? Is that too much for kraken that it can just be handled in cerberus? Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this part would be handled by cerberus. With cerberus intergration, when we receive a true, kraken proceeds with next scenario indicating all the nodes are ready but with false, we terminate kraken indicating some components aren't healthy. But it can be explicitly specified after line when cerberus integration is enabled if needed. Thoughts? |
||
cerberus_integration(config) | ||
logging.info("") | ||
|
||
|
||
def cluster_shut_down_scenarios(scenarios_list, config): | ||
for shut_down_config in scenarios_list: | ||
with open(shut_down_config, 'r') as f: | ||
shut_down_config = yaml.full_load(f) | ||
shut_down_config = shut_down_config["cluster_shut_down_scenario"] | ||
cluster_shut_down(shut_down_config, config) | ||
logging.info("Waiting for the specified duration: %s" % (wait_duration)) | ||
time.sleep(wait_duration) | ||
|
||
|
||
# Main function | ||
def main(cfg): | ||
# Start kraken | ||
|
@@ -329,6 +380,7 @@ def main(cfg): | |
failed_post_scenarios = [] | ||
litmus_namespaces = [] | ||
litmus_installed = False | ||
|
||
# Loop to run the chaos starts here | ||
while (int(iteration) < iterations): | ||
# Inject chaos scenarios specified in the config | ||
|
@@ -350,6 +402,7 @@ def main(cfg): | |
# Inject time skew chaos scenarios specified in the config | ||
elif scenario_type == "time_scenarios": | ||
time_scenarios(scenarios_list, config) | ||
|
||
elif scenario_type == "litmus_scenarios": | ||
if not litmus_installed: | ||
common_litmus.install_litmus(litmus_version) | ||
|
@@ -359,8 +412,13 @@ def main(cfg): | |
litmus_namespaces, | ||
litmus_uninstall) | ||
|
||
# Inject cluster shut down scenario specified in the config | ||
elif scenario_type == "cluster_shut_down_scenarios": | ||
cluster_shut_down_scenarios(scenarios_list, config) | ||
|
||
iteration += 1 | ||
logging.info("") | ||
|
||
if litmus_uninstall and litmus_installed: | ||
for namespace in litmus_namespaces: | ||
common_litmus.delete_chaos(namespace) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
cluster_shut_down_scenario: # Scenario to stop all the nodes for specified duration and restart the nodes | ||
runs: 1 # Number of times to execute the cluster_shut_down scenario | ||
shut_down_duration: 120 # duration in seconds to shut down the cluster | ||
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an open slack channel that anyone can get on?