Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMEA Testing #899

Closed
alexec opened this issue Oct 1, 2020 · 6 comments
Closed

FMEA Testing #899

alexec opened this issue Oct 1, 2020 · 6 comments
Labels
enhancement New feature or request
Milestone

Comments

@alexec
Copy link
Contributor

alexec commented Oct 1, 2020

Perform FMEA testing similar to Argo Workflows:

argoproj/argo-workflows#3751

Expect to need to make changes as a result.

@alexec alexec added the enhancement New feature or request label Oct 1, 2020
@alexec
Copy link
Contributor Author

alexec commented Nov 5, 2020

@whynowy as a Q2 deliverable - we need to think about how to progress this. Should I set-up a meeting?

@whynowy
Copy link
Member

whynowy commented Nov 9, 2020

Understand the behaviour under the following disruptions:

Environment Disruptions

Disruption Behavior Recovery
Kubernetes API network issues. Errors printed in controller logs; EventBus runs well; Errors printed in EventSource Pod when it's resource type, it works well for other types; Errors printed in Sensor Pod when it's a workflow or k8s type trigger, it runs well for the other types of triggers Full recovery
Controller Pod deleted A new controller Pod is created Full recovery
EventSource Pod deleted New EventSource Pod created Full recovery
Sensor Pod deleted New Sensor Pod created Full recovery
Kubernetes node deleted - EventSource New event source POD created Full recovery
Kubernetes node deleted - Sensor New sensor POD created Full recovery
Kubernetes node deleted - EventBus New event bus POD created Full recovery
Node unavailable Wait until it is Full recovery
Not enough resources to schedule pods. Wait until it is Full recovery
EventSource POD offline CalendarEventSource - If catchup is enabled, it will not lose any message; If not, the past messages will be lost.
SQSEventSource - Will not lose any message.
ResourceEventSource - Will lose messages. ( #985 )
-
Sensor POD offline If the offline time is no longer than the EventBus configured message expiration time (defaults to 72h), no messages will be lost. Otherwise it will lose the message earlier than the expiration time. However, user has the ability to control the EventBus message expiration time ( #901 ). -
EventBus offline Use anti-affinity to schedule the pods on different nodes, and use volumes to mitigate the issue, as long as no more than 1 POD gets offline, there's no impact.

CalendarEventSource - If catchup is enabled, it will not lose any message; If not, the past messages will be lost.
SQSEventSource - Will not lose any message.
ResourceEventSource - Will not lose message if EventSource POD keeps running.
-
Sensor fails to trigger action due to external environment change (e.g. connection issue) There's no retry ( #984 ) -

User Disruptions

Disruption Behavior Recovery
EventSource object deleted No new events Manual recovery
Sensor object deleted No action is triggered, when the Sensor object is recreated, it automatically picks up the missing actions Manual recovery
EventBus object delete/recreate Controller does not delete the EventBus if it has EventSource or Sensor connected (#1066) -
EventBus object updated with different auth strategy EventBus PODs remain unchanged, NATS configmap changed. Killing EventBus PODs to reload config led to lots of errors in EventSource and Sensor PODs. This behavior is not allowed if events-webhook is installed. ( #986 )
EventBus object updated with different spec other than auth Could disallow those behaviors like auth strategy ( #986 )
Malformed EventSource spec Validation errors in eventsource-controller log, or in the terminal if events-webhook is installed Manual recovery
Malformed Sensor spec Validation errors in sensor-controller log, or in the terminal if events-webhook is installed Manual recovery
Malformed EventBus spec Validation errors in eventbus-controller log, or in the terminal if events-webhook is installed Manual recovery
EventSource external dependencies change (e.g. SQS config change)

@alexec
Copy link
Contributor Author

alexec commented Nov 9, 2020

Looks good so far. Can you set-up a meeting in the next, say 2 weeks, with me and @VaibhavPage ?

@whynowy whynowy added this to the v1.2 milestone Nov 10, 2020
@alexec
Copy link
Contributor Author

alexec commented Nov 24, 2020

Please can you to list more disruptions:

  • What happens to a message from different event sources (especially calendar) when the event source pod is offline?
  • What happens to a message from the event sources when the event bus is offline?
  • What happens to a message from a trigger to an external system when we cannot connect to that system?

Start with just the top 2 or 3?

@alexec alexec modified the milestones: v1.2, v1.3 Jan 15, 2021
@alexec
Copy link
Contributor Author

alexec commented Feb 18, 2021

@whynowy
Copy link
Member

whynowy commented Mar 31, 2021

Close this as all the related issues were addressed.

@whynowy whynowy closed this as completed Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants