-
Notifications
You must be signed in to change notification settings - Fork 24
[BUG] Fix undeploy router error #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix undeploy router error #194
Conversation
krithika369
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix! I agree with the overall approach but left some suggestions on the implementation. Please feel free to merge when they have been reviewed / addressed.
|
Thanks for the comments! I've refactored the changes with the |
* Update workflows to use Python 3.7 for ensembler engines and sdk (#190) * Update workflows to use Python 3.7 for ensembler engines and sdk * Add none return option for config in Router SDK class * Update text display settings to display entire image name * Set container image name to overflow instead of being truncated * Include response headers in logs (#191) * Update proto and classes * Add response headers and refactor naming of response bodies * Refactor logging methods * Fix tests * Fix kafka tests * Fix kafka tests * Fix line breaks * Fix line breaks * Fix line breaks * Fix typo in error message * Refactor response fields * Refactor response headers as map of strings * Refactor tests to use map of strings to represent headers * Fix non deterministic serialisation of hashmap in tests * Refactor log handler * Remove debug statement * Refactor HTTP header formatting into a helper function * Fix kafka protobuf test to use JSON as means of comparison * Rename body in proto to response to avoid breaking changes * Refactor all tests to use response instead of body to refer to request body * Fix lint import suggestion * Remove debug statements * Rename variables in http header formatting helper function * Fix BQ marshalling issues * Fix lint import suggestion * Fix lint comments * Minor fixes for experiment engine configs in the helm chart (#193) Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Nop Ensembler Config (#192) * UI changes for nop ensembler config * Make handling of router / version status consistent * SDK changes for default route * Correct the default route id in unit tests * Add tests for the nop ensembler config * Update sample code and doc * Add PR comments Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * [BUG] Fix undeploy router error (#194) * Fix bugs in sample SDK script * Refactor UndeployRouterVersion to take in cleanup flag * Update DeploymentService mock class * Fix bug in deployment controller test * Refactor if else blocks * Add tests for IsKnativeServiceInNamespace and IsSecretinNamespace * Rename test method to make it consistent with the method tested * Refactor k8s deletion methods into separate methods * Fix bug in deleting deployments and services * Fix typo in router name * Rename default route from nothing to control * Make undeploying a pending router status a cleanup job * Refactor code to use ignoreNotFound flag * Fix go mod file * Bugfix: Turing API should process experiment engine passkey only if client selection enabled (#196) * Bugfix: Passkey should not be processed if client selection disabled * Update hardcoded sample plugin to use experiment variables, consistent with the runner * Update RPC plugin example and docs * Correct numbering in doc and plugin name change * Add debug message * Update Deployment controller to consider if client selection enabled * Add another unit test case for TestIsClientSelectionEnabled Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Add Fallback Response Route Config for Standard Ensemblers (#197) * UI changes for standard ensembler Fallback response * Add validation for fallback resonse route id * Update docs for the Standard Ensembler config * Routing stragey changes for default route handling * SDK changes for fallback response route id * Amend user docs Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Remove the default route configuration (#198) * Remove unused default route property * Router: Make the default_route_id only required for DefaultTuringRoutingStrategy * Make Default Route ID optional for the Turing Router creater / update API * Update e2e tests * UI: Update router view / edit to stop handling default route explicitly * UI: Exclude routes with traffic rules in the final/fallback response options * SDK: deprecate the default_route_id config * SDK: Remove default route id from samples * Update user docs * Update OpenAPI bundle * Address PR comments Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Adding SDK support for Python 3.8 and 3.9 (#199) * Update SDK / engines to support multiple Python versions * Pin cloudpickle at 2.0.0 * Introduce Python Version on the Pyfunc ensembler config * Update SDK unit tests * Update Github workflows * Update chart values * Update docs, unit tests * Update SDK CI workflows to test on all Python versions * Pin protobuf version at 3.20.1 * Address PR comments Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Bugfix: Clear default route Id from custom ensembler configs (#200) * Miscellaneous bug fixes * Add unit tests for the SDK changes, address PR comments * Bugfix: Regression in display of container configs for Pyfunc * Add type annotation to class methods Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Update chart version for app release (#201) Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Add dynamic loading of Experiment Engine config (#202) * Add dynamic loading of exp engine config * Address PR comments * Add useEffect rerender * Address PR comments * Simplify conditional logic * Attempt to fix yarn install error Co-authored-by: Ewe Zi Yi <36802364+deadlycoconuts@users.noreply.github.com> Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> Co-authored-by: Terence Lim <terencelimxp@gmail.com>
* Upgrade Go version/Knative/Spark Operator/Kubernetes Client/Docker Compose for dev environment. (#183) * Bump versions for all k8s based libs * Use proper context for each scenario. * Upgrade virtual service version. * Update k3d version to kubernetes 1.22. * Upgrade Go to 1.16. * Fix linting errors. * Update k3d flags. * Upgrade knative/istio/spark-operator versions in cluster init. * Update default versions in cluster-init, change e2e test to new api * Fail fast if default environment is wrong. Extra logging. * Fix unit tests. * Addressed PR comments. * Pull requests to be run on any target branch. * Upgrade to go 1.18, upgrade linter. * Upgrade experiment and router to go 1.18. * Update PR comments. * Parameterise Go and Go Linter Versions. * Update documentation with new versions and ports. * Merge from main -> knative-upgrade branch (#203) * Update workflows to use Python 3.7 for ensembler engines and sdk (#190) * Update workflows to use Python 3.7 for ensembler engines and sdk * Add none return option for config in Router SDK class * Update text display settings to display entire image name * Set container image name to overflow instead of being truncated * Include response headers in logs (#191) * Update proto and classes * Add response headers and refactor naming of response bodies * Refactor logging methods * Fix tests * Fix kafka tests * Fix kafka tests * Fix line breaks * Fix line breaks * Fix line breaks * Fix typo in error message * Refactor response fields * Refactor response headers as map of strings * Refactor tests to use map of strings to represent headers * Fix non deterministic serialisation of hashmap in tests * Refactor log handler * Remove debug statement * Refactor HTTP header formatting into a helper function * Fix kafka protobuf test to use JSON as means of comparison * Rename body in proto to response to avoid breaking changes * Refactor all tests to use response instead of body to refer to request body * Fix lint import suggestion * Remove debug statements * Rename variables in http header formatting helper function * Fix BQ marshalling issues * Fix lint import suggestion * Fix lint comments * Minor fixes for experiment engine configs in the helm chart (#193) Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Nop Ensembler Config (#192) * UI changes for nop ensembler config * Make handling of router / version status consistent * SDK changes for default route * Correct the default route id in unit tests * Add tests for the nop ensembler config * Update sample code and doc * Add PR comments Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * [BUG] Fix undeploy router error (#194) * Fix bugs in sample SDK script * Refactor UndeployRouterVersion to take in cleanup flag * Update DeploymentService mock class * Fix bug in deployment controller test * Refactor if else blocks * Add tests for IsKnativeServiceInNamespace and IsSecretinNamespace * Rename test method to make it consistent with the method tested * Refactor k8s deletion methods into separate methods * Fix bug in deleting deployments and services * Fix typo in router name * Rename default route from nothing to control * Make undeploying a pending router status a cleanup job * Refactor code to use ignoreNotFound flag * Fix go mod file * Bugfix: Turing API should process experiment engine passkey only if client selection enabled (#196) * Bugfix: Passkey should not be processed if client selection disabled * Update hardcoded sample plugin to use experiment variables, consistent with the runner * Update RPC plugin example and docs * Correct numbering in doc and plugin name change * Add debug message * Update Deployment controller to consider if client selection enabled * Add another unit test case for TestIsClientSelectionEnabled Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Add Fallback Response Route Config for Standard Ensemblers (#197) * UI changes for standard ensembler Fallback response * Add validation for fallback resonse route id * Update docs for the Standard Ensembler config * Routing stragey changes for default route handling * SDK changes for fallback response route id * Amend user docs Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Remove the default route configuration (#198) * Remove unused default route property * Router: Make the default_route_id only required for DefaultTuringRoutingStrategy * Make Default Route ID optional for the Turing Router creater / update API * Update e2e tests * UI: Update router view / edit to stop handling default route explicitly * UI: Exclude routes with traffic rules in the final/fallback response options * SDK: deprecate the default_route_id config * SDK: Remove default route id from samples * Update user docs * Update OpenAPI bundle * Address PR comments Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Adding SDK support for Python 3.8 and 3.9 (#199) * Update SDK / engines to support multiple Python versions * Pin cloudpickle at 2.0.0 * Introduce Python Version on the Pyfunc ensembler config * Update SDK unit tests * Update Github workflows * Update chart values * Update docs, unit tests * Update SDK CI workflows to test on all Python versions * Pin protobuf version at 3.20.1 * Address PR comments Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Bugfix: Clear default route Id from custom ensembler configs (#200) * Miscellaneous bug fixes * Add unit tests for the SDK changes, address PR comments * Bugfix: Regression in display of container configs for Pyfunc * Add type annotation to class methods Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Update chart version for app release (#201) Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> * Add dynamic loading of Experiment Engine config (#202) * Add dynamic loading of exp engine config * Address PR comments * Add useEffect rerender * Address PR comments * Simplify conditional logic * Attempt to fix yarn install error Co-authored-by: Ewe Zi Yi <36802364+deadlycoconuts@users.noreply.github.com> Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> Co-authored-by: Terence Lim <terencelimxp@gmail.com> * Update CI specs * Revert UI changes during merge * Update CI specs * Update e2e deployment timeout * Remove WIP inline comment Co-authored-by: Ashwin <ashwinath@hotmail.com> Co-authored-by: Ewe Zi Yi <36802364+deadlycoconuts@users.noreply.github.com> Co-authored-by: Krithika Sundararajan <krithika.sundararajan@go-jek.com> Co-authored-by: Terence Lim <terencelimxp@gmail.com>
Context
This PR addresses a bug that occurs when after a failed router deployment attempt, Turing API performs a cleanup of any k8s resources created in the process using the deployment service method
UndeployRouterVersion. Instead of cleaning up any resource created during the deployment process,UndeployRouterVersioninstead attempts to clean up all resources associated with a generic router deployment (as if the router in concern had been created successfully).In other words, should the router deployment fail during the secret deployment stage, which is the first k8s resource to be created in
DeployRouterVersion(see this), Turing API will not only attempt to remove the secrets but also other components like ensemblers, enrichers, routers, fluentd services, experiment engine plugins servers, etc. usingUndeployRouterVersion, which assumes that these components have already been created successfully.Unsurprisingly, any attempt to delete these inexistent components will result in an error being thrown back (since these components cannot be found), hence making it seem as if the undeployment process has also failed.
Fix
To address the bug, an
isCleanUpflag has been created as part of theUndeployRouterVersionsignature, which is set astruewhen the method is called as part of a cleanup process (i.e. remove as many components as can be found), andfalsewhen the method is called as part of a regular undeployment process (i.e. remove all components as one would expect is normally associated with a working Turing Router deployment).This flag is passed to the individual component removal helper functions such as
deleteSecret,deleteK8sService,deletePVCanddeleteKnServices. These helper functions then call controller methods to perform the corresponding actions with an additionalignoreNotFoundflag. Essentially, in a clean up operation (isCleanUpistrue), any attempt to delete a resources that does not exist will return anilvalue instead of an error, ifignoreNotFoundflag istrue.E.g.
However, this change also necessitated some minor refactoring of these existing helper functions (and their dependent functions) as we now need to perform the resource existence check at each deletion step (each resource creation step is a potential point of failure in the router deployment). In particular,
DeleteKubernetesServiceof the cluster controller had to be separated into two new methods,DeleteKubernetesService(as before), andDeleteKubernetesDeployment, each of which concerns itself with only deleting the k8s service OR k8s deployment resource.Unrelated Minor Bug Fix
A method to verify that a default route specified indeed exists when setting up a router using the SDK, was introduced in PR #192. This method throws an error when a default route specified is found to not exist amongst all the routes that were defined for a router. The SDK sample script has been modified accordingly to now specify a default route that exists in the router configuration. In addition, the image of the Enricher and Ensembler has had
docker.io/prefixed to the existing image so as to allow theDockerRegistryPopoverof the UI to render correctly.Modifications
api/turing/api/deployment_controller.go- Addition ofisCleanUpflag argument toUndeployRouterVersioncallsapi/turing/cluster/controller.go- Addition of helper methods to determine whether a k8s resource exists and separation ofDeleteKubernetesServiceintoDeleteKubernetesServiceandDeleteKubernetesDeploymentapi/turing/service/router_deployment_service.go- Addition of logical checks for each helper function that deletes a k8s componentsdk/samples/router/general.py- Minor bug fixes to the sample script