Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latest capa-controller is not working properly with our cluster-eks and spawns infinite number of VPC due some error #3048

Closed
AndiDog opened this issue Dec 14, 2023 · 5 comments
Assignees
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service kind/bug provider/cluster-api-aws Cluster API based running on AWS team/phoenix Team Phoenix topic/capi

Comments

@AndiDog
Copy link

AndiDog commented Dec 14, 2023

during network creation it will fail and that cause the process to restart and create VPC again

Probably related:

E1213 13:17:04.251075       1 controller.go:324] "Reconciler error" err="failed to reconcile network for AWSManagedControlPlane org-giantswarm/vac0eks: failed to patch conditions: AWSManagedControlPlane.controlplane.cluster.x-k8s.io \"vac0eks\" is invalid: spec.network.subnets[7]: Duplicate value: map[string]interface {}{\"id\":\"\"}" controller="awsmanagedcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="AWSManagedControlPlane" AWSManagedControlPlane="org-giantswarm/vac0eks" namespace="org-giantswarm" name="vac0eks" reconcileID="12841c2f-7435-4f3b-9106-782a05e97702"
@AndiDog
Copy link
Author

AndiDog commented Dec 14, 2023

The mess on the test AWS account is cleaned up. I don't have an effort estimate yet for CAPA, given how its AWSManagedControlPlane reconciler has zero unit tests covering the Reconcile or reconcileNormal functions and testability (e.g. dependency injection for mockability) must first be added.

@AndiDog
Copy link
Author

AndiDog commented Jan 4, 2024

The bug is very basic: CAPA creates a VPC, fails to store it for whatever reason. On next reconciliation, CAPA pretends it doesn't know anything about the VPC (which it really doesn't without making AWS requests) and happily creates a new one. In our case of repeated errors, this happens again and again. I implemented a basic unit test for EKS, and VPC creation idempotence (applies to EC2 and EKS based clusters alike). Upstream PR coming up soon.

@AndiDog
Copy link
Author

AndiDog commented Jan 5, 2024

Unfortunately, my pending PR kubernetes-sigs/cluster-api-provider-aws#4637 blocks opening the follow-up which adds the EKS unit test.

However I managed to extract the fix for the blatant bug in a small, separate PR kubernetes-sigs/cluster-api-provider-aws#4723 so we can go on nevertheless and fix the terrifying issue.

@AndiDog
Copy link
Author

AndiDog commented Jan 10, 2024

Image pull errors fixed via giantswarm/cluster-api-provider-aws-app#211, so now this should be done (except for developer-reserved MCs where Flux is paused, currently golem).

@AndiDog AndiDog closed this as completed Jan 10, 2024
@AndiDog
Copy link
Author

AndiDog commented Jan 11, 2024

The storage error will be solved via #2870.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service kind/bug provider/cluster-api-aws Cluster API based running on AWS team/phoenix Team Phoenix topic/capi
Projects
Archived in project
Development

No branches or pull requests

2 participants