-
Notifications
You must be signed in to change notification settings - Fork 71
Target group handling rewrite #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ion. Various refactoring and improvements.
| if err != nil { | ||
| return err | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use one outer for _, resTargetGroup := range resTargetGroups{ loop could make it more clear and concise ?
For example:
for _, resTargetGroup := range resTargetGroups {
prefix := model.TgNamePrefix(resTargetGroup.Spec)
if bool(performDeletes) && resTargetGroup.IsDeleted {
err := t.targetGroupManager.Delete(ctx, resTargetGroup)
if err != nil {
t.log.Infof("Failed TargetGroupManager.Delete %s due to %s", prefix, err)
returnErr = true
}
}
if bool(performUpserts) && !resTargetGroup.IsDeleted {
tgStatus, err := t.targetGroupManager.Upsert(ctx, resTargetGroup)
if err == nil {
resTargetGroup.Status = &tgStatus
} else {
t.log.Debugf("Failed TargetGroupManager.Upsert %s due to %s", prefix, err)
returnErr = true
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it this way before but I agree it's not optimal. I think I may have a better option and will just separate them entirely.
pkg/model/lattice/targetgroup.go
Outdated
| } | ||
| type TargetGroupTagFields struct { | ||
| EKSClusterName string `json:"eksclustername"` | ||
| K8SParentRefType ParentRefType `json:"k8sparentreftype"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems values of this K8SParentRefType could be : ServiceExport, HTTPRoute, GRPCRoute that is totally not related to gateway api Route's parentRefs
Do we need to change to another tag name here to aviod confusion? (e.g, K8SResourceRefType)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. Will update the name to be more appropriate.
pkg/model/lattice/targetgroup.go
Outdated
| TargetGroupTagFields | ||
| } | ||
| type TargetGroupTagFields struct { | ||
| EKSClusterName string `json:"eksclustername"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a more general tag name: K8sClusterName(or ClusterName)? because users are able to install the controller in their self-managed k8s cluster in aws (for example, by kops)
BTW, we already had the DefaultTags:
func getManagedByTag(cfg CloudConfig) string {
return fmt.Sprintf("%s/%s/%s", cfg.AccountId, cfg.ClusterName, cfg.VpcId)
}
Do you still need a new EKSClusterName tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on removing "EKS" from the name. I'll have to think if there are any downsides using the manged-by tag.
pkg/model/lattice/targetgroup.go
Outdated
| return t.K8SParentRefType == ParentRefTypeSvcExport | ||
| } | ||
|
|
||
| func (t *TargetGroupTagFields) IsRoute() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsRefByRoute()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change based on the updated tag name
| resp, err := vpcLatticeSess.ListTargetGroupsAsList(ctx, &targetGroupListInput) | ||
| listInput := vpclattice.ListTargetGroupsInput{} | ||
| resp, err := s.cloud.Lattice().ListTargetGroupsAsList(ctx, &listInput) | ||
| if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Seems we have 2 places to invoke findTargetGroup():
- (s *defaultTargetGroupManager) Upsert()
- (s *defaultTargetGroupManager) Delete()
I assmue above 2 places just manage "local" cluster's target groups(i.e., intendToCreate/intendToDelete TG.VpcId === config.VpcID)?
In that case, can we add a VpcIdentifier filtering condition when listTG to improve preformance?:
listInput := vpclattice.ListTargetGroupsInput{
VpcIdentifier: &config.VpcID,
}
resp, err := s.cloud.Lattice().ListTargetGroupsAsList(ctx, &listInput)
(Could consider to do It in a seperate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll double-check, but this should work. I didn't realize we had Vpc as an input to the list call, thanks for calling this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a closer look it's possible that in the service export case the VPC will be different than what's in the controller's config. Using the vpc id of the target group we're looking for should work though.
mikhail-aws
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see blockers for merge, many good changes and code getting into better shape. It will be easier to improve it further after this PR. I see there are behavioral changes, but cant tell by looking at the code whats the impact.
If it were a number of smaller PRs I would left more comments, this one just too big to talk details.
So high level thoughts, not related to your PR in particular. I dont see reasons for Generic Stack to exists, reading code and see how we use it, it makes it harder to comprehend and trace. Seems unnecessary abstraction.
| // today, target registration is handled through the route controller | ||
| // if that proves not to be fast enough, we can look to do quicker | ||
| // target registration with the service controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this controller doesn't do anything now, do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually removed it then put it back. I do think we'll eventually want something responding just to service changes to do target group updates, though I don't know when we would add that logic. Previously, this controller was handling certain cases the route controller logic was missing, but now it's all covered in the route controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still curious why k8s service is not reconciled into lattice service. And k8s Routes to target groups. Why everything in Routes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this controller will have to handle removing old finalizers I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The finalizers issue is a great point. Without this, services would never be able to be deleted without manually removing the finalizers after an upgrade.
pkg/model/core/stack.go
Outdated
| // Get a resource by its id and type | ||
| GetResource(id string, resType Resource) (Resource, error) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove generic stack, and replace with exact models of lattice resources. Our dependency relationship is strict and single way, we dont need to model it as arbitrary graph. May be in some distant future we will manage something else than lattice, we can revisit it.
Also philosophy of controllers same as API, do a job on single resource type, if we build complex graph for single controller, it might be good idea to split it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would probably simplify things to simply remove the modeling step entirely. Something else I noticed was the code has to be more complicated because it handles both delete and upsert scenarios. Eliminating the model may allow us to further simplify the delete logic as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to remove generic stack, created a issue for it: #463
| } | ||
|
|
||
| func (r *ruleSynthesizer) adjustPriorities(ctx context.Context, snlStackRules map[snlKey]ruleIdMap, resRule []*model.Rule) error { | ||
| var lastUpdateErr error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than showing only last error, you can display all of them with about same code using errors.Join. Looks like this:
var updateErr error
for ... {
err := doUpdate()
if err != nil {
updateErr = errors.Join(updateErr, err)
}
}
if updateErr != nil { ... }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
| resListener, err := r.stack.GetResource(rule.Spec.StackListenerId, &model.Listener{}) | ||
| if err != nil { | ||
| return nil, nil, err | ||
| } | ||
|
|
||
| listener, ok := resListener.(*model.Listener) | ||
| if !ok { | ||
| return nil, nil, errors.New("unexpected type conversion failure for listener stack object") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we enforce type check inside GetResource, so we dont need to do type assertion later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there must be a way to do this that's more idiomatic - I'll sync up with you offline to get your input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can pass reference and let stack populate content of struct. Same as k8s client in first lines of Reconcile functions.
resListener := &model.Listener{}
err := r.stack.GetResource(id, resListener)
if err != nil { ... }
// use listener
resListener.Foo There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, after much hacking I think I have something nicer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Least invasive, a bit ugly. Works on top of new GetResource method. But change method signature to (id, resType string)
func GetStackResource[T any](s stack, id string) (T, error) {
var zero T // need zero in case T is a struct, that will create an empty struct, for pointer type it will be nil
r, err := s.GetResource(id, reflect.TypeOf(zero)) // change method signature to GetResource(id, resType string)
if err != nil {
return zero, err
}
t, ok := r.(T)
if !ok {
return zero, errors.New("...")
}
return t, nil
}| // if IpAddressTypeIpv4 is not set, then default to nil | ||
| if targetGroup.Spec.Config.IpAddressType == "" { | ||
| ipAddressType = nil | ||
| func (s *defaultTargetGroupManager) create(ctx context.Context, modelTg *model.TargetGroup, err error) (model.TargetGroupStatus, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need err as argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have been a Goland refactoring artifact. I'll have a look and thanks for picking it up.
| err := t.stack.ListResources(&resTargetGroups) | ||
| if err != nil { | ||
| return err | ||
| func (t *TargetGroupSynthesizer) isControllerManaged(latticeTg tgListOutput) (model.TargetGroupTagFields, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bit over-complicated function. I would make to return bool and then copy tagFields separately.
if !isControllerManaged(latticeTg) {
return ...
}
tagFields := model.TGTagFieldsFromTags(latticeTg.targetGroupTags.Tags)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've separated and simplified this logic
| s.log.Errorf("Deregistering targets for target group %s failed due to %s", tg.ID, err) | ||
| } | ||
| } | ||
| s.deregisterStaleTargets(ctx, modelTargets, modelTg, listTargetsOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should return errors not log them. Probably ignore NotFound errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have another look at this. I believe the logic previously was to ignore, but in the spirit of ensuring all the desired changes are made we should consider erroring instead.
pkg/deploy/stack_deployer.go
Outdated
| d.log.Infof("Error during tg synthesis %s", err) | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rule of thumb: log or throw, not both. If error is not informative enough, errors.Join or fmt.Errorf("%w: new information", err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this. I had a few of these in while I was debugging but tried to clean them up - must have missed this one. Will have another look to see if there are any others still lingering about.
| HealthCheckConfig *vpclattice.HealthCheckConfig `json:"healthcheckconfig"` | ||
| TargetGroupTagFields | ||
| } | ||
| type TargetGroupTagFields struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TagFields are Tags, Keys or Values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the fields that become tags on the target group. Totally open to a better name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort of typed placeholder for the tags that will eventually merge with other tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having them in a separate struct is really just a convenience that allows us operate on them independent of all the other spec fields. None of the other spec fields become tags. This structure/name is just internal to the controller code and could be changed later with zero impact.
| return false, nil | ||
| } | ||
|
|
||
| // one last check - ProtocolVersion is not present on TargetGroupSummary, so we have to do a Get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if we could save a call by putting protocolversion to tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we should just return ProtocolVersion in the API as part of List
|
I like the how tests are written and that we have much more test coverage, but looks like the styles across overall code is quite inconsistent.
These are not really critical but this might be a good opportunity to think about having a good unit testing practice |
| ) (core.Stack, error) { | ||
| stack := core.NewDefaultStack(core.StackID(k8s.NamespacedName(route.K8sObject()))) | ||
|
|
||
| if b.brTgBuilder == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why b.brTgBuilder could possibly be nil? we already initialized it in NewLatticeServiceBuilder()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging this. I actually wanted to move all this initialization outside and rather take it as a parameter in NewLatticeServiceBuilder(), but looks like I missed it. Will follow up on this.
| TargetGroupID string `json:"latticeServiceID"` | ||
| Name string `json:"name"` | ||
| Arn string `json:"arn"` | ||
| Id string `json:"id"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: In TargetGroupStatus, Arn and Id are duplicated, just need the Arn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We pretty much use Id everywhere, though we use sometimes use the Arn in logging. It doesn't really hurt anything to have both, but it does mean we need to set them both. I think we can probably defer this to later.
pkg/model/core/stack.go
Outdated
| // Get a resource by its id and type | ||
| GetResource(id string, resType Resource) (Resource, error) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to remove generic stack, created a issue for it: #463
This is 100% true. In general, I spent the minimum amount of time I thought I could get away with while still getting the testing that was needed. In some cases, this was modifying the tests that already existed, in others it was writing new ones from scratch. I also took the path of least resistance for the tests. Ones that don't have much mocking I felt work better with an "inputs and outputs" format, and sometimes I created subtests when it helped me organize my thoughts but didn't slow me down. When there is any kind of conditional logic involved I like the table-based tests less since the logic for the test starts to get complicated and it's harder to tell what's actually being tested.
I see if I can add some more subtests in the longer tests.
Yes. This is one I was able to get away with making small modifications, so this mostly looks like what's on the main branch.
With the changes, we now have nested model builders. So, now, the service builder calls the listener builder which calls the target group and rule builders. Rather than provide test inputs so these all process successfully, I've changed the logic to just accept a separate (mock) builder. This helps separate unit testing and reduces overall complexity of the tests.
Appreciate the feedback here. I think overall these are an improvement over what was there previously, but I'll see if there are any small adjustments I can make to further improve. Please just let me know if there are any other specifics you'd like me to address before the merge. @solmonk ^^ |
| prefix := TgNamePrefix(spec) | ||
| randomSuffix := make([]rune, RandomSuffixLength) | ||
| for i := range randomSuffix { | ||
| randomSuffix[i] = rune(rand.Intn(26) + 'a') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so looks like the random suffix is made of alphabets, might have chance to hit some bad name here. One mitigation I heard before is to not use vowels
pkg/model/lattice/targetgroup.go
Outdated
| K8SServiceNameKey = "K8SServiceName" | ||
| K8SServiceNamespaceKey = "K8SServiceNamespace" | ||
| K8SRouteNameKey = "K8SRouteName" | ||
| K8SRouteNamespaceKey = "K8SRouteNamespace" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you are updating some of the tag keys anyways, have you thought about adding a common prefix to these as well? Like gateway-api: (idk exactly but colon seems like a common practice) gives a sense that these tags are managed by controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we talked about it briefly in a separate thread. We could use the same as the managedBy tag, so application-networking.k8s.aws. If we think it's better to just make that change now (which it probably is) we can maybe sync up quickly and decide on this. Would be a quick thing for me to change I think.
|
I think tests are something we could improve on later (thanks for the explanation), and the decisions on overall refactoring makes a lot of sense to me. TG tags and names are key user impacting changes so I'm more focused on that, I think we are good if this behavior is (quick) e2e tested and the comments on this part are addressed. |
|
I've made updates based on yesterday's comments. I also tested upgrade/downgrade using the example in the "Getting Started" doc (up to the point it goes to multicluster). |
|
lgtm |
What type of PR is this?
Huge. I'm sorry.
Which issue does this PR fix:
#425
What does this PR do / Why do we need it:
This PR changes how the controller manages target groups, which in turn impacts everything that references target groups, which is a lot.
Previously, target groups and other objects were stored in a local cache/hashmap called the lattice data store. The logic around the data store was replicated in many places and prone to error. This change aims to improve target group handling by changing the following:
As part of this change, I have done a lot of refactoring as well as updating logic around services, listeners, and rules, including logic that removes unused resources. Basically everything that builds or deploys a model relating to these objects has had a major overhaul. My aim was consistency across the different model builder/synthesizer/manager components.
Additionally, I have aimed to improve unit tests to be more focused on outcomes. Some unit test files I've completely rewritten because porting the originals to the new logic just didn't make sense.
Overall, our basic approach has not changed. We have a controller which uses model builders to build a model, then deployers which use synthesizers. Synthesizers use managers. What has changed is that previously these model builders appeared more nicely decoupled, at least at the model build stage (see model_build_lattice_service.go). This decoupled logic comes at the cost of needing to replicate referential logic throughout each builder. For example, when building a rule we needed to know how to reference a target group that was created by a different builder. Similarly, in the synthesizers/managers, we would need to reconstruct references (typically hash keys) to find the objects we needed stored during synthesis.
The core of this change is a shift to a recursive, referential approach. When building the model now, we do not, for example, build model rules separately at the top level alongside services. Instead, we build rules for specific listeners in the stack. Similarly, we now build target groups as we build the rules that need those target groups. This approach keeps pointers back to the object in the stack that will later be required during synthesis.
Then, during synthesis, we start with services, which will yield the actual service ARN and update the modeled service in the stack via the Status field. When we synthesize the listeners, we will resolve the reference to the service in the stack, then pull the ARN from the status. The same approach works for rules, where we pull service, listener, and target group Ids out of the objects stored in the stack using the references we have.
So, previously we had something like (gross oversimplification):
Now it looks something like:
This does complicate the model building logic as builders are now nested, but it greatly simplifies later resource identification and consolidates any "find resource X" logic to only that resource's synthesizer+manager.
Testing done on this change:
Automated tests were also run, though I had issues with an unrelated test around the vpc+service network association target group not being cleaned up. e2e tests required minimal updates, other than the logic for identifying the correct target group, which is expected. I also added in a small optimization to speed up route deletion by first deleting all rules. This first disassociates target groups from services meaning they are not required to drain.
Automation added to e2e:
Will this PR introduce any new dependencies?:
No
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
Target group naming is incompatible with previous versions. A one-time cleanup would be required. This is true of upgrade or downgrade, but this is otherwise a non-impactful change for existing configurations.
Does this PR introduce any user-facing change?:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.