Skip to content

Conversation

@shulin-sq
Copy link
Contributor

@shulin-sq shulin-sq commented Feb 14, 2024

What type of PR is this?
bug

Which issue does this PR fix:
In the scenario where

  • account A is managed by the lattice controller
  • account A shares a service to account B
  • account B associates this service to its own service network

Account A can be in a situation where does not have permissions to read the tags of that association

Therefore this results in an error during route reconciliation.

error during service synthesis failed ServiceManager.Upsert xxx due to NotFoundException: Resource was not found
	status code: 404, request id: a400c824-64e3-4694-94ce-1b5698e1ee35

What does this PR do / Why do we need it:

if the get tags of a service network association request fails, instead of erroring, skip deleting the association and log.

If an issue # is not available please add repro steps and logs from aws-gateway-controller showing the issue:

see above scenario

Testing done on this change:

tested the fix in a staging environment (on top of most recent master) but open to suggestions on how test it via unit tests (looks like current tests don't touch on this scenario).

sample error message:

{"level":"warn","ts":"2024-02-26T01:47:25.568Z","logger":"controller.route","caller":"lattice/service_manager.go:232","msg":"skippin
g update associations  service: xxx, association:xxx, error: NotFoundException: Resource was not found\n\tstatus code: 404, request id: 086c
ba48-4421-4eda-bcae-c543acd81a85"}

Automation added to e2e:

Will this PR introduce any new dependencies?:
no

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
no, yes

Does this PR introduce any user-facing change?:

* fix reconciling lattice services that have service network associations created in a different account (via RAM)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

assocs, err := m.getAllAssociations(ctx, svcSum)
if err != nil {
return err
return fmt.Errorf("in updateAssociations, getAllAssociations failed with %w", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a some more detailed error messages since the err here is just

status code: 404, request id: ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idiomatic error string would be "update associations: get all associations: %w" lower case for function names, chained by colon. Also it's better to use one of the approaches for error composition:

  1. function that throw error need to add it's own name to errors
  2. or caller adds names of functions it calls

Otherwise you might have duplicates in the final error message. Here both are mixed, updateAssociation adds itself and getAllAssociations. You probably want to move getAllAssociations part into it's own function. For example:
in updateAssociations:

return fmt.Errorf("update associations: %w, err);

in getAllAssociations

return fmt.Errorf("get all associations: %w, err);

Also dont use words "failed" if it's not original error, there is a tendency to add "failed", "unsuccessful", etc when we wrap errors, so when error pop-up to the top level of stack it will have 5-6 "failed" words in single sentence. So back to the first comment, chain method names with colons, and actual error last in the chain.

var svcErr error
for _, resService := range resServices {
svcName := fmt.Sprintf("%s-%s", resService.Spec.RouteName, resService.Spec.RouteNamespace)
svcName := utils.LatticeServiceName(resService.Spec.RouteName, resService.Spec.RouteNamespace)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor fix on error messaging. was confusing to see a non truncated service name here.

if isManaged {
m.log.Errorf("in updateAssociations failed when attempting IsArnManaged check. Skipping delete association. service: %s, association: %s, %s", svc.LatticeServiceName(), assoc.Arn, err)
} else if isManaged {
err = m.deleteAssociation(ctx, assoc.Arn)
Copy link
Contributor

@zijun726911 zijun726911 Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mimic this logic to only ignore get tags AccessDenied and still return other err? for example:

	for _, assoc := range toDelete {
		isManaged, err := m.cloud.IsArnManaged(ctx, *assoc.Arn)
		if err != nil {
			aerr, ok := err.(awserr.Error)
			if ok && aerr.Code() == vpclattice.ErrCodeAccessDeniedException {
				// In a scenario that the service association is created by a foreign account,
				// the owner account's controller cannot read the tags of this ServiceNetworkServiceAssociation,
				// and AccessDeniedException is expected.
				continue
			} else {
				return err
			}
		}
		if isManaged {
			err = m.deleteAssociation(ctx, assoc.Arn)
			if err != nil {
				return err
			}
		}
	}

@solmonk can you review this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this way makes more sense in terms of consistency.

Copy link
Contributor

@solmonk solmonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe service network has the same issue, I'm curious if you are also affected by it? https://github.com/aws/aws-application-networking-k8s/blob/main/pkg/deploy/lattice/service_network_manager.go#L101

if isManaged {
m.log.Errorf("in updateAssociations failed when attempting IsArnManaged check. Skipping delete association. service: %s, association: %s, %s", svc.LatticeServiceName(), assoc.Arn, err)
} else if isManaged {
err = m.deleteAssociation(ctx, assoc.Arn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this way makes more sense in terms of consistency.

@shulin-sq
Copy link
Contributor Author

@solmonk interesting. have not run into the same issue with vpc association but seems like the same issue. I will update the PR.

@shulin-sq shulin-sq force-pushed the shulin/fix-get-tags branch 2 times, most recently from 17a4ec5 to 4942650 Compare February 26, 2024 01:48
@shulin-sq
Copy link
Contributor Author

@mikhail-aws @zijun726911 it seems the presubmit hook is stuck, any advice on how to fix it?

@zijun726911
Copy link
Contributor

presubmit passed, will merge it

@zijun726911
Copy link
Contributor

@shulin-sq Do you need our side to do a new controller version release( new helm chart) to include this fix? Or you build your image by yourself in your private image repo?

@zijun726911 zijun726911 merged commit 2a04399 into aws:main Feb 27, 2024
@shulin-sq
Copy link
Contributor Author

@zijun726911 We already created an internal build that includes this change. No rush on a release version.

@shulin-sq
Copy link
Contributor Author

I had to redact some fields from a comment I left earlier. Here is the comment:

some notes:

I was unable to reproduce the VPC problem. I do have my VPC association set up in another account, so I'm unsure why I'm not running into the same issue. Specifically my scenario is: account A owns the sn, account A shares sn to account B, account B does the vpc association. account A does not have access to the vpc association. The controller has iam permissions for account A.
I tried checking for the error code but unfortunately it looks like getTags returns 404 NotFoundException which is different from ResourceNotFoundException which is already a static string in the vpclattice go sdk. For now I kept the err check and added a todo message. Alternatives would be to check against the status code 404 (but that's not exposed as part of awserr.Error) or to check for NotFoundException but temporarily define the string somewhere else. Please let me know what your preference is.
I was able to test the service association in my staging environment and saw the following error messages:

before while debugging (when inspecting the awserr.Error)

{"level":"error","ts":"2024-02-26T01:31:06.420Z","logger":"controller.route","caller":"lattice/service_manager.go:227","msg":"shulin
_was_here NotFoundException, %!w(*awserr.requestError=&{0xc01324f640 404 5cd0afde-496b-4343-abaa-e239df73eea7 []}), %!w(*awserr.requ
estError=&{0xc01324f640 404 5cd0afde-496b-4343-abaa-e239df73eea7 []})"}
after

{"level":"warn","ts":"2024-02-26T01:47:25.568Z","logger":"controller.route","caller":"lattice/service_manager.go:232","msg":"skippin
g update associations service: REDACTED, association: REDACTED, error: NotFoundException: Resource was not found\n\tstatus code: 404, request id: 086c
ba48-4421-4eda-bcae-c543acd81a85"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants