Skip to content

Conversation

@erikfuller
Copy link
Contributor

@erikfuller erikfuller commented Oct 30, 2024

What type of PR is this?
Feature/Improvement

Which issue does this PR fix:
#671

What does this PR do / Why do we need it:
Adds a configurable value for the number of max workers for route controller instances. Note this worker count applies to each route type: HTTP, GRPC, and TLS. This change also updates locking mechanism for target group garbage collection to be more performant and allow parallel execution for route reconciliation.

Testing done on this change:
Created 50 HTTP routes. On start or change, all 50 are reconciled. Previously with max workers = 1 (default), the reconciliation rate was 1/s and total time was roughly 50s. For 5 workers, the rate was closer to 5/s, and total time was around 12s. When set to very large numbers (e.g. 25 or 100), the controller encounters throttling from the Lattice APIs. At the default API rate limits, the controller maxes out between 5-10/s and handles the throttles fine enough. If customers need higher reconciliation rates, they can request API limit increases.

I also checked the Helm install, including a --dry-run to verify the environment variable as well as a deployment. When run correctly, the max worker count will show in the logs.

# example log line when worker count set to 10
2024-10-29T16:37:47.088-0700	INFO	runtime	controller/controller.go:234	Starting workers	{"controller": "grpcroute", "controllerGroup": "gateway.networking.k8s.io", "controllerKind": "GRPCRoute", "worker count": 10}

This is how I was eyeballing the reconciliation rate and found the issue with the target group GC lock.

# pull the metrics for the HTTPRoute queue depth
while true; do curl -s http://127.0.0.1:8080/metrics | grep 'workqueue_depth{name="httproute"}' | awk -v d="$(date -j +%H:%M:%S)"  '{ print d, $2 }'; sleep 1; done

Also validated the controller errors out when an invalid value is supplied.

Automation added to e2e:
n/a

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No. Default for workers is still 1 if the value unspecified.

Does this PR introduce any user-facing change?:
Yes, new optional/opt-in feature.

Allow configuring concurrency for route reconciliation via ROUTE_MAX_CONCURRENT_RECONCILES environment variable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…rs. Target GC now uses RW lock to accommodate parallel Deploy operations.
Comment on lines +3 to +4
const ReconcileStart = "reconcile_start"
const ReconcileEnd = "reconcile_end"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat unrelated, more an asthetic change. The improved logging hasn't been released yet, so changing this now should be fine.


type TgGc struct {
lock sync.Mutex
lock sync.RWMutex
Copy link
Contributor Author

@erikfuller erikfuller Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent with this change is to allow many reconciliations to process in parallel, so long as target group GC isn't happening and vice versa. Previously, all reconciliations were serialized as a result of this lock. Since every backendRef gets a unique target group in Lattice, even when the backendRefs reference the same Kubernetes service, parallelized route reconciliation should be fine to run in parallel since they're not mutating any of the same resources.

As the logic grows in complexity, we may need to consider something like locks on individual target groups rather than a single "lock them all" lock.

@erikfuller erikfuller merged commit 6c6addc into aws:main Nov 1, 2024
2 checks passed
@erikfuller erikfuller deleted the rec-concurrency branch November 1, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants