-
Notifications
You must be signed in to change notification settings - Fork 848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concourse deployment wide component tracking #4534
Comments
This table should track the following components' current state.
There will be new API to give people visibility into each component (healthcheck?). And let people turn things on and off on their own. |
I'd suggest that this be done as a lock table: some running process can take create a lock record (or to your point, a lease record that has a guaranteed expiry time) for the global work item. So ATC #2 might grab pipeline syncer, ATC #4 grabs resource checker and so on. The nice thing is that this also helps extracting those into standalone processes further down the line. In terms of disabling, instead of an I'd also suggest as a small point that you should use |
@jchesterpivotal This table serves another purpose too. It maintains an interval at which work should actually happen. It's not simply about making sure that no two ATCs are doing the same work at the same time, it's about making sure that none of them do the same work within say a 30s interval. I don't think the virtual process solves this, since after ATC-1 does the work there's no guarantee that the virtual process will get the lock next (instead of ATC-2, or ATC-3). So this means we'd need the |
@pivotal-jwinters I think there's still a need to ensure no other ATCs are actually in the process of running the component after disabling it, though. Say ATC 3 starts scheduling, then ATC 1 comes up and disables scheduling. We would probably want to wait for ATC 3 to finish scheduling before continuing on. Acquiring the scheduling lock after disabling it would be one way to do that. 🤔 |
@vito yeah for the purpose of running migrations I totally agree, I just think we have different requirements for our normal use case. |
subscribe |
The way concourse works right now, it will run each ATC component such as scheduler, lidar/radar, build tracker and GC once per tick per ATC. Meaning that if you have 4 ATCs, it will run GC 4 times every 30 seconds. This doesn't really make sense because each time the components run per tick, it will do a whole sweep of all the pipelines, builds or gc for the whole deployment, so running it 4 times within one interval (usually 30 seconds) does not make it much more effective in consideration to the huge load it adds. As a result, as users scale their ATCs horizontally, the more load they will be putting on their DB without much return.
So in order to avoid this, we want to add deployment wide component tracking. This will ensure that each component will be only run once per interval tick.
A suggestion for the implementation is to have a table that has a column for
last_ran
andenabled
. Each component will only run ifenabled
is true and thelast_ran
timestamp is less than the configured interval for that component.This change will also help the 6.0.0 migration where we need to batch migrate a whole new table, but we want to stop all components from running while we run that migration (#4214).
The text was updated successfully, but these errors were encountered: