Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concourse deployment wide component tracking #4534

Closed
clarafu opened this issue Oct 1, 2019 · 6 comments · Fixed by #4583
Closed

Concourse deployment wide component tracking #4534

clarafu opened this issue Oct 1, 2019 · 6 comments · Fixed by #4583
Labels
core enhancement release/documented Documentation and release notes have been updated.
Milestone

Comments

@clarafu
Copy link
Contributor

clarafu commented Oct 1, 2019

The way concourse works right now, it will run each ATC component such as scheduler, lidar/radar, build tracker and GC once per tick per ATC. Meaning that if you have 4 ATCs, it will run GC 4 times every 30 seconds. This doesn't really make sense because each time the components run per tick, it will do a whole sweep of all the pipelines, builds or gc for the whole deployment, so running it 4 times within one interval (usually 30 seconds) does not make it much more effective in consideration to the huge load it adds. As a result, as users scale their ATCs horizontally, the more load they will be putting on their DB without much return.

So in order to avoid this, we want to add deployment wide component tracking. This will ensure that each component will be only run once per interval tick.

A suggestion for the implementation is to have a table that has a column for last_ran and enabled. Each component will only run if enabled is true and the last_ran timestamp is less than the configured interval for that component.

This change will also help the 6.0.0 migration where we need to batch migrate a whole new table, but we want to stop all components from running while we run that migration (#4214).

@xtremerui
Copy link
Contributor

xtremerui commented Oct 1, 2019

This table should track the following components' current state.

pipeline syncer
build tracker
resource checker
pipeline scheduler
garbage collector
log collector
log drainer

There will be new API to give people visibility into each component (healthcheck?). And let people turn things on and off on their own.

@jchesterpivotal
Copy link
Contributor

I'd suggest that this be done as a lock table: some running process can take create a lock record (or to your point, a lease record that has a guaranteed expiry time) for the global work item. So ATC #2 might grab pipeline syncer, ATC #4 grabs resource checker and so on. The nice thing is that this also helps extracting those into standalone processes further down the line.

In terms of disabling, instead of an enabled column, I would have a virtual process that can grab the locks/leases instead. Logically, "ATC #2 has it, but ... uh ... it's disabled" isn't consistent. If it's disabled, nobody can have it. The alternative would be "null has it and it's disabled", which is begging for trouble. A virtual process is fulfilling the Null Object pattern here.

I'd also suggest as a small point that you should use SERIALIZABLE transactions to avoid anomalies.

@jwntrs
Copy link
Contributor

jwntrs commented Oct 1, 2019

@jchesterpivotal This table serves another purpose too. It maintains an interval at which work should actually happen. It's not simply about making sure that no two ATCs are doing the same work at the same time, it's about making sure that none of them do the same work within say a 30s interval.

I don't think the virtual process solves this, since after ATC-1 does the work there's no guarantee that the virtual process will get the lock next (instead of ATC-2, or ATC-3). So this means we'd need the last_ran column in this table anyway. So to me, it makes sense for each component to scan this table, if the configured interval has elapsed since last_ran and the component isn't paused, then it acquires an advisory lock and runs, then updates the last_ran column before releasing the lock.

@vito
Copy link
Member

vito commented Oct 1, 2019

@pivotal-jwinters I think there's still a need to ensure no other ATCs are actually in the process of running the component after disabling it, though. Say ATC 3 starts scheduling, then ATC 1 comes up and disables scheduling. We would probably want to wait for ATC 3 to finish scheduling before continuing on. Acquiring the scheduling lock after disabling it would be one way to do that. 🤔

@jwntrs
Copy link
Contributor

jwntrs commented Oct 1, 2019

@vito yeah for the purpose of running migrations I totally agree, I just think we have different requirements for our normal use case.

@kcmannem
Copy link
Member

kcmannem commented Oct 1, 2019

subscribe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core enhancement release/documented Documentation and release notes have been updated.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants