Concourse deployment wide component tracking #4534

clarafu · 2019-10-01T14:54:13Z

The way concourse works right now, it will run each ATC component such as scheduler, lidar/radar, build tracker and GC once per tick per ATC. Meaning that if you have 4 ATCs, it will run GC 4 times every 30 seconds. This doesn't really make sense because each time the components run per tick, it will do a whole sweep of all the pipelines, builds or gc for the whole deployment, so running it 4 times within one interval (usually 30 seconds) does not make it much more effective in consideration to the huge load it adds. As a result, as users scale their ATCs horizontally, the more load they will be putting on their DB without much return.

So in order to avoid this, we want to add deployment wide component tracking. This will ensure that each component will be only run once per interval tick.

A suggestion for the implementation is to have a table that has a column for last_ran and enabled. Each component will only run if enabled is true and the last_ran timestamp is less than the configured interval for that component.

This change will also help the 6.0.0 migration where we need to batch migrate a whole new table, but we want to stop all components from running while we run that migration (#4214).

The text was updated successfully, but these errors were encountered:

xtremerui · 2019-10-01T15:30:32Z

This table should track the following components' current state.

pipeline syncer
build tracker
resource checker
pipeline scheduler
garbage collector
log collector
log drainer

There will be new API to give people visibility into each component (healthcheck?). And let people turn things on and off on their own.

jchesterpivotal · 2019-10-01T15:43:42Z

I'd suggest that this be done as a lock table: some running process can take create a lock record (or to your point, a lease record that has a guaranteed expiry time) for the global work item. So ATC #2 might grab pipeline syncer, ATC #4 grabs resource checker and so on. The nice thing is that this also helps extracting those into standalone processes further down the line.

In terms of disabling, instead of an enabled column, I would have a virtual process that can grab the locks/leases instead. Logically, "ATC #2 has it, but ... uh ... it's disabled" isn't consistent. If it's disabled, nobody can have it. The alternative would be "null has it and it's disabled", which is begging for trouble. A virtual process is fulfilling the Null Object pattern here.

I'd also suggest as a small point that you should use SERIALIZABLE transactions to avoid anomalies.

jwntrs · 2019-10-01T16:11:36Z

@jchesterpivotal This table serves another purpose too. It maintains an interval at which work should actually happen. It's not simply about making sure that no two ATCs are doing the same work at the same time, it's about making sure that none of them do the same work within say a 30s interval.

I don't think the virtual process solves this, since after ATC-1 does the work there's no guarantee that the virtual process will get the lock next (instead of ATC-2, or ATC-3). So this means we'd need the last_ran column in this table anyway. So to me, it makes sense for each component to scan this table, if the configured interval has elapsed since last_ran and the component isn't paused, then it acquires an advisory lock and runs, then updates the last_ran column before releasing the lock.

vito · 2019-10-01T16:57:46Z

@pivotal-jwinters I think there's still a need to ensure no other ATCs are actually in the process of running the component after disabling it, though. Say ATC 3 starts scheduling, then ATC 1 comes up and disables scheduling. We would probably want to wait for ATC 3 to finish scheduling before continuing on. Acquiring the scheduling lock after disabling it would be one way to do that. 🤔

jwntrs · 2019-10-01T17:00:39Z

@vito yeah for the purpose of running migrations I totally agree, I just think we have different requirements for our normal use case.

kcmannem · 2019-10-01T20:30:35Z

subscribe

clarafu added enhancement core labels Oct 1, 2019

clarafu mentioned this issue Oct 1, 2019

Incrementally migrate build inputs/outputs to successful_build_outputs #4214

Closed

xtremerui mentioned this issue Oct 8, 2019

Feature 4534 component tracking #4583

Merged

9 tasks

vito closed this as completed in #4583 Oct 22, 2019

jamieklassen added this to the v5.7.0 milestone Oct 30, 2019

jamieklassen added the release/documented Documentation and release notes have been updated. label Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concourse deployment wide component tracking #4534

Concourse deployment wide component tracking #4534

clarafu commented Oct 1, 2019

xtremerui commented Oct 1, 2019 •

edited

Loading

jchesterpivotal commented Oct 1, 2019

jwntrs commented Oct 1, 2019

vito commented Oct 1, 2019

jwntrs commented Oct 1, 2019

kcmannem commented Oct 1, 2019

Concourse deployment wide component tracking #4534

Concourse deployment wide component tracking #4534

Comments

clarafu commented Oct 1, 2019

xtremerui commented Oct 1, 2019 • edited Loading

jchesterpivotal commented Oct 1, 2019

jwntrs commented Oct 1, 2019

vito commented Oct 1, 2019

jwntrs commented Oct 1, 2019

kcmannem commented Oct 1, 2019

xtremerui commented Oct 1, 2019 •

edited

Loading