New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Runtime refactoring #2926

Open
topherbullock opened this Issue Dec 10, 2018 · 15 comments

Comments

5 participants
@topherbullock
Copy link
Member

topherbullock commented Dec 10, 2018

(This may turn into an RFC for a more refined & well defined runtime interface.)

Recently there have been a few examples of features which have been difficult to implement because of the structure of the worker package of ATC.

We should look at the architecture of the code, and the interfaces available to create containers/volumes, find containers/volumes, select a worker for containers/volumes, etc.

@topherbullock topherbullock created this issue from a note in Runtime (Backlog) Dec 10, 2018

@topherbullock

This comment has been minimized.

Copy link
Member Author

topherbullock commented Dec 10, 2018

Some notes I have on the worker package:

The worker.Worker interface's responsibility is a little unclear.

Two implementations are worker.gardenWorker and worker.pool

Given these two implementations and the fact that many of the member methods don't even work for both ( eg, eg ) implementations; it stands to reason that there's really two responsibilities being expressed: Worker Selection and Operations on a Single Worker
..
We also assert that worker.pool implements worker.Client, which seems bonkers.

@marco-m

This comment has been minimized.

Copy link

marco-m commented Dec 11, 2018

Hello, first of all, thanks for all the work around scheduling :-)

I would like to give some input as the originator of #2577 to help prioritize this refactoring work.

Looking at least-build-containers (#2577), it will help us a lot. On the other hand, until #2928 (Add ATC/worker flags to limit max build containers for workers) (blocked by this refactoring) is done, we will still need to keep the workers alive by "controlling" the load by (ab)using the Concourse resource pool resource (I wrote a post at https://discuss.concourse-ci.org/t/ann-concourse-pool-boy-detect-and-release-stale-pool-resource-locks/765 describing the unavoidable flakiness of this approach).

Thanks again!

@topherbullock

This comment has been minimized.

Copy link
Member Author

topherbullock commented Dec 19, 2018

@marco-m Yeah, we're hoping this refactor will make the work for #2928 ( and all future work to improve the runtime ) much easier to complete; hopefully this area of the code will be easier to contribute to via PRs as well.

Currently the code is a little daunting and hard to trace through. Especially due to things like this :

worker := NewGardenWorker(
p.gardenClient,
p.baggageclaimClient,
p,
p.volumeClient,
p.worker,
p.clock,
0,
)

Its not super obvious what that constructor is doing with container provider, and its passing the caller into the constructor.. SUPER WEIRD.. and what is that zero for !? (its an optimization for #2577 to pass down the number of build containers when constructing worker objects).

We ideally want to split up, or trim down the interfaces at play in implementing these features, so that we can easily extend them without impedance.

@topherbullock

This comment has been minimized.

Copy link
Member Author

topherbullock commented Dec 19, 2018

As a start, @ddadlani and I are pulling on the threads of worker.Worker, and seeing where things can be cleaned up.

There's a lot of fields in the gardenWorker struct which are just pulled off the dbWorker to construct it in NewGardenWorker:

func NewGardenWorker(
gardenClient garden.Client,
baggageclaimClient baggageclaim.Client,
containerProvider ContainerProvider,
volumeClient VolumeClient,
dbWorker db.Worker,
clock clock.Clock,
numBuildContainers int,
) Worker {
return &gardenWorker{
gardenClient: gardenClient,
baggageclaimClient: baggageclaimClient,
volumeClient: volumeClient,
containerProvider: containerProvider,
clock: clock,
activeContainers: dbWorker.ActiveContainers(),
activeVolumes: dbWorker.ActiveVolumes(),
buildContainers: numBuildContainers,
resourceTypes: dbWorker.ResourceTypes(),
platform: dbWorker.Platform(),
tags: dbWorker.Tags(),
teamID: dbWorker.TeamID(),
name: dbWorker.Name(),
startTime: dbWorker.StartTime(),
version: dbWorker.Version(),
ephemeral: dbWorker.Ephemeral(),

makes a lot of sense for the gardenWorker struct to just have a dbWorker on it to pluck those off as needed; and it clarifies the purpose of gardenWorker (gluing together garden, baggageclaim, and the database)

@topherbullock topherbullock pinned this issue Dec 19, 2018

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor

pivotal-jwinters commented Dec 20, 2018

Another thing that we should consider is decoupling lifecycle of volumes from the lifecycle of containers. The FindOrCreateContainer method does A LOT of stuff, including creating some volumes that it might need.

There are downstream operations which try and figure out what volumes the FindOrCreateContainer method might have created, so that it can associate the volume with another resource (e.g. task cache, or resource cache). This happens so the volume doesn't get gc'ed after the container goes away.

If we created all the volumes ahead of time we could decouple the task execution from the volume creation and we wouldn't have to figure out what volumes got created after the fact.

@topherbullock

This comment has been minimized.

Copy link
Member Author

topherbullock commented Jan 4, 2019

some more logs for the fire here; there are 2 functions that have high cyclomatic complexity in container_provider.go, according to gocyclo

23 worker (*containerProvider).createGardenContainer ./container_provider.go:314:1
18 worker (*containerProvider).FindOrCreateContainer ./container_provider.go:82:1

And both the test setup and constructor have a lot of dependancies.

Here's some good spots to look at for a refactor at the moment:

  • splitting that method up into smaller parts
  • passing a set of volumes as arguments (or maybe, an image volume and a set of bind mounts?) to FindOrCreateContainer and dealing with volume creation at the call site. This would mitigate the need for a lot of dependancies and complexity, but really hinges on whether we can know all the mounts before hand.
  • reconsider what purpose ContainerProvider serves and what it relies on to fill that purpose; is "provides containers" an unclear purpose (it will likely no longer depend on VolumeClient, or ImageFactory, so is it more of a container client? ) ?

@jama22 jama22 moved this from Backlog to In Flight in Runtime Jan 7, 2019

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Jan 8, 2019

Some more confusing things in build scheduling:
build_factory.go: in constructPlanFromJob, why is there a special case for len(planSequence)==1? It appears to be an optimization for the do step logic which wraps multiple steps, but it seems odd to have a separate case for it.

build_step.go: execBuild.buildDoStep is super confusing. It wraps the last item in an on-success hook and then wraps all that in another on-success and so on, so that when the do step executes, the first thing that runs is the outermost layer of the onion (the first step), and then we go all the way inside the layers of the onion until we hit the identity step. This can probably be written more clearly.

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Jan 14, 2019

Small note: This issue can still be used for discussion. To track progress, the chunks of work that arise from this discussion will be captured in other issues (such as #3043) which will be linked back here.

@marco-m

This comment has been minimized.

Copy link

marco-m commented Jan 15, 2019

@ddadlani thanks for the update, I was exactly wondering about the overall status :-)

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Jan 15, 2019

@marco-m No problem! :)

The next issue we are trying to tackle is the creation of volumes. Currently, volumes are created in containerProvider.createGardenContainer. The reason for this is that you need a container handle to create volumes in the database, which we can only obtain from the creatingContainer.

However, this makes it hard to pull out volume creation to an earlier stage (as per #2926 (comment)). It also means that the containerProvider requires access to a volumeClient, which is odd because the atc/worker also has a volumeClient and containerProvider methods are called from atc/worker methods.

It almost seems like a big reason for this confusion is that worker and containerProvider are sharing responsibilities that are not easy to split. For this reason, we are going to try moving all the container/volume creation logic to the worker and then see what can be split out into the containerProvider, potentially making the containerProvider a collection of helper functions for the worker.

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Jan 15, 2019

Another interesting idea from @topherbullock was that we could maybe have a containerProvider "helper" equivalent for db/worker as well, which handles the database operations required to create containers. Hopefully this would make things easier at a later stage.

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Jan 21, 2019

Update in child issue: #3052 (comment)

@marco-m

This comment has been minimized.

Copy link

marco-m commented Feb 10, 2019

Hello @ddadlani, do you have a high-level view of what will come next ? I would like to understand what is in the way of #2928. Thanks!

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Feb 11, 2019

Hi @marco-m, we hadn't planned to implement #2928 directly after this issue, it hasn't even been prioritized yet. Could you maybe give us more context on #2928 in the other issue? I posted a question there as well.

To answer your question, we are currently refactoring atc/worker/worker.go and container_provider.go to simplify container and volume creation, and are considering ways of simplifying the volume lifecycle as part of #3079.

@marco-m

This comment has been minimized.

Copy link

marco-m commented Feb 12, 2019

hello @ddadlani, thanks for the status update, greatly appreciated :-) I will answer in #2928.

@kcmannem kcmannem added the paused label Feb 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment