Skip to content

Commit 322b029

Browse files
committed
docx: internals content
Signed-off-by: Kevin Bimonte <kbimonte@gmail.com>
1 parent d558403 commit 322b029

File tree

6 files changed

+258
-4
lines changed

6 files changed

+258
-4
lines changed

docs/docs/internals/assets/.gitkeep

Whitespace-only changes.
File renamed without changes.

docs/docs/internals/checker.md

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,45 @@
22
title: Resource Checker
33
---
44

5+
[Resources](../resources/index.md) represent external state such as a git repository, files in an S3 bucket, or anything
6+
else that changes over time. By modelling these as resources, it allows you to use this external state as inputs (or
7+
triggers) to your workloads.
8+
59
## When are resources checked?
610

7-
## What do resource checks produce?
11+
The component that schedules resource checks is called the **resource checker**. The rate at which these checks happen
12+
is called the check interval (configurable via `CONCOURSE_LIDAR_SCANNER_INTERVAL`). There's an obvious tradeoff, whereby
13+
the more frequently you poll, the bigger the strain on Concourse (as well as the external source). However, if you want
14+
to pick up those new commits as quickly as possible, then you need to poll as often as possible.
15+
16+
The resource checker uses the [`resource.check_every`](../resources/index.md#resource-schema) interval in order to
17+
figure out if a resource needs to be checked. A resource's `check_every` interval dictates how often it should be
18+
checked for new versions, with a default of 1 minute. If that seems like a lot of checking, it is, but it's how
19+
Concourse keeps everything snappy. You can configure this interval independently for each resource using `check_every`.
20+
21+
If your external service supports it, you can set [`resource.webhook_token`](../resources/index.md#resource-schema) to
22+
eliminate the need for periodic checking altogether. If a `webhook_token` is configured, the external service can notify
23+
Concourse when to check for new versions. Note that configuring a `webhook_token` will not stop Concourse from
24+
periodically checking your resource. If you wish to rely solely on webhooks for detecting new versions, you can
25+
set `check_every` to `never`.
26+
27+
On every interval tick, the resource checker will see if there are any resources that need to be checked. It does this
28+
by first finding resources which are used as inputs to jobs, and then comparing the current time against the last time
29+
each resource was checked. If it has been longer than a resource's configured `check_every` interval, a new check will
30+
be scheduled. In practice this means that if a resource has a `check_every` of `1m`, it is not guaranteed to be checked
31+
precisely every 60 seconds. `check_every` simply sets a lower bound on the time between checks.
32+
33+
When the resource checker finds a resource to check (either because its `check_every` interval elapsed, or because its
34+
configured `webhook_token` was triggered), it schedules a new build that invokes
35+
the [`check` script](../resource-types/implementing.md#check-check-for-new-versions) of the resource's
36+
underlying [resource type](../resource-types/index.md).
37+
38+
## What do resource checks produce?
39+
40+
The whole point of running checks is to produce versions. Concourse's [Build Scheduler](scheduler.md) is centered around
41+
the idea of resource versions. It's how Concourse determines that something is new and a new build needs to be
42+
triggered.
43+
44+
The versions produced by each resource are unique to the underlying [resource type](../resource-types/index.md). For
45+
instance, the `git` resource type uses commit SHAs as versions. The `registry-image` resource uses the image digest and
46+
tag in the version.

docs/docs/internals/garbage-collector.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,42 @@
22
title: Garbage Collector
33
---
44

5+
Concourse runs everything in isolated environments by creating fresh containers and volumes to ensure things can safely
6+
run in a repeatable environment, isolated from other workloads running on the same worker.
7+
8+
This introduces a new problem of knowing when Concourse should remove these containers and volumes. Safely identifying
9+
things for removal and then getting rid of them, releasing their resources, is the process of _garbage collection_.
10+
511
## Goals
612

7-
## How it Works
13+
Let's define our metrics for success:
14+
15+
* **Safe.** There should never be a case where a build is running and a container or volume is removed out from under
16+
it, causing the build to fail. Resource checking should also never result in errors from check containers being
17+
removed. No one should even know garbage collection is happening.
18+
* **Airtight.** Everything Concourse creates, whether it's a container or volume on a worker or an entry in the
19+
database, should never leak. Each object should have a fully defined lifecycle such that there is a clear end to its
20+
use. The ATC should be interruptible at any point in time and at the very least be able to remove any state it had
21+
created beforehand.
22+
* **Resilient.** Garbage collection should never be outpaced by the workload. A single misbehaving worker should not
23+
prevent garbage collection from being performed on other workers. A slow delete of a volume should not prevent garbage
24+
collecting of other things on the same worker.
25+
26+
## How it Works
27+
28+
The garbage collector is a batch operation that runs on an interval with a default of 30 seconds. It's important to note
29+
that the collector must be able to run frequently enough to not be outpaced by the workload producing things, and so the
30+
batch operation should be able to complete pretty quickly.
31+
32+
The batch operation first performs garbage collection within the database alone, removing rows that are no longer
33+
needed. The removal of rows from one stage will often result in removals in a later stage. There are individual
34+
collectors for each object, such as the volume collector or the container collector, and they are all run
35+
asynchronously.
36+
37+
After the initial pass of garbage collection in the database, there should now be a set of containers and volumes that
38+
meet criteria for garbage collection. These two are a bit more complicated to garbage-collect; they both require talking
39+
to a worker, and waiting on a potentially slow delete.
40+
41+
Containers and volumes are the costliest resources consumed by Concourse. There are also many of them created over time
42+
as builds execute and pipelines perform their resource checking. Therefore it is important to parallelize this aspect of
43+
garbage collection so that one slow delete or one slow worker does not cause them to pile up.

docs/docs/internals/index.md

Lines changed: 122 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,133 @@
22
title: Internals
33
---
44

5+
This section provides a deeper understanding of some of the concepts surrounding Concourse.
6+
7+
An understanding of the basics of Concourse concepts, such as pipelines, jobs, etc, is recommended as parts of this
8+
section might assume a level of knowledge from them. This section is not necessary for using Concourse but are more for
9+
experienced users that want to dig deeper into how Concourse works.
10+
511
## Basic architecture
612

13+
Concourse is a fairly simple distributed system built up from the following components. You'll see them referenced here
14+
and there throughout the documentation, so you may want to skim this page just to get an idea of what they are.
15+
16+
![](assets/index-01.png)
17+
718
## ATC: web UI & build scheduler
819

20+
The ATC is the heart of Concourse. It runs the web UI and API and is responsible for all pipeline scheduling. It
21+
connects to PostgreSQL, which it uses to store pipeline data (including build logs).
22+
23+
Multiple ATCs can be running as one cluster; as long as they're all pointing to the same database, they'll synchronize
24+
using basic locking mechanisms and roughly spread work across the cluster.
25+
26+
The ATC by default listens on port `8080`, and is usually co-located with the [TSA](#tsa-worker-registration-forwarding)
27+
and sitting behind a load balancer.
28+
29+
!!! note
30+
31+
For [`fly intercept`](../builds.md#fly-intercept) to function, make sure your load balancer is configured to do TCP
32+
or SSL forwarding, not HTTP or HTTPS.
33+
34+
There are multiple components within the ATC that each have their own set of responsibilities. The main components
35+
consist of the [checker](checker.md), [scheduler](scheduler.md), [build tracker](build-tracker.md), and
36+
the [garbage collector](garbage-collector.md).
37+
38+
The [checker](checker.md)'s responsibility is to continuously checks for new versions of resources.
39+
The [scheduler](scheduler.md) is responsible for scheduling builds for a job and the [build tracker](build-tracker.md)
40+
is responsible for running any scheduled builds. The [garbage collector](garbage-collector.md) is the cleanup mechanism
41+
for removing any unused or outdated objects, such as containers and volumes.
42+
43+
All the components in a Concourse deployment can be viewed in the _components_ table in the database as of v5.7.0. The
44+
intervals that the components run at can also be adjusted through editing that table, as well as pausing the component
45+
from running entirely.
46+
947
## TSA: worker registration & forwarding
1048

49+
The TSA is a custom-built SSH server that is used solely for securely
50+
registering [workers](../install/running-worker.md) with the [ATC](#atc-web-ui-build-scheduler).
51+
52+
The TSA by default listens on port `2222`, and is usually co-located with the [ATC](#atc-web-ui-build-scheduler) and
53+
sitting behind a load balancer.
54+
55+
The TSA implements CLI over the SSH connection, supporting the following commands:
56+
57+
* The `forward-worker` command is used to reverse-tunnel a worker's addresses through the TSA and register the forwarded
58+
connections with the ATC. This allows workers running in arbitrary networks to register securely, so long as they can
59+
reach the TSA. This is much safer than opening the worker up to the outside world.
60+
* The `land-worker` command is sent from the worker when landing, and initiates the state change to `LANDING` through
61+
the ATC.
62+
* The `retire-worker` command is sent from the worker when retiring, and initiates the state change to `RETIRING`
63+
through the ATC.
64+
* The `delete-worker` command is sent from the worker when draining is interrupted while a worker is retiring. It
65+
removes the worker from the ATC.
66+
* The `sweep-containers` command is sent periodically to facilitate garbage collection of containers which can be
67+
removed from the worker. It returns a list of handles for containers in the `DESTROYING` state, and it is the worker's
68+
job to subsequently destroy them.
69+
* The `report-containers` command is sent along with the list of all container handles on the worker. The ATC uses this
70+
to update the database, removing any `DESTROYING` containers which are no longer in the set of handles, and marking
71+
any `CREATED` containers that are not present as missing.
72+
* The `sweep-volumes` command is sent periodically to facilitate garbage collection of volumes which can be removed from
73+
the worker. It returns a list of handles for volumes in the `DESTROYING` state, and it is the worker's job to
74+
subsequently destroy them.
75+
* The `report-volumes` command is sent along with the list of all volume handles on the worker. The ATC uses this to
76+
update the database, removing any `DESTROYING` volumes which are no longer in the set of handles, and marking
77+
any `CREATED` volumes that are not present as missing.
78+
1179
## Workers Architecture
1280

13-
### The worker lifecycle
81+
Workers are machines running [Garden](https://github.com/cloudfoundry-incubator/garden)
82+
and [Baggageclaim](https://github.com/concourse/concourse/tree/master/worker/baggageclaim) servers and registering
83+
themselves via the [TSA](#tsa-worker-registration-forwarding).
84+
85+
!!! note
86+
87+
Windows and Darwin workers also run Garden and Baggageclaim servers but do not run containers. They both use
88+
[houdini](https://github.com/vito/houdini) to fake making containers. Windows containers are not supported and
89+
Darwin does not have native container technology.
90+
91+
Workers have no important state configured on their machines, as everything runs in a container and thus shouldn't care
92+
about what packages are installed on the host (well, except for those that allow it to be a worker in the first place).
93+
This is very different from workers in other non-containerized CI solutions, where the state of packages on the worker
94+
is crucial to whether your pipeline works or not.
95+
96+
Each worker registers itself with the Concourse cluster via the [TSA](#tsa-worker-registration-forwarding).
97+
98+
Workers by default listen on port `7777` for Garden and port `7788` for Baggageclaim. Connections to both servers are
99+
forwarded over the SSH connection made to the [TSA](#tsa-worker-registration-forwarding).
100+
101+
### The worker lifecycle
102+
103+
#### **RUNNING**
104+
105+
: A worker in this state is registered with the cluster and ready to start running containers and storing volumes.
106+
107+
#### **STALLED**
108+
109+
: A worker in this state was previously registered with the cluster, but stopped advertising itself for some reason.
110+
Usually this is due to network connectivity issues, or the worker stopping unexpectedly.
111+
112+
: If the worker remains in this state and cannot be recovered, it can be removed using
113+
the [`fly prune-worker`](../operation/administration.md#fly-prune-worker) command.
114+
115+
#### **LANDING**
116+
117+
: The `concourse land-worker` command will put a worker in the `LANDING` state to safely drain its assignments for
118+
temporary downtime.
119+
120+
: The ATC will wait for builds on the worker for jobs which are uninterruptible to finish, and transition the worker
121+
into `LANDED` state.
122+
123+
#### **LANDED**
124+
125+
: A worker in this state has successfully waited for all uninterruptible jobs on it after having `concourse land-worker`
126+
called. It will no longer be used to schedule any new containers or create volumes until it registers as `RUNNING`
127+
again.
128+
129+
#### **RETIRING**
130+
131+
: The `concourse retire-worker` command will put a worker in the `RETIRING` state to remove it from the cluster
132+
permanently.
133+
134+
: The ATC will wait for builds on the worker for jobs which are uninterruptible to finish, and remove the worker.

docs/docs/internals/scheduler.md

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,66 @@
22
title: Build Scheduler
33
---
44

5+
!!! warning
6+
7+
As of the v6.0.0 release, there have been many changes to the scheduler, so it would be advisable to assume that
8+
this documentation should only be used for Concourse deployments v6.0.0 and above.
9+
10+
Builds represent each execution of a [job](../jobs.md). Figuring out when to schedule a new job build is the
11+
responsibility of the **build scheduler**. The scheduling of new job builds can be dependent on many different factors
12+
such as when a new version of a resource is discovered, when a dependent upstream build finishes, or when a user
13+
manually triggers a build.
14+
15+
The build scheduler is a global component, where it deals with all the jobs within a deployment. It runs on an interval
16+
with a default of 10 seconds. If there are multiple ATCs, only one of the ATC's scheduler component will run per
17+
interval tick in order to ensure that there will be no duplicated work between ATC nodes.
18+
19+
The subcomponent used to figure out whether a build can be scheduled is called the [algorithm](#algorithm).
20+
521
## Algorithm
622

23+
The algorithm is a subcomponent of the scheduler which is used to determine the input versions to the next build of a
24+
job. There are many factors that contribute to figuring out the next input versions. It can be anything that affects
25+
which resource versions will be used to schedule a build, such as `version` constraints or `passed` constraints in
26+
a [`get` step](../steps/get.md), disabling versions through the web UI, etc. The algorithm can also fail to determine a
27+
successful set of input versions, which the error will be propagated to the preparation view in the build page.
28+
29+
If the algorithm computes a successful set of input versions, it will figure out whether the versions it computed can be
30+
used to produce a new build. This is done by comparing the [trigger-able](../steps/get.md) input versions to the
31+
versions used by the previous build and if any of them have a different version, then the scheduler will know to
32+
schedule a new build. Conversely, if the input versions produced by the algorithm are the same as the previous build,
33+
then the scheduler will not create a new build.
34+
735
## Scheduling behavior
836

9-
## Scheduling on demand
37+
The scheduler will schedule a new build if any of the versions produced by the algorithm for `trigger: true` resources
38+
has not been used in any previous build of the job.
39+
40+
What this means is if the algorithm runs and computes an input version, the scheduler will create a new build as long as
41+
that version has not been used by any previous build's version for that same input. Even if that version has been used
42+
by a build 2 months ago, the scheduler will **not** schedule a new build because that version has been previously used
43+
in a build of the job.
44+
45+
If there are any input versions that are different from any previous build, it will trigger a new build.
46+
47+
## Scheduling on demand
48+
49+
The scheduler runs on an interval, but rather than scheduling all the jobs within a deployment on every tick, it only
50+
schedules the jobs that need to be _scheduled_.
51+
52+
First, the scheduler determines which jobs need to be scheduled. Below are all the reasons why Concourse will think a
53+
job needs to be scheduled:
54+
55+
* Detecting new versions of a resource through a check
56+
* Saving a new version through a put
57+
* A build finishes for an upstream job (through passed constraints)
58+
* Enabling/Disabling a resource version that has not been used in a previous build
59+
* Pinning/Unpinning a resource version that has not been used in a previous build
60+
* Setting a pipeline
61+
* Updating a resource's `resource_config`
62+
* Manually triggering a build
63+
* Rerunning a build
64+
* Multiple versions available for a version every constraint
65+
66+
Each job that is scheduled will use the algorithm to determine what inputs its next build should have. Then the build is
67+
scheduled and picked up by the [Build Tracker](build-tracker.md).

0 commit comments

Comments
 (0)