docx: internals content

kcbimonte · kcbimonte · commit 322b0290adc7 · 2025-10-27T14:41:09.000-04:00
Signed-off-by: Kevin Bimonte &lt;kbimonte@gmail.com&gt;
diff --git a/docs/docs/internals/assets/.gitkeep b/docs/docs/internals/assets/.gitkeep
diff --git a/docs/docs/internals/assets/index-01.png b/docs/docs/internals/assets/index-01.png
diff --git a/docs/docs/internals/checker.md b/docs/docs/internals/checker.md
@@ -2,6 +2,45 @@
 title: Resource Checker
 ---
 
+[Resources](../resources/index.md) represent external state such as a git repository, files in an S3 bucket, or anything
+else that changes over time. By modelling these as resources, it allows you to use this external state as inputs (or
+triggers) to your workloads.
+
 ## When are resources checked?
 
-## What do resource checks produce?
+The component that schedules resource checks is called the **resource checker**. The rate at which these checks happen
+is called the check interval (configurable via `CONCOURSE_LIDAR_SCANNER_INTERVAL`). There's an obvious tradeoff, whereby
+the more frequently you poll, the bigger the strain on Concourse (as well as the external source). However, if you want
+to pick up those new commits as quickly as possible, then you need to poll as often as possible.
+
+The resource checker uses the [`resource.check_every`](../resources/index.md#resource-schema) interval in order to
+figure out if a resource needs to be checked. A resource's `check_every` interval dictates how often it should be
+checked for new versions, with a default of 1 minute. If that seems like a lot of checking, it is, but it's how
+Concourse keeps everything snappy. You can configure this interval independently for each resource using `check_every`.
+
+If your external service supports it, you can set [`resource.webhook_token`](../resources/index.md#resource-schema) to
+eliminate the need for periodic checking altogether. If a `webhook_token` is configured, the external service can notify
+Concourse when to check for new versions. Note that configuring a `webhook_token` will not stop Concourse from
+periodically checking your resource. If you wish to rely solely on webhooks for detecting new versions, you can
+set `check_every` to `never`.
+
+On every interval tick, the resource checker will see if there are any resources that need to be checked. It does this
+by first finding resources which are used as inputs to jobs, and then comparing the current time against the last time
+each resource was checked. If it has been longer than a resource's configured `check_every` interval, a new check will
+be scheduled. In practice this means that if a resource has a `check_every` of `1m`, it is not guaranteed to be checked
+precisely every 60 seconds. `check_every` simply sets a lower bound on the time between checks.
+
+When the resource checker finds a resource to check (either because its `check_every` interval elapsed, or because its
+configured `webhook_token` was triggered), it schedules a new build that invokes
+the [`check` script](../resource-types/implementing.md#check-check-for-new-versions) of the resource's
+underlying [resource type](../resource-types/index.md).
+
+## What do resource checks produce?
+
+The whole point of running checks is to produce versions. Concourse's [Build Scheduler](scheduler.md) is centered around
+the idea of resource versions. It's how Concourse determines that something is new and a new build needs to be
+triggered.
+
+The versions produced by each resource are unique to the underlying [resource type](../resource-types/index.md). For
+instance, the `git` resource type uses commit SHAs as versions. The `registry-image` resource uses the image digest and
+tag in the version.
diff --git a/docs/docs/internals/garbage-collector.md b/docs/docs/internals/garbage-collector.md
@@ -2,6 +2,42 @@
 title: Garbage Collector
 ---
 
+Concourse runs everything in isolated environments by creating fresh containers and volumes to ensure things can safely
+run in a repeatable environment, isolated from other workloads running on the same worker.
+
+This introduces a new problem of knowing when Concourse should remove these containers and volumes. Safely identifying
+things for removal and then getting rid of them, releasing their resources, is the process of _garbage collection_.
+
 ## Goals
 
-## How it Works
+Let's define our metrics for success:
+
+* **Safe.** There should never be a case where a build is running and a container or volume is removed out from under
+  it, causing the build to fail. Resource checking should also never result in errors from check containers being
+  removed. No one should even know garbage collection is happening.
+* **Airtight.** Everything Concourse creates, whether it's a container or volume on a worker or an entry in the
+  database, should never leak. Each object should have a fully defined lifecycle such that there is a clear end to its
+  use. The ATC should be interruptible at any point in time and at the very least be able to remove any state it had
+  created beforehand.
+* **Resilient.** Garbage collection should never be outpaced by the workload. A single misbehaving worker should not
+  prevent garbage collection from being performed on other workers. A slow delete of a volume should not prevent garbage
+  collecting of other things on the same worker.
+
+## How it Works
+
+The garbage collector is a batch operation that runs on an interval with a default of 30 seconds. It's important to note
+that the collector must be able to run frequently enough to not be outpaced by the workload producing things, and so the
+batch operation should be able to complete pretty quickly.
+
+The batch operation first performs garbage collection within the database alone, removing rows that are no longer
+needed. The removal of rows from one stage will often result in removals in a later stage. There are individual
+collectors for each object, such as the volume collector or the container collector, and they are all run
+asynchronously.
+
+After the initial pass of garbage collection in the database, there should now be a set of containers and volumes that
+meet criteria for garbage collection. These two are a bit more complicated to garbage-collect; they both require talking
+to a worker, and waiting on a potentially slow delete.
+
+Containers and volumes are the costliest resources consumed by Concourse. There are also many of them created over time
+as builds execute and pipelines perform their resource checking. Therefore it is important to parallelize this aspect of
+garbage collection so that one slow delete or one slow worker does not cause them to pile up.
diff --git a/docs/docs/internals/index.md b/docs/docs/internals/index.md
@@ -2,12 +2,133 @@
 title: Internals
 ---
 
+This section provides a deeper understanding of some of the concepts surrounding Concourse.
+
+An understanding of the basics of Concourse concepts, such as pipelines, jobs, etc, is recommended as parts of this
+section might assume a level of knowledge from them. This section is not necessary for using Concourse but are more for
+experienced users that want to dig deeper into how Concourse works.
+
 ## Basic architecture
 
+Concourse is a fairly simple distributed system built up from the following components. You'll see them referenced here
+and there throughout the documentation, so you may want to skim this page just to get an idea of what they are.
+
+![](assets/index-01.png)
+
 ## ATC: web UI & build scheduler
 
+The ATC is the heart of Concourse. It runs the web UI and API and is responsible for all pipeline scheduling. It
+connects to PostgreSQL, which it uses to store pipeline data (including build logs).
+
+Multiple ATCs can be running as one cluster; as long as they're all pointing to the same database, they'll synchronize
+using basic locking mechanisms and roughly spread work across the cluster.
+
+The ATC by default listens on port `8080`, and is usually co-located with the [TSA](#tsa-worker-registration-forwarding)
+and sitting behind a load balancer.
+
+!!! note
+
+    For [`fly intercept`](../builds.md#fly-intercept) to function, make sure your load balancer is configured to do TCP 
+    or SSL forwarding, not HTTP or HTTPS.
+
+There are multiple components within the ATC that each have their own set of responsibilities. The main components
+consist of the [checker](checker.md), [scheduler](scheduler.md), [build tracker](build-tracker.md), and
+the [garbage collector](garbage-collector.md).
+
+The [checker](checker.md)'s responsibility is to continuously checks for new versions of resources.
+The [scheduler](scheduler.md) is responsible for scheduling builds for a job and the [build tracker](build-tracker.md)
+is responsible for running any scheduled builds. The [garbage collector](garbage-collector.md) is the cleanup mechanism
+for removing any unused or outdated objects, such as containers and volumes.
+
+All the components in a Concourse deployment can be viewed in the _components_ table in the database as of v5.7.0. The
+intervals that the components run at can also be adjusted through editing that table, as well as pausing the component
+from running entirely.
+
 ## TSA: worker registration & forwarding
 
+The TSA is a custom-built SSH server that is used solely for securely
+registering [workers](../install/running-worker.md) with the [ATC](#atc-web-ui-build-scheduler).
+
+The TSA by default listens on port `2222`, and is usually co-located with the [ATC](#atc-web-ui-build-scheduler) and
+sitting behind a load balancer.
+
+The TSA implements CLI over the SSH connection, supporting the following commands:
+
+* The `forward-worker` command is used to reverse-tunnel a worker's addresses through the TSA and register the forwarded
+  connections with the ATC. This allows workers running in arbitrary networks to register securely, so long as they can
+  reach the TSA. This is much safer than opening the worker up to the outside world.
+* The `land-worker` command is sent from the worker when landing, and initiates the state change to `LANDING` through
+  the ATC.
+* The `retire-worker` command is sent from the worker when retiring, and initiates the state change to `RETIRING`
+  through the ATC.
+* The `delete-worker` command is sent from the worker when draining is interrupted while a worker is retiring. It
+  removes the worker from the ATC.
+* The `sweep-containers` command is sent periodically to facilitate garbage collection of containers which can be
+  removed from the worker. It returns a list of handles for containers in the `DESTROYING` state, and it is the worker's
+  job to subsequently destroy them.
+* The `report-containers` command is sent along with the list of all container handles on the worker. The ATC uses this
+  to update the database, removing any `DESTROYING` containers which are no longer in the set of handles, and marking
+  any `CREATED` containers that are not present as missing.
+* The `sweep-volumes` command is sent periodically to facilitate garbage collection of volumes which can be removed from
+  the worker. It returns a list of handles for volumes in the `DESTROYING` state, and it is the worker's job to
+  subsequently destroy them.
+* The `report-volumes` command is sent along with the list of all volume handles on the worker. The ATC uses this to
+  update the database, removing any `DESTROYING` volumes which are no longer in the set of handles, and marking
+  any `CREATED` volumes that are not present as missing.
+
 ## Workers Architecture
 
-### The worker lifecycle
+Workers are machines running [Garden](https://github.com/cloudfoundry-incubator/garden)
+and [Baggageclaim](https://github.com/concourse/concourse/tree/master/worker/baggageclaim) servers and registering
+themselves via the [TSA](#tsa-worker-registration-forwarding).
+
+!!! note
+
+    Windows and Darwin workers also run Garden and Baggageclaim servers but do not run containers. They both use 
+    [houdini](https://github.com/vito/houdini) to fake making containers. Windows containers are not supported and 
+    Darwin does not have native container technology.
+
+Workers have no important state configured on their machines, as everything runs in a container and thus shouldn't care
+about what packages are installed on the host (well, except for those that allow it to be a worker in the first place).
+This is very different from workers in other non-containerized CI solutions, where the state of packages on the worker
+is crucial to whether your pipeline works or not.
+
+Each worker registers itself with the Concourse cluster via the [TSA](#tsa-worker-registration-forwarding).
+
+Workers by default listen on port `7777` for Garden and port `7788` for Baggageclaim. Connections to both servers are
+forwarded over the SSH connection made to the [TSA](#tsa-worker-registration-forwarding).
+
+### The worker lifecycle
+
+#### **RUNNING**
+
+: A worker in this state is registered with the cluster and ready to start running containers and storing volumes.
+
+#### **STALLED**
+
+: A worker in this state was previously registered with the cluster, but stopped advertising itself for some reason.
+Usually this is due to network connectivity issues, or the worker stopping unexpectedly.
+
+: If the worker remains in this state and cannot be recovered, it can be removed using
+the [`fly prune-worker`](../operation/administration.md#fly-prune-worker) command.
+
+#### **LANDING**
+
+: The `concourse land-worker` command will put a worker in the `LANDING` state to safely drain its assignments for
+temporary downtime.
+
+: The ATC will wait for builds on the worker for jobs which are uninterruptible to finish, and transition the worker
+into `LANDED` state.
+
+#### **LANDED**
+
+: A worker in this state has successfully waited for all uninterruptible jobs on it after having `concourse land-worker`
+called. It will no longer be used to schedule any new containers or create volumes until it registers as `RUNNING`
+again.
+
+#### **RETIRING**
+
+: The `concourse retire-worker` command will put a worker in the `RETIRING` state to remove it from the cluster
+permanently.
+
+: The ATC will wait for builds on the worker for jobs which are uninterruptible to finish, and remove the worker.
diff --git a/docs/docs/internals/scheduler.md b/docs/docs/internals/scheduler.md
@@ -2,8 +2,66 @@
 title: Build Scheduler
 ---
 
+!!! warning
+
+    As of the v6.0.0 release, there have been many changes to the scheduler, so it would be advisable to assume that 
+    this documentation should only be used for Concourse deployments v6.0.0 and above.
+
+Builds represent each execution of a [job](../jobs.md). Figuring out when to schedule a new job build is the
+responsibility of the **build scheduler**. The scheduling of new job builds can be dependent on many different factors
+such as when a new version of a resource is discovered, when a dependent upstream build finishes, or when a user
+manually triggers a build.
+
+The build scheduler is a global component, where it deals with all the jobs within a deployment. It runs on an interval
+with a default of 10 seconds. If there are multiple ATCs, only one of the ATC's scheduler component will run per
+interval tick in order to ensure that there will be no duplicated work between ATC nodes.
+
+The subcomponent used to figure out whether a build can be scheduled is called the [algorithm](#algorithm).
+
 ## Algorithm
 
+The algorithm is a subcomponent of the scheduler which is used to determine the input versions to the next build of a
+job. There are many factors that contribute to figuring out the next input versions. It can be anything that affects
+which resource versions will be used to schedule a build, such as `version` constraints or `passed` constraints in
+a [`get` step](../steps/get.md), disabling versions through the web UI, etc. The algorithm can also fail to determine a
+successful set of input versions, which the error will be propagated to the preparation view in the build page.
+
+If the algorithm computes a successful set of input versions, it will figure out whether the versions it computed can be
+used to produce a new build. This is done by comparing the [trigger-able](../steps/get.md) input versions to the
+versions used by the previous build and if any of them have a different version, then the scheduler will know to
+schedule a new build. Conversely, if the input versions produced by the algorithm are the same as the previous build,
+then the scheduler will not create a new build.
+
 ## Scheduling behavior
 
-## Scheduling on demand
+The scheduler will schedule a new build if any of the versions produced by the algorithm for `trigger: true` resources
+has not been used in any previous build of the job.
+
+What this means is if the algorithm runs and computes an input version, the scheduler will create a new build as long as
+that version has not been used by any previous build's version for that same input. Even if that version has been used
+by a build 2 months ago, the scheduler will **not** schedule a new build because that version has been previously used
+in a build of the job.
+
+If there are any input versions that are different from any previous build, it will trigger a new build.
+
+## Scheduling on demand
+
+The scheduler runs on an interval, but rather than scheduling all the jobs within a deployment on every tick, it only
+schedules the jobs that need to be _scheduled_.
+
+First, the scheduler determines which jobs need to be scheduled. Below are all the reasons why Concourse will think a
+job needs to be scheduled:
+
+* Detecting new versions of a resource through a check
+* Saving a new version through a put
+* A build finishes for an upstream job (through passed constraints)
+* Enabling/Disabling a resource version that has not been used in a previous build
+* Pinning/Unpinning a resource version that has not been used in a previous build
+* Setting a pipeline
+* Updating a resource's `resource_config`
+* Manually triggering a build
+* Rerunning a build
+* Multiple versions available for a version every constraint
+
+Each job that is scheduled will use the algorithm to determine what inputs its next build should have. Then the build is
+scheduled and picked up by the [Build Tracker](build-tracker.md).