Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename instance types #1514

Merged
merged 2 commits into from
Oct 30, 2023
Merged

Rename instance types #1514

merged 2 commits into from
Oct 30, 2023

Conversation

luxe
Copy link
Contributor

@luxe luxe commented Oct 30, 2023

This is the current hierarchy. Some of the names make it confusing.

  • ShardInstance is only used by Server. It should mention server so we don't confuse it with worker logic.
  • AbstractServerInstance is used by workers - not just servers

Screenshot from 2023-10-29 23-23-35

After renames:
Screenshot from 2023-10-29 23-25-30

@luxe luxe requested a review from werkt as a code owner October 30, 2023 13:34
Copy link
Member

@werkt werkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fine change, though the Server in this naming was intended to be for a gRPC Server, not necessarily a pure buildfarm Server role.
Also, this has reminded me that we should have further abstraction of the action validator - it deserves to be its own interface + implementation with configuration and maybe expected connectivity. The behavior is a) useful to test in absentia of a working NodeInstance, and b) will reduce the complexity of this class which should amount to a collection of injected objects to access AC/CAS/Queues/etc

@luxe luxe merged commit bd740c9 into main Oct 30, 2023
6 checks passed
amishra-u pushed a commit to amishra-u/bazel-buildfarm that referenced this pull request Oct 30, 2023
Trying to get more info on the Lombok stamping issue on Windows CI.
See also bazelbuild/bazel#10363 and
bazelbuild/bazel#18185

Rename instance types (buildfarm#1514)

feat: Implement CAS lease extension

Run formatter

Remove ExecutorPool, Instead make fire and forget call to worker:fmb
amishra-u pushed a commit to amishra-u/bazel-buildfarm that referenced this pull request Nov 8, 2023
Don't get transitive grpc dependencies, use the ones from our `maven_install(...)`

chore(deps): bump protobuf runtime to 3.19.1

chore(deps) add transitive dependencies

feat: add Proto reflection service to shard worker

To aid connection troubleshooting

Bug: Fix Blocked thread in WriteStreamObserver Caused by CASFile Write (buildfarm#1486)

* Add unit test
* Signal Write on complete

Pin the Java toolchain to `remotejdk_17` (buildfarm#1509)

Closes buildfarm#1508

Cleanups:
- remove the unused `ubuntu-bionic` base image
- replace `ubuntu-jammy:jammy-java11-gcc` with `ubuntu-mantic:mantic-java17-gcc`
- replace `amazoncorretto:19` with `ubuntu-mantic:mantic-java17-gcc`
- swap inverted log file names in a file

docs: add markdown language specifiers for code blocks

Support OutputPaths in OutputDirectory

Specifying any number of OutputPaths will ignore OutputFiles (consistent
with uploads). Where an OutputPath specifies an output directory, the
action must be able to create the directory itself.

Permit Absolute Symlink Targets with configuration

Partial specification of the absolute symlink response per REAPI.
Remaining work will be in output identification.

chore: update bazel to 6.4.0 (buildfarm#1513)

Trying to get more info on the Lombok stamping issue on Windows CI.
See also bazelbuild/bazel#10363 and
bazelbuild/bazel#18185

Rename instance types (buildfarm#1514)

Create SymlinkNode outputs during upload (buildfarm#1515)

Default disabled, available with createSymlinkOutputs option in Worker
config.

feat: Implement CAS lease extension (buildfarm#1455)

Problem

    Enabling the findMissingBlobsViaBackplane flag in BuildfarmServer eliminates the need for the BuildfarmWorker's fmb API call. This BuildfarmWorker:fmb call was also responsible for tracking CAS entry access. As result, our CAS cache eviction strategy shifted from LRU to FIFO.
    When the findMissingBlobsViaBackplane flag is enabled, the buildfarm relies on the backplane as the definitive source for CAS availability. Since we don't update CAS expiry on each access, the backplane will independently expire CAS entries based on the specified cas_expire duration, even if they are actively being read.

Solution

Updated bfServer:fmb call to perform non-blocking fmb calls to workers, allowing these workers to record access for the relevant CAS entries.

Extended expiry duration for available CAS entries in the backplane on each fmb call.

With these changes, we can utilize Bazel's experimental_remote_cache_lease_extension and experimental_remote_cache_ttl flags for incremental builds.

Closes buildfarm#1428

Bump org.json:json from 20230227 to 20231013 in /admin/main (buildfarm#1516)

Bumps [org.json:json](https://github.com/douglascrockford/JSON-java) from 20230227 to 20231013.
- [Release notes](https://github.com/douglascrockford/JSON-java/releases)
- [Changelog](https://github.com/stleary/JSON-java/blob/master/docs/RELEASES.md)
- [Commits](https://github.com/douglascrockford/JSON-java/commits)

---
updated-dependencies:
- dependency-name: org.json:json
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Re-add missing graceful shutdown functionality (buildfarm#1520)

Technically correct to unwrap EE on lock failure

Bump rules_oss_audit and patch for py3.11

Prevent healthStatusManager NPE on start failure

Consistent check for publicName presence

Read through external with query THROUGH=true

Specifying a correlated invocation id with a uri containing a
THROUGH=true query param will cause the CFC to read a blob through an
external input stream, populating locally along the way. This permits
client-based replication of blobs, and can enable N+1 replication and
traffic balancing for reads.

Add --port option to worker

Option to run the worker with a cmdline specification for its gRPC
server port.

Restore worker --root cmdline specification

Root cmdline specification has been broken since the config change of
v2.

Make bf-executor small blob names consistent

Remove the size identification for small blobs when uploading with
bf-executor.

feat: Hot CAS Entries - Update read counts in Redis
amishra-u added a commit to amishra-u/bazel-buildfarm that referenced this pull request Nov 23, 2023
author Anshuman Mishra <a.mishra@uber.com> 1696277984 -0700
committer Anshuman Mishra <a.mishra@uber.com> 1700707781 -0800

parent 90439ca
author Anshuman Mishra <a.mishra@uber.com> 1696277984 -0700
committer Anshuman Mishra <a.mishra@uber.com> 1700707757 -0800

parent 90439ca
author Anshuman Mishra <a.mishra@uber.com> 1696277984 -0700
committer Anshuman Mishra <a.mishra@uber.com> 1700707511 -0800

feat: Hot CAS Entries - Implement CAS access metrics recorder

Log on write errors

Use integer ids for Sqlite bidirectional index

The cost in size for a single table bidirectional index is vast compared
to the use of 3nf integer keys. Experimental estimates offer a decrease
in file size of 90%.

Update graceful shutdown functionality to better handle worker terminations (buildfarm#1462)

Manipulate worker set directly in RSB

Avoid dependency on subscriber to update state changes when removing
workers. This prevents an NPE which will occur invariably when workers
are allowably configured with subscribeToBackplane: false.

Remove publishTtlMetric option

The individual metric controls for ttl are not necessary either for
performance or feature support. Use Files' attributes acquisition
mechanism for modified time.

Config-compatible behavior for publishTtlMetric

Correct logging advisements for current Java

Java logging definitions must now match java.util.logging.config.file,
update these specifications in our README.md

Rename GracefulShutdownTest

Remove WebController

Interrupt+Join operationQueuer/dispatchMonitor

Use interrupt to halt the OperationQueuer.
Join on both operationQueuer and dispatchMonitor before instance stop
return.

Present operationNames by stage

Include Match and Report Result stages in output
Record the active operationName occupying slots in each of the stages
and present them with WorkerProfile
Avoid several unnecessary casts with interfaces for operation slot
stages.

Remove subscribeToBackplane, adjust failsafe op

A shard server is impractical without operation subscription, partition
subscription confirmation between servers and workers.
The failsafe execution is configuration that is likely not desired on
workers. This change removes the failsafe behavior from workers via
backplane config, and relegates the setting of failsafe boolean to
server config. If the option is restored for workers, it can be added to
worker configs so that configs may continue to be shared between workers
and servers and retain independent addressability.

Removing AWS/GCP Metrics and Admin controls

Internally driven metrics and scaling controls have low, if any, usage
rates. Prometheus has largely succeeded independent publication of
metrics, and externally driven scaling is the norm. These modules have
been incomplete between cloud providers, and for the functional side of
AWS, bind us to springboot. Removing them for the sake of reduced
dependencies and complexity.

Remove unused setOnCancelHandler

Remove this unused OperationQueue feature which provides no invocations
on any use.

Update BWoB docs for ensureOutputsPresent

Improve unit test

Disable Bzlmod explicitly in .bazelrc

Log write errors with worker address

Revert "Use integer ids for Sqlite bidirectional index"

This reverts commit f651cdb.

Common String.format for PipelineStage

Cleanup matched logic in SWC listener

Continue the loop while we have *not* matched successfully and avoid a
confusing inversion in getMatched()

Refactor SWC matcher and clarify Nullable

Distinguish the valid/unique/propagating methods of entry listening.

Interrupt matchStage to induce prepare shutdown

The only signal to a waiting match that will halt its current listen
loop for a valid unique operation is an interrupt.

Specify example config with grpc target

Distinguish target param with GRPC type storage from FILESYSTEM
definition

Remove SpringBoot usage

Reinstate prior usage of LoggingMain for safe shutdown, with added
release mechanism for interrupted processes. All invoked shutdowns are
graceful, with vastly improved shutdown speed for empty workers waiting
for pipeline stages.

Enable graceful shutdown for server (buildfarm#1490)

refactor: code cleanup

Tiny code cleanup

Log paths created on putDirectory

Will include operation root and inform directory cache effectiveness.

Permit regex realInputDirectories

Selecting realInputDirectories by regex permits flexible patterns that
can yield drastic improvements in directory reuse for specialized
deployments. runfiles in particular are hazardous expansions of
nearly-execroot in the case of bazel.

Care must be taken to match directories exclusively.
The entire input tree is traversed for matches against expanded paths
under the root, to allow for nested selection.
Each match thus costs the number of input directories.
Counterintuitively, OutputFiles are augmented to avoid the recursive
check for OutputDirectories which only applies to actual reported
results, resulting in a path match when creating the exec root.
Regex style is java.util.Pattern, and must match the full input
directory.

Log execPath rather than the cache dir path

This will include the path to the missed directory and the operation
which required it.

Shore up OutputDirectory for silence on duplicates

Prevent adding duplicate realInputDirectories matches

Trigger realInputDirectories to have empty files

Ensure that the last leg of the execution presents a directory, rather
than the parent, per OutputDirectory's stamping.

Switch to positive check for linkInputDirectories

docs(configuration): document --prometheus_port CLI argument

docs(configuration): readability and typos

style(configuration.md): table formatting

feat: support --redis_uri command line option

Support a `--redis_uri` command line option for start-up.

docs(configuration): document the --redis_uri command line options

also fixed some spelling typos.

Example should use `container_image` instead of `java_image`

chore: bump rules_jvm_external

Bumping 4.2 -> 5.3

chore: bump rules_cc

Bump fro 0.0.6 -> 0.0.9

Implement local resources for workers (buildfarm#1282)

Suppress unused warning

Bump bazel version, otherwise some test fail with System::setSecurityManager

Revert bazel upgrade

New line at end of file

feat: Hot CAS Entries - Update read counts in Redis

feat: Hot CAS Entries - Final Integration

build: override grpc dependencies with our dependencies

Don't get transitive grpc dependencies, use the ones from our `maven_install(...)`

chore(deps): bump protobuf runtime to 3.19.1

chore(deps) add transitive dependencies

feat: add Proto reflection service to shard worker

To aid connection troubleshooting

Bug: Fix Blocked thread in WriteStreamObserver Caused by CASFile Write (buildfarm#1486)

* Add unit test
* Signal Write on complete

Pin the Java toolchain to `remotejdk_17` (buildfarm#1509)

Closes buildfarm#1508

Cleanups:
- remove the unused `ubuntu-bionic` base image
- replace `ubuntu-jammy:jammy-java11-gcc` with `ubuntu-mantic:mantic-java17-gcc`
- replace `amazoncorretto:19` with `ubuntu-mantic:mantic-java17-gcc`
- swap inverted log file names in a file

docs: add markdown language specifiers for code blocks

Support OutputPaths in OutputDirectory

Specifying any number of OutputPaths will ignore OutputFiles (consistent
with uploads). Where an OutputPath specifies an output directory, the
action must be able to create the directory itself.

Permit Absolute Symlink Targets with configuration

Partial specification of the absolute symlink response per REAPI.
Remaining work will be in output identification.

chore: update bazel to 6.4.0 (buildfarm#1513)

Trying to get more info on the Lombok stamping issue on Windows CI.
See also bazelbuild/bazel#10363 and
bazelbuild/bazel#18185

Rename instance types (buildfarm#1514)

Create SymlinkNode outputs during upload (buildfarm#1515)

Default disabled, available with createSymlinkOutputs option in Worker
config.

feat: Implement CAS lease extension (buildfarm#1455)

Problem

    Enabling the findMissingBlobsViaBackplane flag in BuildfarmServer eliminates the need for the BuildfarmWorker's fmb API call. This BuildfarmWorker:fmb call was also responsible for tracking CAS entry access. As result, our CAS cache eviction strategy shifted from LRU to FIFO.
    When the findMissingBlobsViaBackplane flag is enabled, the buildfarm relies on the backplane as the definitive source for CAS availability. Since we don't update CAS expiry on each access, the backplane will independently expire CAS entries based on the specified cas_expire duration, even if they are actively being read.

Solution

Updated bfServer:fmb call to perform non-blocking fmb calls to workers, allowing these workers to record access for the relevant CAS entries.

Extended expiry duration for available CAS entries in the backplane on each fmb call.

With these changes, we can utilize Bazel's experimental_remote_cache_lease_extension and experimental_remote_cache_ttl flags for incremental builds.

Closes buildfarm#1428

Bump org.json:json from 20230227 to 20231013 in /admin/main (buildfarm#1516)

Bumps [org.json:json](https://github.com/douglascrockford/JSON-java) from 20230227 to 20231013.
- [Release notes](https://github.com/douglascrockford/JSON-java/releases)
- [Changelog](https://github.com/stleary/JSON-java/blob/master/docs/RELEASES.md)
- [Commits](https://github.com/douglascrockford/JSON-java/commits)

---
updated-dependencies:
- dependency-name: org.json:json
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Re-add missing graceful shutdown functionality (buildfarm#1520)

Technically correct to unwrap EE on lock failure

Bump rules_oss_audit and patch for py3.11

Prevent healthStatusManager NPE on start failure

Consistent check for publicName presence

Read through external with query THROUGH=true

Specifying a correlated invocation id with a uri containing a
THROUGH=true query param will cause the CFC to read a blob through an
external input stream, populating locally along the way. This permits
client-based replication of blobs, and can enable N+1 replication and
traffic balancing for reads.

Add --port option to worker

Option to run the worker with a cmdline specification for its gRPC
server port.

Restore worker --root cmdline specification

Root cmdline specification has been broken since the config change of
v2.

Make bf-executor small blob names consistent

Remove the size identification for small blobs when uploading with
bf-executor.

feat: Hot CAS Entries - Update read counts in Redis

chore(deps): bump protobuf runtime to 3.19.1

chore(deps) add transitive dependencies

feat: add Proto reflection service to shard worker

To aid connection troubleshooting

Bug: Fix Blocked thread in WriteStreamObserver Caused by CASFile Write (buildfarm#1486)

* Add unit test
* Signal Write on complete

Pin the Java toolchain to `remotejdk_17` (buildfarm#1509)

Closes buildfarm#1508

Cleanups:
- remove the unused `ubuntu-bionic` base image
- replace `ubuntu-jammy:jammy-java11-gcc` with `ubuntu-mantic:mantic-java17-gcc`
- replace `amazoncorretto:19` with `ubuntu-mantic:mantic-java17-gcc`
- swap inverted log file names in a file

docs: add markdown language specifiers for code blocks

Support OutputPaths in OutputDirectory

Specifying any number of OutputPaths will ignore OutputFiles (consistent
with uploads). Where an OutputPath specifies an output directory, the
action must be able to create the directory itself.

chore: update bazel to 6.4.0 (buildfarm#1513)

Trying to get more info on the Lombok stamping issue on Windows CI.
See also bazelbuild/bazel#10363 and
bazelbuild/bazel#18185

Rename instance types (buildfarm#1514)

Create SymlinkNode outputs during upload (buildfarm#1515)

Default disabled, available with createSymlinkOutputs option in Worker
config.

feat: Implement CAS lease extension (buildfarm#1455)

Problem

    Enabling the findMissingBlobsViaBackplane flag in BuildfarmServer eliminates the need for the BuildfarmWorker's fmb API call. This BuildfarmWorker:fmb call was also responsible for tracking CAS entry access. As result, our CAS cache eviction strategy shifted from LRU to FIFO.
    When the findMissingBlobsViaBackplane flag is enabled, the buildfarm relies on the backplane as the definitive source for CAS availability. Since we don't update CAS expiry on each access, the backplane will independently expire CAS entries based on the specified cas_expire duration, even if they are actively being read.

Solution

Updated bfServer:fmb call to perform non-blocking fmb calls to workers, allowing these workers to record access for the relevant CAS entries.

Extended expiry duration for available CAS entries in the backplane on each fmb call.

With these changes, we can utilize Bazel's experimental_remote_cache_lease_extension and experimental_remote_cache_ttl flags for incremental builds.

Closes buildfarm#1428

Bump org.json:json from 20230227 to 20231013 in /admin/main (buildfarm#1516)

Bumps [org.json:json](https://github.com/douglascrockford/JSON-java) from 20230227 to 20231013.
- [Release notes](https://github.com/douglascrockford/JSON-java/releases)
- [Changelog](https://github.com/stleary/JSON-java/blob/master/docs/RELEASES.md)
- [Commits](https://github.com/douglascrockford/JSON-java/commits)

---
updated-dependencies:
- dependency-name: org.json:json
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Re-add missing graceful shutdown functionality (buildfarm#1520)

Technically correct to unwrap EE on lock failure

Bump rules_oss_audit and patch for py3.11

Prevent healthStatusManager NPE on start failure

Consistent check for publicName presence

Read through external with query THROUGH=true

Specifying a correlated invocation id with a uri containing a
THROUGH=true query param will cause the CFC to read a blob through an
external input stream, populating locally along the way. This permits
client-based replication of blobs, and can enable N+1 replication and
traffic balancing for reads.

Add --port option to worker

Option to run the worker with a cmdline specification for its gRPC
server port.

Restore worker --root cmdline specification

Root cmdline specification has been broken since the config change of
v2.

Make bf-executor small blob names consistent

Remove the size identification for small blobs when uploading with
bf-executor.

Configured output size operation failure

Permit installations to control the failure process for operations which
produce outputs larger than the maxEntrySizeBytes. A default value
of false retains the existing behavior which appears transient and
blacklists the executed action key. When enabled, the action will fail
under an invalid violation that indicates user error.

Restore abbrev port as -p

Update zstd-jni for latest version

There's been a few releases of it by now and this pulls the latest. For
buildfarm, notable changes included performance enhancments during
decompression.

See:
https://github.com/facebook/zstd/releases/tag/v1.5.5

Attempt to resolve windows stamping

Bug: Fix workerSet update logic for RemoteCasWriter

Detail storage requirements

Update for further docs related to storage+type functionality
Remove outdated Operation Queue worker definitions

Fix worker execution env title

Add storage example descriptions

Check for context cancelled before responding to error (buildfarm#1526)

When a write fails because the write was already cancelled before due to something like deadline exceeded, we get an unknown error. The exception comes from here and when it gets to errorResponse(), it only checks if status code is cancelled. In this case the status code is unknown, so we need to check if context is cancelled to prevent responseObserver from being invoked

The code change adds checking if context is cancelled and a unit test testing when the exception has context cancelled.

chore(deps): bump com.google.errorprone:error-prone

Release notes: https://github.com/google/error-prone/releases/tag/v2.22.0

Write logs and cleaup

Run formatter

Fix main merge

remove cleanup

Minor updates

Worker name execution properties matching

 updates

 updates

 updates

 updates

 updates

Update ShardWorkerContext.java

Update ShardWorkerContext.java

Release resources when not keeping an operation (buildfarm#1535)

Update queues.md

Refer to new camelized DMS fields.
Expand predefined dynamic execution property name matches.

Implement custom label header support for Grpc metrics interceptor (buildfarm#1530)

Add an option to provide a list of custom label headers to add to metrics.

Specify direct guava dependency usage (buildfarm#1538)

Testing with bazel HEAD using jdk21 compilation has revealed new
direct dependencies on guava.

Update lombok dependency for jdk21 (buildfarm#1540)

Annotations under lombok were fixed for jdk21 in 1.18.28, update to
current.

Reorganize DequeueMatchEvaluator (buildfarm#1537)

Remove acceptEverything DequeueMatchSetting
Place worker name in workerProvisions
Only enable allowUnmatched effects on key mismatch
Only acquire resources after asserting compatibility
Update documentation to match changes

Upgrade com_google_protobuf for jvm compatibility (buildfarm#1539)

Correct deprecated AccessController usage warning
Requires a newer bazel than 6.4.0 for macos to choose unix toolchain with C++ std=c++14 specification for protobuf->absl dependency.

Create buildfarm-worker-base-build-and-deploy.yml (buildfarm#1534)

Create a github workflow to build base buildfarm worker image.

Add base image generation scripts (buildfarm#1532)

Fix buildfarm-worker-base-build-and-deploy.yml (buildfarm#1541)

Add public buildfarm image generation actions (buildfarm#1542)

Update base image building action (buildfarm#1544)

Add release image generation action (buildfarm#1545)

Limit workflow to canonical repository (buildfarm#1547)

Check for "cores" exec property as min-cores match (buildfarm#1548)

The execution platform property "cores" is detailed in documentation as
specifying "min-cores" and "max-cores". Match this definition and
prevent "cores" from being evaluated as a strict match with the worker
provision properties (with likely rejection).

Consider output_* as relative to WD (buildfarm#1550)

Per the REAPI spec:

`The paths are relative to the working directory of the action
execution.`

Prefix the WorkingDirectory to paths used as OutputDirectory parameters,
and verify that these are present in the layout of the directory for
use.

Implement Persistent Workers as an execution path (buildfarm#1260)

Followup to buildfarm#1195

Add a new execution pathway in worker/Executor.java to use persistent workers via PersistentExecutor, like DockerExecutor.

Mostly unchanged from the form we used to experiment back at Twitter, but now with tests.

Co-authored-by: Shane Delmore shane@delmore.io

Locate Output Paths relative to WorkingDirectory (buildfarm#1553)

* Locate Output Paths relative to WorkingDirectory

Required as a corollary to OutputDirectory changes to consider outputs
as relative to working directory.

* Windows builds emit relativize paths with native separators

Remove incorrect external resolve of WD on upload (buildfarm#1554)

Previous patch included a change in actionRoot parameter, expecting it
to prefer the working directory rooted path to discover outputs. Might
want to reapply this later, but for now leave the resolution in
uploadOutputs.

empty
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants