Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
author Anshuman Mishra <a.mishra@uber.com> 1696277984 -0700 committer Anshuman Mishra <a.mishra@uber.com> 1700707781 -0800 parent 90439ca author Anshuman Mishra <a.mishra@uber.com> 1696277984 -0700 committer Anshuman Mishra <a.mishra@uber.com> 1700707757 -0800 parent 90439ca author Anshuman Mishra <a.mishra@uber.com> 1696277984 -0700 committer Anshuman Mishra <a.mishra@uber.com> 1700707511 -0800 feat: Hot CAS Entries - Implement CAS access metrics recorder Log on write errors Use integer ids for Sqlite bidirectional index The cost in size for a single table bidirectional index is vast compared to the use of 3nf integer keys. Experimental estimates offer a decrease in file size of 90%. Update graceful shutdown functionality to better handle worker terminations (buildfarm#1462) Manipulate worker set directly in RSB Avoid dependency on subscriber to update state changes when removing workers. This prevents an NPE which will occur invariably when workers are allowably configured with subscribeToBackplane: false. Remove publishTtlMetric option The individual metric controls for ttl are not necessary either for performance or feature support. Use Files' attributes acquisition mechanism for modified time. Config-compatible behavior for publishTtlMetric Correct logging advisements for current Java Java logging definitions must now match java.util.logging.config.file, update these specifications in our README.md Rename GracefulShutdownTest Remove WebController Interrupt+Join operationQueuer/dispatchMonitor Use interrupt to halt the OperationQueuer. Join on both operationQueuer and dispatchMonitor before instance stop return. Present operationNames by stage Include Match and Report Result stages in output Record the active operationName occupying slots in each of the stages and present them with WorkerProfile Avoid several unnecessary casts with interfaces for operation slot stages. Remove subscribeToBackplane, adjust failsafe op A shard server is impractical without operation subscription, partition subscription confirmation between servers and workers. The failsafe execution is configuration that is likely not desired on workers. This change removes the failsafe behavior from workers via backplane config, and relegates the setting of failsafe boolean to server config. If the option is restored for workers, it can be added to worker configs so that configs may continue to be shared between workers and servers and retain independent addressability. Removing AWS/GCP Metrics and Admin controls Internally driven metrics and scaling controls have low, if any, usage rates. Prometheus has largely succeeded independent publication of metrics, and externally driven scaling is the norm. These modules have been incomplete between cloud providers, and for the functional side of AWS, bind us to springboot. Removing them for the sake of reduced dependencies and complexity. Remove unused setOnCancelHandler Remove this unused OperationQueue feature which provides no invocations on any use. Update BWoB docs for ensureOutputsPresent Improve unit test Disable Bzlmod explicitly in .bazelrc Log write errors with worker address Revert "Use integer ids for Sqlite bidirectional index" This reverts commit f651cdb. Common String.format for PipelineStage Cleanup matched logic in SWC listener Continue the loop while we have *not* matched successfully and avoid a confusing inversion in getMatched() Refactor SWC matcher and clarify Nullable Distinguish the valid/unique/propagating methods of entry listening. Interrupt matchStage to induce prepare shutdown The only signal to a waiting match that will halt its current listen loop for a valid unique operation is an interrupt. Specify example config with grpc target Distinguish target param with GRPC type storage from FILESYSTEM definition Remove SpringBoot usage Reinstate prior usage of LoggingMain for safe shutdown, with added release mechanism for interrupted processes. All invoked shutdowns are graceful, with vastly improved shutdown speed for empty workers waiting for pipeline stages. Enable graceful shutdown for server (buildfarm#1490) refactor: code cleanup Tiny code cleanup Log paths created on putDirectory Will include operation root and inform directory cache effectiveness. Permit regex realInputDirectories Selecting realInputDirectories by regex permits flexible patterns that can yield drastic improvements in directory reuse for specialized deployments. runfiles in particular are hazardous expansions of nearly-execroot in the case of bazel. Care must be taken to match directories exclusively. The entire input tree is traversed for matches against expanded paths under the root, to allow for nested selection. Each match thus costs the number of input directories. Counterintuitively, OutputFiles are augmented to avoid the recursive check for OutputDirectories which only applies to actual reported results, resulting in a path match when creating the exec root. Regex style is java.util.Pattern, and must match the full input directory. Log execPath rather than the cache dir path This will include the path to the missed directory and the operation which required it. Shore up OutputDirectory for silence on duplicates Prevent adding duplicate realInputDirectories matches Trigger realInputDirectories to have empty files Ensure that the last leg of the execution presents a directory, rather than the parent, per OutputDirectory's stamping. Switch to positive check for linkInputDirectories docs(configuration): document --prometheus_port CLI argument docs(configuration): readability and typos style(configuration.md): table formatting feat: support --redis_uri command line option Support a `--redis_uri` command line option for start-up. docs(configuration): document the --redis_uri command line options also fixed some spelling typos. Example should use `container_image` instead of `java_image` chore: bump rules_jvm_external Bumping 4.2 -> 5.3 chore: bump rules_cc Bump fro 0.0.6 -> 0.0.9 Implement local resources for workers (buildfarm#1282) Suppress unused warning Bump bazel version, otherwise some test fail with System::setSecurityManager Revert bazel upgrade New line at end of file feat: Hot CAS Entries - Update read counts in Redis feat: Hot CAS Entries - Final Integration build: override grpc dependencies with our dependencies Don't get transitive grpc dependencies, use the ones from our `maven_install(...)` chore(deps): bump protobuf runtime to 3.19.1 chore(deps) add transitive dependencies feat: add Proto reflection service to shard worker To aid connection troubleshooting Bug: Fix Blocked thread in WriteStreamObserver Caused by CASFile Write (buildfarm#1486) * Add unit test * Signal Write on complete Pin the Java toolchain to `remotejdk_17` (buildfarm#1509) Closes buildfarm#1508 Cleanups: - remove the unused `ubuntu-bionic` base image - replace `ubuntu-jammy:jammy-java11-gcc` with `ubuntu-mantic:mantic-java17-gcc` - replace `amazoncorretto:19` with `ubuntu-mantic:mantic-java17-gcc` - swap inverted log file names in a file docs: add markdown language specifiers for code blocks Support OutputPaths in OutputDirectory Specifying any number of OutputPaths will ignore OutputFiles (consistent with uploads). Where an OutputPath specifies an output directory, the action must be able to create the directory itself. Permit Absolute Symlink Targets with configuration Partial specification of the absolute symlink response per REAPI. Remaining work will be in output identification. chore: update bazel to 6.4.0 (buildfarm#1513) Trying to get more info on the Lombok stamping issue on Windows CI. See also bazelbuild/bazel#10363 and bazelbuild/bazel#18185 Rename instance types (buildfarm#1514) Create SymlinkNode outputs during upload (buildfarm#1515) Default disabled, available with createSymlinkOutputs option in Worker config. feat: Implement CAS lease extension (buildfarm#1455) Problem Enabling the findMissingBlobsViaBackplane flag in BuildfarmServer eliminates the need for the BuildfarmWorker's fmb API call. This BuildfarmWorker:fmb call was also responsible for tracking CAS entry access. As result, our CAS cache eviction strategy shifted from LRU to FIFO. When the findMissingBlobsViaBackplane flag is enabled, the buildfarm relies on the backplane as the definitive source for CAS availability. Since we don't update CAS expiry on each access, the backplane will independently expire CAS entries based on the specified cas_expire duration, even if they are actively being read. Solution Updated bfServer:fmb call to perform non-blocking fmb calls to workers, allowing these workers to record access for the relevant CAS entries. Extended expiry duration for available CAS entries in the backplane on each fmb call. With these changes, we can utilize Bazel's experimental_remote_cache_lease_extension and experimental_remote_cache_ttl flags for incremental builds. Closes buildfarm#1428 Bump org.json:json from 20230227 to 20231013 in /admin/main (buildfarm#1516) Bumps [org.json:json](https://github.com/douglascrockford/JSON-java) from 20230227 to 20231013. - [Release notes](https://github.com/douglascrockford/JSON-java/releases) - [Changelog](https://github.com/stleary/JSON-java/blob/master/docs/RELEASES.md) - [Commits](https://github.com/douglascrockford/JSON-java/commits) --- updated-dependencies: - dependency-name: org.json:json dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Re-add missing graceful shutdown functionality (buildfarm#1520) Technically correct to unwrap EE on lock failure Bump rules_oss_audit and patch for py3.11 Prevent healthStatusManager NPE on start failure Consistent check for publicName presence Read through external with query THROUGH=true Specifying a correlated invocation id with a uri containing a THROUGH=true query param will cause the CFC to read a blob through an external input stream, populating locally along the way. This permits client-based replication of blobs, and can enable N+1 replication and traffic balancing for reads. Add --port option to worker Option to run the worker with a cmdline specification for its gRPC server port. Restore worker --root cmdline specification Root cmdline specification has been broken since the config change of v2. Make bf-executor small blob names consistent Remove the size identification for small blobs when uploading with bf-executor. feat: Hot CAS Entries - Update read counts in Redis chore(deps): bump protobuf runtime to 3.19.1 chore(deps) add transitive dependencies feat: add Proto reflection service to shard worker To aid connection troubleshooting Bug: Fix Blocked thread in WriteStreamObserver Caused by CASFile Write (buildfarm#1486) * Add unit test * Signal Write on complete Pin the Java toolchain to `remotejdk_17` (buildfarm#1509) Closes buildfarm#1508 Cleanups: - remove the unused `ubuntu-bionic` base image - replace `ubuntu-jammy:jammy-java11-gcc` with `ubuntu-mantic:mantic-java17-gcc` - replace `amazoncorretto:19` with `ubuntu-mantic:mantic-java17-gcc` - swap inverted log file names in a file docs: add markdown language specifiers for code blocks Support OutputPaths in OutputDirectory Specifying any number of OutputPaths will ignore OutputFiles (consistent with uploads). Where an OutputPath specifies an output directory, the action must be able to create the directory itself. chore: update bazel to 6.4.0 (buildfarm#1513) Trying to get more info on the Lombok stamping issue on Windows CI. See also bazelbuild/bazel#10363 and bazelbuild/bazel#18185 Rename instance types (buildfarm#1514) Create SymlinkNode outputs during upload (buildfarm#1515) Default disabled, available with createSymlinkOutputs option in Worker config. feat: Implement CAS lease extension (buildfarm#1455) Problem Enabling the findMissingBlobsViaBackplane flag in BuildfarmServer eliminates the need for the BuildfarmWorker's fmb API call. This BuildfarmWorker:fmb call was also responsible for tracking CAS entry access. As result, our CAS cache eviction strategy shifted from LRU to FIFO. When the findMissingBlobsViaBackplane flag is enabled, the buildfarm relies on the backplane as the definitive source for CAS availability. Since we don't update CAS expiry on each access, the backplane will independently expire CAS entries based on the specified cas_expire duration, even if they are actively being read. Solution Updated bfServer:fmb call to perform non-blocking fmb calls to workers, allowing these workers to record access for the relevant CAS entries. Extended expiry duration for available CAS entries in the backplane on each fmb call. With these changes, we can utilize Bazel's experimental_remote_cache_lease_extension and experimental_remote_cache_ttl flags for incremental builds. Closes buildfarm#1428 Bump org.json:json from 20230227 to 20231013 in /admin/main (buildfarm#1516) Bumps [org.json:json](https://github.com/douglascrockford/JSON-java) from 20230227 to 20231013. - [Release notes](https://github.com/douglascrockford/JSON-java/releases) - [Changelog](https://github.com/stleary/JSON-java/blob/master/docs/RELEASES.md) - [Commits](https://github.com/douglascrockford/JSON-java/commits) --- updated-dependencies: - dependency-name: org.json:json dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Re-add missing graceful shutdown functionality (buildfarm#1520) Technically correct to unwrap EE on lock failure Bump rules_oss_audit and patch for py3.11 Prevent healthStatusManager NPE on start failure Consistent check for publicName presence Read through external with query THROUGH=true Specifying a correlated invocation id with a uri containing a THROUGH=true query param will cause the CFC to read a blob through an external input stream, populating locally along the way. This permits client-based replication of blobs, and can enable N+1 replication and traffic balancing for reads. Add --port option to worker Option to run the worker with a cmdline specification for its gRPC server port. Restore worker --root cmdline specification Root cmdline specification has been broken since the config change of v2. Make bf-executor small blob names consistent Remove the size identification for small blobs when uploading with bf-executor. Configured output size operation failure Permit installations to control the failure process for operations which produce outputs larger than the maxEntrySizeBytes. A default value of false retains the existing behavior which appears transient and blacklists the executed action key. When enabled, the action will fail under an invalid violation that indicates user error. Restore abbrev port as -p Update zstd-jni for latest version There's been a few releases of it by now and this pulls the latest. For buildfarm, notable changes included performance enhancments during decompression. See: https://github.com/facebook/zstd/releases/tag/v1.5.5 Attempt to resolve windows stamping Bug: Fix workerSet update logic for RemoteCasWriter Detail storage requirements Update for further docs related to storage+type functionality Remove outdated Operation Queue worker definitions Fix worker execution env title Add storage example descriptions Check for context cancelled before responding to error (buildfarm#1526) When a write fails because the write was already cancelled before due to something like deadline exceeded, we get an unknown error. The exception comes from here and when it gets to errorResponse(), it only checks if status code is cancelled. In this case the status code is unknown, so we need to check if context is cancelled to prevent responseObserver from being invoked The code change adds checking if context is cancelled and a unit test testing when the exception has context cancelled. chore(deps): bump com.google.errorprone:error-prone Release notes: https://github.com/google/error-prone/releases/tag/v2.22.0 Write logs and cleaup Run formatter Fix main merge remove cleanup Minor updates Worker name execution properties matching updates updates updates updates updates Update ShardWorkerContext.java Update ShardWorkerContext.java Release resources when not keeping an operation (buildfarm#1535) Update queues.md Refer to new camelized DMS fields. Expand predefined dynamic execution property name matches. Implement custom label header support for Grpc metrics interceptor (buildfarm#1530) Add an option to provide a list of custom label headers to add to metrics. Specify direct guava dependency usage (buildfarm#1538) Testing with bazel HEAD using jdk21 compilation has revealed new direct dependencies on guava. Update lombok dependency for jdk21 (buildfarm#1540) Annotations under lombok were fixed for jdk21 in 1.18.28, update to current. Reorganize DequeueMatchEvaluator (buildfarm#1537) Remove acceptEverything DequeueMatchSetting Place worker name in workerProvisions Only enable allowUnmatched effects on key mismatch Only acquire resources after asserting compatibility Update documentation to match changes Upgrade com_google_protobuf for jvm compatibility (buildfarm#1539) Correct deprecated AccessController usage warning Requires a newer bazel than 6.4.0 for macos to choose unix toolchain with C++ std=c++14 specification for protobuf->absl dependency. Create buildfarm-worker-base-build-and-deploy.yml (buildfarm#1534) Create a github workflow to build base buildfarm worker image. Add base image generation scripts (buildfarm#1532) Fix buildfarm-worker-base-build-and-deploy.yml (buildfarm#1541) Add public buildfarm image generation actions (buildfarm#1542) Update base image building action (buildfarm#1544) Add release image generation action (buildfarm#1545) Limit workflow to canonical repository (buildfarm#1547) Check for "cores" exec property as min-cores match (buildfarm#1548) The execution platform property "cores" is detailed in documentation as specifying "min-cores" and "max-cores". Match this definition and prevent "cores" from being evaluated as a strict match with the worker provision properties (with likely rejection). Consider output_* as relative to WD (buildfarm#1550) Per the REAPI spec: `The paths are relative to the working directory of the action execution.` Prefix the WorkingDirectory to paths used as OutputDirectory parameters, and verify that these are present in the layout of the directory for use. Implement Persistent Workers as an execution path (buildfarm#1260) Followup to buildfarm#1195 Add a new execution pathway in worker/Executor.java to use persistent workers via PersistentExecutor, like DockerExecutor. Mostly unchanged from the form we used to experiment back at Twitter, but now with tests. Co-authored-by: Shane Delmore shane@delmore.io Locate Output Paths relative to WorkingDirectory (buildfarm#1553) * Locate Output Paths relative to WorkingDirectory Required as a corollary to OutputDirectory changes to consider outputs as relative to working directory. * Windows builds emit relativize paths with native separators Remove incorrect external resolve of WD on upload (buildfarm#1554) Previous patch included a change in actionRoot parameter, expecting it to prefer the working directory rooted path to discover outputs. Might want to reapply this later, but for now leave the resolution in uploadOutputs. empty
- Loading branch information