[pull] master from ray-project:master by pull[bot] · Pull Request #826 · garymm/ray

pull · 2026-03-14T01:18:15Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…61663) It looks like we changed some log lines that were used to detect when memory pressure monitor killed a worker: #61210 Causing the memory pressure test to now consistently fail since the test conditions waited for these lines to be generated. Updating the memory pressure test to now wait on these new log lines being generated Signed-off-by: Joshua Lee <joshlee@anyscale.com>

…ests (#61668) java_test targets do not produce a _deploy.jar in Bazel 7+. Add a companion java_binary (all_tests_bin) that produces all_tests_bin_deploy.jar in both Bazel 6 and Bazel 7, and update all references accordingly. The java_binary includes //cpp:counter.so and //cpp:plus.so as resources so that CrossLanguageInvocationTest.getResourceAsStream("/cpp/counter.so") finds them in the deploy jar classpath. Signed-off-by: andrew <andrew@anyscale.com>

## Description 1. Inlining `ActorPoolResizingPolicy` 2. Rebasing `_ActorPool` to compute utilization based on all actors, not just running 3. Allow autoscaler to scale up while pending actors are still starting up 4. Updated tests ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

`test_data_parallel_trainer::test_config_accelerator_type` has been timing out in CI ([Buildkite #61885](https://buildkite.com/ray-project/premerge/builds/61885#019ce2b6-7daa-4dfb-932c-d687cc33edac)). This PR deflakes the test by replacing the expensive 6-node heterogeneous cluster with a single-node `ray.init` cluster and reducing the parameter space from 6 cases to 2. This cuts runtime significantly while preserving the core coverage of the `accelerator_type` scheduling constraint. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

… store budget with outputs (#61605)" (#61729) Reverts #61605, since this PR which more strictly caps object store memory budget per operator caused regressions in the batch inference benchmark `image_classification_fixed_size` and the training ingest `preserve_order=True` benchmarks.

…#61374) ## Description This PR optimizes the amount of calls to `_try_schedule_one` which are causing the autoscaler to hang. It reduces the time complexity of trying to fit resource requests on in-flight nodes by grouping requests by their shape. Currently, the v2 scheduler evaluates every individual request against every node, the time complexity is approximately O(N^2*M). By using `SerializeToString(deterministic=True)` to generate a deterministic hash, we cache infeasible request shapes per node. If a shape fails to fit on a given node, the scheduler now skips the expensive `_try_schedule_one` check for all subsequent identical requests on that node. This PR includes a unit test in `test_scheduler.py` to verify the caching logic correctly short-circuits redundant evaluations, a manual test is included in the additional info. ## Related issues [#3794](ray-project/kuberay#3794) ## Additional information Can verify the optimization by running the below test on a RayCluster with Autoscaler V2 enabled: ``` import ray import time import os import logging logging.getLogger("ray").setLevel(logging.DEBUG) @ray.remote def ten_minute_task(task_id): start = time.time() while time.time() - start < 300: _ = sum([i * i for i in range(10000)]) time.sleep(0.1) return task_id def main(): tasks = [] for i in range(4000): task = ten_minute_task.remote(i) tasks.append(task) results = ray.get(tasks) if __name__ == "__main__": main() ``` --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

#61731) When a deployment starts up and replicas are scheduled but not yet RUNNING (`current_num_replicas=0`), the autoscaling policy runs with `total_num_requests=0`. The cold start fast path returns `None` (no traffic), so the core policy returns `target_num_replicas` and it flows into `_apply_scaling_factors`. The scaling formula is: `ceil(current + factor * (desired - current))` When `current=0`, this becomes `ceil(factor * desired)`, which amplifies the entire target as if it were growth. Combined with the delay bypass for `current==0`, this compounds every tick: | Tick | target\_in | formula | target\_out | |------|-----------|---------|------------| | 0 | 2 | `ceil(2.0 × 2)` | **4** | | 1 | 4 | `ceil(2.0 × 4)` | **8** | | 2 | 8 | `ceil(2.0 × 8) → 16, clamped` | **10 (max)** | In 3 ticks, with zero traffic, a `min_replicas=2, max_replicas=10, upscaling_factor=2.0` deployment scales to `max_replicas`. This was introduced in #60851 which removed the cold start fallback (`return ctx.target_num_replicas` when `current==0` and no traffic) so that custom policies like `AsyncInferenceAutoscalingPolicy` could detect queue work. That change was correct for custom policies but exposed the default policy to the amplification loop. ## Fix Skip scaling factor amplification when `current_num_replicas == 0` in `_apply_scaling_factors`. Scaling factors control the *rate of change from a baseline* — when there is no baseline, amplifying the full target as delta is incorrect. The cold start fast path already handles the `current==0` with-traffic case separately (applying `upscaling_factor` once), so this is consistent. This preserves the async inference scale-from-zero behavior: custom policies still run, return their desired value (e.g. `1` for queue work, `0` for idle), and the delay bypass lets legitimate scale-ups through immediately. Signed-off-by: abrar <abrar@anyscale.com>

Sparks0219 and others added 7 commits March 13, 2026 14:14

pull bot locked and limited conversation to collaborators Mar 14, 2026

pull bot added the ⤵️ pull label Mar 14, 2026

pull bot merged commit 495220a into garymm:master Mar 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#826

[pull] master from ray-project:master#826
pull[bot] merged 7 commits intogarymm:masterfrom
ray-project:master

pull bot commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

pull bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pull bot commented Mar 14, 2026 •

edited

Loading