[pull] master from ray-project:master#327
Merged
pull[bot] merged 19 commits intodwongdev:masterfrom May 28, 2025
Merged
Conversation
We currently have many flaky tests, especially on Windows. This causes the builds to be extremely long and sometimes time out. This change reduces the number of baseline runs per CI job from 1 to 2. This should ~halve the run time for this build at the cost of it taking two CI jobs instead of one to promote a test out of flaky status. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…3353) ## Why are these changes needed? Fix the path differences in windows test ## Related issue number Closes https://anyscale1.atlassian.net/browse/SERVE-804 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Gene Su <e870252314@gmail.com>
…lth() (#53358) Signed-off-by: Linkun Chen <github@lkchen.net>
…53002) ## Problem description See #53012 for minimal reproduction of the issue. Currently the shutdown sequence looks approximately like: 1. Blindly decrement all local reference counts. 2. Drain ongoing _submitted_ tasks. 3. Drain the local task executor. This drains ongoing _executed_ tasks if the worker is part of a concurrent actor. For single-threaded actors and task workers, it's basically a no-op. 4. Drain the reference counter. Step (1) essentially invalidates the state of the reference counter, meaning that subsequently executing any user code that interacts with the reference counter will cause undefined behavior and/or crashes. This is the problem observed in the two linked issues. In both cases, the worker is executing concurrent actor tasks which are not drained until step (3). ## Original motivation for `ReleaseAllLocalReferences` The original motivation for this code is documented in: #19639. However, I haven't been able to fully understand the rationale or the fix. I also haven't been able to reproduce the issue described, even when substituting the driver for an owner actor to avoid job-level worker cleanup. What happens in the described setup inside the worker is (confirmed empirically): 1. Before the worker starts to execute the task, we [increase the local reference count](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L3762) for the argument as part of [GetAndPinArgsForExecutor](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L3739). 2. The worker executes the task, blocks in `time.sleep`, then raises `SystemExit` when it receives a `SIGTERM`. 3. After `SystemExit` is raised, the rest of `ExecuteTask` is executed, which [decrements the local reference count](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/reference_count.cc#L968) as part of [PopAndClearLocalBorrowers](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L3403). 4. The worker calls [CoreWorker::Exit()](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L3455). During the shutdown sequence in (4), the reference has already been cleared, so the call to `ReleaseAllLocalReferences` is a no-op. However, there is another edge case that `ReleaseAllLocalReferences` does currently handle, which is when a normal task (not an actor task) stores an `ObjectRef` in a global variable before exiting. We currently [assume this never happens](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L1229). In this case, the call to `ReleaseAllLocalReferences` will decrement the globally-held reference(s) and allow the worker to exit. ## Solution and alternatives Originally I planned to fully delete the call to `ReleaseAllLocalReferences`, but it would cause a behavior change in the edge case described above pertaining to `ObjectRef`s stored in global variables. Therefore, I'm opting for a safe option for now: move the call to `ReleaseAllLocalReferences` _after_ we drain the task executor, which should be a no-op change for single-threaded actors and normal task workers, but solve the issue of tasks executing after the reference counter has been invalidated in the case of concurrent actors. ## Follow-up items - We should solve the global `ObjectRef` issue for tasks separately, then remove the `ReleaseAllLocalReferences` logic. (TODO: file ticket before merging -- 1 ref and multiple refs) Closes #51408 Closes #53012 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This is a series of PR to migrate metric collection from opencencus to openlemetry. For context on the existing components, see #53098. ------------ This PR add the open telemetry grpc service to the DashboardAgent component. The grpc service is behind a feature flag so this PR by default is a no-op, but it allows me to make progress in a piece meal fashion. Test: - CI - e2e testing on top of this PR --------- Signed-off-by: can <can@anyscale.com>
When looking for a resource deadlock, the first thing done is to check if there's any workers available. Right now that involves getting a vector of all the workers and then checking. Created a new worker pool IsAvailableForScheduling that will just return out if there's one available. AnyPendingTasksForResourceAcquisition (also called to check for resource deadlock) was a kind of messy with a return value and four out parameters. Simplified down to a return value and two out parameters. The moves in FillResourceUsage weren't actually moving anything because it was using the const protobuf accessors. Fixed that. Signed-off-by: dayshah <dhyey2019@gmail.com>
or smaller. it does not run with multiple runs. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ionFnV2 (#53152) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> Using AggregationFnV2 in RayData is more efficient to compute stats instead of iter_batches. This PR modifies internal logic in vectorizer, encoder and imputer preprocessor use AggregationFnV2 <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Praveen Gorthy <praveeng@anyscale.com>
Currently failing on Windows: https://buildkite.com/ray-project/postmerge/builds/10446#01971548-9687-4f68-b7e6-127b45b29d55/10172-10839 AI tells me it's because we can't re-open the file inside the `tempfile` context manager, so refactored to accommodate this. I also cleaned up the test to use a fixture. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
The job supervisor was accessing the deprecated `ray.worker.global_worker` attribute, causing confusing warning messages. This PR "fixes" the problem by using the private path instead. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…3351) Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Test was failing due to permissions error trying to re-open file on Windows. Refactored fixture to avoid holding the file descriptor open. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Why are these changes needed? If the named deployment isn't found in a Ray serve instance, it's difficult to troubleshoot the names of deployments that are available. This error lists the known ones out in the error message instead. For example, we now can see: ``` ray.serve.exceptions.RayServeException: Deployment 'capitalizer' for application 'capitalizer' does not exist. Available: ['Capitalizer', 'Uncapitalizer', 'CapitalizerEndpoint'] ``` vs. previously: ``` (ServeController pid=99265) File ... error in override_deployment_info (ServeController pid=99265) info = deployment_infos[deployment_name] (ServeController pid=99265) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ (ServeController pid=99265) KeyError: 'capitalizer' ``` ## Related issue number <!-- For example: "Closes #1234" --> (N/A) ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] ~~I've included any doc changes needed for https://docs.ray.io/en/master/.~~ - [ ] ~~I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file.~~ - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Brandon Shelley <brandon@pacificaviator.co> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Seiji Eicher <seiji@anyscale.com>
…na dashboards (#53236) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
no lower than 0.16.0 --------- Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Timothy Seah <tseah@anyscale.com>
this will relax protobuf version and allow using protobuf v4 or above Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.1)
Can you help keep this open source service alive? 💖 Please sponsor : )