Skip to content

[pull] master from ray-project:master#327

Merged
pull[bot] merged 19 commits intodwongdev:masterfrom
ray-project:master
May 28, 2025
Merged

[pull] master from ray-project:master#327
pull[bot] merged 19 commits intodwongdev:masterfrom
ray-project:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented May 28, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

edoakes and others added 19 commits May 28, 2025 08:54
We currently have many flaky tests, especially on Windows. This causes
the builds to be extremely long and sometimes time out.

This change reduces the number of baseline runs per CI job from 1 to 2.
This should ~halve the run time for this build at the cost of it taking
two CI jobs instead of one to promote a test out of flaky status.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…3353)

## Why are these changes needed?

Fix the path differences in windows test

## Related issue number

Closes https://anyscale1.atlassian.net/browse/SERVE-804

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Gene Su <e870252314@gmail.com>
…lth() (#53358)

Signed-off-by: Linkun Chen <github@lkchen.net>
…53002)

## Problem description

See #53012 for minimal
reproduction of the issue.

Currently the shutdown sequence looks approximately like:

1. Blindly decrement all local reference counts.
2. Drain ongoing _submitted_ tasks.
3. Drain the local task executor. This drains ongoing _executed_ tasks
if the worker is part of a concurrent actor. For single-threaded actors
and task workers, it's basically a no-op.
4. Drain the reference counter.

Step (1) essentially invalidates the state of the reference counter,
meaning that subsequently executing any user code that interacts with
the reference counter will cause undefined behavior and/or crashes. This
is the problem observed in the two linked issues. In both cases, the
worker is executing concurrent actor tasks which are not drained until
step (3).

## Original motivation for `ReleaseAllLocalReferences`

The original motivation for this code is documented in:
#19639. However, I haven't been
able to fully understand the rationale or the fix. I also haven't been
able to reproduce the issue described, even when substituting the driver
for an owner actor to avoid job-level worker cleanup.

What happens in the described setup inside the worker is (confirmed
empirically):

1. Before the worker starts to execute the task, we [increase the local
reference
count](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L3762)
for the argument as part of
[GetAndPinArgsForExecutor](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L3739).
2. The worker executes the task, blocks in `time.sleep`, then raises
`SystemExit` when it receives a `SIGTERM`.
3. After `SystemExit` is raised, the rest of `ExecuteTask` is executed,
which [decrements the local reference
count](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/reference_count.cc#L968)
as part of
[PopAndClearLocalBorrowers](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L3403).
4. The worker calls
[CoreWorker::Exit()](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L3455).

During the shutdown sequence in (4), the reference has already been
cleared, so the call to `ReleaseAllLocalReferences` is a no-op.

However, there is another edge case that `ReleaseAllLocalReferences`
does currently handle, which is when a normal task (not an actor task)
stores an `ObjectRef` in a global variable before exiting. We currently
[assume this never
happens](https://github.com/ray-project/ray/blob/871cc89ceda42caaa4819112343ff7b0e005e0e4/src/ray/core_worker/core_worker.cc#L1229).
In this case, the call to `ReleaseAllLocalReferences` will decrement the
globally-held reference(s) and allow the worker to exit.

## Solution and alternatives

Originally I planned to fully delete the call to
`ReleaseAllLocalReferences`, but it would cause a behavior change in the
edge case described above pertaining to `ObjectRef`s stored in global
variables.

Therefore, I'm opting for a safe option for now: move the call to
`ReleaseAllLocalReferences` _after_ we drain the task executor, which
should be a no-op change for single-threaded actors and normal task
workers, but solve the issue of tasks executing after the reference
counter has been invalidated in the case of concurrent actors.

## Follow-up items

- We should solve the global `ObjectRef` issue for tasks separately,
then remove the `ReleaseAllLocalReferences` logic. (TODO: file ticket
before merging -- 1 ref and multiple refs)

Closes #51408
Closes #53012

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This is a series of PR to migrate metric collection from opencencus to
openlemetry. For context on the existing components, see
#53098.

------------

This PR add the open telemetry grpc service to the DashboardAgent
component. The grpc service is behind a feature flag so this PR by
default is a no-op, but it allows me to make progress in a piece meal
fashion.

Test:
- CI
- e2e testing on top of this PR

---------

Signed-off-by: can <can@anyscale.com>
When looking for a resource deadlock, the first thing done is to check
if there's any workers available. Right now that involves getting a
vector of all the workers and then checking.
Created a new worker pool IsAvailableForScheduling that will just return
out if there's one available.

AnyPendingTasksForResourceAcquisition (also called to check for resource
deadlock) was a kind of messy with a return value and four out
parameters. Simplified down to a return value and two out parameters.

The moves in FillResourceUsage weren't actually moving anything because
it was using the const protobuf accessors. Fixed that.

Signed-off-by: dayshah <dhyey2019@gmail.com>
)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
or smaller. it does not run with multiple runs.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ionFnV2 (#53152)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

Using AggregationFnV2 in RayData is more efficient to compute stats
instead of iter_batches. This PR modifies internal logic in vectorizer,
encoder and imputer preprocessor use AggregationFnV2

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Praveen Gorthy <praveeng@anyscale.com>
Currently failing on Windows:
https://buildkite.com/ray-project/postmerge/builds/10446#01971548-9687-4f68-b7e6-127b45b29d55/10172-10839

AI tells me it's because we can't re-open the file inside the `tempfile`
context manager, so refactored to accommodate this.

I also cleaned up the test to use a fixture.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
The job supervisor was accessing the deprecated
`ray.worker.global_worker` attribute, causing confusing warning
messages. This PR "fixes" the problem by using the private path instead.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Test was failing due to permissions error trying to re-open file on
Windows. Refactored fixture to avoid holding the file descriptor open.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Why are these changes needed?

If the named deployment isn't found in a Ray serve instance, it's
difficult to troubleshoot the names of deployments that are available.
This error lists the known ones out in the error message instead.

For example, we now can see:

```
ray.serve.exceptions.RayServeException: Deployment 'capitalizer' for application 'capitalizer' 
does not exist. Available: ['Capitalizer', 'Uncapitalizer', 'CapitalizerEndpoint']
```

vs. previously:

```
(ServeController pid=99265)   File ... error in override_deployment_info
(ServeController pid=99265)     info = deployment_infos[deployment_name]
(ServeController pid=99265)            ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
(ServeController pid=99265) KeyError: 'capitalizer'
```


## Related issue number

<!-- For example: "Closes #1234" -->

(N/A)

## Checks

- [X] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] ~~I've included any doc changes needed for
https://docs.ray.io/en/master/.~~
- [ ] ~~I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.~~
- [X] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Brandon Shelley <brandon@pacificaviator.co>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Seiji Eicher <seiji@anyscale.com>
…na dashboards (#53236)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
no lower than 0.16.0

---------

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Timothy Seah <tseah@anyscale.com>
this will relax protobuf version and allow using protobuf v4 or above

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
@pull pull bot added the ⤵️ pull label May 28, 2025
@pull pull bot merged commit 5da9757 into dwongdev:master May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.