[pull] master from ray-project:master #958

pull · 2024-07-10T06:30:26Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Signed-off-by: hongchaodeng <hongchaodeng1@gmail.com>

since the tests are currently flaky and slow Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

…s in ray.init(). (#46516) This helps debugging GCS connection issues. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

for standard bazel build file formatting Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

the pycache and tests dirs are not useful, non deterministic, and just making the wheel larger. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

to load balance the review load Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

to master and release branches only Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…nosecond (#46518) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

move into ray package, remove the one from root. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

remove stuff that does not work anymore, and fixes the ray image building parts Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Did some bisecting and found out this commit was causing Serve's performance test latency to spike. Reverting 05067f4 to go back to previous state.

- Do not block windows on release automation run anymore - Make the block on windows + linux arm64 consistent Test: - CI Signed-off-by: can <can@anyscale.com>

…perly (#46484) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Split window flaky test jobs into core, serve and serverless. Couple of reasons; - the script `ci/build/upload_build_info.sh` fails on windows when being called repeatedly https://buildkite.com/ray-project/postmerge/builds/5319#0190996e-9a8d-4385-bb22-0d51ff2cd9cd/7990-7991 - it can take hours and the whole things keep retrying Test: - CI - postmerge: https://buildkite.com/ray-project/postmerge/builds/5349 Signed-off-by: can <can@anyscale.com>

…#46431) Signed-off-by: kaihsun <kaihsun@anyscale.com>

…ed memory write operation (#46508) Signed-off-by: kaihsun <kaihsun@anyscale.com>

Signed-off-by: hongchaodeng <hongchaodeng1@gmail.com>

Forgot to fix bazel version in the new window flaky test jobs Test: - CI Signed-off-by: can <can@anyscale.com>

Signed-off-by: can <can@anyscale.com>

so that the effect of `refreshenv` is preserved Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

``` New release perf metrics missing file scalability/object_store.json REGRESSION 6.74%: placement_group_create/removal (THROUGHPUT) regresses from 840.8257707443967 to 784.1202913310515 in microbenchmark.json REGRESSION 4.55%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 534.2825013844715 to 509.9599816194958 in microbenchmark.json REGRESSION 4.46%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 10593.772848299006 to 10121.103242219997 in microbenchmark.json REGRESSION 4.06%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.28764104367834 to 19.46333348333893 in microbenchmark.json REGRESSION 4.04%: multi_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 13048.216108376133 to 12520.58965968965 in microbenchmark.json REGRESSION 3.95%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2713.0325692965866 to 2605.856362562882 in microbenchmark.json REGRESSION 3.83%: client__tasks_and_put_batch (THROUGHPUT) regresses from 11759.788796582228 to 11309.127935041968 in microbenchmark.json REGRESSION 3.08%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.16167615938565 to 12.756244682120503 in microbenchmark.json REGRESSION 2.54%: client__put_calls (THROUGHPUT) regresses from 814.6764560093619 to 794.0222907625882 in microbenchmark.json REGRESSION 1.58%: tasks_per_second (THROUGHPUT) regresses from 588.1590100663536 to 578.8766226882515 in benchmarks/many_tasks.json REGRESSION 1.58%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 8.033801054151493 to 7.9070880635954 in microbenchmark.json REGRESSION 1.54%: n_n_actor_calls_async (THROUGHPUT) regresses from 27657.83033159681 to 27232.414296780542 in microbenchmark.json REGRESSION 1.41%: single_client_wait_1k_refs (THROUGHPUT) regresses from 5.378868872174563 to 5.302957674144409 in microbenchmark.json REGRESSION 1.39%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5300.894918847503 to 5227.298677681264 in microbenchmark.json REGRESSION 1.22%: pgs_per_second (THROUGHPUT) regresses from 22.96731187832995 to 22.687659485012095 in benchmarks/many_pgs.json REGRESSION 1.16%: multi_client_tasks_async (THROUGHPUT) regresses from 23557.51911206466 to 23283.706392178385 in microbenchmark.json REGRESSION 0.76%: tasks_per_second (THROUGHPUT) regresses from 346.9124752975113 to 344.2841239720449 in benchmarks/many_nodes.json REGRESSION 0.57%: single_client_tasks_sync (THROUGHPUT) regresses from 987.4363632697047 to 981.7983599799647 in microbenchmark.json REGRESSION 35.04%: dashboard_p95_latency_ms (LATENCY) regresses from 1221.413 to 1649.419 in benchmarks/many_tasks.json REGRESSION 25.70%: dashboard_p95_latency_ms (LATENCY) regresses from 8.221 to 10.334 in benchmarks/many_pgs.json REGRESSION 18.31%: dashboard_p99_latency_ms (LATENCY) regresses from 281.247 to 332.736 in benchmarks/many_pgs.json REGRESSION 9.82%: dashboard_p50_latency_ms (LATENCY) regresses from 128.651 to 141.285 in benchmarks/many_tasks.json REGRESSION 7.17%: dashboard_p95_latency_ms (LATENCY) regresses from 63.659 to 68.223 in benchmarks/many_nodes.json REGRESSION 5.59%: dashboard_p99_latency_ms (LATENCY) regresses from 133.071 to 140.505 in benchmarks/many_nodes.json REGRESSION 5.56%: stage_2_avg_iteration_time (LATENCY) regresses from 62.212187099456784 to 65.67325186729431 in stress_tests/stress_test_many_tasks.json REGRESSION 4.42%: avg_pg_remove_time_ms (LATENCY) regresses from 0.8868805465475326 to 0.9261005015020346 in stress_tests/stress_test_placement_group.json REGRESSION 3.93%: dashboard_p99_latency_ms (LATENCY) regresses from 3317.765 to 3448.302 in benchmarks/many_tasks.json REGRESSION 3.92%: avg_iteration_time (LATENCY) regresses from 1.0120761251449586 to 1.0517002582550048 in stress_tests/stress_test_dead_actors.json REGRESSION 3.80%: 3000_returns_time (LATENCY) regresses from 5.560233610000012 to 5.771739185999991 in scalability/single_node.json REGRESSION 3.46%: 10000_get_time (LATENCY) regresses from 22.85316222099999 to 23.645023898000005 in scalability/single_node.json REGRESSION 3.03%: 1000000_queued_time (LATENCY) regresses from 182.31759296599998 to 187.84350834300002 in scalability/single_node.json REGRESSION 1.79%: stage_3_time (LATENCY) regresses from 3011.46821808815 to 3065.2378103733063 in stress_tests/stress_test_many_tasks.json REGRESSION 1.20%: dashboard_p50_latency_ms (LATENCY) regresses from 3.924 to 3.971 in benchmarks/many_nodes.json REGRESSION 1.05%: 10000_args_time (LATENCY) regresses from 17.234402031000002 to 17.415384294000006 in scalability/single_node.json REGRESSION 0.38%: dashboard_p50_latency_ms (LATENCY) regresses from 3.377 to 3.39 in benchmarks/many_pgs.json ``` Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Co-authored-by: Lonnie Liu <lonnie@anyscale.com>

…ack and two separate optimizers (w/ different learning rates). (#46540)

linux://python/ray/dashboard:test_serve_dashboard has become timedout and flaky recently (#46459); not sure if it has anything to do with #45943. I just increase its timed out in this PR Test: - CI - https://buildkite.com/ray-project/postmerge/builds/5358 Signed-off-by: can <can@anyscale.com>

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…Ref`s in-memory (#46369) Currently, the implementation of `Dataset.count()` retrieves the entire list of `BlockRef`s associated with the Dataset when calculating the number of rows per block. This PR is a minor performance improvement to use an iterator over the `BlockRef`s, so that we can drop them as soon as we get each block's row count, and we do not need to hold the entire list of `BlockRef`s. Signed-off-by: sjl <sjl@anyscale.com> Signed-off-by: Scott Lee <sjl@anyscale.com>

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

This dependency is needed to test video-related APIs. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Currently `api_policy_check` only obtain APIs in the head rst file, and rsts included in the head file. However, we now have rsts including other rsts, etc. This PR updates the logic to recursively obtain all rsts from the head file. Test: - CI --------- Signed-off-by: can <can@anyscale.com>

closes #46350 Signed-off-by: Superskyyy <yihaochen@apache.org>

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

simplify things Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ndles` (#46547) The name is misleading. The value represents bundles, not blocks. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Fix api policy check for auto-generated API docs. For the check to work properly, we first need to compile ray docs to generate all API docs. Test: - CI Signed-off-by: can <can@anyscale.com>

close #46482 Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

…GcsJobManager::HandleGetAllJobInfo (#46335) Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

hongchaodeng and others added 22 commits July 9, 2024 19:59

[core] catch exception in async_callback (#46488)

6b98284

Signed-off-by: hongchaodeng <hongchaodeng1@gmail.com>

[rllib] only run rllib gpu tests on rllib changes (#46509)

a09e5ea

since the tests are currently flaky and slow Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[Dashboard] Correct the event time of failed tasks (#46439)

cf6f636

Signed-off-by: liuxsh9 <liuxiaoshuang4@huawei.com>

Revert "docs: warn about running multiple local Ray instances (#45836)…

4568ace

…" (#46515)

[core] add "last exception" to error message when GCS connection fail…

d134c25

…s in ray.init(). (#46516) This helps debugging GCS connection issues. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

[doc] buildifier format doc/BUILD (#46458)

c3316b1

for standard bazel build file formatting Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[wheel] exclude 3rd-party __pycache__ and tests dir (#46405)

c16e9ab

the pycache and tests dirs are not useful, non deterministic, and just making the wheel larger. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] promote kevin to reviewer (#46514)

5fe0536

to load balance the review load Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[github action] limit branches of pr sync hook (#46487)

0f47647

to master and release branches only Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[Core] Rename state_ts to state_ts_ns to indicate that the unit is na…

51421b9

…nosecond (#46518) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[dashboard] remove symbolic link (#46461)

27d3d81

move into ray package, remove the one from root. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[image] fix manual docker image building script (#46490)

7ada4ad

remove stuff that does not work anymore, and fixes the ray image building parts Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Revert #46393 (#46517)

5c78268

Did some bisecting and found out this commit was causing Serve's performance test latency to spike. Reverting 05067f4 to go back to previous state.

[ci] do not block windows on release automation run (#46520)

a0fe305

- Do not block windows on release automation run anymore - Make the block on windows + linux arm64 consistent Test: - CI Signed-off-by: can <can@anyscale.com>

[Release] Use with clause to make sure result json file is closed pro…

2c5745f

…perly (#46484) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[core][experimental] Support multiple readers for IntraProcessChannel (…

ca67a49

…#46431) Signed-off-by: kaihsun <kaihsun@anyscale.com>

[core][experimental] Check whether the channel is closed for the shar…

a6c3da3

…ed memory write operation (#46508) Signed-off-by: kaihsun <kaihsun@anyscale.com>

[Core] Fix structured logging to use explicit value (#46523)

f466cad

Signed-off-by: hongchaodeng <hongchaodeng1@gmail.com>

[ci] fix bazel version in window flaky tests (#46533)

3c4b6c3

Forgot to fix bazel version in the new window flaky test jobs Test: - CI Signed-off-by: can <can@anyscale.com>

[ci] fix window flaky tests again

909ea51

Signed-off-by: can <can@anyscale.com>

[dashboard] revert back to pushd and popd for win build (#46532)

de3c92e

so that the effect of `refreshenv` is preserved Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

pull bot added the ⤵️ pull label Jul 10, 2024

khluu and others added 7 commits July 10, 2024 05:11

[RLlib] Cleanup examples folder 20: Add example script for new API st…

5f1bcbd

…ack and two separate optimizers (w/ different learning rates). (#46540)

[spark] Fix Ray on Spark fractional GPU error (#46443)

c103330

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

[core] Add more methods to GcsClient Accessors. (#46359)

ec35737

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

[Data] Add decord to test dependencies (#46526)

4ab5cd2

This dependency is needed to test video-related APIs. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

can-anyscale and others added 10 commits July 10, 2024 20:53

[Doc] Fix security doc link (#46352)

1524223

closes #46350 Signed-off-by: Superskyyy <yihaochen@apache.org>

[Data] Add snowflake-python-connector to test requirements (#46544)

d250007

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

[Data] Remove dead InputDataBuffer._set_num_output_blocks (#46546)

cdb3585

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

[data] lint fixes on unnecessary comprehension (#46463)

7c8a326

simplify things Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[Data] Rename _estimated_output_blocks to `_estimated_num_output_bu…

fc67496

…ndles` (#46547) The name is misleading. The value represents bundles, not blocks. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

[doc][api] fix check for auto-generated api docs (#46543)

73307a4

Fix api policy check for auto-generated API docs. For the check to work properly, we first need to compile ray docs to generate all API docs. Test: - CI Signed-off-by: can <can@anyscale.com>

[Data] Set for better performance in loop (#46541)

170d108

close #46482 Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

[core] Add 1s timeout in RPC to CoreWorkerService.NumPendingTasks in …

26b9464

…GcsJobManager::HandleGetAllJobInfo (#46335) Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

docs: update GTM tag (#46549)

bb1759a

pull bot merged commit bb1759a into ddelange:master Jul 11, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master #958

[pull] master from ray-project:master #958

pull bot commented Jul 10, 2024 •

edited

Loading

[pull] master from ray-project:master #958

[pull] master from ray-project:master #958

Conversation

pull bot commented Jul 10, 2024 • edited Loading

pull bot commented Jul 10, 2024 •

edited

Loading