Upgrade to Ray 2.0 #1635

kukushking · 2022-09-26T15:02:16Z

Feature or Bugfix

Feature

Detail

Update to Ray 2.0, Modin 0.14.1
Use OOB Ray ParquetReader that now includes loading partitions as columns
Update init logic to connect to the cluster if the address is provided or cluster is detected, or init local cluster otherwise

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

malachi-constant · 2022-09-26T15:04:43Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 644f3f9
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

kukushking · 2022-09-26T15:07:24Z

Load test above is expected to fail as it runs it against 1.x cluster

malachi-constant · 2022-09-26T15:07:36Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 644f3f9
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-26T15:09:44Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: e5fb1b7
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-26T15:52:43Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 6a16616
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-26T15:55:34Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 6a16616
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

LeonLuttenberger · 2022-09-26T16:00:19Z

awswrangler/distributed/datasources/parquet_datasource.py

-        return read_tasks
+    def create_reader(self, **kwargs: Dict[str, Any]) -> Reader[Any]:
+        """Return a Reader for the given read arguments."""
+        return _ParquetDatasourceReader(**kwargs)  # type: ignore


Does this reader automatically handle partitioning the way we need it to be handled?

Yep, it adds partitions as columns

I wish it was this simple but I highly doubt that this class is doing everything that the previous implementation was doing in terms of partitioning:

Our method (_add_table_partitions) not only adds partitions, but also converts them to the categorical type (with .dictionary_encode()) like in the non-distributed version

I don't see how this would honour the dataset equals True vs False case? You have removed the path_root argument from the call which we were using to distinguish between the two cases. The Ray implementation does not read partitions the way we want it, you can use this script to see the differences:

import awswrangler as wr if wr.config.distributed: import modin.pandas as pd else: import pandas as pd bucket = "my-bucket" df = pd.DataFrame({"c0": [0, 1, 2], "c1": [3, 4, 5], "c2": [6, 7, 8]}) wr.s3.delete_objects(f"s3://{bucket}/pq2/") wr.s3.to_parquet(df=df, path=f"s3://{bucket}/pq2/", dataset=True, partition_cols=["c1", "c2"]) print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/", dataset=True)) print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/c1=3/", dataset=True)) print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/")) print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/c1=3/"))

If you test the above against your changes vs the current implementation, you will see how they differ in behaviour

Side note, I think this is where not having the parquet tests in the distributed case is causing us issues as the failing tests would have highlighted the above

I have actually ran the two.

Current implementation:

c0 c1 c2 0 0 3 6 1 1 4 7 2 2 5 8 c0 c1 c2 0 0 3 6 c0 0 0 1 1 2 2 c0 0 0

Partition columns are of type categorical

Suggested (Ray 2.0) implementation

c0 c1 c2 0 0 3 6 1 1 4 7 2 2 5 8 c0 c1 c2 0 0 <NA> <NA> For #3, the following error is thrown: File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Invalid column index to add field. c0 0 0

Partition columns are of type Int64

Thanks, both good catches.

(1) - I'll add conversion to categorical which is missing in 2.0 implementation.
(2) - so as I understand the expected behaviour is if dataset=False even if there are partitions detected, we should still only load data under that specific prefix and not load partitions as columns bc current implementation does greedy partition loading.

Both doable. Yeah, in the absence of tests covering those scenarios the 2.0 upgrade was looking deceptively easy :)

Yes, when dataset=False, the partition columns should not be added.

But also notice the difference in behaviour for the second case of dataset=True:

print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/c1=3/", dataset=True))

i.e. reading from one specific partition.

We expect c1 and c2 to be added as partition columns, but for some reason the ray implementation just returns NA

awswrangler/distributed/_distributed.py

- Update Ray 2.0, Modin 0.14.1 - Update datasources to 2.0 api - Detect an existing cluster or create local otherwise

jaidisido

Left comments on the refactoring for read parquet which I doubt would work

awswrangler/distributed/_distributed.py

awswrangler/s3/_read_parquet.py

jaidisido · 2022-09-26T16:53:59Z

awswrangler/distributed/datasources/parquet_datasource.py

-        return read_tasks
+    def create_reader(self, **kwargs: Dict[str, Any]) -> Reader[Any]:
+        """Return a Reader for the given read arguments."""
+        return _ParquetDatasourceReader(**kwargs)  # type: ignore


I wish it was this simple but I highly doubt that this class is doing everything that the previous implementation was doing in terms of partitioning:

Our method (_add_table_partitions) not only adds partitions, but also converts them to the categorical type (with .dictionary_encode()) like in the non-distributed version

I don't see how this would honour the dataset equals True vs False case? You have removed the path_root argument from the call which we were using to distinguish between the two cases. The Ray implementation does not read partitions the way we want it, you can use this script to see the differences:

import awswrangler as wr if wr.config.distributed: import modin.pandas as pd else: import pandas as pd bucket = "my-bucket" df = pd.DataFrame({"c0": [0, 1, 2], "c1": [3, 4, 5], "c2": [6, 7, 8]}) wr.s3.delete_objects(f"s3://{bucket}/pq2/") wr.s3.to_parquet(df=df, path=f"s3://{bucket}/pq2/", dataset=True, partition_cols=["c1", "c2"]) print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/", dataset=True)) print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/c1=3/", dataset=True)) print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/")) print(wr.s3.read_parquet(path=f"s3://{bucket}/pq2/c1=3/"))

If you test the above against your changes vs the current implementation, you will see how they differ in behaviour

Side note, I think this is where not having the parquet tests in the distributed case is causing us issues as the failing tests would have highlighted the above

malachi-constant · 2022-09-26T17:09:55Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: f69869c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-27T13:41:59Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: d62a475
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-27T13:44:39Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: d62a475
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-27T14:02:30Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: d62a475
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido

Looking good, could you just remove parallelism from the load tests please?

malachi-constant · 2022-09-27T14:38:30Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 8def575
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-27T14:41:27Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 8def575
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-27T15:00:58Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 8def575
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-28T10:48:51Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 9208a3b
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-28T10:53:18Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 9208a3b
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-28T11:14:04Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 9208a3b
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-28T11:14:39Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: eb9dedc
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-28T11:17:21Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: eb9dedc
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-28T13:06:51Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: a4d22ae
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-28T13:09:29Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: a4d22ae
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-09-28T13:28:35Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: a4d22ae
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

LeonLuttenberger reviewed Sep 26, 2022

View reviewed changes

awswrangler/distributed/_distributed.py Show resolved Hide resolved

Update to Ray 2.0

f69869c

- Update Ray 2.0, Modin 0.14.1 - Update datasources to 2.0 api - Detect an existing cluster or create local otherwise

kukushking force-pushed the ray-20 branch from 6a16616 to f69869c Compare September 26, 2022 16:20

jaidisido requested changes Sep 26, 2022

View reviewed changes

Backward-compatible partition handling

d62a475

jaidisido self-requested a review September 27, 2022 13:53

jaidisido approved these changes Sep 27, 2022

View reviewed changes

kukushking added 3 commits September 27, 2022 15:31

Fix partition keys bug

6682c37

Remove parallelism arg from laod tests

5a0803e

Minor - add comment

8def575

Remove custom object store detection - rely on ray 2.x defaults.

9208a3b

Linting

eb9dedc

Update 2.0 cluster config

a4d22ae

kukushking merged commit 1073eb9 into release-3.0.0 Sep 28, 2022

jaidisido deleted the ray-20 branch September 29, 2022 15:43

kukushking self-assigned this Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to Ray 2.0 #1635

Upgrade to Ray 2.0 #1635

kukushking commented Sep 26, 2022

malachi-constant commented Sep 26, 2022

kukushking commented Sep 26, 2022

malachi-constant commented Sep 26, 2022

malachi-constant commented Sep 26, 2022

malachi-constant commented Sep 26, 2022

malachi-constant commented Sep 26, 2022

LeonLuttenberger Sep 26, 2022

kukushking Sep 26, 2022

jaidisido Sep 26, 2022 •

edited

jaidisido Sep 26, 2022 •

edited

kukushking Sep 26, 2022

jaidisido Sep 26, 2022

jaidisido left a comment

jaidisido Sep 26, 2022 •

edited

malachi-constant commented Sep 26, 2022

malachi-constant commented Sep 27, 2022

malachi-constant commented Sep 27, 2022

malachi-constant commented Sep 27, 2022

jaidisido left a comment

malachi-constant commented Sep 27, 2022

malachi-constant commented Sep 27, 2022

malachi-constant commented Sep 27, 2022

malachi-constant commented Sep 28, 2022

malachi-constant commented Sep 28, 2022

malachi-constant commented Sep 28, 2022

malachi-constant commented Sep 28, 2022

malachi-constant commented Sep 28, 2022

malachi-constant commented Sep 28, 2022

malachi-constant commented Sep 28, 2022

malachi-constant commented Sep 28, 2022

Upgrade to Ray 2.0 #1635

Upgrade to Ray 2.0 #1635

Conversation

kukushking commented Sep 26, 2022

Feature or Bugfix

Detail

malachi-constant commented Sep 26, 2022

AWS CodeBuild CI Report

kukushking commented Sep 26, 2022

malachi-constant commented Sep 26, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 26, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 26, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 26, 2022

AWS CodeBuild CI Report

LeonLuttenberger Sep 26, 2022

Choose a reason for hiding this comment

kukushking Sep 26, 2022

Choose a reason for hiding this comment

jaidisido Sep 26, 2022 • edited

Choose a reason for hiding this comment

jaidisido Sep 26, 2022 • edited

Choose a reason for hiding this comment

kukushking Sep 26, 2022

Choose a reason for hiding this comment

jaidisido Sep 26, 2022

Choose a reason for hiding this comment

jaidisido left a comment

Choose a reason for hiding this comment

jaidisido Sep 26, 2022 • edited

Choose a reason for hiding this comment

malachi-constant commented Sep 26, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 27, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 27, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 27, 2022

AWS CodeBuild CI Report

jaidisido left a comment

Choose a reason for hiding this comment

malachi-constant commented Sep 27, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 27, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 27, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 28, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 28, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 28, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 28, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 28, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 28, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 28, 2022

AWS CodeBuild CI Report

malachi-constant commented Sep 28, 2022

AWS CodeBuild CI Report

jaidisido Sep 26, 2022 •

edited

jaidisido Sep 26, 2022 •

edited

jaidisido Sep 26, 2022 •

edited