(feat): Refactor to distribute s3.read_parquet #1513

jaidisido · 2022-08-10T15:07:18Z

Feature or Bugfix

Feature
Refactoring

Detail

Refactor wr.s3.read_parquet and other methods in _read_parquet S3 module to reduce technical debt:

Leverage thread pool executor when possible
Simplify chunk generation logic
Reduce number of conditionals by generalising edge cases
Improve documentation

Distribute both read_file_metadata and read_parquet calls

read_file_metadata is distributed as a @ray_remote method via the executor
read_parquet is distributed using a custom datasource and the read_datasource Ray public API

Testing

Standard tests are passing with minimal changes to the tests
Two tests are added to the load_test (simple and partitioned case)

Related Issue

Enable Ray distribution on read_parquet method #1490

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

malachi-constant · 2022-08-10T15:17:40Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 41c79fc
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-10T15:34:45Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 41c79fc
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-10T21:59:44Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: dd1a0dc
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-10T22:17:02Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: aa6d689
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-10T22:44:06Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 0157c10
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-11T00:01:19Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: b80bfcd
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-11T00:11:41Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: b80bfcd
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-11T11:01:52Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: a9c6f07
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-11T11:12:57Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 6864cd1
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-11T13:48:18Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 17a8e79
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-11T15:35:54Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 3a6d75c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-11T15:46:55Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 5369e6f
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-11T15:52:44Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 3a6d75c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

awswrangler/s3/_read_parquet.py

awswrangler/distributed/datasources/parquet_datasource.py

malachi-constant · 2022-08-15T18:25:17Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 092f14a
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

….com/awslabs/aws-data-wrangler into feat-3.0/distributed-s3-read-parquet

malachi-constant · 2022-08-15T21:30:28Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 8413d4e
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-15T21:44:48Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 8413d4e
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-15T21:45:23Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 8413d4e
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-15T22:01:58Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 1d3fdad
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

awswrangler/s3/_read_parquet.py

malachi-constant

Great work Abdel, this is impressive.

kukushking · 2022-08-16T09:18:45Z

awswrangler/distributed/_utils.py

 ) -> pa.Table:
    block = ArrowBlockAccessor.for_block(block)
-    df = block._table.to_pandas(**kwargs)  # pylint: disable=protected-access
-    return df.astype(dtype=dtype) if dtype else df


I think this was added to feature-match with non-distributed version. Do you propose to handle this differently or just remove for now?

I thought so but then I could not find any other reference in the library. The only one I found was here. And even if there was one, I would move it inside this new _table_to_df method in order to standardise it

Yeah I think this type conversion was done in a different way (probably using map_types when going from pyarrow table to a dataframe), but it wasn't available in distributed case so this was the only crude way to do it

Right ok, but do you agree that it's now solved since we are using the same _table_to_df method for both the distributed and standard implementations?

awswrangler/distributed/datasources/parquet_datasource.py

awswrangler/s3/_read_parquet.py

kukushking · 2022-08-16T09:42:53Z

awswrangler/s3/_read_parquet.py

-    categories: Optional[List[str]],
-    safe: bool,
-    map_types: bool,
+def _read_parquet_chunked(


Is my understanding correct here that the refactoring in this method is:

Schema validation was removed

Chunking case for pyarrow < 3 was removed as we're supporting from version 6 now

Has the validation been moved somewhere else?

Ok found the validation

Yeah, validation is now centralised and done at the very beginning. Before it was done in multiple places and done much later

And yes, I think it's time to drop arrow < 3, so I suggest we also drop any logic that handled older versions

Do all our major runtimes support pyarrow above 3 now? Like, Glue, Lambda?

Lambda and Python Shell for sure, not sure about the rest. I created an issue so one of us can check. To be clear though, ray and modin only support pyarrow 6+, so we would need to do some gymnastics in the pyproject.toml to support older versions...

awswrangler/s3/_read_parquet.py

malachi-constant · 2022-08-16T16:22:52Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
Commit ID: 23410dc
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2022-08-16T16:30:55Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 23410dc
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido added WIP Work in progress major release Will be addressed in the next major release feature labels Aug 10, 2022

jaidisido added this to the 3.0.0 milestone Aug 10, 2022

jaidisido self-assigned this Aug 10, 2022

jaidisido added this to In progress in AWS SDK for pandas roadmap Aug 10, 2022

jaidisido linked an issue Aug 10, 2022 that may be closed by this pull request

Enable Ray distribution on read_parquet method #1490

Closed

jaidisido changed the base branch from release-3.0.0 to main August 11, 2022 10:54

jaidisido changed the base branch from main to release-3.0.0 August 11, 2022 10:55

jaidisido changed the base branch from release-3.0.0 to ray-experiments August 11, 2022 10:55

jaidisido changed the base branch from ray-experiments to release-3.0.0 August 11, 2022 10:55

jaidisido changed the base branch from release-3.0.0 to main August 11, 2022 15:37

jaidisido changed the base branch from main to release-3.0.0 August 11, 2022 15:37

jaidisido changed the base branch from release-3.0.0 to main August 11, 2022 15:44

jaidisido changed the base branch from main to release-3.0.0 August 11, 2022 15:45

jaidisido force-pushed the feat-3.0/distributed-s3-read-parquet branch from 5369e6f to 3a6d75c Compare August 11, 2022 15:50

LeonLuttenberger reviewed Aug 12, 2022

View reviewed changes

awswrangler/s3/_read_parquet.py Outdated Show resolved Hide resolved

awswrangler/s3/_read_parquet.py Outdated Show resolved Hide resolved

awswrangler/s3/_read_parquet.py Outdated Show resolved Hide resolved

awswrangler/distributed/datasources/parquet_datasource.py Show resolved Hide resolved

malachi-constant moved this from In progress to In Review in AWS SDK for pandas roadmap Aug 15, 2022

jaidisido added 2 commits August 15, 2022 22:16

(feat): Refactor and distribute s3.read_parquet

90b2eea

Merge branch 'feat-3.0/distributed-s3-read-parquet' of https://github…

d89a584

….com/awslabs/aws-data-wrangler into feat-3.0/distributed-s3-read-parquet

jaidisido changed the base branch from release-3.0.0 to main August 15, 2022 21:20

jaidisido changed the base branch from main to release-3.0.0 August 15, 2022 21:20

Minor - Fix casing in S3 Select

8413d4e

Minor - Increase parallelism and benchmark time in load tests

1d3fdad

malachi-constant reviewed Aug 15, 2022

View reviewed changes

awswrangler/s3/_read_parquet.py Show resolved Hide resolved

malachi-constant reviewed Aug 15, 2022

View reviewed changes

awswrangler/s3/_read_parquet.py Show resolved Hide resolved

malachi-constant approved these changes Aug 15, 2022

View reviewed changes