Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): Refactor to distribute s3.read_parquet #1513

Merged
merged 6 commits into from
Aug 17, 2022

Conversation

jaidisido
Copy link
Contributor

@jaidisido jaidisido commented Aug 10, 2022

Feature or Bugfix

  • Feature
  • Refactoring

Detail

  1. Refactor wr.s3.read_parquet and other methods in _read_parquet S3 module to reduce technical debt:
  • Leverage thread pool executor when possible
  • Simplify chunk generation logic
  • Reduce number of conditionals by generalising edge cases
  • Improve documentation
  1. Distribute both read_file_metadata and read_parquet calls
  • read_file_metadata is distributed as a @ray_remote method via the executor
  • read_parquet is distributed using a custom datasource and the read_datasource Ray public API

Testing

  • Standard tests are passing with minimal changes to the tests
  • Two tests are added to the load_test (simple and partitioned case)

Related Issue

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido jaidisido added WIP Work in progress major release Will be addressed in the next major release feature labels Aug 10, 2022
@jaidisido jaidisido added this to the 3.0.0 milestone Aug 10, 2022
@jaidisido jaidisido self-assigned this Aug 10, 2022
@jaidisido jaidisido added this to In progress in AWS SDK for pandas roadmap Aug 10, 2022
@jaidisido jaidisido linked an issue Aug 10, 2022 that may be closed by this pull request
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 41c79fc
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 41c79fc
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: dd1a0dc
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: aa6d689
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 0157c10
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: b80bfcd
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: b80bfcd
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido changed the base branch from release-3.0.0 to main August 11, 2022 10:54
@jaidisido jaidisido changed the base branch from main to release-3.0.0 August 11, 2022 10:55
@jaidisido jaidisido changed the base branch from release-3.0.0 to ray-experiments August 11, 2022 10:55
@jaidisido jaidisido changed the base branch from ray-experiments to release-3.0.0 August 11, 2022 10:55
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: a9c6f07
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 6864cd1
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 17a8e79
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 3a6d75c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido changed the base branch from release-3.0.0 to main August 11, 2022 15:37
@jaidisido jaidisido changed the base branch from main to release-3.0.0 August 11, 2022 15:37
@jaidisido jaidisido changed the base branch from release-3.0.0 to main August 11, 2022 15:44
@jaidisido jaidisido changed the base branch from main to release-3.0.0 August 11, 2022 15:45
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 5369e6f
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido force-pushed the feat-3.0/distributed-s3-read-parquet branch from 5369e6f to 3a6d75c Compare August 11, 2022 15:50
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 3a6d75c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

awswrangler/s3/_read_parquet.py Outdated Show resolved Hide resolved
awswrangler/s3/_read_parquet.py Outdated Show resolved Hide resolved
awswrangler/s3/_read_parquet.py Outdated Show resolved Hide resolved
@malachi-constant malachi-constant moved this from In progress to In Review in AWS SDK for pandas roadmap Aug 15, 2022
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 092f14a
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido changed the base branch from release-3.0.0 to main August 15, 2022 21:20
@jaidisido jaidisido changed the base branch from main to release-3.0.0 August 15, 2022 21:20
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 8413d4e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 8413d4e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 8413d4e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 1d3fdad
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@malachi-constant malachi-constant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Abdel, this is impressive.

) -> pa.Table:
block = ArrowBlockAccessor.for_block(block)
df = block._table.to_pandas(**kwargs) # pylint: disable=protected-access
return df.astype(dtype=dtype) if dtype else df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was added to feature-match with non-distributed version. Do you propose to handle this differently or just remove for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so but then I could not find any other reference in the library. The only one I found was here. And even if there was one, I would move it inside this new _table_to_df method in order to standardise it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this type conversion was done in a different way (probably using map_types when going from pyarrow table to a dataframe), but it wasn't available in distributed case so this was the only crude way to do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ok, but do you agree that it's now solved since we are using the same _table_to_df method for both the distributed and standard implementations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

categories: Optional[List[str]],
safe: bool,
map_types: bool,
def _read_parquet_chunked(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct here that the refactoring in this method is:

  1. Schema validation was removed
  2. Chunking case for pyarrow < 3 was removed as we're supporting from version 6 now

Has the validation been moved somewhere else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok found the validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, validation is now centralised and done at the very beginning. Before it was done in multiple places and done much later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, I think it's time to drop arrow < 3, so I suggest we also drop any logic that handled older versions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all our major runtimes support pyarrow above 3 now? Like, Glue, Lambda?

Copy link
Contributor Author

@jaidisido jaidisido Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lambda and Python Shell for sure, not sure about the rest. I created an issue so one of us can check. To be clear though, ray and modin only support pyarrow 6+, so we would need to do some gymnastics in the pyproject.toml to support older versions...

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-s6u9F3qN9oFy
  • Commit ID: 23410dc
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 23410dc
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido merged commit 5a1f275 into release-3.0.0 Aug 17, 2022
@jaidisido jaidisido deleted the feat-3.0/distributed-s3-read-parquet branch August 17, 2022 09:29
@malachi-constant malachi-constant moved this from In Review to Done in AWS SDK for pandas roadmap Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature major release Will be addressed in the next major release
Development

Successfully merging this pull request may close these issues.

Enable Ray distribution on read_parquet method
4 participants