-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat): Refactor to distribute s3.read_parquet #1513
(feat): Refactor to distribute s3.read_parquet #1513
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
5369e6f
to
3a6d75c
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
….com/awslabs/aws-data-wrangler into feat-3.0/distributed-s3-read-parquet
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Abdel, this is impressive.
) -> pa.Table: | ||
block = ArrowBlockAccessor.for_block(block) | ||
df = block._table.to_pandas(**kwargs) # pylint: disable=protected-access | ||
return df.astype(dtype=dtype) if dtype else df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was added to feature-match with non-distributed version. Do you propose to handle this differently or just remove for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so but then I could not find any other reference in the library. The only one I found was here. And even if there was one, I would move it inside this new _table_to_df
method in order to standardise it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this type conversion was done in a different way (probably using map_types when going from pyarrow table to a dataframe), but it wasn't available in distributed case so this was the only crude way to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ok, but do you agree that it's now solved since we are using the same _table_to_df
method for both the distributed and standard implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
categories: Optional[List[str]], | ||
safe: bool, | ||
map_types: bool, | ||
def _read_parquet_chunked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct here that the refactoring in this method is:
- Schema validation was removed
- Chunking case for pyarrow < 3 was removed as we're supporting from version 6 now
Has the validation been moved somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok found the validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, validation is now centralised and done at the very beginning. Before it was done in multiple places and done much later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, I think it's time to drop arrow < 3, so I suggest we also drop any logic that handled older versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all our major runtimes support pyarrow above 3 now? Like, Glue, Lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lambda and Python Shell for sure, not sure about the rest. I created an issue so one of us can check. To be clear though, ray and modin only support pyarrow 6+, so we would need to do some gymnastics in the pyproject.toml to support older versions...
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Feature or Bugfix
Detail
wr.s3.read_parquet
and other methods in_read_parquet
S3 module to reduce technical debt:read_file_metadata
andread_parquet
callsread_file_metadata
is distributed as a@ray_remote
method via the executorread_parquet
is distributed using a custom datasource and theread_datasource
Ray public APITesting
load_test
(simple and partitioned case)Related Issue
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.