-
Notifications
You must be signed in to change notification settings - Fork 706
Add deltalake support in AWS S3 with Pandas #1834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deltalake support in AWS S3 with Pandas #1834
Conversation
fa8c7e6
to
fea90fb
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
fea90fb
to
59b6757
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
59b6757
to
2d409c6
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
2d409c6
to
23b5e4b
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
bf211fa
to
1dd60da
Compare
1dd60da
to
cc03a66
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
awswrangler/s3/_read_deltalake.py
Outdated
_logger: logging.Logger = logging.getLogger(__name__) | ||
|
||
|
||
def read_deltalake( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping with the spirit of awswrangler
, perhaps we should combine read_deltalake
and read_deltalake_from_glue
into a single method. For instance, wr.s3.read_parquet
currently handles both reading from S3 directly or via the Glue catalog.
We did face some overloading issues with this strategy in the past though, so happy to be convinced otherwise and keen to hear others thoughts on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the implementation separated for the moment since deltalake
already has its own way of getting the S3 location of a Glue table in Rust.
As a side note: we could also remove the Glue integration to start only with S3.
awswrangler/s3/_read_deltalake.py
Outdated
without_files: bool = False, | ||
partitions: Optional[List[Tuple[str, str, Any]]] = None, | ||
columns: Optional[List[str]] = None, | ||
filesystem: Optional[Union[str, pa_fs.FileSystem]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awswrangler
is entirely based on boto3 for credentials management. Potentially we could build the Pyarrow FileSystem from the boto3 session. However, it might not be necessary altogether (see comment on to_pandas
below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation by relying only on the boto3 session to build the storage_options
for the storage backend of the DeltaTable
while keeping the possibility to add more configuration via s3_additional_kwargs
.
awswrangler/s3/_read_deltalake.py
Outdated
storage_options=storage_options, | ||
without_files=without_files, | ||
) | ||
return table.to_pandas(partitions=partitions, columns=columns, filesystem=filesystem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into the source for this to_pandas implementation, it seems that it goes S3 -> Arrow Table -> Pandas DF. The user-provided conversion arguments seem limited though (partitions
, columns
, filesystem
). Within awswrangler
we tend to provide some sane defaults for this conversion and then simply delegate to the user to provide whatever arguments they want to override via a pyarrow_additional_kwargs
argument. That dict is then forwarded to the to_pandas
method. For consistency we could follow something similar here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I created pyarrow_additional_kwargs
and s3_additional_kwargs
based on your suggestion!
3eef47f
to
62c88e1
Compare
…ers to provide more flexibility for the user when reading a DeltaTable
62c88e1
to
d1267db
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have compiled my suggested changes in this commit.
I would have created a PR against your branch but I get permission denied
on your fork.
If you are happy with the changes, you can either grant me access to the fork or I can create a new PR with this branch.
The only outstanding item in my view is making a decision about whether to support the from_data_catalog
method or drop it, and if so whether it should be within a single method or not.
I would vote for dropping it. One reason being that I was unable to create an integration test that would read from the Glue catalog. That is because it does not seem like the write_deltalake
API support it for now. So if we cannot test it, I would rather not include it.
Thanks @jaidisido I am totally aligned with your changes and your suggestions. All good for me, I included your commits in this PR 👍 |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Feature or Bugfix
Detail
deltalake
in pandas makes the support work out of the box for pandas users in lambda.Relates
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.