Add deltalake support in AWS S3 with Pandas#1834
Conversation
fa8c7e6 to
fea90fb
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
fea90fb to
59b6757
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
59b6757 to
2d409c6
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
2d409c6 to
23b5e4b
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
bf211fa to
1dd60da
Compare
1dd60da to
cc03a66
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
awswrangler/s3/_read_deltalake.py
Outdated
| _logger: logging.Logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def read_deltalake( |
There was a problem hiding this comment.
Keeping with the spirit of awswrangler, perhaps we should combine read_deltalake and read_deltalake_from_glue into a single method. For instance, wr.s3.read_parquet currently handles both reading from S3 directly or via the Glue catalog.
We did face some overloading issues with this strategy in the past though, so happy to be convinced otherwise and keen to hear others thoughts on this
There was a problem hiding this comment.
I kept the implementation separated for the moment since deltalake already has its own way of getting the S3 location of a Glue table in Rust.
As a side note: we could also remove the Glue integration to start only with S3.
awswrangler/s3/_read_deltalake.py
Outdated
| without_files: bool = False, | ||
| partitions: Optional[List[Tuple[str, str, Any]]] = None, | ||
| columns: Optional[List[str]] = None, | ||
| filesystem: Optional[Union[str, pa_fs.FileSystem]] = None, |
There was a problem hiding this comment.
awswrangler is entirely based on boto3 for credentials management. Potentially we could build the Pyarrow FileSystem from the boto3 session. However, it might not be necessary altogether (see comment on to_pandas below).
There was a problem hiding this comment.
I changed the implementation by relying only on the boto3 session to build the storage_options for the storage backend of the DeltaTable while keeping the possibility to add more configuration via s3_additional_kwargs.
awswrangler/s3/_read_deltalake.py
Outdated
| storage_options=storage_options, | ||
| without_files=without_files, | ||
| ) | ||
| return table.to_pandas(partitions=partitions, columns=columns, filesystem=filesystem) |
There was a problem hiding this comment.
Looking into the source for this to_pandas implementation, it seems that it goes S3 -> Arrow Table -> Pandas DF. The user-provided conversion arguments seem limited though (partitions, columns, filesystem). Within awswrangler we tend to provide some sane defaults for this conversion and then simply delegate to the user to provide whatever arguments they want to override via a pyarrow_additional_kwargs argument. That dict is then forwarded to the to_pandas method. For consistency we could follow something similar here.
There was a problem hiding this comment.
Good idea! I created pyarrow_additional_kwargs and s3_additional_kwargs based on your suggestion!
3eef47f to
62c88e1
Compare
…ers to provide more flexibility for the user when reading a DeltaTable
62c88e1 to
d1267db
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
I have compiled my suggested changes in this commit.
I would have created a PR against your branch but I get permission denied on your fork.
If you are happy with the changes, you can either grant me access to the fork or I can create a new PR with this branch.
The only outstanding item in my view is making a decision about whether to support the from_data_catalog method or drop it, and if so whether it should be within a single method or not.
I would vote for dropping it. One reason being that I was unable to create an integration test that would read from the Glue catalog. That is because it does not seem like the write_deltalake API support it for now. So if we cannot test it, I would rather not include it.
|
Thanks @jaidisido I am totally aligned with your changes and your suggestions. All good for me, I included your commits in this PR 👍 |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Feature or Bugfix
Detail
deltalakein pandas makes the support work out of the box for pandas users in lambda.Relates
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.