Skip to content

Add deltalake support in AWS S3 with Pandas #1834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 12, 2022

Conversation

fvaleye
Copy link
Contributor

@fvaleye fvaleye commented Dec 2, 2022

Feature or Bugfix

  • Feature

Detail

  • Delta Lake is an open-source storage framework, a Python library delta-rs is available to access it using Pandas. Integrating deltalake in pandas makes the support work out of the box for pandas users in lambda.
  • The first version is only a read operation converting into a Pandas DataFrame from AWS S3 or AWS Glue, other methods could be added (the writer is still experimental atm).

Relates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@fvaleye fvaleye force-pushed the feature/add-deltalake-support-in-s3 branch 2 times, most recently from fa8c7e6 to fea90fb Compare December 2, 2022 18:34
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: fea90fb
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@fvaleye fvaleye force-pushed the feature/add-deltalake-support-in-s3 branch from fea90fb to 59b6757 Compare December 2, 2022 22:16
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 59b6757
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@fvaleye fvaleye force-pushed the feature/add-deltalake-support-in-s3 branch from 59b6757 to 2d409c6 Compare December 2, 2022 22:21
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 2d409c6
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant malachi-constant marked this pull request as draft December 2, 2022 23:31
@fvaleye fvaleye force-pushed the feature/add-deltalake-support-in-s3 branch from 2d409c6 to 23b5e4b Compare December 5, 2022 18:58
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 23b5e4b
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@fvaleye fvaleye force-pushed the feature/add-deltalake-support-in-s3 branch from bf211fa to 1dd60da Compare December 5, 2022 19:40
@fvaleye fvaleye force-pushed the feature/add-deltalake-support-in-s3 branch from 1dd60da to cc03a66 Compare December 5, 2022 19:42
@fvaleye fvaleye marked this pull request as ready for review December 5, 2022 19:49
@malachi-constant malachi-constant added the enhancement New feature or request label Dec 5, 2022
@malachi-constant malachi-constant added this to the 2.19.0 milestone Dec 5, 2022
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: bf211fa
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

_logger: logging.Logger = logging.getLogger(__name__)


def read_deltalake(
Copy link
Contributor

@jaidisido jaidisido Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping with the spirit of awswrangler, perhaps we should combine read_deltalake and read_deltalake_from_glue into a single method. For instance, wr.s3.read_parquet currently handles both reading from S3 directly or via the Glue catalog.

We did face some overloading issues with this strategy in the past though, so happy to be convinced otherwise and keen to hear others thoughts on this

Copy link
Contributor Author

@fvaleye fvaleye Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the implementation separated for the moment since deltalake already has its own way of getting the S3 location of a Glue table in Rust.

As a side note: we could also remove the Glue integration to start only with S3.

without_files: bool = False,
partitions: Optional[List[Tuple[str, str, Any]]] = None,
columns: Optional[List[str]] = None,
filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awswrangler is entirely based on boto3 for credentials management. Potentially we could build the Pyarrow FileSystem from the boto3 session. However, it might not be necessary altogether (see comment on to_pandas below).

Copy link
Contributor Author

@fvaleye fvaleye Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the implementation by relying only on the boto3 session to build the storage_options for the storage backend of the DeltaTable while keeping the possibility to add more configuration via s3_additional_kwargs.

storage_options=storage_options,
without_files=without_files,
)
return table.to_pandas(partitions=partitions, columns=columns, filesystem=filesystem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into the source for this to_pandas implementation, it seems that it goes S3 -> Arrow Table -> Pandas DF. The user-provided conversion arguments seem limited though (partitions, columns, filesystem). Within awswrangler we tend to provide some sane defaults for this conversion and then simply delegate to the user to provide whatever arguments they want to override via a pyarrow_additional_kwargs argument. That dict is then forwarded to the to_pandas method. For consistency we could follow something similar here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I created pyarrow_additional_kwargs and s3_additional_kwargs based on your suggestion!

@fvaleye fvaleye requested review from jaidisido and kukushking and removed request for cnfait, malachi-constant and LeonLuttenberger December 7, 2022 19:46
@fvaleye fvaleye force-pushed the feature/add-deltalake-support-in-s3 branch from 3eef47f to 62c88e1 Compare December 7, 2022 20:01
…ers to provide more flexibility for the user when reading a DeltaTable
@fvaleye fvaleye force-pushed the feature/add-deltalake-support-in-s3 branch from 62c88e1 to d1267db Compare December 7, 2022 20:01
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: e5b4694
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have compiled my suggested changes in this commit.

I would have created a PR against your branch but I get permission denied on your fork.

If you are happy with the changes, you can either grant me access to the fork or I can create a new PR with this branch.

The only outstanding item in my view is making a decision about whether to support the from_data_catalog method or drop it, and if so whether it should be within a single method or not.

I would vote for dropping it. One reason being that I was unable to create an integration test that would read from the Glue catalog. That is because it does not seem like the write_deltalake API support it for now. So if we cannot test it, I would rather not include it.

@fvaleye
Copy link
Contributor Author

fvaleye commented Dec 9, 2022

Thanks @jaidisido

I am totally aligned with your changes and your suggestions. All good for me, I included your commits in this PR 👍

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 6dcde18
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: a0516ea
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@fvaleye fvaleye requested review from jaidisido and kukushking and removed request for kukushking and jaidisido December 9, 2022 18:34
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 79a6088
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants