Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrangler to support Hudi/Iceberg datasets read/write #1470

Open
anandshah123 opened this issue Jul 22, 2022 · 8 comments
Open

Wrangler to support Hudi/Iceberg datasets read/write #1470

anandshah123 opened this issue Jul 22, 2022 · 8 comments
Labels
backlog enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@anandshah123
Copy link

Is your idea related to a problem? Please describe.
No

Describe the solution you'd like
It would be good to have support for CDC data lake formats like Apache Hudi, Apache Iceberg or Detla Lake format.

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

@anandshah123 anandshah123 added the enhancement New feature or request label Jul 22, 2022
@jaidisido
Copy link
Contributor

Thanks for raising this, two options are currently available in the library to handle CDC operations:

  • AWS Glue Governed Tables
  • Apache Iceberg is natively supported via Athena, meaning you can use existing wr.athena.* methods to create, update and delete Iceberg tables

Delta Lake and Hudi are not on our roadmap at the moment because they lack native support in AWS Glue, that being said PRs are always welcome if you have a specific implementation in mind :)

@github-actions
Copy link

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

@malachi-constant malachi-constant added help wanted Extra attention is needed blocked Something is blocking the development backlog and removed blocked Something is blocking the development closing-soon labels Sep 20, 2022
@AdrianoNicolucci
Copy link

With the release of Glue 4.0, it appears there is "support for Apache Hudi, Apache Iceberg, and Delta Lake formats" with AWS Glue. Will this make implementing this feature possible to implement now?

@cdelamocepsa
Copy link

  • Apache Iceberg is natively supported via Athena, meaning you can use existing wr.athena.* methods to create, update and delete Iceberg tables

@jaidisido is there any way to use wr.athena.* to update a Iceberg table with a pandas DataFrame then? In the docs I can only see examples for reading DataFrames...

@nicor88
Copy link

nicor88 commented Feb 13, 2023

When apache/iceberg#6564 is implemented might be possible to write in Iceberg format natively using python, without any help from external processing systems like Spark/Athena/Trino.

@apopata-aws
Copy link

Without Athena, could we have a more seamless integration for Wrangler on all transactional formats e.g.

  • wr.create_table (format='hudi' ...)
  • wr.create_table (format='iceberg' ...)
  • wr.create_table (format='deltalake' ...)

@jaidisido jaidisido added this to the 3.3.0 milestone Jun 6, 2023
@kukushking
Copy link
Contributor

kukushking commented Jun 8, 2023

HI @cdelamocepsa it is now possible to write into Iceberg using Athena since release 3.1: Athena Iceberg tutorial.

@cdelamocepsa
Copy link

cdelamocepsa commented Jun 8, 2023

HI @cdelamocepsa it is now possible to write into Iceberg since release 3.1: Athena Iceberg tutorial.

@kukushking I see that you need to specify a temp_path, what I'm supposing is that this method writes the data in a temporary glue table and then makes the insert into the Iceberg table from the temp table.

I'm concerned about the efficiency of this, do you have any inputs in how will it behave in terms of latency/cost?

@jaidisido jaidisido modified the milestones: 3.3.0, 3.4.0 Aug 1, 2023
@LeonLuttenberger LeonLuttenberger modified the milestones: 3.4.0, 3.5.0 Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

9 participants