Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [destination-DuckDB] use pyarrow batch insert to replace executemany #36715

Merged
merged 8 commits into from
Apr 22, 2024

Conversation

hrl20
Copy link
Contributor

@hrl20 hrl20 commented Apr 1, 2024

What

Improve the DuckDB connector performance by replacing the .executemany calls.

How

DuckDB docs recommends against using .executemany for ingestion https://duckdb.org/docs/api/python/dbapi. Pyarrow is a convenient way to represent the buffered records to do bulk insert into duckdb.

User Impact

No functional changes. Better ingestion performance.

In the MotherDuck integration test:
5000 in batches of 1000: before: ~258s; after: ~1.8s
1M records in batches of 1000: before:(not measured); after: 70s

Copy link

vercel bot commented Apr 1, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Apr 16, 2024 4:19am

@CLAassistant
Copy link

CLAassistant commented Apr 1, 2024

CLA assistant check
All committers have signed the CLA.

@hrl20 hrl20 force-pushed the duckdb-batching branch 3 times, most recently from 31f7967 to ddda446 Compare April 2, 2024 05:17
@hrl20 hrl20 changed the title use pyarrow to batch insert into duckdb ✨ Destination DuckDB: use pyarrow batch insert to replace executemany Apr 2, 2024
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Apr 2, 2024
@hrl20 hrl20 marked this pull request as ready for review April 2, 2024 15:57
@aaronsteers aaronsteers self-requested a review April 2, 2024 17:41
Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hrl20 - This looks great, first of all. Thanks very much for contributing! The performance boost is very exciting. 🔥

One question inline about complex/messy schemas.

@hrl20 hrl20 requested a review from aaronsteers April 4, 2024 18:58
@marcosmarxm
Copy link
Member

@aaronsteers and @hrl20 make sure all contributors had sign the CLA to merge this contribution.

@marcosmarxm marcosmarxm changed the title ✨ Destination DuckDB: use pyarrow batch insert to replace executemany ✨ [destination-DuckDB] use pyarrow batch insert to replace executemany Apr 10, 2024
Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @hrl20! Thanks very much for this submission!

A few minor changes requested. Then, I think this is ready to ship.

Once you've replied and/or updated the PR, if you can re-request my review, I'll try to review and get merged without much additional delay.

Thanks!

Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. 🎉 🚀

We still need to make sure CI tests pass. I'll try to get that done tomorrow or Wed.
cc @marcosmarxm in case you have any cycles to contribute towards this. (Totally ok either way.)

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Copy link
Contributor

✅ Tests passed. [🔗][1]

[1]:

Copy link
Contributor

PR test job started... Check job output.

Copy link
Contributor

github-actions bot commented Apr 18, 2024

✅ Tests passed. 📎

Copy link
Contributor

github-actions bot commented Apr 22, 2024

PR test job started... Check job output.

✅ Tests passed. 📎

Copy link
Contributor

github-actions bot commented Apr 22, 2024

PR test job started... Check job output.

✅ Tests passed. 📎

Copy link
Contributor

github-actions bot commented Apr 22, 2024

PR test job started... Check job output.

✅ Tests passed. 📎

@aaronsteers
Copy link
Collaborator

aaronsteers commented Apr 22, 2024

I've run this through the airbyte-ci tests and applied the format_fix manual CI workflow action.

All have passed so I'm proceeding to merge.

Connector should auto-publish after merge.

@aaronsteers aaronsteers merged commit d4944a3 into airbytehq:master Apr 22, 2024
25 of 29 checks passed
strosek pushed a commit that referenced this pull request Apr 24, 2024
FVidalCarneiro pushed a commit to AgiData/airbyte that referenced this pull request May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/destination/duckdb
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants