-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ [destination-DuckDB] use pyarrow
batch insert to replace executemany
#36715
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
31f7967
to
ddda446
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hrl20 - This looks great, first of all. Thanks very much for contributing! The performance boost is very exciting. 🔥
One question inline about complex/messy schemas.
airbyte-integrations/connectors/destination-duckdb/destination_duckdb/destination.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-duckdb/destination_duckdb/destination.py
Show resolved
Hide resolved
@aaronsteers and @hrl20 make sure all contributors had sign the CLA to merge this contribution. |
pyarrow
batch insert to replace executemany
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @hrl20! Thanks very much for this submission!
A few minor changes requested. Then, I think this is ready to ship.
Once you've replied and/or updated the PR, if you can re-request my review, I'll try to review and get merged without much additional delay.
Thanks!
airbyte-integrations/connectors/destination-duckdb/destination_duckdb/destination.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-duckdb/destination_duckdb/destination.py
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-duckdb/destination_duckdb/destination.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. 🎉 🚀
We still need to make sure CI tests pass. I'll try to get that done tomorrow or Wed.
cc @marcosmarxm in case you have any cycles to contribute towards this. (Totally ok either way.)
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
[1]: |
|
|
|
|
|
I've run this through the All have passed so I'm proceeding to merge. Connector should auto-publish after merge. |
…t as replacement of `executemany` (#36715)
…t as replacement of `executemany` (airbytehq#36715)
What
Improve the DuckDB connector performance by replacing the
.executemany
calls.How
DuckDB docs recommends against using
.executemany
for ingestion https://duckdb.org/docs/api/python/dbapi. Pyarrow is a convenient way to represent the buffered records to do bulk insert into duckdb.User Impact
No functional changes. Better ingestion performance.
In the MotherDuck integration test:
5000 in batches of 1000: before: ~258s; after: ~1.8s
1M records in batches of 1000: before:(not measured); after: 70s