Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AirbyteLib: DuckDB Perf Boost #34589

Merged
merged 64 commits into from
Jan 30, 2024
Merged

Conversation

aaronsteers
Copy link
Collaborator

@aaronsteers aaronsteers commented Jan 28, 2024

Note: This should be merged after:


This is a relatively small change but it gives a nice performance boost for DuckDB caches.

Based on the DuckDB Parquet loading docs here:

After this update, 10MM records from 1,001 files loaded in 12 seconds. The "read" phase from Faker takes just over 6 minutes, including the time to write to the intermediate Parquet files.

                                         Read Progress                                         

Started reading at 21:36:50.                                                                   

Read 10,000,135 records over 6min 7s (27,248.3 records / second).                              

Wrote 10,000,100 records over 1,001 batches.                                                   

Finished reading at 21:42:58.                                                                  

Started finalizing streams at 21:42:58.                                                        

Finalized 1001 batches over 12 seconds.                                                        

Completed 3 out of 3 streams:                                                                  

 • users                                                                                       
 • purchases                                                                                   
 • products                                                                                    

Completed writing at 21:43:10. Total time elapsed: 6min 19s                                    

────────────────────────────

Related: I also created a NullCache and NullWriter implementation to determine how much time is used when writing files:

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested and works as expected, LGTM

@octavia-squidington-iv octavia-squidington-iv requested a review from a team January 29, 2024 14:16
Base automatically changed from aj/airbyte-lib/progress-print to master January 30, 2024 06:39
@aaronsteers aaronsteers enabled auto-merge (squash) January 30, 2024 06:49
@aaronsteers aaronsteers merged commit b37bde8 into master Jan 30, 2024
19 checks passed
@aaronsteers aaronsteers deleted the aj/airbyte-lib/duckdb-perf-boost branch January 30, 2024 06:56
clnoll pushed a commit that referenced this pull request Jan 30, 2024
jbfbell pushed a commit that referenced this pull request Feb 1, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
airbyte-lib Related to AirbyteLib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants