Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading datasets of 10GB in Snowflake in less than 5 min #430

Closed
4 tasks
tatiana opened this issue Jun 6, 2022 · 1 comment · Fixed by #544
Closed
4 tasks

Support loading datasets of 10GB in Snowflake in less than 5 min #430

tatiana opened this issue Jun 6, 2022 · 1 comment · Fixed by #544
Assignees
Labels
improvement Enhancement or improvement in an existing feature
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Jun 6, 2022

Dependencies

Acceptance criteria

  • Make changes to how we load data into Snowflake.
  • Re-run the benchmark and identify the performance improvements (?)
  • The changes must work for all file types supported (CSV, JSON, NDJSON, Parquet)
  • The changes made would also work on S3 and GCS
@kaxil kaxil added this to the 1.0.0 milestone Jun 6, 2022
@tatiana tatiana changed the title Support loading datasets of 10GB in Snowflake in less than 2 min Support loading datasets of 10GB in Snowflake in less than 5 min Jun 7, 2022
@tatiana tatiana added the improvement Enhancement or improvement in an existing feature label Jun 7, 2022
@kaxil kaxil assigned kaxil and sunank200 and unassigned tatiana and kaxil Jun 22, 2022
@tatiana tatiana self-assigned this Jul 13, 2022
tatiana added a commit that referenced this issue Jul 14, 2022
Refactor how tables are created in BaseDatabase.load_file_to_table

We should prioritise creating the table using the `table.columns` if they are specified by the user and have the dataframe autodetection as a fallback.

Most of the complexity of #487 was the creation of tables, and this step aims to simplify the Snowflake `load_file` optimization.

Relates to: #430, #481, #493, #494
tatiana added a commit that referenced this issue Jul 15, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
tatiana added a commit that referenced this issue Jul 18, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
tatiana added a commit that referenced this issue Jul 19, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
tatiana added a commit that referenced this issue Jul 19, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
kaxil pushed a commit that referenced this issue Jul 20, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
tatiana added a commit that referenced this issue Jul 20, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
@tatiana tatiana reopened this Jul 20, 2022
@tatiana
Copy link
Collaborator Author

tatiana commented Jul 20, 2022

I believe this was closed by mistake, we haven't merged the changes into master yet..!

sunank200 pushed a commit that referenced this issue Jul 22, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
utkarsharma2 pushed a commit that referenced this issue Jul 25, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
tatiana added a commit that referenced this issue Jul 25, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
tatiana added a commit that referenced this issue Jul 25, 2022
Fix: #430

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
tatiana added a commit that referenced this issue Jul 26, 2022
Fix: #430 

Reduce the time to load to Snowflake by 20% for 5GB datasets (from 24.46 min to 5.49 min). Further details are in the PR results file.

Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Enhancement or improvement in an existing feature
Projects
None yet
4 participants