Skip to content

Make Polars a first class citizen for pulling and loading data to snowflake#295

Merged
gladysteh99 merged 17 commits intodevelopfrom
polars
Sep 5, 2024
Merged

Make Polars a first class citizen for pulling and loading data to snowflake#295
gladysteh99 merged 17 commits intodevelopfrom
polars

Conversation

@gladysteh99
Copy link
Copy Markdown
Contributor

pyproject.toml & requirements.txt add polars as dependency

For pulling data from snowflake:

  • to_dataframe in database.py and snowflake.py: added an argument called df_type for users to specify the dataframe type of output, defaults to "pandas" to prevent any breaking changes for exisiting users. snowflake.py version uses pyarrow form instead of native fetching to improve performance

For loading data to snowflake:

  • insert_dataframe_to_table in snowflake.py and redshift.py: allow dataframe argument to take in polars.DataFrame and polars.LazyFrame
  • find_column_type in utility.py uses singledispatch to extend function to have polars functionality.

Copy link
Copy Markdown
Contributor

@jdawang jdawang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good work with the logic.

I have a few requested changes regarding undeleted comments.

Then a few questions/comments on potentially using more dispatching and removing support for LazyFrame that would love to hear your thoughts on.

Comment thread locopy/utility.py Outdated
Comment thread locopy/snowflake.py Outdated
Comment thread pyproject.toml Outdated
]
license = {text = "Apache Software License"}
dependencies = ["boto3<=1.35.9,>=1.9.92", "PyYAML<=6.0.1,>=5.1", "pandas<=2.2.2,>=0.25.2", "numpy<=2.0.2,>=1.22.0"]
dependencies = ["boto3<=1.35.9,>=1.9.92", "PyYAML<=6.0.1,>=5.1", "pandas<=2.2.2,>=0.25.2", "numpy<=2.0.2,>=1.22.0", "polars>=1.5.0"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1.5 actually the min version that works? I think it's worth looking into this as I think people are still using versions before 1.0.

Copy link
Copy Markdown
Contributor Author

@gladysteh99 gladysteh99 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the minimum version needed is 1.0 for one of the methods used (collect_schema) which is necessary for supporting both lazyframe and dataframe. It really draws down to if we want to support both. I personally prefer setting it to at least 1.0.0 because of the significant upgrades from 0.20.0 to 1.0.0.

Comment thread locopy/snowflake.py Outdated
Comment thread locopy/snowflake.py Outdated

if columns:
dataframe = dataframe[columns]
try:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the method, would it be possible to use a single dispatch helper here for the first part of this function? E.g. a dispatched function for line 431 to 456 - creating the to_insert. Once we have the list of tuples, the rest is the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is possible, my only concern was readability for future developers, but happy to refactor it to dispatched function if there is any significant upside

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more readable to use the isinstance checks below to select columns from dataframe. I don't think anything between requires this to be done first. That or doing the dispatch. Lmk if I'm wrong here. Like moving the dataframe=dataframe[columns] into the pandas branch and the .select(columns) into the polars branch.

Copy link
Copy Markdown
Contributor Author

@gladysteh99 gladysteh99 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually polars dataframe can read dataframe[columns] as of version 1.0.0, the .select(columns) was intended for LazyFrame. Since we are not supporting LazyFrame now, it can be removed completely

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Comment thread locopy/redshift.py Outdated
try:
length = len(dataframe)
except TypeError:
length = dataframe.select(pl.len()).collect().item() # for polars lazyframe
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here on LazyFrame. If we want to consider not supporting that.

Same comment here on considering dispatch helper method to get to_insert.

Comment thread locopy/database.py Outdated
@gladysteh99 gladysteh99 requested a review from jdawang September 5, 2024 15:54
Copy link
Copy Markdown
Contributor

@jdawang jdawang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment that I missed last time, everything else looks good to me!

Comment thread locopy/database.py
@jdawang jdawang self-requested a review September 5, 2024 17:21
Copy link
Copy Markdown
Contributor

@jdawang jdawang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work!

@gladysteh99 gladysteh99 merged commit 57568fa into develop Sep 5, 2024
@gladysteh99 gladysteh99 deleted the polars branch September 5, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants