Make Polars a first class citizen for pulling and loading data to snowflake#295
Make Polars a first class citizen for pulling and loading data to snowflake#295gladysteh99 merged 17 commits intodevelopfrom
Conversation
jdawang
left a comment
There was a problem hiding this comment.
Overall good work with the logic.
I have a few requested changes regarding undeleted comments.
Then a few questions/comments on potentially using more dispatching and removing support for LazyFrame that would love to hear your thoughts on.
| ] | ||
| license = {text = "Apache Software License"} | ||
| dependencies = ["boto3<=1.35.9,>=1.9.92", "PyYAML<=6.0.1,>=5.1", "pandas<=2.2.2,>=0.25.2", "numpy<=2.0.2,>=1.22.0"] | ||
| dependencies = ["boto3<=1.35.9,>=1.9.92", "PyYAML<=6.0.1,>=5.1", "pandas<=2.2.2,>=0.25.2", "numpy<=2.0.2,>=1.22.0", "polars>=1.5.0"] |
There was a problem hiding this comment.
Is 1.5 actually the min version that works? I think it's worth looking into this as I think people are still using versions before 1.0.
There was a problem hiding this comment.
the minimum version needed is 1.0 for one of the methods used (collect_schema) which is necessary for supporting both lazyframe and dataframe. It really draws down to if we want to support both. I personally prefer setting it to at least 1.0.0 because of the significant upgrades from 0.20.0 to 1.0.0.
|
|
||
| if columns: | ||
| dataframe = dataframe[columns] | ||
| try: |
There was a problem hiding this comment.
Looking at the method, would it be possible to use a single dispatch helper here for the first part of this function? E.g. a dispatched function for line 431 to 456 - creating the to_insert. Once we have the list of tuples, the rest is the same.
There was a problem hiding this comment.
Yes it is possible, my only concern was readability for future developers, but happy to refactor it to dispatched function if there is any significant upside
There was a problem hiding this comment.
I think it's more readable to use the isinstance checks below to select columns from dataframe. I don't think anything between requires this to be done first. That or doing the dispatch. Lmk if I'm wrong here. Like moving the dataframe=dataframe[columns] into the pandas branch and the .select(columns) into the polars branch.
There was a problem hiding this comment.
Actually polars dataframe can read dataframe[columns] as of version 1.0.0, the .select(columns) was intended for LazyFrame. Since we are not supporting LazyFrame now, it can be removed completely
| try: | ||
| length = len(dataframe) | ||
| except TypeError: | ||
| length = dataframe.select(pl.len()).collect().item() # for polars lazyframe |
There was a problem hiding this comment.
Same comment here on LazyFrame. If we want to consider not supporting that.
Same comment here on considering dispatch helper method to get to_insert.
jdawang
left a comment
There was a problem hiding this comment.
Last comment that I missed last time, everything else looks good to me!
pyproject.toml&requirements.txtadd polars as dependencyFor pulling data from snowflake:
to_dataframein database.py and snowflake.py: added an argument calleddf_typefor users to specify the dataframe type of output, defaults to "pandas" to prevent any breaking changes for exisiting users. snowflake.py version uses pyarrow form instead of native fetching to improve performanceFor loading data to snowflake:
insert_dataframe_to_tablein snowflake.py and redshift.py: allowdataframeargument to take inpolars.DataFrameandpolars.LazyFramefind_column_typein utility.py uses singledispatch to extend function to have polars functionality.