[RFC] How should the Snowflake I/O manager handle Pandas time data? #13969
Replies: 1 comment
-
Thanks to the Dagster team for your hard work on creating and enhancing this great tool. Use Cases and Expectations I see two main categories of use cases for uploading dataframes with the Snowflake Resource / IO Manager:
The needs and expectations of the Snowflake Resource / IO Manager's behavior are dependent on the use case. For Use Case 1: The expectation is that the dataframe can do a round-trip (to and from Snowflake) without any impactful changes. It is acceptable for datetimes to be stored as text in Snowflake as long as they are converted back to datetimes when downloading. Any datatype or timezone change during the round-trip would be unexpected and undesirable. For Use Case 2: The expectation is that the dataframe types can be translated to the closest and most sensible type in Snowflake, so operations can be done (e.g., in SQL, dbt, or Snowpark) without friction. The data type mapping should be sensible, known, and reproducible, taking into account timezones and other special cases when possible. Dagster should ideally support both use cases, but doing so would involve increased complexity in the implementation. Issues The main issue affecting both use cases is that types are often translated unexpectedly when uploading dataframes with snowflake-connector-python or sqlalchemy. Issue Examples
Solution Ideas For Use Case 1: One possible implementation would be for Dagster to add a new column to the DataFrame containing the serialized values of the other columns when uploading a dataframe. This would ensure that round-trips do not change the data types or timezones. import pandas as pd
import pickle
# Function to serialize a row
def serialize_row(row):
return pickle.dumps(row.to_dict())
# Add a new column with the serialized values
df['Serialized'] = df.apply(serialize_row, axis=1)
# Function to deserialize a row
def deserialize_row(serialized_row):
return pd.Series(pickle.loads(serialized_row))
# Rebuild the DataFrame using the 'Serialized' column
rebuilt_df = df['Serialized'].apply(deserialize_row) Pros:
Cons:
For Use Case 2: I don't see a simple solution that would fix all issues. One possibility is to create tables explicitly when uploading a dataframe. If the table does not already exist in Snowflake, Dagster could create the table explicitly, using its own datatype mapping. This would avoid relying on snowflake-connector-python, which often infers types incorrectly. Pros:
Cons:
I hope these suggestions can contribute to the ongoing improvement of Dagster, and I look forward to seeing the project's continued growth and success. Keep up the good work! |
Beta Was this translation helpful? Give feedback.
-
Over the past few months, a number of issues with how the
snowflake-connector-python
handles time data have been uncovered. We have made some attempts to work around these issues, but each solution has presented it’s own problems.Background information
snowflake-connector-python
will cause the data to become mangled (for example, the year 2008 will be changed to 52170, when you try to load the data in a downstream asset, Pandas will error because 52170 is an invalid year). The maintainers have decided they will not prioritize fixing this issue because some workarounds exist.Our attempts to work around the issue
Approach 1: Convert all time data to strings
In this solution, we used the Pandas library to determine if a column contained timestamp data. If it did, we converted that data to a string using the
%Y-%m-%d %H:%M:%S.%f %z
pattern. When loading data from Snowflake, we converted these timestamp strings back to timestamps.The main issue with this solution is that Dagster is changing the type of the data in a user’s DataFrame. Some users were confused why their time data was being stored as strings.
Approach 2: Add timezones to timestamp data
As pointed out in this [comment](snowflakedb/snowflake-connector-python#319 (comment)) in the
snowflake-connector-python
issue, Snowflake will properly store timestamp data if it has a timezone attached. We updated the Snowflake I/O manager to automatically add the UTC timezone to timestamp data without timezones. To maintain backwards compatibility and to allow users to choose how they wanted their data stored, we retained the ability to store timestamp data as strings if configuration was set.The main issue with this solution is that timestamp data with non-UTC timezones would be automatically converted to UTC by the
snowflake-connector-python
. We have no way to convert the data back to the original timezone when loading the data from Snowflake. Additionally, we are still modifying user data with this approach, but not changing the type of the data.Reflections on the attempted solutions
Neither of the above solutions is perfect, and until Snowflake fixes the underlying issues in their Python connector, it is unlikely that we will be able to provide a solution that works in all cases for all users. So we’d like to understand more about how the community is using Snowflake, and what kinds of tradeoffs you are willing to make. Here are some proposals:
Proposed solutions
Proposal 1: Snowflake I/O manager simply forwards user data to Snowflake
In this solution, the Snowflake I/O manager would not attempt to modify user data to avoid the problems in the
snowflake-connector-python
. Users would be responsible for modifying their data (ie converting to strings, adding UTC timezones, etc) before returning it from their assets. Dagster would provide documentation on the existing issues and potential solutions to assist users in dealing with the issues.This solution would provide users full control over how they store their data in Snowflake, but it would also require users to understand the issues and perform their own data quality checks.
Proposal 2: Maintain the status quo
In this solution, we would maintain the current implemented solutions, and make some usability improvements. For example, we can do a better job warning users in logs if their timestamp data has non-UTC timezones that will get wiped by Snowflake.
Some other cases to consider
There are other instances where the Snowflake connector does not store data as expected:
We’d like our solution to the timestamp issue to be applicable to these cases as well. For example, if we choose to not modify user data at all, we would not intervene in either of the above cases. But if we do modify timestamp data, we would need to consider modifying data in these cases as well
Conclusion
So we’d like to hear from you! Do the issues with storing time data affect you? Have you attempted to solve them in the past? If so, what was your approach? Which of the above proposals would you like to see implemented? Do you have other ideas for how we can solve this issue?
Beta Was this translation helpful? Give feedback.
All reactions