[RFC] How should the Snowflake I/O manager handle Pandas time data? #13969

jamiedemaria · 2023-04-27T20:44:03Z

jamiedemaria
Apr 27, 2023
Maintainer

Over the past few months, a number of issues with how the snowflake-connector-python handles time data have been uncovered. We have made some attempts to work around these issues, but each solution has presented it’s own problems.

Background information

Snowflake bug report (link). Don’t feel the need to read the entire issue. The tl;dr is that storing timestamp data with the snowflake-connector-python will cause the data to become mangled (for example, the year 2008 will be changed to 52170, when you try to load the data in a downstream asset, Pandas will error because 52170 is an invalid year). The maintainers have decided they will not prioritize fixing this issue because some workarounds exist.

Our attempts to work around the issue

Approach 1: Convert all time data to strings

In this solution, we used the Pandas library to determine if a column contained timestamp data. If it did, we converted that data to a string using the %Y-%m-%d %H:%M:%S.%f %z pattern. When loading data from Snowflake, we converted these timestamp strings back to timestamps.

The main issue with this solution is that Dagster is changing the type of the data in a user’s DataFrame. Some users were confused why their time data was being stored as strings.

Approach 2: Add timezones to timestamp data

As pointed out in this [comment](snowflakedb/snowflake-connector-python#319 (comment)) in the snowflake-connector-python issue, Snowflake will properly store timestamp data if it has a timezone attached. We updated the Snowflake I/O manager to automatically add the UTC timezone to timestamp data without timezones. To maintain backwards compatibility and to allow users to choose how they wanted their data stored, we retained the ability to store timestamp data as strings if configuration was set.

The main issue with this solution is that timestamp data with non-UTC timezones would be automatically converted to UTC by the snowflake-connector-python. We have no way to convert the data back to the original timezone when loading the data from Snowflake. Additionally, we are still modifying user data with this approach, but not changing the type of the data.

Reflections on the attempted solutions

Neither of the above solutions is perfect, and until Snowflake fixes the underlying issues in their Python connector, it is unlikely that we will be able to provide a solution that works in all cases for all users. So we’d like to understand more about how the community is using Snowflake, and what kinds of tradeoffs you are willing to make. Here are some proposals:

Proposed solutions

Proposal 1: Snowflake I/O manager simply forwards user data to Snowflake

In this solution, the Snowflake I/O manager would not attempt to modify user data to avoid the problems in the snowflake-connector-python. Users would be responsible for modifying their data (ie converting to strings, adding UTC timezones, etc) before returning it from their assets. Dagster would provide documentation on the existing issues and potential solutions to assist users in dealing with the issues.

This solution would provide users full control over how they store their data in Snowflake, but it would also require users to understand the issues and perform their own data quality checks.

Proposal 2: Maintain the status quo

In this solution, we would maintain the current implemented solutions, and make some usability improvements. For example, we can do a better job warning users in logs if their timestamp data has non-UTC timezones that will get wiped by Snowflake.

Some other cases to consider

There are other instances where the Snowflake connector does not store data as expected:

Snowflake does not have a Timedelta type, so Pandas TimeDeltas are stored as integers representing the number of nano-seconds in the timedelta.
If a column contains None, the corresponding Snowflake column type may not be as expected. For example, a column of integers with some None data will be stored as a float column in Snowflake.

We’d like our solution to the timestamp issue to be applicable to these cases as well. For example, if we choose to not modify user data at all, we would not intervene in either of the above cases. But if we do modify timestamp data, we would need to consider modifying data in these cases as well

Conclusion

So we’d like to hear from you! Do the issues with storing time data affect you? Have you attempted to solve them in the past? If so, what was your approach? Which of the above proposals would you like to see implemented? Do you have other ideas for how we can solve this issue?

f-tremblay · 2023-05-06T00:25:33Z

f-tremblay
May 6, 2023

Thanks to the Dagster team for your hard work on creating and enhancing this great tool.

Use Cases and Expectations

I see two main categories of use cases for uploading dataframes with the Snowflake Resource / IO Manager:

Uploaded dataframes are later reused in Dagster.
Uploaded dataframes are later reused externally (e.g., in SQL, dbt, or Snowpark).

The needs and expectations of the Snowflake Resource / IO Manager's behavior are dependent on the use case.

For Use Case 1: The expectation is that the dataframe can do a round-trip (to and from Snowflake) without any impactful changes. It is acceptable for datetimes to be stored as text in Snowflake as long as they are converted back to datetimes when downloading. Any datatype or timezone change during the round-trip would be unexpected and undesirable.

For Use Case 2: The expectation is that the dataframe types can be translated to the closest and most sensible type in Snowflake, so operations can be done (e.g., in SQL, dbt, or Snowpark) without friction. The data type mapping should be sensible, known, and reproducible, taking into account timezones and other special cases when possible.

Dagster should ideally support both use cases, but doing so would involve increased complexity in the implementation.

Issues

The main issue affecting both use cases is that types are often translated unexpectedly when uploading dataframes with snowflake-connector-python or sqlalchemy.

Issue Examples

Datetimes can be uploaded as NUMBER
Timezones are lost
Time is uploaded as FLOAT
Timedelta is uploaded as NUMBER
Dict (or list of dicts) can sometimes be uploaded as TEXT rather than VARIANT
Integers are converted to NUMBER(38, 0) in Snowflake, but become float when downloading back to Python
Behavior can sometimes change unexpectedly when updating snowflake-connector-python, sqlalchemy, or pandas

Solution Ideas

For Use Case 1: One possible implementation would be for Dagster to add a new column to the DataFrame containing the serialized values of the other columns when uploading a dataframe. This would ensure that round-trips do not change the data types or timezones.

import pandas as pd
import pickle

# Function to serialize a row
def serialize_row(row):
    return pickle.dumps(row.to_dict())

# Add a new column with the serialized values
df['Serialized'] = df.apply(serialize_row, axis=1)

# Function to deserialize a row
def deserialize_row(serialized_row):
    return pd.Series(pickle.loads(serialized_row))

# Rebuild the DataFrame using the 'Serialized' column
rebuilt_df = df['Serialized'].apply(deserialize_row)

Pros:

Simple to implement
Ensures datatype and timezone consistency during round-trips

Cons:

Security concerns: Deserializing data from untrusted sources using pickle can lead to security vulnerabilities. Alternative serialization methods (e.g., JSON, MessagePack) are more secure but less flexible and compatible with Python object types.
Unpicklable objects: Some objects cannot be pickled, and pickle will fail to serialize them. This scenario is unlikely, but alternative methods (e.g., dill) can handle a broader range of Python objects.

For Use Case 2: I don't see a simple solution that would fix all issues. One possibility is to create tables explicitly when uploading a dataframe. If the table does not already exist in Snowflake, Dagster could create the table explicitly, using its own datatype mapping. This would avoid relying on snowflake-connector-python, which often infers types incorrectly.
Additionally, if Dagster manages table creation through its own datatype mapping, it could also handle column additions without adding significant complexity.

Pros:

More accurate datatype mapping: By defining the datatype mapping explicitly, we can avoid incorrect type inferences made by snowflake-connector-python.
Less dependent on external libraries: This approach reduces the dependency on updates or changes in snowflake-connector-python, sqlalchemy, or pandas.

Cons:

Increased complexity: Implementing explicit table creation and datatype mapping adds complexity to the Snowflake I/O manager due to the need to consider a wide variety of data types and edge cases.

I hope these suggestions can contribute to the ongoing improvement of Dagster, and I look forward to seeing the project's continued growth and success. Keep up the good work!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] How should the Snowflake I/O manager handle Pandas time data? #13969

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[RFC] How should the Snowflake I/O manager handle Pandas time data? #13969

jamiedemaria Apr 27, 2023 Maintainer

Background information

Our attempts to work around the issue

Approach 1: Convert all time data to strings

Approach 2: Add timezones to timestamp data

Reflections on the attempted solutions

Proposed solutions

Proposal 1: Snowflake I/O manager simply forwards user data to Snowflake

Proposal 2: Maintain the status quo

Some other cases to consider

Conclusion

Replies: 1 comment

f-tremblay May 6, 2023

jamiedemaria
Apr 27, 2023
Maintainer

f-tremblay
May 6, 2023