New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframes with datetime indexes and timezone can be written but not read - TypeError: data type not understood #433
Comments
There was a similar issue not too long ago, which was fixed. I don't know if the index was treated as part of that fix. Can you try with fastparquet from master? |
I have the same issue, installed fastparquet from master and the problem is still there... |
Copying the solution for columns looks like this, would you mind trying? --- a/fastparquet/dataframe.py
+++ b/fastparquet/dataframe.py
@@ -119,7 +119,17 @@ def empty(types, size, cats=None, cols=None, index_types=None, index_names=None,
views[col] = vals
views[col+'-catdef'] = index._data
else:
+ if hasattr(t, 'base'):
+ # funky pandas not-dtype
+ t = t.base
d = np.empty(size, dtype=t)
+ if d.dtype.kind == "M" and six.text_type(col) in timezones:
+ try:
+ d = Series(d).dt.tz_localize(timezones[six.text_type(col)])
+ except:
+ warnings.warn("Inferring time-zone from %s in column %s "
+ "failed, using time-zone-agnostic"
+ "" % (timezones[six.text_type(col)], col))
index = Index(d)
views[col] = index.values
else: |
I was having the same issue, but the above fix works for me |
Thanks for the ping ,I forgot about this. |
Description
This bug concerns pandas dataframes which contain Datetime data with timezone information. When such data is part of a dataframe column, both writing and reading to/from a parquet file succeed. When the same data is part of the frame index, writing succeeds but reading fails. Note that if the datetime array is timezone insensitive (e.g. pd.date_range(start="2019-01-01", end="2019-05-01", freq="M", tz=None)), everything works fine.
Packages versions
pandas=0.24.2
fastparquet=0.3.1 (conda-forge)
How to reproduce the error
# Prepare dataframe
data_array = np.empty((4,3))
data_array[:] = np.nan
index = pd.DatetimeIndex(pd.date_range(start="2019-01-01", end="2019-05-01", freq="M", tz="UTC"))
df = pd.DataFrame(index=index, data=data_array, columns=["A", "B", "C"])
# Move Datetime data to the columns
A = df.reset_index()
# Keep Datetime on the index
B = df
# Save and read back A
path="/tmp/test_pkl_to_par.parquet"
# Write and read back A (works fine)
A.to_parquet(path, engine="fastparquet", compression='snappy', file_scheme='simple')
_ = pd.read_parquet(path, engine='fastparquet', columns=None)
print("Save and read back A: SUCCESS.")
# Write and read back B (Raises error)
B.to_parquet(path, engine="fastparquet", compression='snappy', file_scheme='simple')
_ = pd.read_parquet(path, engine='fastparquet', columns=None)
Output
The text was updated successfully, but these errors were encountered: