Dataframes with datetime indexes and timezone can be written but not read - TypeError: data type not understood #433

LorisMarini · 2019-05-16T04:01:54Z

Description

This bug concerns pandas dataframes which contain Datetime data with timezone information. When such data is part of a dataframe column, both writing and reading to/from a parquet file succeed. When the same data is part of the frame index, writing succeeds but reading fails. Note that if the datetime array is timezone insensitive (e.g. pd.date_range(start="2019-01-01", end="2019-05-01", freq="M", tz=None)), everything works fine.

Packages versions

pandas=0.24.2
fastparquet=0.3.1 (conda-forge)

How to reproduce the error

# Prepare dataframe
data_array = np.empty((4,3))
data_array[:] = np.nan
index = pd.DatetimeIndex(pd.date_range(start="2019-01-01", end="2019-05-01", freq="M", tz="UTC"))
df = pd.DataFrame(index=index, data=data_array, columns=["A", "B", "C"])

# Move Datetime data to the columns
A = df.reset_index()

# Keep Datetime on the index
B = df

# Save and read back A
path="/tmp/test_pkl_to_par.parquet"

# Write and read back A (works fine)
A.to_parquet(path, engine="fastparquet", compression='snappy', file_scheme='simple')
_ = pd.read_parquet(path, engine='fastparquet', columns=None)
print("Save and read back A: SUCCESS.")

# Write and read back B (Raises error)
B.to_parquet(path, engine="fastparquet", compression='snappy', file_scheme='simple')
_ = pd.read_parquet(path, engine='fastparquet', columns=None)

Output

Save and read back A: SUCCESS.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-105-332bf3f633c3> in <module>()
     16 # Save and read back B
     17 B.to_parquet(path, engine="fastparquet", compression='snappy', file_scheme='simple')
---> 18 pd.read_parquet(path, engine='fastparquet', columns=None)

/opt/conda/envs/myenv/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    280 
    281     impl = get_engine(engine)
--> 282     return impl.read(path, columns=columns, **kwargs)

/opt/conda/envs/myenv/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    209             parquet_file = self.api.ParquetFile(path)
    210 
--> 211         return parquet_file.to_pandas(columns=columns, **kwargs)
    212 
    213 

/opt/conda/envs/myenv/lib/python3.6/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index)
    416             columns += [i for i in index if i not in columns]
    417         check_column_names(self.columns + list(self.cats), columns, categories)
--> 418         df, views = self.pre_allocate(size, columns, categories, index)
    419         start = 0
    420         if self.file_scheme == 'simple':

/opt/conda/envs/myenv/lib/python3.6/site-packages/fastparquet/api.py in pre_allocate(self, size, columns, categories, index)
    440         categories = self.check_categories(categories)
    441         return _pre_allocate(size, columns, categories, index, self.cats,
--> 442                              self._dtypes(categories), self.tz)
    443 
    444     @property

/opt/conda/envs/myenv/lib/python3.6/site-packages/fastparquet/api.py in _pre_allocate(size, columns, categories, index, cs, dt, tz)
    555     dtypes.extend(['category'] * len(cs))
    556     df, views = dataframe.empty(dtypes, size, cols=cols, index_names=index,
--> 557                                 index_types=index_types, cats=cats, timezones=tz)
    558     return df, views
    559 

/opt/conda/envs/myenv/lib/python3.6/site-packages/fastparquet/dataframe.py in empty(types, size, cats, cols, index_types, index_names, timezones)
    114             views[col+'-catdef'] = index._data
    115         else:
--> 116             d = np.empty(size, dtype=t)
    117             index = Index(d)
    118             views[col] = index.values

TypeError: data type not understood

The text was updated successfully, but these errors were encountered:

martindurant · 2019-05-16T12:31:33Z

There was a similar issue not too long ago, which was fixed. I don't know if the index was treated as part of that fix. Can you try with fastparquet from master?

otmezger · 2019-05-23T12:32:06Z

I have the same issue, installed fastparquet from master and the problem is still there...

martindurant · 2019-05-23T13:13:15Z

Copying the solution for columns looks like this, would you mind trying?

--- a/fastparquet/dataframe.py
+++ b/fastparquet/dataframe.py
@@ -119,7 +119,17 @@ def empty(types, size, cats=None, cols=None, index_types=None, index_names=None,
             views[col] = vals
             views[col+'-catdef'] = index._data
         else:
+            if hasattr(t, 'base'):
+                # funky pandas not-dtype
+                t = t.base
             d = np.empty(size, dtype=t)
+            if d.dtype.kind == "M" and six.text_type(col) in timezones:
+                try:
+                    d = Series(d).dt.tz_localize(timezones[six.text_type(col)])
+                except:
+                    warnings.warn("Inferring time-zone from %s in column %s "
+                                  "failed, using time-zone-agnostic"
+                                  "" % (timezones[six.text_type(col)], col))
             index = Index(d)
             views[col] = index.values
     else:

callumstew · 2019-06-12T16:28:08Z

I was having the same issue, but the above fix works for me

Fixes dask#433

martindurant · 2019-06-12T16:36:19Z

Thanks for the ping ,I forgot about this.

Fixes #433

martindurant pushed a commit to martindurant/fastparquet that referenced this issue Jun 12, 2019

Apply timezone to index

a9c754f

Fixes dask#433

martindurant mentioned this issue Jun 12, 2019

Apply timezone to index #439

Merged

martindurant closed this as completed in #439 Jun 13, 2019

martindurant added a commit that referenced this issue Jun 13, 2019

Apply timezone to index (#439)

6c20042

Fixes #433

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframes with datetime indexes and timezone can be written but not read - TypeError: data type not understood #433

Dataframes with datetime indexes and timezone can be written but not read - TypeError: data type not understood #433

LorisMarini commented May 16, 2019

martindurant commented May 16, 2019

otmezger commented May 23, 2019

martindurant commented May 23, 2019

callumstew commented Jun 12, 2019

martindurant commented Jun 12, 2019

Dataframes with datetime indexes and timezone can be written but not read - TypeError: data type not understood #433

Dataframes with datetime indexes and timezone can be written but not read - TypeError: data type not understood #433

Comments

LorisMarini commented May 16, 2019

Description

Packages versions

How to reproduce the error

Output

martindurant commented May 16, 2019

otmezger commented May 23, 2019

martindurant commented May 23, 2019

callumstew commented Jun 12, 2019

martindurant commented Jun 12, 2019