You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently min and max column statistics are returned as formatted strings of the physical type. This makes using them in python a bit tricky, as the strings need to be parsed as the proper logical type. Observe:
In [20]: importpandasaspdIn [21]: df = pd.DataFrame({'a': [1, 2, 3],
...: 'b': ['a', 'b', 'c'],
...: 'c': [pd.Timestamp('1991-01-01')]*3})
...:
In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
In [23]: frompyarrowimportparquetaspqIn [24]: f = pq.ParquetFile('temp.parquet')
In [25]: rg = f.metadata.row_group(0)
In [26]: rg.column(0).statistics.min # stringinsteadofintegerOut[26]: '1'In [27]: rg.column(1).statistics.min # weirdspaceaddedaftervalueduetoformatterOut[27]: 'a 'In [28]: rg.column(2).statistics.min # formattedasphysicaltype (int) insteadoflogical (datetime)
Out[28]: '662688000000'
Since the type information is known, it should be possible to convert these to arrow values instead of strings.
Currently
min
andmax
column statistics are returned as formatted strings of the physical type. This makes using them in python a bit tricky, as the strings need to be parsed as the proper logical type. Observe:Since the type information is known, it should be possible to convert these to arrow values instead of strings.
Reporter: Jim Crist / @jcrist
Assignee: Wes McKinney / @wesm
PRs and other links:
Note: This issue was originally created as ARROW-1982. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: