-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
import os
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(dict(symbol=["A", "B", "C", "D"], year=[2017, 2018, 2019, 2020], close=np.arange(4)))
root_path = "test"
os.makedirs(root_path, exist_ok=True)
dataset = ds.dataset(root_path, format="parquet", partitioning="hive")
table1 = pa.Table.from_pandas(df)
print(f"\nbefore:\n{table.schema.to_string(show_field_metadata=False)}")
pq.write_to_dataset(table, root_path=root_path, partition_cols=["symbol", "year"])
table2 = dataset.to_table()
print(f"\nafter:\n{table2.schema.to_string(show_field_metadata=False)}")
before:
symbol: string
year: int64
close: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 582
after:
close: int64
symbol: string
year: int32
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 300
i.e. column ordering and types. I suspect this might be due to partitioning. Should I be storing additional metadata and using it when subsequently retrieving?
Thanks
Metadata
Metadata
Assignees
Labels
No labels