I noticed an issue with the row order when storing in a dataset store. This code works in version 6.0.1 but fails in 8.0.0. due to different row/ index order that seems to be stored. pd.sort_index() solves the problem (but I would like to avoid this compute expensive operation)
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
from pathlib import Path
rows = 500
columns = 300
project_path = '/TBD'
path = Path(project_path) / 'sort_issue'
path.mkdir()
data = np.random.normal(size=(rows, columns))
index = pd.date_range('19900101', periods=rows, freq='b')
data = pd.DataFrame(data=data, index=index).stack().to_frame('a')
year = [x.year for x in data.index.get_level_values(0)]
data['year'] = year
tbl = pa.Table.from_pandas(data) # NOQA
pq.write_to_dataset(tbl, root_path=path, partition_cols=['year'])
tmp_ds = ds.dataset(path, format="parquet")
data_disk = tmp_ds.to_table().to_pandas()
data_disk.loc['19910101': '19911231']
I noticed an issue with the row order when storing in a dataset store. This code works in version 6.0.1 but fails in 8.0.0. due to different row/ index order that seems to be stored. pd.sort_index() solves the problem (but I would like to avoid this compute expensive operation)