Skip to content

Sort order lost in pq.write_to_dataset() version 8.0.0. vs 6.0.1 #13262

@MMCMA

Description

@MMCMA

I noticed an issue with the row order when storing in a dataset store. This code works in version 6.0.1 but fails in 8.0.0. due to different row/ index order that seems to be stored. pd.sort_index() solves the problem (but I would like to avoid this compute expensive operation)

    import pandas as pd
    import numpy as np
    import pyarrow as pa
    import pyarrow.dataset as ds
    import pyarrow.parquet as pq
    from pathlib import Path

    rows = 500
    columns = 300
    project_path =  '/TBD'
    path = Path(project_path) / 'sort_issue'
    path.mkdir()

    data = np.random.normal(size=(rows, columns))
    index = pd.date_range('19900101', periods=rows, freq='b')
    data = pd.DataFrame(data=data, index=index).stack().to_frame('a')
    year = [x.year for x in data.index.get_level_values(0)]
    data['year'] = year


    tbl = pa.Table.from_pandas(data)  # NOQA
    pq.write_to_dataset(tbl, root_path=path,  partition_cols=['year'])
    tmp_ds = ds.dataset(path, format="parquet")

    data_disk = tmp_ds.to_table().to_pandas()

    data_disk.loc['19910101': '19911231']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions