pyarrow.dataset.write_dataset do not preserve order #39030

xquyvu · 2023-12-01T14:13:45Z

Describe the bug, including details regarding any error messages, version, and platform.

As described, when writing a file with pyarrow.dataset.write_dataset, the order is not preserved. I have tested this with both parquet and csv file format.

import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow.dataset
from pathlib import Path


data_load_path = './data.parquet'
pyarrow_dataset_write_path = './pyarrow_saved_data.parquet'

data = pd.DataFrame({'col': np.arange(1e7)})
data.to_parquet(data_load_path)

# Check if data loaded with pandas and pyarrow are the same
pyarrow_dataset = pyarrow.dataset.dataset(data_load_path, format='parquet')
pyarrow_dataset_df = pyarrow_dataset.to_table().to_pandas()

print((pyarrow_dataset_df['col'] == data['col']).all()) # True

# Write with pyarrow.dataset.write_dataset
pyarrow.dataset.write_dataset(
    pyarrow_dataset,
    pyarrow_dataset_write_path,
    format='parquet',
)

loaded_pyarrow_dataset = pyarrow.dataset.dataset(pyarrow_dataset_write_path, format='parquet')
loaded_pyarrow_dataset_df = loaded_pyarrow_dataset.to_table().to_pandas()
print((loaded_pyarrow_dataset_df['col'] == data['col']).all()) # False
print((loaded_pyarrow_dataset_df['col'] == data['col']).mean()) # 0.29

# Write with pq.write_to_dataset
pq.write_to_dataset(
    pyarrow_dataset,
    'x.parquet',
    existing_data_behavior='delete_matching'
)

(pyarrow.dataset.dataset('x.parquet').to_table().to_pandas()['col'] == data['col']).all() # True

Component(s)

Python

The text was updated successfully, but these errors were encountered:

mapleFU · 2023-12-01T15:20:11Z

Just curious, does to_parquet gurantee the ordering?

xquyvu · 2023-12-01T15:30:28Z

Just curious, does to_parquet gurantee the ordering?

yes.

xquyvu · 2024-01-09T17:52:28Z

Hello any updates on this? Thanks!

u3Izx9ql7vW4 · 2024-07-12T20:54:33Z

Interested in this as well. Would be great if there was a way to ensure ordering for datasets

xquyvu added the Type: bug label Dec 1, 2023

github-actions bot added the Component: Python label Dec 1, 2023

mikeburkat mentioned this issue Jul 12, 2024

Write monotonic sequence, but read is non monotonic delta-io/delta-rs#2659

Open

This was referenced Jul 12, 2024

[C++][Dataset] Preserve order when writing dataset #26818

Open

[Python] Dataset sorting_columns support request #43239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyarrow.dataset.write_dataset do not preserve order #39030

pyarrow.dataset.write_dataset do not preserve order #39030

xquyvu commented Dec 1, 2023 •

edited

Loading

mapleFU commented Dec 1, 2023

xquyvu commented Dec 1, 2023

xquyvu commented Jan 9, 2024 •

edited

Loading

u3Izx9ql7vW4 commented Jul 12, 2024

pyarrow.dataset.write_dataset do not preserve order #39030

pyarrow.dataset.write_dataset do not preserve order #39030

Comments

xquyvu commented Dec 1, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

mapleFU commented Dec 1, 2023

xquyvu commented Dec 1, 2023

xquyvu commented Jan 9, 2024 • edited Loading

u3Izx9ql7vW4 commented Jul 12, 2024

xquyvu commented Dec 1, 2023 •

edited

Loading

xquyvu commented Jan 9, 2024 •

edited

Loading