Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.dataset.write_dataset do not preserve order #39030

Open
xquyvu opened this issue Dec 1, 2023 · 4 comments
Open

pyarrow.dataset.write_dataset do not preserve order #39030

xquyvu opened this issue Dec 1, 2023 · 4 comments

Comments

@xquyvu
Copy link

xquyvu commented Dec 1, 2023

Describe the bug, including details regarding any error messages, version, and platform.

As described, when writing a file with pyarrow.dataset.write_dataset, the order is not preserved. I have tested this with both parquet and csv file format.

import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow.dataset
from pathlib import Path


data_load_path = './data.parquet'
pyarrow_dataset_write_path = './pyarrow_saved_data.parquet'

data = pd.DataFrame({'col': np.arange(1e7)})
data.to_parquet(data_load_path)

# Check if data loaded with pandas and pyarrow are the same
pyarrow_dataset = pyarrow.dataset.dataset(data_load_path, format='parquet')
pyarrow_dataset_df = pyarrow_dataset.to_table().to_pandas()

print((pyarrow_dataset_df['col'] == data['col']).all()) # True

# Write with pyarrow.dataset.write_dataset
pyarrow.dataset.write_dataset(
    pyarrow_dataset,
    pyarrow_dataset_write_path,
    format='parquet',
)

loaded_pyarrow_dataset = pyarrow.dataset.dataset(pyarrow_dataset_write_path, format='parquet')
loaded_pyarrow_dataset_df = loaded_pyarrow_dataset.to_table().to_pandas()
print((loaded_pyarrow_dataset_df['col'] == data['col']).all()) # False
print((loaded_pyarrow_dataset_df['col'] == data['col']).mean()) # 0.29

# Write with pq.write_to_dataset
pq.write_to_dataset(
    pyarrow_dataset,
    'x.parquet',
    existing_data_behavior='delete_matching'
)

(pyarrow.dataset.dataset('x.parquet').to_table().to_pandas()['col'] == data['col']).all() # True

Component(s)

Python

@mapleFU
Copy link
Member

mapleFU commented Dec 1, 2023

Just curious, does to_parquet gurantee the ordering?

@xquyvu
Copy link
Author

xquyvu commented Dec 1, 2023

Just curious, does to_parquet gurantee the ordering?

yes.

@xquyvu
Copy link
Author

xquyvu commented Jan 9, 2024

Hello any updates on this? Thanks!

@u3Izx9ql7vW4
Copy link

Interested in this as well. Would be great if there was a way to ensure ordering for datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants