Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write monotonic sequence, but read is non monotonic #2659

Closed
mikeburkat opened this issue Jul 11, 2024 · 2 comments
Closed

Write monotonic sequence, but read is non monotonic #2659

mikeburkat opened this issue Jul 11, 2024 · 2 comments
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@mikeburkat
Copy link

Environment

Delta-rs version: 0.18.2

Binding: python

Environment:

  • Cloud provider: None
  • OS: macOS Ventura Version 13.6.2 (22G320)
  • Other: pyarrow==16.1.0 pyarrow-hotfix==0.6 pandas==2.2.2

Bug

What happened:
I wrote a monotonically incrementing sequence into a deltalake table using the pyarrow engine.
When reading this deltalake table, the data is no longer monotonically incrementing.

What you expected to happen: I expect the data to be monotonically incrementing.
The rust engine appears to work as expected, however the pyarrow engine appears to re-order the data.

How to reproduce it:
Minimal example which reproduces the bug consistently on my laptop.

# out_of_order.py

import argparse
import os

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

from deltalake import write_deltalake, DeltaTable


def write_data_file(file, schema, length, batch_size):
    with pq.ParquetWriter(file,
                          schema=schema,
                          compression='gzip',
                          compression_level=6) as writer:

        for i in range(0, length, batch_size):
            rows = min(i + batch_size, length) - i
            df = pd.DataFrame(range(i, i + rows, 1), columns=['increment'])
            batch = pa.record_batch(schema=schema, data=df)
            writer.write_batch(batch)

    df = pd.read_parquet(file)
    assert df['increment'].is_monotonic_increasing, 'data file not monotonic'


def write_delta(engine, uri, schema, file, batch_size):
    with pq.ParquetFile(file) as data:
        write_deltalake(table_or_uri=uri,
                        data=data.iter_batches(batch_size=batch_size),
                        schema=schema,
                        mode='overwrite',
                        engine=engine)


def assert_monotonic(engine, uri):
    dt = DeltaTable(uri)
    assert dt.to_pandas()['increment'].is_monotonic_increasing, f'{engine} not monotonic'


if __name__ == '__main__':
    parser = argparse.ArgumentParser(usage=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--path', required=True, help='Deltalake table path')
    parser.add_argument('--length', default=62914561, type=int, help='Dataset length')
    parser.add_argument('--batch-size', default=100_000, type=int, help='Batch size')
    args = parser.parse_args()

    schema = pa.schema([
        pa.field('increment', pa.int64(), nullable=False),
    ])

    os.makedirs(args.path, exist_ok=True)
    file = args.path + '/monotonic'
    write_data_file(file, schema, args.length, args.batch_size)

    uri = args.path + '/rust'
    write_delta('rust', uri, schema, file, args.batch_size)
    assert_monotonic('rust', uri)

    uri = args.path + '/pyarrow'
    write_delta('pyarrow', uri, schema, file, args.batch_size)
    assert_monotonic('pyarrow', uri)

Run using the command:

python out_of_order.py --path $PWD/out_of_order

Following exception is raised:

Traceback (most recent call last):
  File ".../out_of_order.py", line 64, in <module>
    assert_monotonic('pyarrow', uri)
  File ".../out_of_order.py", line 40, in assert_monotonic
    assert dt.to_pandas()['increment'].is_monotonic_increasing, f'{engine} not monotonic'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: pyarrow not monotonic

More details:
The file which appears to be out of order on my machine is part-5 ($PWD/out_of_order/pyarrow/0-6e748f6b-f69e-47dd-8857-dc652c73cfef-5.parquet).
This is reproducible when running multiple times, implying it's somewhat deterministic, however it appears that a different "row group" is out of order each time, given that when inspecting the file a different increment number is non monotonic.

@mikeburkat mikeburkat added the bug Something isn't working label Jul 11, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Jul 11, 2024
@mikeburkat
Copy link
Author

mikeburkat commented Jul 12, 2024

Looks like this is mainly due to a known issue in pyarrow: apache/arrow#39030

However, I did find that a _delta_log transaction which has multiple add actions can have it's part files in an unsorted order in the transaction, which also contributes to this problem given that if the add action files are read in the "transaction order" then the data can also appear unsorted even though data in individual files is ordered.

@ion-elgreco
Copy link
Collaborator

This is something we cannot guarantee in the first place and the PyArrow engine will be deprecated as of v0.19, so I am closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants