Skip to content

CSV/JSON Writes Not Guarenteed to Preserve Expected Ordering #7536

@devinjdangelo

Description

@devinjdangelo

Describe the bug

Initial implementation of #7452 intended to preserve the ordering of rows in CSV/JSON files in case a user runs a query like:

COPY (select * from my_table order by my_col)
TO my_file.csv

It is reasonable to expect that the CSV should be ordered by my_col. When this function: https://github.com/apache/arrow-datafusion/blob/561e0d7e87825aba224bf2eb9c3b8b5e1b725597/datafusion/core/src/datasource/file_format/write.rs#L310-L393

was updated to include the mpsc::channel I believe we lost the guarantee of preserving expected file order. The channel is nice since it introduces backpressure and ensures memory requirements do not grow without bound in case ObjectStore writes are falling behind, but I am not sure how to preserve the ordering of serialized RecordBatches in the channel construct.

To Reproduce

I have not yet verified a specific case where order is not preserved (TODO), but I don't see any reason why it is guarenteed since the channel does not preserve ordering (it will depend on how tokio schedules the tasks).

Expected behavior

We should guarantee that file ordering is preserved regardless of parallelization.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions