-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
Initial implementation of #7452 intended to preserve the ordering of rows in CSV/JSON files in case a user runs a query like:
COPY (select * from my_table order by my_col)
TO my_file.csvIt is reasonable to expect that the CSV should be ordered by my_col. When this function: https://github.com/apache/arrow-datafusion/blob/561e0d7e87825aba224bf2eb9c3b8b5e1b725597/datafusion/core/src/datasource/file_format/write.rs#L310-L393
was updated to include the mpsc::channel I believe we lost the guarantee of preserving expected file order. The channel is nice since it introduces backpressure and ensures memory requirements do not grow without bound in case ObjectStore writes are falling behind, but I am not sure how to preserve the ordering of serialized RecordBatches in the channel construct.
To Reproduce
I have not yet verified a specific case where order is not preserved (TODO), but I don't see any reason why it is guarenteed since the channel does not preserve ordering (it will depend on how tokio schedules the tasks).
Expected behavior
We should guarantee that file ordering is preserved regardless of parallelization.
Additional context
No response