ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV #5577

hengruo · 2019-10-04T06:15:25Z

I think if we implement a function converting Relation to CSV string, it is a duplicate of some functions in arrow/csv/writer.rs, so I just added a function to export all RecordBatches in Relation.
Also, I encapsulated StringWriter in arrow/utils/string_writer.rs, as a Write trait's implementation, which can be set as arrow::csv::Writer 's parameter.

In fact, it is not an elegant implementation. I have planned to implement a function converting Vec<& RecordBatch> to string, however, the same issue, it is duplicate.

And another way is that we can convert Vec<& RecordBatch> to Vec<Vec> first and convert it into a string or write it into a file, but it may have performance issue when there are many rows.

Any suggestions are welcome. Thanks!

github-actions · 2019-10-04T06:16:10Z

https://issues.apache.org/jira/browse/ARROW-6274

rust/arrow/src/csv/writer.rs

rust/datafusion/src/execution/relation.rs

rust/arrow/src/csv/writer.rs

andygrove · 2019-10-04T12:19:35Z

@hengruo This is a great start! I think what we need is a struct for the CSV writer that users can create and then repeatedly call a method to write batches. This would allow me to easily integrate this with DataFusion for example. The DataFusion integration can be a separate PR.

hengruo · 2019-10-07T07:02:08Z

@hengruo This is a great start! I think what we need is a struct for the CSV writer that users can create and then repeatedly call a method to write batches. This would allow me to easily integrate this with DataFusion for example. The DataFusion integration can be a separate PR.

I have a question. If write() writes one batch once, how to deal with the header? Should we add a new argument to decide whether it is the first calling so that it needs to write the header?

andygrove · 2019-10-07T11:40:59Z

I think that the writer should keep track of whether it has written the header yet and just write it on the first call to write()

nevi-me

Hi @hengruo, I initially worked on the CSV writer, I'm happy with the changes that you made, but I wasn't able to follow the StringWriter, likely due to there not being docs/comments there.

I think overall, we could take care not to generate vecs of strings for large datasets (as some batches might be large, also, if I understand your comment correctly) before writing them.

I like the approach of writing batches individually, because it allows downstream implementors to change the behaviour of writing files (e.g. if we want to write partitioned csv files with each record batch as a file, and with each file having headers).

In fact, it is not an elegant implementation. I have planned to implement a function converting Vec<& RecordBatch> to string, however, the same issue, it is duplicate.

And another way is that we can convert Vec<& RecordBatch> to Vec first and convert it into a string or write it into a file, but it may have performance issue when there are many rows.

What would the type in Vec<Vec<?>> be?

rust/datafusion/src/converter/csv.rs

rust/arrow/src/util/string_writer.rs

nevi-me

LGTM

To avoid having other commits appearing in your PR, you should preferably rebase upstream changes instead of merging master into your PR branch.

Something like

git rebase origin/master
git push --force

andygrove

LGTM ... I think we could get rid of some of these unwrap calls but happy to do this in a separate PR

rust/arrow/src/csv/writer.rs

andygrove · 2019-10-09T04:11:36Z

We also might need to rebase before merging?

andygrove · 2019-10-09T08:06:17Z

Thanks for rebasing! This can be merged once CI is green.

fsaintjacques changed the title ~~ARROW-6274~~ ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV Oct 4, 2019

andygrove added Component: Rust - DataFusion Component: Rust labels Oct 4, 2019

andygrove reviewed Oct 4, 2019

View reviewed changes

rust/arrow/src/csv/writer.rs Outdated Show resolved Hide resolved

andygrove reviewed Oct 4, 2019

View reviewed changes

rust/datafusion/src/execution/relation.rs Outdated Show resolved Hide resolved

andygrove reviewed Oct 4, 2019

View reviewed changes

rust/arrow/src/csv/writer.rs Outdated Show resolved Hide resolved

kszucs force-pushed the master branch from fc93312 to af097e6 Compare October 5, 2019 09:47

nevi-me reviewed Oct 8, 2019

View reviewed changes

rust/datafusion/src/converter/csv.rs Outdated Show resolved Hide resolved

rust/arrow/src/util/string_writer.rs Show resolved Hide resolved

nevi-me approved these changes Oct 8, 2019

View reviewed changes

andygrove approved these changes Oct 9, 2019

View reviewed changes

rust/arrow/src/csv/writer.rs Outdated Show resolved Hide resolved

Hengruo Zhang added 2 commits October 8, 2019 21:24

ARROW-6274

f5a6ae0

ARROW-6274: rename 'bow' to 'beginning'

e89623f

andygrove closed this in f9cd295 Oct 9, 2019

asfimport mentioned this pull request Oct 9, 2019

[Rust] [DataFusion] Add support for writing results to CSV #22658

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV #5577

ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV #5577

hengruo commented Oct 4, 2019

github-actions bot commented Oct 4, 2019

andygrove commented Oct 4, 2019

hengruo commented Oct 7, 2019

andygrove commented Oct 7, 2019

nevi-me left a comment

nevi-me left a comment

andygrove left a comment

andygrove commented Oct 9, 2019

andygrove commented Oct 9, 2019

ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV #5577

ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV #5577

Conversation

hengruo commented Oct 4, 2019

github-actions bot commented Oct 4, 2019

andygrove commented Oct 4, 2019

hengruo commented Oct 7, 2019

andygrove commented Oct 7, 2019

nevi-me left a comment

Choose a reason for hiding this comment

nevi-me left a comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

andygrove commented Oct 9, 2019

andygrove commented Oct 9, 2019