New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV #5577
Conversation
@hengruo This is a great start! I think what we need is a struct for the CSV writer that users can create and then repeatedly call a method to write batches. This would allow me to easily integrate this with DataFusion for example. The DataFusion integration can be a separate PR. |
I have a question. If |
I think that the writer should keep track of whether it has written the header yet and just write it on the first call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hengruo, I initially worked on the CSV writer, I'm happy with the changes that you made, but I wasn't able to follow the StringWriter
, likely due to there not being docs/comments there.
I think overall, we could take care not to generate vecs of strings for large datasets (as some batches might be large, also, if I understand your comment correctly) before writing them.
I like the approach of writing batches individually, because it allows downstream implementors to change the behaviour of writing files (e.g. if we want to write partitioned csv files with each record batch as a file, and with each file having headers).
In fact, it is not an elegant implementation. I have planned to implement a function converting Vec<& RecordBatch> to string, however, the same issue, it is duplicate.
And another way is that we can convert Vec<& RecordBatch> to Vec first and convert it into a string or write it into a file, but it may have performance issue when there are many rows.
What would the type in Vec<Vec<?>>
be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
To avoid having other commits appearing in your PR, you should preferably rebase upstream changes instead of merging master into your PR branch.
Something like
git rebase origin/master
git push --force
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ... I think we could get rid of some of these unwrap
calls but happy to do this in a separate PR
We also might need to rebase before merging? |
Thanks for rebasing! This can be merged once CI is green. |
I think if we implement a function converting Relation to CSV string, it is a duplicate of some functions in arrow/csv/writer.rs, so I just added a function to export all RecordBatches in Relation.
Also, I encapsulated StringWriter in arrow/utils/string_writer.rs, as a Write trait's implementation, which can be set as arrow::csv::Writer 's parameter.
In fact, it is not an elegant implementation. I have planned to implement a function converting Vec<& RecordBatch> to string, however, the same issue, it is duplicate.
And another way is that we can convert Vec<& RecordBatch> to Vec<Vec> first and convert it into a string or write it into a file, but it may have performance issue when there are many rows.
Any suggestions are welcome. Thanks!