Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV #5577

Closed
wants to merge 2 commits into from
Closed

ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV #5577

wants to merge 2 commits into from

Conversation

hengruo
Copy link

@hengruo hengruo commented Oct 4, 2019

I think if we implement a function converting Relation to CSV string, it is a duplicate of some functions in arrow/csv/writer.rs, so I just added a function to export all RecordBatches in Relation.
Also, I encapsulated StringWriter in arrow/utils/string_writer.rs, as a Write trait's implementation, which can be set as arrow::csv::Writer 's parameter.

In fact, it is not an elegant implementation. I have planned to implement a function converting Vec<& RecordBatch> to string, however, the same issue, it is duplicate.

And another way is that we can convert Vec<& RecordBatch> to Vec<Vec> first and convert it into a string or write it into a file, but it may have performance issue when there are many rows.

Any suggestions are welcome. Thanks!

@github-actions
Copy link

github-actions bot commented Oct 4, 2019

@fsaintjacques fsaintjacques changed the title ARROW-6274 ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV Oct 4, 2019
@andygrove
Copy link
Member

@hengruo This is a great start! I think what we need is a struct for the CSV writer that users can create and then repeatedly call a method to write batches. This would allow me to easily integrate this with DataFusion for example. The DataFusion integration can be a separate PR.

@hengruo
Copy link
Author

hengruo commented Oct 7, 2019

@hengruo This is a great start! I think what we need is a struct for the CSV writer that users can create and then repeatedly call a method to write batches. This would allow me to easily integrate this with DataFusion for example. The DataFusion integration can be a separate PR.

I have a question. If write() writes one batch once, how to deal with the header? Should we add a new argument to decide whether it is the first calling so that it needs to write the header?

@andygrove
Copy link
Member

I think that the writer should keep track of whether it has written the header yet and just write it on the first call to write()

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hengruo, I initially worked on the CSV writer, I'm happy with the changes that you made, but I wasn't able to follow the StringWriter, likely due to there not being docs/comments there.

I think overall, we could take care not to generate vecs of strings for large datasets (as some batches might be large, also, if I understand your comment correctly) before writing them.

I like the approach of writing batches individually, because it allows downstream implementors to change the behaviour of writing files (e.g. if we want to write partitioned csv files with each record batch as a file, and with each file having headers).

In fact, it is not an elegant implementation. I have planned to implement a function converting Vec<& RecordBatch> to string, however, the same issue, it is duplicate.

And another way is that we can convert Vec<& RecordBatch> to Vec first and convert it into a string or write it into a file, but it may have performance issue when there are many rows.

What would the type in Vec<Vec<?>> be?

rust/datafusion/src/converter/csv.rs Outdated Show resolved Hide resolved
rust/arrow/src/util/string_writer.rs Show resolved Hide resolved
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

To avoid having other commits appearing in your PR, you should preferably rebase upstream changes instead of merging master into your PR branch.

Something like

git rebase origin/master
git push --force

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ... I think we could get rid of some of these unwrap calls but happy to do this in a separate PR

rust/arrow/src/csv/writer.rs Outdated Show resolved Hide resolved
@andygrove
Copy link
Member

We also might need to rebase before merging?

@andygrove
Copy link
Member

Thanks for rebasing! This can be merged once CI is green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants