Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change return type of 'DataFrame.collect()' #442

Closed
wants to merge 1 commit into from

Conversation

kination
Copy link
Contributor

Which issue does this PR close?

It is for following issue #47

What changes are included in this PR?

Change return type of 'DataFrame.collect()' to 'SendableRecordBatchStream'.

Currently this is draft change, so please let me know if this change is not the one you've intended 🙏

@alamb alamb added api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate labels Jun 1, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR appears to change the signature of the free function datafusion::physical_plan::collect -- I think the definition of DataFrame::collect is here:
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/dataframe.rs#L221

@andygrove is this what you had in mind in #47?

@andygrove
Copy link
Member

andygrove commented Jun 5, 2021

The change in this PR looks like a good start and probably necessary, but yes, the goal is to update DataFrame::collect

@kination
Copy link
Contributor Author

kination commented Jun 6, 2021

@alamb @andygrove thanks for review. Will fix soon 🙏

As far as I review from now, seems it needs to fix return format of collect inside dataframe_impl. And I've fixed collect inside physical_plan because they are calling this function inside...

Please let me know if there's any misunderstanding 🙇

@alamb alamb added the stale-pr label Jul 13, 2021
@alamb
Copy link
Contributor

alamb commented Jul 13, 2021

@djKooks do you plan to keep working on this PR?

@kination
Copy link
Contributor Author

kination commented Jul 14, 2021

@alamb sure~
But because I'm still on reviewing the project, so want to make sure I'm going in right way...

As far as I review from now, seems it needs to fix return format of collect inside dataframe_impl. And I've fixed collect inside physical_plan because they are calling this function inside...
Please let me know if there's any misunderstanding 🙇

=> Am I thinking correctly?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @djKooks

#47
I think is referring to DataFrame::collect(), https://docs.rs/datafusion/4.0.0/datafusion/dataframe/trait.DataFrame.html#tymethod.collect not physical_plan::collect() which is what this PR currently does

@andygrove can you please weigh in here on the desired API?

#47 says collect() , however, collect() elsewhere in the code returns Vec<RecordBatch>

Perhaps would it make sense to add a new function like DataFrame::execute() that returned a SendableRecordBatch stream and leave collect() the way it is?

`

@andygrove
Copy link
Member

Sorry for the delayed response. I have been working on related areas and plan to look at this later today or tomorrow.

@andygrove
Copy link
Member

I have created #789 to implement streaming versions of the collect methods.

@kination
Copy link
Contributor Author

@andygrove okay. Will close this

@kination kination closed this Jul 31, 2021
@alamb
Copy link
Contributor

alamb commented Jul 31, 2021

Thanks for prodding this along @djKooks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants